[redland-dev] raptor turtle serializer scalability patch

Dave Beckett dave at dajobe.org
Mon Nov 16 18:20:33 CET 2009


On Mon, 9 Nov 2009, Chris Cannam wrote:
> Attached is a patch for raptor svn r15635 (also works with the 1.4.19
> release) which addresses a scalability problem in Turtle serialization
> by replacing raptor_sequence with raptor_avltree for the subject and
> blanks containers passed to raptor_abbrev_node_lookup.  This attempts
> to fix the "this should really be a hash, not a list" FIXME previously
> noted in that function.

Thanks a lot for this.  It does indeed fix that problem.

>
> The tricky bit is handling the way blank nodes were removed from the
> sequence when written, if they were not going to be needed again.
> Because you can't just replace an item in the tree with NULL (as was
> done in the sequence) without breaking tree ordering, and you can't
> remove an item from the tree while iterating over it, I've instead
> added a "valid" flag to the subject struct itself which is reset on
> writing and subsequently tested to prevent duplicate writes.  I'm not
> hugely keen on this, should anyone have any better ideas.

Seems adequate for now.  I guess we should add a new FIXME for that :)

> This patch reduces the runtime of my own test case (c. 400K triples
> constructed and serialized) from about 25 minutes to about 14 seconds.
>
> It also reduces the runtime for the Turtle test suite from about 12.1
> to 10.4 seconds on this machine in release trim, and it passes
> valgrind --leak-check=full with no errors or leaks.
>
> The bad news is that it causes a number of unit tests to fail because
> of changes to the ordering in output.  One test (ex-38) in the rdfxml
> and turtle test suites fails (I think because rdfdiff is wrongly
> seeing a difference that isn't there), and the entire feeds test suite
> fails (I think because it doesn't use rdfdiff at all).  I haven't
> spotted anything that looks like a "real" failure, but I might be
> missing something.  Thoughts welcome.

I've had a bit of a look at this and it seems the rdfdiff 
rdf-graph-compare algorithm is naive/broken and relies on the semantics of 
the current raptor_abbrev (rdfxml-abbrev and turtle serializing) module 
too much.  I'm not sure what I'll do about this but I'll carry on looking.

At this stage I'd like to commit the patch (with some minor code style 
fixes) but I would really like to get rdfdiff working first.  Hmm.

Dave


More information about the redland-dev mailing list