[redland-dev] raptor turtle serializer scalability patch

Dave Beckett dave at dajobe.org
Tue Nov 10 02:26:16 CET 2009


Chris Cannam wrote:
> Attached is a patch for raptor svn r15635 (also works with the 1.4.19
> release) which addresses a scalability problem in Turtle serialization
> by replacing raptor_sequence with raptor_avltree for the subject and
> blanks containers passed to raptor_abbrev_node_lookup.  This attempts
> to fix the "this should really be a hash, not a list" FIXME previously
> noted in that function.
> 
> The tricky bit is handling the way blank nodes were removed from the
> sequence when written, if they were not going to be needed again.
> Because you can't just replace an item in the tree with NULL (as was
> done in the sequence) without breaking tree ordering, and you can't
> remove an item from the tree while iterating over it, I've instead
> added a "valid" flag to the subject struct itself which is reset on
> writing and subsequently tested to prevent duplicate writes.  I'm not
> hugely keen on this, should anyone have any better ideas.
> 
> This patch reduces the runtime of my own test case (c. 400K triples
> constructed and serialized) from about 25 minutes to about 14 seconds.
> 
> It also reduces the runtime for the Turtle test suite from about 12.1
> to 10.4 seconds on this machine in release trim, and it passes
> valgrind --leak-check=full with no errors or leaks.
> 
> The bad news is that it causes a number of unit tests to fail because
> of changes to the ordering in output.  One test (ex-38) in the rdfxml
> and turtle test suites fails (I think because rdfdiff is wrongly
> seeing a difference that isn't there), and the entire feeds test suite
> fails (I think because it doesn't use rdfdiff at all).  I haven't
> spotted anything that looks like a "real" failure, but I might be
> missing something.  Thoughts welcome.

Thanks for the patch Chris.  It certainly was something that we knew was a
performance problem for serializing large turtle and rdf/xml-abbrev graphs.

I had a quick look at the rdfdiff logic and it does look dubious.  I may see
if I can fix it so the tests still run.

If I get it working, I'll roll a new raptor1 release with this (and another
crash fix patch already in SVN) and commit it to the raptor2 trunk also.

Thanks again.

Dave


More information about the redland-dev mailing list