[redland-dev] turtle serializer isn't scalable at all

Alexander I. Gordeev lasaine at lvk.cs.msu.su
Wed May 20 12:44:24 CEST 2009


On Tue, May 19, 2009 at 11:44:19AM -0400, Dave Robillard wrote:
> > > The problem is that the serializer does not know if the triple stream is
> > > sorted.  This is solvable easily enough, it just needs doing...
> > 
> > Well, why should it bother if the triple stream is sorted? I've just
> > checked the spec one more time and there is no single word about it.
> 
> ... because it has to be sorted to be able to write abbreviated turtle?

But if it is not sorted and serialized as is (probably not optimally)
the resulting file will be a valid turtle, right? And the fact is that
rapper currently fails to sort input if it is big enough! I think
rapper should do it's job only - serializing - and not try to do what
'sort' does good enough and fast. Or maybe there should be an option
somewhere for switching this behaviour off...

> > > P.S. 'sort' on an ntriples file won't actually give you properly sorted
> > > triples, and the problem remains that the serialiser needs to know
> > > anyway
> > 
> > Why not? And what do you mean by 'properly sorted'?
> 
> Uh... alphabetical order by text is obviously not equal to the (SPO)
> sorted order

True.

> needed to write abbreviated turtle:

I think not. I've attached an archive with six files:
 * notsorted.ntriples: your original example
 * notsorted-rapper.turtle: first file converted to turtle by rapper
 * notsorted-mytool.turtle: first file converted to turtle by my tool
 * sorted.ntriples: your example sorted by 'sort'
 * sorted-rapper.turtle: third file converted to turtle by rapper
 * sorted-mytool.turtle: third file converted to turtle by my tool

Note that my tool doesn't sort input at all, just like I proposed. BTW,
rapper doesn't sort as well. It just tries to abbreviate everything
optimally by size (but not speed).
You can see that everything is abbreviated ok. This example only shows
that alphabetical sort is ok. I don't have time at the moment to provide
a better example with lots of triples but I can do it later if you wish.

IMO much of the slowdown in rapper is because rapper puts subject in
an array (I checked the code) which has O(n) seek time instead of some
binary search tree which will give O(log N) seek time. BTW STL has very
fast rbtree implementation. std::set or std::map are based on them.

> <http://a> <http://b> <http://c>
> <http://ab> <http://b> <http://c>
> 
> The first triple should be the first in increasing sorted order (because
> the subject URI is < the second), but textually sorted this is wrong
> (because the character '>' is > 'b').

--
  Alexander
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sorttest.tgz
Type: application/x-gtar
Size: 400 bytes
Desc: not available
Url : http://lists.librdf.org/pipermail/redland-dev/attachments/20090520/7093a369/attachment.tgz 


More information about the redland-dev mailing list