[redland-dev] turtle serializer isn't scalable at all

Alexander I. Gordeev lasaine at lvk.cs.msu.su
Tue May 19 10:47:37 CEST 2009


Sorry for the delay...

On Wed, May 13, 2009 at 07:07:24PM -0400, Dave Robillard wrote:
> On Thu, 2009-05-14 at 01:44 +0400, Alexander Gordeev wrote:
> > Hi All!
> > 
> > I tried  to use raptor-utils a while ago to convert very big files in ntriples 
> > format to turtle but it went too slow. I've discovered in the code that 
> > turtle serializer collects all triples in memory and outputs once in the end. 
> > (If this is not true please correct me.) I think this is done to make 
> > subjects appear only once. You shouldn't do this IMO because this means the 
> > performance is really BAD!
> > 
> > Turtle is really meant to be a stream format i.e. the serializer should not 
> > collect lots of triples. Collect triples while the subject is the same and 
> > write them down as soon as the subject changes. This is IMO the right way to 
> > do. If you want to optimize the output you can just use 'sort' on ntriples 
> > file before the conversion. sort does this job MUCH better.
> > 
> > Sorry, I don't have a patch and I'm not going to write it because I don't use 
> > rapper anymore. But I decided to write about this issue because it was the 
> > only shortcoming I've noticed. Thanks for the great software!
> 
> This has been discussed occasionally for a while.

Hmm, I've searched the archives and found nothing. Could you please
give me a link to the discussion?

> The problem is that the serializer does not know if the triple stream is
> sorted.  This is solvable easily enough, it just needs doing...

Well, why should it bother if the triple stream is sorted? I've just
checked the spec one more time and there is no single word about it.

> The rdfxml-abbrev serialiser works in the exact same way, BTW.  If you
> really need performance, use ntriples.

We needed both performance and small size. This is where turtle is good
(if done right ;) ).

> P.S. 'sort' on an ntriples file won't actually give you properly sorted
> triples, and the problem remains that the serialiser needs to know
> anyway

Why not? And what do you mean by 'properly sorted'?
I had to do a nightly backup of a RDF database. Also we wanted to be
able to see which changes went into database. I did it this way:
  * dump the DB to ntriples files
  * sort these files
  * convert them into turtle using our own converter (because rapper was
    to slow)
  * put the resulting files under version control using SVN or Git and
    then update them nightly
And sort does its job perfectly in this scheme. If I do 'svn diff' it
shows exactly the real changes, not the garbage after improper sorting.
Works perfectly for about a month.

--
  Alexander


More information about the redland-dev mailing list