[redland-dev] [Raptor RDF Parsing and Serializing Library 0000355]: libraptor serializer very slow for large number of objects - No flush API
Mantis Bug Tracker
mantis-bug-sender at librdf.org
Tue Feb 23 22:46:26 CET 2010
The following issue has been SUBMITTED.
======================================================================
http://bugs.librdf.org/mantis/view.php?id=355
======================================================================
Reported By: scudette
Assigned To:
======================================================================
Project: Raptor RDF Parsing and Serializing Library
Issue ID: 355
Category: api
Reproducibility: always
Severity: tweak
Priority: normal
Status: new
Syntax Name: Turtle
======================================================================
Date Submitted: 2010-02-23 21:46
Last Modified: 2010-02-23 21:46
======================================================================
Summary: libraptor serializer very slow for large number of
objects - No flush API
Description:
When serializing into Turtle the raptor_turtle_serialize_statement()
function maintains an avl tree to group all the statements into subjects.
This is necessary in order to ensure that all statements related to the
same subject are emitted together - even if the statements are serialized
in random order.
This behaviour is reasonable in the case where the statements are issued
randomly. In many applications however, the statements are issued
basically in the correct order - that is
raptor_turtle_serialize_statement() is called for all the predicates of
each subject in turn.
The problem is that maintaining the AVL tree slows down the
raptor_turtle_serialize_statement() function significantly for large
number of subjects. Memory is consumed for the AVL tree until the
serialize_end() function is called, when the tree is walked and then any
output is being produced. For very large number of subject this is very
slow and no output at all is produced until the very end.
The API really needs a flush() function which can be called when you know
you are done serializing a subject. When the flush() is called, the tree
can be free'd and all present subjects can be dumped into the stream.
For now I have simulated a flush() function by allowing the
serialize_end() function to be called as many times as needed:
- I have added a code block to free and rebuild the AVL tree in
raptor_turtle_emit() which is called from raptor_serialize_end()
- I have removed the iostream freeing in raptor_serialize_end() (this
might leak - maybe this should be moved to raptor_free_serializer() ),
and removed the context->written_header=0 in
raptor_turtle_serialize_end()
http://code.google.com/p/aff4/source/browse/libraptor/raptor_serialize_turtle.c
So the end result is that I can call raptor_serialize_end() as
frequently as I want without the state of the serializer being
changed. Each time I call it the serializer flushes more data into the
iostream which reduces memory consumption (and also means progress is
made in writing the file). I am calling it about every 100 subjects.
To give you an idea of the speed improvement - my simple unit test
serialises about 22k objects. Prior to the change it took about 2min
to write the file (during which time there was no writing at all until
the very end). After the change it takes about 20sec to do the same,
and the file is written progressively. Memory demand is obviously much
more modest in the new code.
======================================================================
Issue History
Date Modified Username Field Change
======================================================================
2010-02-23 21:46 scudette New Issue
2010-02-23 21:46 scudette Syntax Name => Turtle
======================================================================
More information about the redland-dev
mailing list