[redland-dev] Raptor Turle parser memory usage

"Martin J. Dürst" duerst at it.aoyama.ac.jp
Tue Jul 10 02:06:54 EDT 2012


Hello Dave,

On 2012/07/10 1:01, Dave Beckett wrote:
> Nick is correct about the serializer but the question was about the turtle
> parser, and it is also valid.
>
> The Raptor turtle (n3, trig) parser relies on flex and bison (aka lex+yacc)
> of which  bison:
> a) has to have the entire input in memory in one block in order to parse

This is really the first time I hear something like this about bison. 
flex definitely doesn't need all its input in memory, it has a 
well-organized buffer mechanism (check for YY_BUFFER_STATE, yyin, 
yy_scan_string, YY_INPUT,...). Therefore, bison can't require to have 
the whole input in memory. There may be an 
application/implementation-specific reason for having everything in 
memory in raptor, but that would be a different story.

Regards,   Martin.

> b) uses 32 bit unsigned int offsets
>
> So Raptor has to assemble the input in memory (lots of alloc / realloc) and
> end up with a max 2G size.  A 5G file is not going to parse.
>
> I have looked at fixing this several times but writing a streaming lexer
> and parser is damn hard - months of work.  Using ANTLR and other things
> that do the same job looks like it would make things a lot more complex
> (it's C++).  I've also tried looking at sqlite's lemon but it doesn't stream
> so it seems the only road to this is a lot of work.
>
> Dave
>
>
> On 7/9/12 1:30 AM, Nicholas Humfrey wrote:
>> Hello,
>>
>> Yes, the Turtle serialiser puts everything into RAM, in order to build a tree of the data and out a nice pretty file, with all the triples with the same subject next to each other.
>>
>> If you output as ntriples, then output will be much faster and it won't try and load everything into RAM.
>>
>> nick.
>>
>>
>> On 9 Jul 2012, at 02:15, Medha Atre wrote:
>>
>>> Hello,
>>>
>>> I am trying to use the Raptor RDF parser library to parse a very large RDF/XML file of LUBM dataset (synthetically generated) and convert it into Turle representation. The gzipped format of RDF/XML file itself is 5.1 GB (I am reading its input through a fifo and "rapper" reads from this fifo).
>>>
>>> When I run "rapper" command to convert RDF/XML into Turtle on this file, the memory utilization shoots up very high (it consumes almost all of my RAM leaving me unable to do anything else on the computer).
>>>
>>> I was wondering if there is any option to restrict the memory used by "rapper" tool? I checked "configure" and "rapper --help", but didn't find any such option.
>>>
>>> Can someone please let me know what the best and easiest workaround for this?
>>>
>>> Thanks.
>>>
>>> Medha
>>>
>>> _______________________________________________
>>> redland-dev mailing list
>>> redland-dev at lists.librdf.org
>>> http://lists.librdf.org/mailman/listinfo/redland-dev
>>
>> _______________________________________________
>> redland-dev mailing list
>> redland-dev at lists.librdf.org
>> http://lists.librdf.org/mailman/listinfo/redland-dev
>>
>
> _______________________________________________
> redland-dev mailing list
> redland-dev at lists.librdf.org
> http://lists.librdf.org/mailman/listinfo/redland-dev
>


More information about the redland-dev mailing list