[redland-dev] Redland scalability
Bruno Barberi Gnecco
brunobg at gmail.com
Wed Aug 20 23:48:59 BST 2008
carmen r wrote:
>>Hi,
>>
>> Has anyone here used Redland with a large number of triples (>10 million)?
>>How does it scale?
>
> once or twice i loaded some small dbpedia sets in that range
>
> i cant really say, as rasqal has seen some commits since then.
Was it too bad then? Did you use a mysql back end?
> perhaps dajobe can comment on the dataloads Y! has put redland through?
It would be wonderful...
> proper indexing is essential at that size, and did get performance closer to Virtuoso
> (which usually still won)
> be aware Virtuoso is a gigantic monolithic beast with XML processing stuff, a SQL DB,
> etc..
I've seen the benchmarks. Virtuoso seems to support dbpedia well, but I
couldn't find details over what hardware they're using.
> on that note, i think the best solution towards more scalable semweb stuff is more
> modularity
>
> eg, acces to internal rasqal set-intersection stuff so one can do offline (ahead of
> time) aggregations hinted on use patterns
>
> or overload functions using some class-inheritance/super() technique and provide
> optimized SQL, etc
I agree, but I'm worried that a MySQL backend may start to choke somewhere
before 100 million triples. Besides, to do that sort of tweaking it'd be best to
create a more complex SQL schema, partitioning data by type and hashes, etc. I think
it could work very well, but it's a project more complex than I can handle now.
> currently i switched to a FS based store to support flexible optimization/aggregation
> strategy and remove the 'black box beast' components from the system. if god took away
> my FS, id defintiely look at redland again, before anything else
>
> hope that helps
Thanks, that helps a lot. I was considering moving most of the data to
FS, and it's good to know that it works. May I ask you how you're doing that?
Are you using Redland file storage?
I was considering partitioning the data by subject, which will solve most
of my needs, but I'll lose the ability to perform SPARQL over the set. I have
also considered caching the aggregations and query results, but in order to do
that I still need to know how well it will scale: if queries take hours to run,
there's no way I can build and keep the cache up to date.
Thanks a lot for your reply!
--
Bruno Barberi Gnecco <brunobg_at_users.sourceforge.net>
My only love sprung from my only hate!
Too early seen unknown, and known too late!
-- William Shakespeare, "Romeo and Juliet"
More information about the redland-dev
mailing list