[redland-dev] Redland scalability

Wed Aug 20 23:48:59 BST 2008

carmen r wrote:
>>Hi,
>>
>>	Has anyone here used Redland with a large number of triples (>10 million)?
>>How does it scale?
 >
> once or twice i loaded some small dbpedia sets in that range
> 
> i cant really say, as rasqal has seen some commits since then.

	Was it too bad then? Did you use a mysql back end?

> perhaps dajobe can comment on the dataloads Y! has put redland through?

	It would be wonderful...

> proper indexing is essential at that size, and did get performance closer to Virtuoso
> (which usually still won)
> be aware Virtuoso is a gigantic monolithic beast with XML processing stuff, a SQL DB,
> etc..

	I've seen the benchmarks. Virtuoso seems to support dbpedia well, but I
couldn't find details over what hardware they're using.

> on that note, i think the best solution towards more scalable semweb stuff is more
> modularity
> 
> eg, acces to internal rasqal set-intersection stuff so one can do offline (ahead of
> time) aggregations hinted on use patterns
> 
> or overload functions using some class-inheritance/super() technique and provide
> optimized SQL, etc

	I agree, but I'm worried that a MySQL backend may start to choke somewhere
before 100 million triples. Besides, to do that sort of tweaking it'd be best to
create a more complex SQL schema, partitioning data by type and hashes, etc. I think
it could work very well, but it's a project more complex than I can handle now.

> currently i switched to a FS based store to support flexible optimization/aggregation
> strategy and remove the 'black box beast' components from the system. if god took away
> my FS, id defintiely look at redland again, before anything else
> 
> hope that helps

	Thanks, that helps a lot. I was considering moving most of the data to
FS, and it's good to know that it works. May I ask you how you're doing that?
Are you using Redland file storage?

	I was considering partitioning the data by subject, which will solve most
of my needs, but I'll lose the ability to perform SPARQL over the set. I have
also considered caching the aggregations and query results, but in order to do
that I still need to know how well it will scale: if queries take hours to run,
there's no way I can build and keep the cache up to date.

	Thanks a lot for your reply!

-- 
Bruno Barberi Gnecco <brunobg_at_users.sourceforge.net>
My only love sprung from my only hate!
Too early seen unknown, and known too late!
		-- William Shakespeare, "Romeo and Juliet"