[redland-dev] Creating additional storage hashes

Thu May 15 19:27:11 BST 2003

On Thu, 15 May 2003 10:48:55 -0600
Jason Johnston <redland at lojjic.net> wrote:

> I'm writing an application on top of Redland which often makes requests 
> for *all* statements about a given resource, i.e. the Subject node is 
> the only known node.  Currently the only way I can find to do this (in 
> Perl) is $model->find_statements($subjectNode,null,null).  As I 
> understand it, since the BDB hashes are indexed for when only *one* node 
> is unknown, Redland has to traverse the entire database and do a 
> comparison on each and every Statement to see if it matches.  Needless 
> to say, this gets very slow with large storages.

Currently, yes they index for one unknown by default.

> The arcs_out C method would seem to be what I need, but if I follow the 
> source correctly it just ends up doing the same thing anyway, because of 
> the way the hashes are indexed.
> 
> So what I need is a s2po indexed BDB hash, but I have no idea how to go 
> about creating such a thing.
> 
> I found the following passage in the section on Storage Modules (6.2) in 
> the article "The Design and Implementation of the Redland RDF 
> Application Framework" (http://www10.org/cdrom/papers/490/):
> 
> > There are other potentially useful hashes that might be maintained
> > including incoming and outgoing arcs indexed for particular nodes.
> > These choices might be suitable for an option on the storage hash or
> > for user configuration of which statement parts are indexed. The
> > current hash storage module has hooks for such a facility but no
> > current interface to it.
> 
> I also remember reading something about enabling a p2so index as a 
> configure option a while back, but can't find it now.

Yeah, it is documented ... somewhere ...  I never could find a good
place to put descriptions of the language-independent stuff.  So
here's the perl:

  index-predicates='yes' (hashes storage only)
      Enable indexing from predicates to (subject,object) which can
      in particular be useful for rdf:type relations.
  -- http://www.redland.opensource.ac.uk/docs/pod/RDF/Redland/Storage.html

> So... is there a simple way to create a s2po hash?  And, if not, is 
> there a difficult way to create it? ;-)  I'm not opposed to modifying 
> the C source to accomplish what I need.

Oh yes :)

Let me enter tutorial mode.

        Adding Extra Indexing to Redland Hashed (Indexed) Storage

The rdf_storage_hashes is now entirely dynamically constructed but
the current state is made from a table in the C, moving towards when
it can be fully dynamic via the storage parameters above.

So let's delve into the source at librdf/rdf_storage_hashes.c

  Redland Hashed (Indexed) Storage Module
  CVS http://cvs.ilrt.org/cvsweb/redland/librdf/librdf/rdf_storage_hashes.c

At the very top of the file is something like:

----------------------------------------------------------------------
static const librdf_hash_descriptor librdf_storage_hashes_descriptions[]= {
  {"sp2o",
   LIBRDF_STATEMENT_SUBJECT|LIBRDF_STATEMENT_PREDICATE,
   LIBRDF_STATEMENT_OBJECT},  /* For 'get targets' */
  {"po2s",
   LIBRDF_STATEMENT_PREDICATE|LIBRDF_STATEMENT_OBJECT,
   LIBRDF_STATEMENT_SUBJECT},  /* For 'get sources' */
  {"so2p", 
   LIBRDF_STATEMENT_SUBJECT|LIBRDF_STATEMENT_OBJECT,
   LIBRDF_STATEMENT_PREDICATE},  /* For 'get arcs' */
  {"p2so", 
   LIBRDF_STATEMENT_PREDICATE,
   LIBRDF_STATEMENT_SUBJECT|LIBRDF_STATEMENT_OBJECT},  /* For '(?, p, ?)' */
  {"contexts",
   0L, /* for contexts - do not touch when storing statements! */
   0L},
  {NULL,0L,0L}
};
----------------------------------------------------------------------

which maps directly to the storage interface and then model methods:
  librdf_model_get_targets
  librdf_model_get_sources
  librdf_model_get_arcs

They also provide all the other store functionality via the base
storage class implementation code in rdf_storage.c or the
model class rdf_model.c

So, if you want to add a new index, this is the place to do it. Here,
you should add a new one after so2p giving the fields you want to
index (any combination is allowed), like this possibly:

----------------------------------------------------------------------
  {"s2po", 
   LIBRDF_STATEMENT_SUBJECT,
   LIBRDF_STATEMENT_PREDICATE|LIBRDF_STATEMENT_OBJECT},  /* For 'arcs_out' */
----------------------------------------------------------------------

Then you need to add this to the code by looking for the magic number
'3' in the code below, change it to 4.  (This is the stuff that will
change to be truly dynamic).  A quick glance shows me stuff like:

----------------------------------------------------------------------
  ...
  /* Work out the number of hashes for allocating stuff below */
  hash_count=3;

  ...

  for(i=0; i<3; i++) {
      status=librdf_storage_hashes_register(storage, name,
                                            &librdf_storage_hashes_descriptions[i]);
----------------------------------------------------------------------

and if you notice just after that, the optional 4th (p2so) and 5th
(contexts) indexes are handled.

So that will cover all the hash creation, indexing and destroying.

Really!

The interfaces to use this extra indexing could work in two ways.
If you want all the statements then that can only be done
by the find_statements model method:
  stream of statements:=model.find_statements(subject_node, NULL, NULL)
more of which below.

If you just want the properties, then that is the arcs_out model
method as you mentioned:
  iterator of nodes:=model.arcs_out(subject_node)

So considering implementing the latter.

The first step is to tell the storage factory implementation that
there is new code providing this.  If you look at
librdf_storage_hashes_register_factory you will see there is no
factory implementation registered for the storage factory method

----------------------------------------------------------------------
  librdf_iterator* (*get_arcs_out)(librdf_storage *storage, librdf_node *node);
----------------------------------------------------------------------
  (defined in rdf_storage.h as struct librdf_storage_factory_s)

so that means it is implemented by the base storage class
code in rdf-storage.c - see librdf_storage_get_arcs_out there.

So you need to create a new method librdf_storage_hashes_get_arcs_out
with the signature above and register it in
librdf_storage_hashes_register_factory:
----------------------------------------------------------------------
  factory->get_arcs_out          = librdf_storage_hashes_get_arcs_out;
----------------------------------------------------------------------

And now onwards to implementing it.

This will be very similar to the code for
  librdf_storage_hashes_find_sources (and _arcs, _targets) 
but you will need to pick the right hash index (in this case, number
3) which you might consider calculating when it is registered, as is
done for the other methods (such as sources_index) and storing in
something like arcs_out_index rather than add another hard-coded
integer.

If you add this, you will end up with something like this (UNTESTED)

----------------------------------------------------------------------
static librdf_iterator*
librdf_storage_hashes_get_arcs_out(librdf_storage* storage, 
                                   librdf_node* source) 
{
  librdf_storage_hashes_context *scontext=(librdf_storage_hashes_context*)storage->context;
  return librdf_storage_hashes_node_iterator_create(storage, source, NULL,
                                                    scontext->arcs_out_index,
                                                    LIBRDF_STATEMENT_PREDICATE);
}
----------------------------------------------------------------------

At this point, it is clear that although the index is s2po, only the
'p' is used, not the full statement - so the index could happily be
s2p. 

Which isn't quite what you wanted probably

So lets assume the above doesn't work for you.  Ignore generating
that new method (keep the indexing of course) and lets consider how
to improve the model method find_statements (which calls the
storage method find_statements).

At this point you'll have to look at rdf_storage.c method
librdf_storage_find_statements where it optimises certain queries to
use the indexes if they are available.  You will need to use similar
code but moved to librdf_storage_hashes_find_statements to recognise
the (s, ?, ?) pattern inside the storage hashes implementation only.

Happily there is an existing use of similar code right
there in librdf_storage_hashes_find_statements
used for Redland contexts  that is close to what you need.
You need to modify it to

1) recognise (s, ? ?) with a new if clause
2) serialize the s2po index (rather than p2so)
3) return the predicate and object (rather than 
   subject and object)

and you are pretty much done.  The
librdf_storage_hashes_serialise_common helper function will do all
the rest.

So that's pretty much all of the code that I think is needed to
implement it.

The APIs above won't change, but more indexing does mean things run a
bit slower.  You might have worked out that you can *remove* indexes
from the list above and let the higher levels work it out (in future
this will be configurable via storage parameters).  This works fine,
and indeed rdf_storage_list.c implements none of the
arcs/sources/targets and lets the base class rdf_storage.c deal with
the consequences.

All of this is untested of course :)

Dave