Re: [RT] Moving Piggy Bank forward... from Jeen Broekstra on 2005-07-22 (stdin)

From: Jeen Broekstra <jeen_at_aduna.biz>
Date: Fri, 22 Jul 2005 14:42:09 +0200

Stefano Mazzocchi wrote:

[snip]

> 3) use the Sesame Memory store instead of the native one (which
> supports RDFSchema entailment and hopefully OWL-tiny with some
> custom rules)
>
> [Jeen, can you tell us more about that?]

Sure, what would you like to know? :)

Here are some trivia: Sesame's in-memory store uses an internal java
object model for remembering nodes (uris, bnodes, literals) and
statements. There are three main hashmap indexes, one for URIs, one
for bNodes, one for literals. Each RDF node links to a list of
statements in which it is used as a subject/predicate/object. URIs use
shared representations for their namespace to minimize memory
consumption for string. This makes querying, especially on triple
patterns for which at least one variable is fixed, very fast.

Unfortunately, there is currently no production-ready custom
inferencer for the in-memory store (the available custom inferencer
only operates on MySQL databases). There is some raw code on a new
custom inferencer that uses SeRQL queries as the rule format, but we
are short a number of hands to make that thing work properly. An
alternative is perhaps OntoTexts' OWLIM package, which is an adapted
in-memory store that can do simple OWL-Lite entailment.

> PROs:
>
> - we get basic inferencing on equivalences and subclassing

As an aside: basic inferencing is still on the ToDo list for the
native store. We are also awaiting a number of third party
contributions which will hopefully significantly improve native store
performance (better indexing). I'll keep you informed on progress if
you want.

> - it's considerably faster
>
> CONs:
>
> - memory consumption grows linearly with the amount of data stored
> (or worse?)

About linear. Roughly 170 bytes per triple (this is an average
observed on a ~30 million triple memory store, on a 64-bit machine, so
it will probably be a bit less on a regular 32-bit architecture).

Note that this includes inferred triples: a 1000 triple document may
result in 2000 actual triples in the store (the ratio depends on your
ontology of course, we usually find that for simple schemas the number
of inferred triples is 30-60% of the original number of triples).

> - data is saved on disk *only* after regular shutdown. In case of
> system collapse there is data loss. (Jeen, is there a workaround
> for this problem? like saving the new RDF right away before
> returning)

This is actually no longer true. In the newer versions of Sesame, data
is saved to disk immediately after each commit by default, and the
behavior is configurable. Quoting from the configuration manual
(http://www.openrdf.org/doc/sesame/users/ch04.html#d0e651):

  The 'syncDelay' parameter specifies the time (in milliseconds) to
  wait after a transaction was commited before writing the changed data
  to file. Setting this variable to '0' (the default value) will force
  a file sync immediately after each commit. A negative value will
  deactivate file synchronization until the Sail is shut down. A
  positive value will postpone the synchronization for at least that
  amount of milliseconds. If in the meantime a new transaction is
  started, the file synchronization will be rescheduled to wait for
  another syncDelay ms. This way, bursts of transaction events can be
  combined in one file sync, improving performance.

You can also explicitly force a disk sync if you want, by invoking the
RdfRepository.sync() method.

HTH.

Jeen

-- 
Jeen Broekstra          Aduna BV
Knowledge Engineer      Julianaplein 14b, 3817 CS Amersfoort
http://aduna.biz        The Netherlands
tel. +31 33 46599877

text/x-vcard attachment: jeen.vcf

Received on Fri Jul 22 2005 - 12:40:41 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT