Re: [RT] Moving Piggy Bank forward... from Stefano Mazzocchi on 2005-07-23 (stdin)

From: Stefano Mazzocchi <stefanom_at_mit.edu>
Date: Sat, 23 Jul 2005 21:44:25 -0400

Jeen Broekstra wrote:
> Stefano Mazzocchi wrote:
>
> [snip]
>
>> 3) use the Sesame Memory store instead of the native one (which
>> supports RDFSchema entailment and hopefully OWL-tiny with some
>> custom rules)
>>
>> [Jeen, can you tell us more about that?]
>
>
> Sure, what would you like to know? :)

Everything :-) How about a nice RDF-dump of your brain about Sesame and
friends ;-)

> Here are some trivia: Sesame's in-memory store uses an internal java
> object model for remembering nodes (uris, bnodes, literals) and
> statements. There are three main hashmap indexes, one for URIs, one
> for bNodes, one for literals. Each RDF node links to a list of
> statements in which it is used as a subject/predicate/object. URIs use
> shared representations for their namespace to minimize memory
> consumption for string. This makes querying, especially on triple
> patterns for which at least one variable is fixed, very fast.

Ok (this is very similar to what we do in Welkin, btw, maybe we should
think about ditch that and use Sesame memory store directly, hmmmm...)

> Unfortunately, there is currently no production-ready custom
> inferencer for the in-memory store (the available custom inferencer
> only operates on MySQL databases). There is some raw code on a new
> custom inferencer that uses SeRQL queries as the rule format, but we
> are short a number of hands to make that thing work properly. An
> alternative is perhaps OntoTexts' OWLIM package, which is an adapted
> in-memory store that can do simple OWL-Lite entailment.

If you can point us to the code, maybe we can fill up those needed hands.

We *really* need to perform basic OWL-lite (or even tiny?) entailment.

>> PROs:
>>
>> - we get basic inferencing on equivalences and subclassing
>
>
> As an aside: basic inferencing is still on the ToDo list for the
> native store. We are also awaiting a number of third party
> contributions which will hopefully significantly improve native store
> performance (better indexing). I'll keep you informed on progress if
> you want.

Please do, we do want to be more active in helping up shaping this space
in the future, but we are unsure of what/how to do it. More
collaboration would just be great for us and yes, we are not just those
who demand for a fix, we do it ourselves if we know we are not stepping
on other people's toes.

Ah, and I hate web forums with a passion ;-)

>> - it's considerably faster
>>
>> CONs:
>>
>> - memory consumption grows linearly with the amount of data stored
>> (or worse?)
>
>
> About linear. Roughly 170 bytes per triple (this is an average
> observed on a ~30 million triple memory store, on a 64-bit machine, so
> it will probably be a bit less on a regular 32-bit architecture).

Awesome.

> Note that this includes inferred triples: a 1000 triple document may
> result in 2000 actual triples in the store (the ratio depends on your
> ontology of course, we usually find that for simple schemas the number
> of inferred triples is 30-60% of the original number of triples).

This is aligned with our experience too.

>> - data is saved on disk *only* after regular shutdown. In case of
>> system collapse there is data loss. (Jeen, is there a workaround
>> for this problem? like saving the new RDF right away before
>> returning)
>
>
> This is actually no longer true. In the newer versions of Sesame, data
> is saved to disk immediately after each commit by default, and the
> behavior is configurable. Quoting from the configuration manual
> (http://www.openrdf.org/doc/sesame/users/ch04.html#d0e651):
>
> The 'syncDelay' parameter specifies the time (in milliseconds) to
> wait after a transaction was commited before writing the changed data
> to file. Setting this variable to '0' (the default value) will force
> a file sync immediately after each commit. A negative value will
> deactivate file synchronization until the Sail is shut down. A
> positive value will postpone the synchronization for at least that
> amount of milliseconds. If in the meantime a new transaction is
> started, the file synchronization will be rescheduled to wait for
> another syncDelay ms. This way, bursts of transaction events can be
> combined in one file sync, improving performance.

Awesome!!!

> You can also explicitly force a disk sync if you want, by invoking the
> RdfRepository.sync() method.

This is great info (how did I overlook that!?!) I'll start looking into
that ASAP.

-- 
Stefano Mazzocchi
Research Scientist                 Digital Libraries Research Group
Massachusetts Institute of Technology            location: E25-131C
77 Massachusetts Ave                   telephone: +1 (617) 253-1096
Cambridge, MA  02139-4307              email: stefanom at mit . edu
-------------------------------------------------------------------

Received on Sun Jul 24 2005 - 01:41:21 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT