Re: [update] piggybank performance profiling from Jeen Broekstra on 2005-09-07 (stdin)

From: Jeen Broekstra <jeen_at_aduna.biz>
Date: Wed, 07 Sep 2005 11:13:15 +0200

Stefano Mazzocchi wrote:
> Jeen Broekstra wrote:
>
>> Stefano Mazzocchi wrote:
>>
>>> Well, it's worse than this, I suspect that while the "subj pred
>>> ?x" queries are hashed, the "?x pred obj" queries are iterated,
>>> meaning that it's not a fixed cost that we are paying but a cost
>>> that is proportional to the amount of data in the triple store...
>>> and this is *really* bad news!
>>>
>>> Jeen, are we doing something wrong? or is this really Sesame's
>>> limitation?
>>
>>
>>
>> This is probably indeed a limitation of the current native store -
>> Arjohn can comment on this in more detail (I'll ask him about it
>> as soon as he checks in today) as he designed the native store's
>> indexes, but IIRC we limited the number and type of indexes in the
>> native store to speed up upload, at the cost of bad performance on
>> some query patterns (which were uncommon patterns in the use cases
>> we had in mind for the native store).
>
>
> I see. Very interesting that a "?x pred obj" query is something you
> considered uncommon... but maybe it is because we are very picky
> about allowing users to crawl graphs in both directions.

To clarify: the native store was developed with a particular use case in
our own products in mind. In this application, we expected queries to be
typically of the form "s p ?x" so we optimized for that.

>> This is not insurmountable by the way. It should be doable to add
>> an extra index relatively quickly. I'll come back to this when I've
>> had more coffee and details.
>
>
> Thanks, it would be very much appreciated.

The native store has a BTree index using an SPO key. To add an index for
your use case, we'd need to insert, say, an OPS-based index (the
implementation is a pure BTree so we can not currently support foreign
keys, we'd need a B+tree for that). In practice that means creating an
extra BTree and using that for evaluating patterns where the subject is
unknown.

Arjohn guestimated that implementing this would be a day's work,
approximately, but we can make no guesses or guarantees on the
performance. For each extra BTree an extra 'triples.dat' file will be
created, that gives you an indication of the disk space demands. I
noticed that Vineet offered to take a stab at implementing this, Arjohn
will contact him with more details on how to get started.

>> FWIW: both the in-memory store and the RDBMS-backed store do not
>> have this limitation.
>
>
> Ok, let's talk more about this.
>
> David, Ryan and I had a meeting today about how to move forward and
> remove the current performance limitations of piggybank.
>
> Two things are big turn offs for us and the native store:
>
> 1) slow (I suspect O(n)-complex) subject-based queries 2) lack of
> OWL-lite inferencing
>
> Both are not impossible to add, but require time and effort, while
> they are already available in the other two stores.
>
> The memory store, especially the one in 1.2.1 which saves on disk
> before returning guaranteeing a minimal acidity, is appealing for its
> speed but we are kind of afraid of the memory consumption, as we
> don't expect people to cleanup their piggybanks often but just to
> keep throwing coins in (just like you do with real piggybanks),
> having some 50/100Mb of stuff loaded everytime in your browser
> because of PB might be a little overwhelming for some... even if
> memory is so cheap these days and modern browsers tend to use a ton
> of memory on their own.

It does indeed not sound like a good idea for a browser plugin :)

> The RDBMS store is appealing to us on the server side (think semantic
> bank and standalone longwell) but not really on the client side,
> unless we can support native java RDBMSs like Derby or HSQL. I've
> looked at the code and as far as I understood, Sesame supports MySQL,
> Oracle and Postgres as databases, but adding support for another SQL
> dialect should be a single class away, is that correct?

That is correct.

Having said that, we are planning to move away from the current design
of 'generic RDBMS' with a minimal specialization for each database
product, as it just limits us in the types of optimization we can do. In
Sesame 2.0 we are moving to database-specific implementations. This is
still only a single class though, but it makes us more flexible in
changing database and index structure for each product.

Of course, the generic stuff can still be used as a bootstrap.

> If so, would you guys be able to help us out in case we would like to
> try to run sesame on top of Derby or HSQL?

Definitely! In fact Derby is something we've been eyeballing for a while
as well. We haven't made any effort to support it yet but it seems a
likely candidate and well suited to your use case, and useful to Sesame
users in general. If you can pick this up that would be great! We'd be
more than happy to support it.

> Also, Jeen, another question came up today: all our communication to
> the triple store goes thru SeRQL queries, but they are rather simple
> exploration ones (subj pred ?x) or (?x pred obj) or (?x pred ?y), do
> you think it would be faster to use the API directly and avoid the
> SeRQL parsing time? or you think this would be of minimal help?

There is a constant, but quite small, parser overhead for each query, so
usually it does not make much of a difference compared to query
evaluation time itself - and of course for larger query patterns there
is query optimization going on in the backend that you miss if you do
API calls.

If all your queries are single path matches, it might give a small gain
but I wouldn't expect too much of it, unless your queries are already
very fast (in which case the parse overhead is a signifant part of the
total).

> thanks so much in advance for you support.
>
> Ah, also, how's work on 2.0 going? anything we can already try
> out/test?

The state is that we have a revised sail API and access API, including
full transactional support and context support, have extended SeRQL to
work with contexts, and have an in-memory store with RDFS inferencing
support working. The current focus is on the client/server stuff and
porting the RDBMS and native stores. Oh and we need to update the
documentation of course :)

The goal is still to have a first prerelease by the end of September,
and it looks likely we'll make that, although it may not be
all-singing-all-dancing (for example the custom inferencer may have to
wait a bit).

If you want to have a look at what's there or test it, check out module
'openrdf2' from cvs.sourceforge.net:/cvsroot/sesame . Feedback is of
course most welcome.

Cheers,

Jeen
Received on Wed Sep 07 2005 - 09:09:39 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT