Re: [update] piggybank performance profiling

From: Stefano Mazzocchi <stefanom_at_mit.edu>
Date: Tue, 06 Sep 2005 17:10:56 -0400

Jeen Broekstra wrote:
> Stefano Mazzocchi wrote:
>
>> Well, it's worse than this, I suspect that while the "subj pred ?x"
>> queries are hashed, the "?x pred obj" queries are iterated, meaning
>> that it's not a fixed cost that we are paying but a cost that is
>> proportional to the amount of data in the triple store... and this is
>> *really* bad news!
>>
>> Jeen, are we doing something wrong? or is this really Sesame's
>> limitation?
>
>
> This is probably indeed a limitation of the current native store -
> Arjohn can comment on this in more detail (I'll ask him about it as soon
> as he checks in today) as he designed the native store's indexes, but
> IIRC we limited the number and type of indexes in the native store to
> speed up upload, at the cost of bad performance on some query patterns
> (which were uncommon patterns in the use cases we had in mind for the
> native store).

I see. Very interesting that a "?x pred obj" query is something you
considered uncommon... but maybe it is because we are very picky about
allowing users to crawl graphs in both directions.

> This is not insurmountable by the way. It should be doable to add an
> extra index relatively quickly. I'll come back to this when I've had
> more coffee and details.

Thanks, it would be very much appreciated.

> FWIW: both the in-memory store and the RDBMS-backed store do not have
> this limitation.

Ok, let's talk more about this.

David, Ryan and I had a meeting today about how to move forward and
remove the current performance limitations of piggybank.

Two things are big turn offs for us and the native store:

  1) slow (I suspect O(n)-complex) subject-based queries
  2) lack of OWL-lite inferencing

Both are not impossible to add, but require time and effort, while they
are already available in the other two stores.

The memory store, especially the one in 1.2.1 which saves on disk before
returning guaranteeing a minimal acidity, is appealing for its speed but
we are kind of afraid of the memory consumption, as we don't expect
people to cleanup their piggybanks often but just to keep throwing coins
in (just like you do with real piggybanks), having some 50/100Mb of
stuff loaded everytime in your browser because of PB might be a little
overwhelming for some... even if memory is so cheap these days and
modern browsers tend to use a ton of memory on their own.

The RDBMS store is appealing to us on the server side (think semantic
bank and standalone longwell) but not really on the client side, unless
we can support native java RDBMSs like Derby or HSQL. I've looked at the
code and as far as I understood, Sesame supports MySQL, Oracle and
Postgres as databases, but adding support for another SQL dialect should
be a single class away, is that correct? If so, would you guys be able
to help us out in case we would like to try to run sesame on top of
Derby or HSQL?

Also, Jeen, another question came up today: all our communication to the
triple store goes thru SeRQL queries, but they are rather simple
exploration ones (subj pred ?x) or (?x pred obj) or (?x pred ?y), do you
think it would be faster to use the API directly and avoid the SeRQL
parsing time? or you think this would be of minimal help?

thanks so much in advance for you support.

Ah, also, how's work on 2.0 going? anything we can already try out/test?

-- 
Stefano Mazzocchi
Research Scientist                 Digital Libraries Research Group
Massachusetts Institute of Technology            location: E25-131C
77 Massachusetts Ave                   telephone: +1 (617) 253-1096
Cambridge, MA  02139-4307              email: stefanom at mit . edu
-------------------------------------------------------------------
Received on Tue Sep 06 2005 - 21:06:37 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT