Re: Querying and caching with large datasets

From: Rickard Öberg <rickard.oberg_at_senselogic.se>
Date: Mon, 23 Jan 2006 11:29:54 +0100

Jeen Broekstra wrote:
> You could even do it more specific than that: since your cache is a
> (stacked) SAIL and therefore is aware of all operations that take place
> on the underlying store, you can let it 'sniff' all update transactions
> and invalidate cache objects as soon as it detects an operation that
> changes data relevant to that cache object. Or more simply: refresh the
> cache whenever data is added/removed. No real need for timeouts.

Indeed, and preferably I want to combine it:
1) if data is changed -> purge the cache
2) if timout is reached and data has changed -> refresh the cache

I do not want to recompute the cache when the actual data change happens
since I might be doing bulk updates, and don't want to recompute for
each insert. For 2), if the timeout has been reached and the cache has
been purged since the last refresh, refresh it to preemptively load it
so that no user will have to "take the hit".

> The general idea looks good. The caching structure itself takes some
> thinking through. There are two types of query results (variable
> bindings and triples) so you need a cache that takes that into account -
> *or* you can choose to simply treat each query as a triple-returning
> query and only do the conversion to variable-bindings last-minute, that
> way you can use an in-memory SAIL as the cache object.

That would be preferred.

> I know that some people at the Vrije Universiteit Amsterdam have played
> with caching query results for often-occurring queries on top of a
> MySQL-backed Sesame store, as part of the BuRST project
> (http://www.cs.vu.nl/~pmika/research/burst/BuRST.html). We talked
> through their main ideas at the time (which were fairly similar to your
> own outline), but I'm not sure if they have actually implemented the
> caching scheme.

For me it is simple: if I am to use this, then I HAVE to do it :-) Or it
will be too slow for real usage. Example: first page of a website shows
aggregated news, which would be an RDF query. Everyone sees the same
data, and everyone hits that page, and it results in the same query. So,
even if it's only 50ms to compute or less that is way way way too much
time in this case. Compare it with our total average page rendering time
today which is 20ms. And no, I can't do caching of the HTML, because the
actual rendering is often personalized.

> We've often considered implementing a query cache as a SAIL ourselves
> though (in fact, the main reason the architecture has stacked SAILs in
> the first place is precisely because we wanted to cater for this kind of
> improvemement). The main reason there is not something like it in Sesame
> already is lack of time. So yeah, we think it's a great idea, and
> especially on persistent (MySQL, native) stores it may give great speedups.

I looked through the Sesame 1 API though, and it doesn't seem like I can
do this as a Sail feature, since the query handling is not done by each
Sail, but rather by a separate query engine (if I understood things
correctly). Looking at Sesame 2 it does indeed look like it's a chained
architecture that would allow such caching, so I'm assuming you're
talking about Sesame 2.

My only problem with using Sesame 2 is that it's an alpha, and I don't
know when I'll have to go into production with this stuff. Preferably
within the next six months. Any thoughts on that? Sesame 1 or 2?

/Rickard

-- 
Rickard berg
rickard.oberg_at_senselogic.se
_at_work +46-(0)19-173036



Received on Mon Jan 23 2006 - 10:35:32 EST

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT