Re: Querying and caching with large datasets

From: Jeen Broekstra <jeen_at_aduna.biz>
Date: Mon, 23 Jan 2006 10:32:37 +0100

Rickard Öberg wrote:

> A general question about performance that we have is query speed. The
> test I did with importing 3.000.000 tuples into a store allowed us
> to make many interesting queries and use our data in ways that we
> could not have done before.
>
> But, the fastest of those queries were around 100ms, and in a CMS
> that is simply not good enough. In most cases it needs to be almost
> instantaneous, and the only way to do that is to cache, cache, cache.
>
>
> I have been thinking about how to do this in Sesame, and it seems
> like a decent approach would be to implement it as a Sail layer. The
> layer would delegate to the underlying "real" layers and *if* the
> query contains special tuples indicating that a cached response is
> wanted, then it could store the result so that the next time the
> results can be returned immediately from the cache. The next step is
> to also recognize a cache-timeout and when that timeout is reached
> the cache should preemptively perform the query and re-fill the
> cache.

You could even do it more specific than that: since your cache is a
(stacked) SAIL and therefore is aware of all operations that take place
on the underlying store, you can let it 'sniff' all update transactions
and invalidate cache objects as soon as it detects an operation that
changes data relevant to that cache object. Or more simply: refresh the
cache whenever data is added/removed. No real need for timeouts.

The general idea looks good. The caching structure itself takes some
thinking through. There are two types of query results (variable
bindings and triples) so you need a cache that takes that into account -
*or* you can choose to simply treat each query as a triple-returning
query and only do the conversion to variable-bindings last-minute, that
way you can use an in-memory SAIL as the cache object.

> This seems like a good way, in theory, to get caching of queries. The
> main assumptions are that there are some queries that are performed
> often, that data is not updated so frequently that the cached data
> will be immediately outdated, and that by allowing the caller to
> decide when to use the cache by including cache information in the
> query we can easily bypass the cache if necessary. For queries that
> return large result sets it would definitely be a bad idea to cache
> it. For example, our daily link checker would do a query for all
> links, and to query those results would be pointless.
>
> Has this been done before? Is there anything wrong with the basic
> idea? Any other ways to perform good caching of query results?

I know that some people at the Vrije Universiteit Amsterdam have played
with caching query results for often-occurring queries on top of a
MySQL-backed Sesame store, as part of the BuRST project
(http://www.cs.vu.nl/~pmika/research/burst/BuRST.html). We talked
through their main ideas at the time (which were fairly similar to your
own outline), but I'm not sure if they have actually implemented the
caching scheme.

We've often considered implementing a query cache as a SAIL ourselves
though (in fact, the main reason the architecture has stacked SAILs in
the first place is precisely because we wanted to cater for this kind of
improvemement). The main reason there is not something like it in Sesame
already is lack of time. So yeah, we think it's a great idea, and
especially on persistent (MySQL, native) stores it may give great speedups.

Regards,

Jeen
Received on Mon Jan 23 2006 - 09:30:52 EST

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT