Querying and caching with large datasets

From: Rickard Öberg <rickard.oberg_at_senselogic.se>
Date: Mon, 23 Jan 2006 09:47:35 +0100

A general question about performance that we have is query speed. The
test I did with importing 3.000.000 tuples into a store allowed us to
make many interesting queries and use our data in ways that we could not
have done before.

But, the fastest of those queries were around 100ms, and in a CMS that
is simply not good enough. In most cases it needs to be almost
instantaneous, and the only way to do that is to cache, cache, cache.

I have been thinking about how to do this in Sesame, and it seems like a
decent approach would be to implement it as a Sail layer. The layer
would delegate to the underlying "real" layers and *if* the query
contains special tuples indicating that a cached response is wanted,
then it could store the result so that the next time the results can be
returned immediately from the cache. The next step is to also recognize
a cache-timeout and when that timeout is reached the cache should
preemptively perform the query and re-fill the cache.

This seems like a good way, in theory, to get caching of queries. The
main assumptions are that there are some queries that are performed
often, that data is not updated so frequently that the cached data will
be immediately outdated, and that by allowing the caller to decide when
to use the cache by including cache information in the query we can
easily bypass the cache if necessary. For queries that return large
result sets it would definitely be a bad idea to cache it. For example,
our daily link checker would do a query for all links, and to query
those results would be pointless.

Has this been done before? Is there anything wrong with the basic idea?
Any other ways to perform good caching of query results?

/Rickard

-- 
Rickard Öberg
rickard.oberg_at_senselogic.se
_at_work +46-(0)19-173036



Received on Mon Jan 23 2006 - 08:53:14 EST

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT