Re: Just-in-time scraping, queries? from Rickard Öberg on 2005-10-23 (stdin)

From: Rickard Öberg <rickard.oberg_at_senselogic.se>
Date: Sun, 23 Oct 2005 22:55:18 +0200

Danny Ayers wrote:
> I was wondering if anyone had come up with any strategies that might
> be useful in a scenario that came up on the SIMILE list [1]. Rickard
> is using Piggy Bank's scraper to harvest moderately large amounts of
> data into its store (30,000 items, 10 properties each), and is running
> into performance issues. I'm not sure, but he mentioned Wikipedia
> earlier, that may be the datasource.

Right now I'm using the databases at usgs.gov (earthquakes, volcanoes,
etc.) as a start. Wikipedia is my next target though.

> I think it's reasonable to consider a triplestore as merely a cache of
> a certain chunk of the Semantic Web at large. So in a case like this,
> maybe it makes more sense to forget trying to cache /everything/, just
> grabbing things into the working model as required. But say there's a
> setup like a SPARQL interface to a store, and a scraper (HTTP GET+
> whatever translation is appropriate). How might you do figure what's
> needed to fulfil the query, what joins are required, especially where
> there isn't any direct subject-object kind of connection to the
> original data? (i.e. where there's lots of bnodes). Querying Wikipedia
> as-is via SPARQL is probably a good use case.

Indeed, I've been thinking about multi-layered triplestores as well,
such as:
1) Multiple distributed specialized banks
fronted by:
2) Local persistent bank acting as proxy and file-store cache
fronted by:
3) In-memory cache that contains queried data

When you start working with data in Piggy Bank, for example, you start
with some basic filtering, like "I want all earthquakes". This is sent
to 2) which can then use 1) to get the data, possibly using the local
cache whenever possible. The aggregated dataset is then put into 3) and
presented to the user. Since we then have a reasonably small subset of
all data in memory it is very fast to do further drilldown filtering,
such as showing "all earthquakes in 1980, of magnitude 6-8". Since such
operations would be done in memory they would be very very fast. 3) can
even exist on the users computer so that user sessions do not eat vast
amounts of server resources. This way each layer is used to the maximum.
1) might even be a "fake" semantic bank, where scraping would be
performed in real-time as the queries come in. It is easy to imagine all
of this being combined with P2P techniques as well.

/Rickard
Received on Sun Oct 23 2005 - 20:49:58 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT