Re: Piggy Bank and Semantic Bank: scalability and performance

From: Stefano Mazzocchi <>
Date: Mon, 24 Oct 2005 11:03:35 -0400

Rickard Öberg wrote:
> Hey
> I have been running into some performance problems and am wondering what
> the general thoughts are on the subject.
> The first one is a problem with the general way Piggy Bank works right
> now. First of all everything is done in a memory store. I wrote a
> scraper that scraped around ~30.000 items with 10 properties each. There
> were several problems I ran into:
> 1) Everything was scraped in one transaction (see
> which means that after a while I ran out of memory -> crash
> 2) If I explicitly added getRepository() calls to force new transactions
> it got much better, but due to the data size I still run out of memory
> -> crash
> 3) If I chop it up into smaller pieces the scraper works, but then
> submitting it to the central bank causes error messages like "Form too
> large" -> crash
> 4) To get around the above I had to start the scraper manually on
> smaller chunks of the data -> works, and I now have all the data, but it
> is a workaround and points to a larger problem


we did not expect Piggy Bank scraping to generate that many statements,
as we imagined, as David mentioned, that all sort of other projects
exist in direct "RDF-ization" of datasets.

We started a collection of standalone tools for RDFization in

but since we wanted to keep submissions open for many languages, we
didn't start a framework for scraping.

I'm positive such a thing will be required and yes, we are fully aware
that 300Kt (kilo triples) is a tiny thing if you ever want to get
anywhere with this.

So much that we are going to hire somebody in the next few months, to
work specifically on scalability issues.

> Questions:
> 1) Are there any thoughts on doing scraping not in-memory? Perhaps the
> default should be in-memory, but for people like me who do "heavy-duty"
> scrapes there should be an option of doing it in a slightly slower but
> working manner using file-backed repositories

Yes, this would be good, but we don't have the energy to do it ourselves
at the moment.

> 2) Why is there a restriction on the size of the data submission?

Probably jetty has an upload max size.

> And if
> there is, why doesn't Piggy Bank itself chop it up into smaller pieces
> and submits the pieces one-after-the-other? (or this just a side-effect
> of all of this being work-in-progress?)

Because we didn't expect this kind of usage :-)

> This is for Piggy Bank. Then, once all the data is in the bank (in my
> case I got 30.000 earthquakes from 1974-2005 to play with) there's the
> issue of query performance. It is really really slow to do filtering of
> that size of data. If this thing is to be used for really really large
> data sets (and I mean, 30.000 is TINY if you consider just how much data
> can be filled into a semantic database) then either the databases have
> to get substantially faster, or performance tricks and optimizations
> will have to be done. What are the largest databases other people are
> using? What performance issues have others found?

We have not tried anything bigger than 600Kt (350Kt data and 250Kt
inferenced) and we used Longwell 1.x and we needed to go "in memory" to
reach quick performance in a reasonable timeframe.

We are targetting something along the lines of 150Mt for next iteration
and, like I said, we are going to hire somebody to try to get there.

Finding the balance between indexing, caching and runtime query
evaluation will be a very tricky thing.

> One problem right now seems to be that the UI has to go through the
> entire database to find out what values are used in order to present
> selection lists. In some cases I know what the accepted value ranges are
> (e.g. earthquake magnitudes are in the range 1-9), and could hence
> provide specialized selectors for that. This should avoid doing the
> "search for all values in data set" problem quite nicely.


> Is there a way to associate property types with viewers like that?

Currently there isn't, but this is one of the things that are planned.

> Would this be possible to do?

Some of us have been working on the Fresnel visualization ontology, a
vocabulary to create "lenses", sort of views of some RDF data.

we plan to integrate fresnel in the next major release of Longwell, and
therefore be available for both Piggy Bank and Semantic Bank, since they
both use it.

> To be clear, I really really like this stuff, and I can see TONS of neat
> applications of it, but it'd be nice to have some idea of whether it
> will scale to much larger applications and data sets, and what can be
> done to handle it if not.


as much as I'd love to say that this stuff is rock solid, unfortunately,
it's not. We have been working on the concepts and architectural visions
and implemented prototypes and when we had to decide between spending a
lot of time and energy to "get it right" the first time or "get it out
there", we decided to use the "release early and often" strategy.

"Good ideas and bad code build communities, the other three combinations
do not", I said this years ago and I still strongly believe it.

On the other hand, SIMILE just got funding for the next 2 years and we
are going to hire several new people to start this much needed process
of "software solidification" along with the process of "community
incubation". Both will be very important for the success of this project
and, more in general, of the technological visions about the semantic
web that we implement.

There are huge issues in the semantic web architecture that are not
solved and that are very far away from being solid enough to be viable
for the software industry at large.

Things like version control of triple stores, scalability of querying,
inferencing scalability, trust, ontology harmonization and linking, RDF
extraction and emergence out of semi-structured data.... these are all
research topics that are very far to be closed.

I also know (and I'm happy to have you around because you know very well
that too), that not only accademia can solve hard research problems and
sometimes some shared pragmatism is what is needed to bring the problems
to a closure.

For SIMILE, we try to do a very hard thing, mix research, development
and open source, all done well. It has worked well so far, as we were
able to get enough traction to get pragmatic people like you excited
about ideas that were considered academic and overdesigned by the many.

But there is *tons* of work to do and the number of corners we can cut
is getting smaller and smaller since we need to move some of our stuff
in a state that can be used for real.

That said, we are currently 2 developers and 1 grad student. Not exactly
a workforce. We are going to hire 3 more developers in the next few
months and one will be tasked with the scalability and performance issues.

But what we want is to start a community, so that we can share the costs
and benefits of all this.

Well, the above was a very verbose way to say "patches welcome" :-)

Stefano Mazzocchi
Research Scientist                 Digital Libraries Research Group
Massachusetts Institute of Technology            location: E25-131C
77 Massachusetts Ave                   telephone: +1 (617) 253-1096
Cambridge, MA  02139-4307              email: stefanom at mit . edu
Received on Mon Oct 24 2005 - 14:58:08 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT