Piggy Bank and Semantic Bank: scalability and performance

From: Rickard Öberg <rickard.oberg_at_senselogic.se>
Date: Sun, 23 Oct 2005 18:54:10 +0200

Hey

I have been running into some performance problems and am wondering what
the general thoughts are on the subject.

The first one is a problem with the general way Piggy Bank works right
now. First of all everything is done in a memory store. I wrote a
scraper that scraped around ~30.000 items with 10 properties each. There
were several problems I ran into:
1) Everything was scraped in one transaction (see WorkingModel.java)
which means that after a while I ran out of memory -> crash
2) If I explicitly added getRepository() calls to force new transactions
it got much better, but due to the data size I still run out of memory
-> crash
3) If I chop it up into smaller pieces the scraper works, but then
submitting it to the central bank causes error messages like "Form too
large" -> crash
4) To get around the above I had to start the scraper manually on
smaller chunks of the data -> works, and I now have all the data, but it
is a workaround and points to a larger problem

Questions:
1) Are there any thoughts on doing scraping not in-memory? Perhaps the
default should be in-memory, but for people like me who do "heavy-duty"
scrapes there should be an option of doing it in a slightly slower but
working manner using file-backed repositories
2) Why is there a restriction on the size of the data submission? And if
there is, why doesn't Piggy Bank itself chop it up into smaller pieces
and submits the pieces one-after-the-other? (or this just a side-effect
of all of this being work-in-progress?)

This is for Piggy Bank. Then, once all the data is in the bank (in my
case I got 30.000 earthquakes from 1974-2005 to play with) there's the
issue of query performance. It is really really slow to do filtering of
that size of data. If this thing is to be used for really really large
data sets (and I mean, 30.000 is TINY if you consider just how much data
can be filled into a semantic database) then either the databases have
to get substantially faster, or performance tricks and optimizations
will have to be done. What are the largest databases other people are
using? What performance issues have others found?

One problem right now seems to be that the UI has to go through the
entire database to find out what values are used in order to present
selection lists. In some cases I know what the accepted value ranges are
(e.g. earthquake magnitudes are in the range 1-9), and could hence
provide specialized selectors for that. This should avoid doing the
"search for all values in data set" problem quite nicely. Is there a way
to associate property types with viewers like that? Would this be
possible to do?

To be clear, I really really like this stuff, and I can see TONS of neat
applications of it, but it'd be nice to have some idea of whether it
will scale to much larger applications and data sets, and what can be
done to handle it if not.

regards,
   Rickard
Received on Sun Oct 23 2005 - 16:48:45 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT