Re: Piggy Bank and Semantic Bank: scalability and performance from David Huynh on 2005-10-23 (stdin)

From: David Huynh <dfhuynh_at_csail.mit.edu>
Date: Sun, 23 Oct 2005 15:50:10 -0400

Rickard Öberg wrote:

> Hey
>
> I have been running into some performance problems and am wondering
> what the general thoughts are on the subject.
>
> The first one is a problem with the general way Piggy Bank works right
> now. First of all everything is done in a memory store. I wrote a
> scraper that scraped around ~30.000 items with 10 properties each.
> There were several problems I ran into:
> 1) Everything was scraped in one transaction (see WorkingModel.java)
> which means that after a while I ran out of memory -> crash
> 2) If I explicitly added getRepository() calls to force new
> transactions it got much better, but due to the data size I still run
> out of memory -> crash
> 3) If I chop it up into smaller pieces the scraper works, but then
> submitting it to the central bank causes error messages like "Form too
> large" -> crash
> 4) To get around the above I had to start the scraper manually on
> smaller chunks of the data -> works, and I now have all the data, but
> it is a workaround and points to a larger problem

~30.000 items... You're brave :-)

You can try this (I haven't tried it myself):

// for each reasonably-sized chunk of pages to scrape {
    var aModel = scrapingUtilities.createWorkingModel();
    // scrape into aModel;

    var myPiggyBankProfile =
PB_Extension.getPiggyBankServer().getDefaultProfile();
    myPiggyBankProfile.addData(aModel.getRepository(), false);

    // presumably aModel will get garbage collected
// }

Of course, this code dumps everything into the My Piggy Bank store
(native file-backed Sesame store). You can also open a file and write
out to it. Your scraper Javascript code can do pretty much anything.

> Questions:
> 1) Are there any thoughts on doing scraping not in-memory? Perhaps the
> default should be in-memory, but for people like me who do
> "heavy-duty" scrapes there should be an option of doing it in a
> slightly slower but working manner using file-backed repositories

I think heavy-duty scraping should not be supported inside Piggy Bank.
We can have an entirely stand-alone application for that. You probably
want sophisticated status monitoring UI, etc., etc., too.

> 2) Why is there a restriction on the size of the data submission? And
> if there is, why doesn't Piggy Bank itself chop it up into smaller
> pieces and submits the pieces one-after-the-other? (or this just a
> side-effect of all of this being work-in-progress?)

It's probably a Firefox's restriction or a Jetty's restriction. We've
never hit that limit before.

> This is for Piggy Bank. Then, once all the data is in the bank (in my
> case I got 30.000 earthquakes from 1974-2005 to play with) there's the
> issue of query performance. It is really really slow to do filtering
> of that size of data. If this thing is to be used for really really
> large data sets (and I mean, 30.000 is TINY if you consider just how
> much data can be filled into a semantic database) then either the
> databases have to get substantially faster, or performance tricks and
> optimizations will have to be done. What are the largest databases
> other people are using? What performance issues have others found?

I'd say export the data out to N3 or RDF/XML. Then run Longwell2 on it
using an MySQL database.

We do intend to handle large datasets but we haven't gotten around to
that yet.

> One problem right now seems to be that the UI has to go through the
> entire database to find out what values are used in order to present
> selection lists. In some cases I know what the accepted value ranges
> are (e.g. earthquake magnitudes are in the range 1-9), and could hence
> provide specialized selectors for that. This should avoid doing the
> "search for all values in data set" problem quite nicely. Is there a
> way to associate property types with viewers like that? Would this be
> possible to do?

There are two sets of tools that we are interested in providing:
- tools that add value to existing Web information for naive users in
their everyday use of the Web (--> Piggy Bank)
- tools that let domain experts make sense of their information
(-->Welkin,...)
The first category should not require much configuration while the
latter should take advantage of domain knowledge for
optimization--speed, memory, and UI.

> To be clear, I really really like this stuff, and I can see TONS of
> neat applications of it, but it'd be nice to have some idea of whether
> it will scale to much larger applications and data sets, and what can
> be done to handle it if not.

Glad you're pushing it to the limit :-) Just curious, have you tried
plotting 30,000 items on Google Maps?!

David
Received on Sun Oct 23 2005 - 19:44:41 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT