Re: Piggy Bank and Semantic Bank: scalability and performance from Rickard Öberg on 2005-10-23 (stdin)

From: Rickard Öberg <rickard.oberg_at_senselogic.se>
Date: Sun, 23 Oct 2005 21:58:31 +0200

David Huynh wrote:
> ~30.000 items... You're brave :-)

Well, that's just it... if you think about it 30.000 is NOTHING. Tiny.
Miniscule. Insignificant. I mean, if you are really really serious about
building semantic banks and semantic webs then you should add a couple
of zeroes to that number...

> You can try this (I haven't tried it myself):
>
> // for each reasonably-sized chunk of pages to scrape {
> var aModel = scrapingUtilities.createWorkingModel();
> // scrape into aModel;
> var myPiggyBankProfile =
> PB_Extension.getPiggyBankServer().getDefaultProfile();
> myPiggyBankProfile.addData(aModel.getRepository(), false);
>
> // presumably aModel will get garbage collected
> // }
>
> Of course, this code dumps everything into the My Piggy Bank store
> (native file-backed Sesame store). You can also open a file and write
> out to it. Your scraper Javascript code can do pretty much anything.

Thanks, that should help some. As I've noted, I want to scrape
Wikipedia, which means a shitload of data 8-)

>> Questions:
>> 1) Are there any thoughts on doing scraping not in-memory? Perhaps the
>> default should be in-memory, but for people like me who do
>> "heavy-duty" scrapes there should be an option of doing it in a
>> slightly slower but working manner using file-backed repositories
>
>
> I think heavy-duty scraping should not be supported inside Piggy Bank.
> We can have an entirely stand-alone application for that. You probably
> want sophisticated status monitoring UI, etc., etc., too.

Indeed. Any ideas for how to do that? For heavy-duty scraping I will
probably want to do it on a timer as well, and revisit websites
reasonably often to get new data.

>> 2) Why is there a restriction on the size of the data submission? And
>> if there is, why doesn't Piggy Bank itself chop it up into smaller
>> pieces and submits the pieces one-after-the-other? (or this just a
>> side-effect of all of this being work-in-progress?)
>
> It's probably a Firefox's restriction or a Jetty's restriction. We've
> never hit that limit before.

Hm.. will have to try some more then.

>> This is for Piggy Bank. Then, once all the data is in the bank (in my
>> case I got 30.000 earthquakes from 1974-2005 to play with) there's the
>> issue of query performance. It is really really slow to do filtering
>> of that size of data. If this thing is to be used for really really
>> large data sets (and I mean, 30.000 is TINY if you consider just how
>> much data can be filled into a semantic database) then either the
>> databases have to get substantially faster, or performance tricks and
>> optimizations will have to be done. What are the largest databases
>> other people are using? What performance issues have others found?
>
> I'd say export the data out to N3 or RDF/XML. Then run Longwell2 on it
> using an MySQL database.
>
> We do intend to handle large datasets but we haven't gotten around to
> that yet.

Well, if you want to I can really recommend the Earthquake data :-) I
can send you the scraper if you're interested. It's good as a reality
check if nothing else ;-)

>> One problem right now seems to be that the UI has to go through the
>> entire database to find out what values are used in order to present
>> selection lists. In some cases I know what the accepted value ranges
>> are (e.g. earthquake magnitudes are in the range 1-9), and could hence
>> provide specialized selectors for that. This should avoid doing the
>> "search for all values in data set" problem quite nicely. Is there a
>> way to associate property types with viewers like that? Would this be
>> possible to do?
>
> There are two sets of tools that we are interested in providing:
> - tools that add value to existing Web information for naive users in
> their everyday use of the Web (--> Piggy Bank)
> - tools that let domain experts make sense of their information
> (-->Welkin,...)
> The first category should not require much configuration while the
> latter should take advantage of domain knowledge for
> optimization--speed, memory, and UI.

Well, if Piggy Bank could be expanded using RDF that describes how
properties should be visualized it'll be just another kind of data to
get in there. I want to allow non-technical people to work with huge
datasets using nice visualizations (yes, it's a challenge, I know that)
so being able to extend the PB interface is worth a lot as it is
webbased. If I can embed applets as interface which can do some of the
more complicated things (like timelines) then that is fine.

>> To be clear, I really really like this stuff, and I can see TONS of
>> neat applications of it, but it'd be nice to have some idea of whether
>> it will scale to much larger applications and data sets, and what can
>> be done to handle it if not.
>
> Glad you're pushing it to the limit :-) Just curious, have you tried
> plotting 30,000 items on Google Maps?!

Yes. Doesn't work :-)

/Rickard
Received on Sun Oct 23 2005 - 19:53:06 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT