Re: Piggy Bank and Semantic Bank: scalability and performance

From: Rickard Öberg <rickard.oberg_at_senselogic.se>
Date: Mon, 24 Oct 2005 17:13:58 +0200

Stefano Mazzocchi wrote:
> Rickard,
>
> we did not expect Piggy Bank scraping to generate that many statements,
> as we imagined, as David mentioned, that all sort of other projects
> exist in direct "RDF-ization" of datasets.

Okidoki. I just kinda liked the interface of Piggy Bank as it is very
rapid to work with. With some backdoors to allow reasonable scalability
that will be enough for me, for now.

> We started a collection of standalone tools for RDFization in
>
> http://simile.mit.edu/RDFizers/
>
> but since we wanted to keep submissions open for many languages, we
> didn't start a framework for scraping.
>
> I'm positive such a thing will be required and yes, we are fully aware
> that 300Kt (kilo triples) is a tiny thing if you ever want to get
> anywhere with this.
>
> So much that we are going to hire somebody in the next few months, to
> work specifically on scalability issues.

Oh great, then we are on the same page. (I knew we were though, just
wanted to confirm)

>> Questions:
>> 1) Are there any thoughts on doing scraping not in-memory? Perhaps the
>> default should be in-memory, but for people like me who do
>> "heavy-duty" scrapes there should be an option of doing it in a
>> slightly slower but working manner using file-backed repositories
>
> Yes, this would be good, but we don't have the energy to do it ourselves
> at the moment.

Fair enough.

>> 2) Why is there a restriction on the size of the data submission?
>
> Probably jetty has an upload max size.

Quite probably, yes.

>> And if there is, why doesn't Piggy Bank itself chop it up into smaller
>> pieces and submits the pieces one-after-the-other? (or this just a
>> side-effect of all of this being work-in-progress?)
>
> Because we didn't expect this kind of usage :-)

:-)

> We have not tried anything bigger than 600Kt (350Kt data and 250Kt
> inferenced) and we used Longwell 1.x and we needed to go "in memory" to
> reach quick performance in a reasonable timeframe.
>
> We are targetting something along the lines of 150Mt for next iteration
> and, like I said, we are going to hire somebody to try to get there.

Good, then we are, again, on the same page. (I realize there's a
difference between vision and reality, heck I know that from own
experience, but at least it's good to know the approximate target)

>> Is there a way to associate property types with viewers like that?
>
> Currently there isn't, but this is one of the things that are planned.

Fair enough, that's what I wanted to hear.

>> Would this be possible to do?
>
> Some of us have been working on the Fresnel visualization ontology, a
> vocabulary to create "lenses", sort of views of some RDF data.
>
> http://simile.mit.edu/fresnel/
>
> we plan to integrate fresnel in the next major release of Longwell, and
> therefore be available for both Piggy Bank and Semantic Bank, since they
> both use it.

Excellent!

> Rickard,
>
> as much as I'd love to say that this stuff is rock solid, unfortunately,
> it's not. We have been working on the concepts and architectural visions
> and implemented prototypes and when we had to decide between spending a
> lot of time and energy to "get it right" the first time or "get it out
> there", we decided to use the "release early and often" strategy.
>
> "Good ideas and bad code build communities, the other three combinations
> do not", I said this years ago and I still strongly believe it.

Well... as long as we can agree that there is a difference between "bad
code" and "rotten code" I can see what you're going for... ;-)

> Things like version control of triple stores, scalability of querying,
> inferencing scalability, trust, ontology harmonization and linking, RDF
> extraction and emergence out of semi-structured data.... these are all
> research topics that are very far to be closed.
>
> I also know (and I'm happy to have you around because you know very well
> that too), that not only accademia can solve hard research problems and
> sometimes some shared pragmatism is what is needed to bring the problems
> to a closure.

Yup. We are going to need much of this stuff for our commercial uses
(apart from my hobby interests) as well, so if there's anything I can
do, I will.

> For SIMILE, we try to do a very hard thing, mix research, development
> and open source, all done well. It has worked well so far, as we were
> able to get enough traction to get pragmatic people like you excited
> about ideas that were considered academic and overdesigned by the many.
>
> But there is *tons* of work to do and the number of corners we can cut
> is getting smaller and smaller since we need to move some of our stuff
> in a state that can be used for real.
>
> That said, we are currently 2 developers and 1 grad student. Not exactly
> a workforce. We are going to hire 3 more developers in the next few
> months and one will be tasked with the scalability and performance issues.
>
> But what we want is to start a community, so that we can share the costs
> and benefits of all this.
>
> Well, the above was a very verbose way to say "patches welcome" :-)

And it was well put, as always. I wasn't worried though, but I think you
knew that.

Cool stuff ahead. Steaming on.

regards,
   Rickard

-- 
Rickard Öberg
rickard.oberg_at_senselogic.se
_at_work +46-(0)19-173036



Received on Mon Oct 24 2005 - 15:11:06 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT