Re: Piggy Bank, timeline visualization, and scrapers

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]

From: Rickard Öberg <rickard.oberg_at_senselogic.se>
Date: Wed, 19 Oct 2005 09:46:32 +0200

Hi!

I managed to figure out the flow of multi-page scrapers, but now that I try it on a slightly larger test I run into memory problems. First of all, if you are doing multi-page stuff (as in nr>1000) it appears to be a good idea to do model.getRepository() for every page in order to flush the transaction. Otherwise the process comes to a halt due to lack of memory. Second, since the repository that the transaction is flushed to is also a memory model (see WorkingModel.java) that would set a limit as to how much data can be scraped in one go as well. For example, I want to do a scraper that gets data from Wikipedia.org, and then we're not talking 1000 pages or so, but much much more. In order to get that to work the data has to be flushed down into a persistent store once in a while. Any way to do that?

Other than that I'm happy for now. It all works reasonably well (except I can't get my Google Maps to work on the Semantic Bank server; do I have to have an API key for that as well? If so, where do I put it?).

regards,
Rickard
Received on Wed Oct 19 2005 - 07:41:15 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT