Re: Piggy Bank, timeline visualization, and scrapers

From: David Huynh <dfhuynh_at_csail.mit.edu>
Date: Tue, 18 Oct 2005 09:29:38 -0400

Rickard,

We are about to start designing into our architecture richer
visualizations such as timelines, but it seems that the demands are
getting ahead of us. :-)

Question 1:

If you have your dates in the right format (2005-10-18T09:03:00Z), then
Longwell 2 (embedded in both Piggy Bank and Semantic Bank) already
provides a date-sensitive facet. You can see this yourself by going to
http://simile.mit.edu/bank/ and click on Web Pages in the middle box.
Right now, the date facet breaks down the date range of the items being
shown into centuries, decades, years, months, weeks, days... whichever
largest unit of time that gives you more than one choice. The dropdown
list in the date facet also gives you different options for breaking
down the date range. If all items fall within one day, we also provide
"Before hour" and "After hour". This support is so far incomplete and
only meant to be demonstrative. If you want to add more options, please
take a look at the function internalSuggestBuckets in
DateTimeBucketer.java in the Longwell 2 codebase.

If you want to make your own UI, e.g., scrollbar, that works only on the
maps, then you need to look into this file to start with:
    longwell2/src/ui/html/server-side/templates/panes/map-results-pane.vt
This is not something one can do with little time investment.

Question 2:

A lot of the scrapers I've written are actually multipage scrapers. The
VacancyGuide.com scraper
    
http://people.csail.mit.edu/dfhuynh/research/downloads/screen-scrapers/vacancyguide-dot-com-scraper.js
goes from one search results page to the next by finding out the "Next
Page" link. For each search results page, it accumulates the URL to each
search result's details page.

You might want to try out this little extension (warning: alpha version)
for writing scraper:
    http://people.csail.mit.edu/dfhuynh/research/downloads/solvent-0.1.0.xpi

Happy scraping! :-)

Cheers,

David


Rickard Öberg wrote:

> Hey
>
> After having thought about building an RDF database/visualizer I
> happened upon Stefanos blog and found Piggy Bank, and after having
> played with it some I ditched all my own ideas and started writing
> scrapers. This stuff is just way way cool! Many thanks for providing it!
>
> I have lots of questions etc. but to start off with there are two core
> things I need, and wanted to check with you how difficult it would be
> to fix. The majority of the data I want to have uses both locations
> and dates, specifically, I have lots and lots of historical data so
> it's kind of a "four dimensional" thing. So, for my purposes I want to
> combine the Google Maps visualizer with a scrollbar thingy where I can
> set a "window" of dates and then move that scrollbar and have the
> events for that window be shown. If I were to start adding that, how
> would I go about it? Is this something that anyone else could do with
> a relatively small time investment? In any case, I would imagine that
> such a thing would be kind of useful not only for my purposes, but for
> other things as well, e.g. earthquake visualization over a period of
> time (btw, I touched up Davids commented-out, but Google cached, USGS
> scraper so it works), RSS news items over a period of time, etc. Any
> feedback on how to get something like that going would be MUCH
> appreciated.
> Second, I am considering writing a scraper for Wikipedia, to bootstrap
> the historical database. Basically just going in there and scraping up
> all the events, deaths, births etc. that they have. However, instead
> of visiting each page individually and clicking the coin it would be
> preferred if I could point a scraper at it and just say "fetch
> website", in other words, the script needs to be able to fetch pages
> on its own based on generation of URL's. How easy would it be to do
> something like that? Has anyone done things like that (i.e. multipage
> scraping) before? Examples available?
>
> I've read through the email archives, and it seems like this project
> is kind of in a starting position and that there are performance and
> technical issues to be fixed, but I am betting on that those will be
> resolved further down the line :-) I seriously like the approach, and
> will focus on writing scrapers for now.
>
> Alright, good enough for an intro post I guess. Again, thanks for
> providing this great tool!
>
> regards,
> Rickard
>
Received on Tue Oct 18 2005 - 13:26:18 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT