Standalone Scraper [was Re: Perl Scraper] from Stefano Mazzocchi on 2006-01-05 (stdin)

From: Stefano Mazzocchi <stefanom_at_mit.edu>
Date: Thu, 05 Jan 2006 13:52:42 -0800

ilango gurusamy wrote:
> David
> I was also wondering if I can add Piggy Bank as a plugin to the Nutch
> crawler I have running on my machine.
>
> Do you think such a thing is possible?

hmmm, very interesting, that's clearly something that would be very very
very useful to have.

On the surface, it seems easy, since piggy bank is mostly written in
java and nutch is written in java as well. But there is a problem, what
you are interested in, if I understand your intentions correctly, is to
be able to use the scraping part of piggy-bank and obtain richer
metadata out of nutch crawls.

The problem with that is that the part of piggy-bank that does the
javascript scraping is executed by firefox, not by the java virtual
machine. This means that in order for nutch to use piggy-bank as a
metadata extractor, you would have to have nutch call firefox as a
native library but I don't know if that is feasible.

A more elegant approach would be to call firefox's XPCOM subsystem from
java and use the HTML parser and the javascript engine directly,
therefore using only the minimal pieces to replicate the scraping
functionality.

There is some (old, and probably obsolete) code to implement a
java-xpcom component bridge in the mozilla CVS repository at

http://lxr.mozilla.org/mozilla/source/java/xpcom/

but I've tried to see if it works.

Another alternative is to write a command line scraping tool that
receives a URL (or list of URLs) at STDIN and spits out RDF statements
at STDOUT. Nutch (or anything else) could invoke that relatively easily
and that doesn't require all the java-xpcom or JNI machinery.

Maybe using the PyXPCOM extensions would allow to write such command
line app in python instead that in C++ (for the C++ phobic like me, this
is a plus).

In short, it's all but easy to add piggy-bank to nutch, althought it
would be very interesting to have a command line scraping tool that used
the same exact piggy bank scrapers.

Another alternative is to use JTidy+XML+Rhino in java to reproduce an
equivalent environment in pure java without using any of the xpcom
mozilla code. In theory it should provide the same exact response, in
practice, I believe that a DOM generated by the mozilla HTML parser and
a DOM generated by a JTidy+XML parser would not be identical all the
time, leading to all sort of cases where the scraper would work on
piggy-bank inside the browser but not on the command line scraping tool.

-- 
Stefano Mazzocchi
Research Scientist                 Digital Libraries Research Group
Massachusetts Institute of Technology            location: E25-131C
77 Massachusetts Ave                   telephone: +1 (617) 253-1096
Cambridge, MA  02139-4307              email: stefanom at mit . edu
-------------------------------------------------------------------

Received on Thu Jan 05 2006 - 21:52:35 EST

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT