David Karger wrote:
> Stefano, there are at least two separate haystack theses that have
> useful code that could perhaps be applied to this problem (although that
> code is java in one case and perl in the other).
Awesome! (no worries about the code, I think finding the proper
algorithm/methodology is probably the hard part, we can port/incorporate
the code when we need it)
> Kai Shih wrote winnow, which solves the "static parts of the page"
> detection problem you mentioned (as a first step towards learning which
> things to pull out of the page based on your historical interests in
> different parts of the page). This code is up and running at
> http://shrike.csail.mit.edu/cgi-bin/winnow/frontpage.cgi.
bummer, I get a 404.
> If you set up
> a page and read news from it for a few days, it will start recommending
> news you will like. You'll also notice the ads are missing---image OR
> text---winnow pulls that off by noticing which parts of the page are
> usually redirects. Kai's thesis is on the haystack site.
hmmm, no occurrence of "shih" in
http://haystack.csail.mit.edu/publications.html
I found this instead
http://www2004.org/proceedings/docs/1p193.pdf
> Andrew Hogue wrote thresher. It lets an end user highlight a region of
> a web page, declare what "type" of object it is, then highlight
> subregions of the region and declare what "predicates" those regions
> specify. Based on that example (usually one is sufficient) the system
> automatically learns a scraper to extract that type of object into RDF
> from elsewhere on the same page, or from other pages generated by the
> same site. Thresher is running inside of haystack. It is a separately
> developed module with a clean API, and already integrated with mozilla,
> so it should be very easy to pull it out and back into piggy bank.
I think I crossed paths with Andrew when I arrived and he was
finishing... but thought it was only IE. Is the code in the haystack
repository?
> Andrew's thesis is also on the haystack web site.
got it.
Thanks much.
Don't hold your breath for any implementation of all this though, I'm
already swamped trying to understand why PB doesn't cleanup is memory mess.
--
Stefano Mazzocchi
Research Scientist Digital Libraries Research Group
Massachusetts Institute of Technology location: E25-131C
77 Massachusetts Ave telephone: +1 (617) 253-1096
Cambridge, MA 02139-4307 email: stefanom at mit . edu
-------------------------------------------------------------------
Received on Wed Aug 03 2005 - 23:08:53 EDT