Re: [RT] Learning from Greasemonkey + Platypus

From: Stefano Mazzocchi <stefanom_at_mit.edu>
Date: Wed, 03 Aug 2005 16:12:27 -0700

David Karger wrote:
> Stefano, there are at least two separate haystack theses that have
> useful code that could perhaps be applied to this problem (although that
> code is java in one case and perl in the other).

Awesome! (no worries about the code, I think finding the proper
algorithm/methodology is probably the hard part, we can port/incorporate
the code when we need it)

> Kai Shih wrote winnow, which solves the "static parts of the page"
> detection problem you mentioned (as a first step towards learning which
> things to pull out of the page based on your historical interests in
> different parts of the page). This code is up and running at
> http://shrike.csail.mit.edu/cgi-bin/winnow/frontpage.cgi.

bummer, I get a 404.

> If you set up
> a page and read news from it for a few days, it will start recommending
> news you will like. You'll also notice the ads are missing---image OR
> text---winnow pulls that off by noticing which parts of the page are
> usually redirects. Kai's thesis is on the haystack site.

hmmm, no occurrence of "shih" in

http://haystack.csail.mit.edu/publications.html

I found this instead

http://www2004.org/proceedings/docs/1p193.pdf

> Andrew Hogue wrote thresher. It lets an end user highlight a region of
> a web page, declare what "type" of object it is, then highlight
> subregions of the region and declare what "predicates" those regions
> specify. Based on that example (usually one is sufficient) the system
> automatically learns a scraper to extract that type of object into RDF
> from elsewhere on the same page, or from other pages generated by the
> same site. Thresher is running inside of haystack. It is a separately
> developed module with a clean API, and already integrated with mozilla,
> so it should be very easy to pull it out and back into piggy bank.

I think I crossed paths with Andrew when I arrived and he was
finishing... but thought it was only IE. Is the code in the haystack
repository?

> Andrew's thesis is also on the haystack web site.

got it.

Thanks much.

Don't hold your breath for any implementation of all this though, I'm
already swamped trying to understand why PB doesn't cleanup is memory mess.

-- 
Stefano Mazzocchi
Research Scientist                 Digital Libraries Research Group
Massachusetts Institute of Technology            location: E25-131C
77 Massachusetts Ave                   telephone: +1 (617) 253-1096
Cambridge, MA  02139-4307              email: stefanom at mit . edu
-------------------------------------------------------------------
Received on Wed Aug 03 2005 - 23:08:53 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT