Re: [RT] Learning from Greasemonkey + Platypus from David Karger on 2005-08-03 (stdin)

From: David Karger <karger_at_mit.edu>
Date: Wed, 03 Aug 2005 19:19:54 -0400

Stefano Mazzocchi wrote:
> David Karger wrote:
>
>> Stefano, there are at least two separate haystack theses that have
>> useful code that could perhaps be applied to this problem (although
>> that code is java in one case and perl in the other).
>
>
> Awesome! (no worries about the code, I think finding the proper
> algorithm/methodology is probably the hard part, we can port/incorporate
> the code when we need it)
>
>> Kai Shih wrote winnow, which solves the "static parts of the page"
>> detection problem you mentioned (as a first step towards learning
>> which things to pull out of the page based on your historical
>> interests in different parts of the page). This code is up and
>> running at http://shrike.csail.mit.edu/cgi-bin/winnow/frontpage.cgi.
>
>
> bummer, I get a 404.

Typo:
http://shrike.csail.mit.edu/~kai/cgi-bin/winnow/frontpage.cgi
>
>> If you set up a page and read news from it for a few days, it will
>> start recommending news you will like. You'll also notice the ads are
>> missing---image OR text---winnow pulls that off by noticing which
>> parts of the page are usually redirects. Kai's thesis is on the
>> haystack site.
>
>
> hmmm, no occurrence of "shih" in
>
> http://haystack.csail.mit.edu/publications.html
>
> I found this instead
>
> http://www2004.org/proceedings/docs/1p193.pdf
>
yes, that is the paper. the thesis is around somewhere; i can dig it up
if we need more details than in the paper.
>> Andrew Hogue wrote thresher. It lets an end user highlight a region
>> of a web page, declare what "type" of object it is, then highlight
>> subregions of the region and declare what "predicates" those regions
>> specify. Based on that example (usually one is sufficient) the system
>> automatically learns a scraper to extract that type of object into RDF
>> from elsewhere on the same page, or from other pages generated by the
>> same site. Thresher is running inside of haystack. It is a
>> separately developed module with a clean API, and already integrated
>> with mozilla, so it should be very easy to pull it out and back into
>> piggy bank.
>
>
> I think I crossed paths with Andrew when I arrived and he was
> finishing... but thought it was only IE. Is the code in the haystack
> repository?

I scrambled. Andrew implemented on IE. A following student, ryan
manuel, extended the work to scraping web forms (ie, after scraping you
are able to just specify the arguments and it takes care of building a
propert form-submission for you) and also on porting the stuff to
mozilla. I think he succeeded but we will need to check
(rfmanuel_at_alum.mit.edu)
Received on Wed Aug 03 2005 - 23:16:47 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT