Re: [RT] Learning from Greasemonkey + Platypus from David Karger on 2005-08-03 (stdin)

From: David Karger <karger_at_mit.edu>
Date: Wed, 03 Aug 2005 18:06:42 -0400

Stefano, there are at least two separate haystack theses that have
useful code that could perhaps be applied to this problem (although that
code is java in one case and perl in the other).

Kai Shih wrote winnow, which solves the "static parts of the page"
detection problem you mentioned (as a first step towards learning which
things to pull out of the page based on your historical interests in
different parts of the page). This code is up and running at
http://shrike.csail.mit.edu/cgi-bin/winnow/frontpage.cgi. If you set up
a page and read news from it for a few days, it will start recommending
news you will like. You'll also notice the ads are missing---image OR
text---winnow pulls that off by noticing which parts of the page are
usually redirects. Kai's thesis is on the haystack site.

Andrew Hogue wrote thresher. It lets an end user highlight a region of
a web page, declare what "type" of object it is, then highlight
subregions of the region and declare what "predicates" those regions
specify. Based on that example (usually one is sufficient) the system
automatically learns a scraper to extract that type of object into RDF
from elsewhere on the same page, or from other pages generated by the
same site. Thresher is running inside of haystack. It is a separately
developed module with a clean API, and already integrated with mozilla,
so it should be very easy to pull it out and back into piggy bank.
Andrew's thesis is also on the haystack web site.

Stefano Mazzocchi wrote:
> Greasemonkey is cool. It's an extension framework built as an extension
> (meta-extensions?) that allows you to customize/personalize particular
> web sites, on the client side.
>
> http://greasemonkey.mozdev.org
>
> Greasemonkey has been taking off like wildfire (especially after Mark
> Pilgrim's "Dive into Greasemonkey" book) but it still requires users to
> know enough javascript (and related technologies like DOM and XPath) to
> be able to achieve the desired functionality.
>
> Here is where Platypus gets in.
>
> http://platypus.mozdev.org
>
> Platypus (no, not the rdf-based wiki!) is a firefox extention that let
> you edit and modify any given page *directly inside the browser* but,
> hear hear, instead of saving the page as HTML (which would be totally
> useless) it saves it as a greasemonkey script and loads it into
> greasemonkey!
>
> For example, you don't care about google ads? Here is what you have to do:
>
> 1) go to google.com
> 2) search something
> 3) turn on platypus (one click)
> 4) click on the platypus erase button (one click, platypus activates)
> 5) move your mouse to the area in the page that you want to get rid of
> (here platypus highlights the blocks of the page as you mouseover them)
> 6) remove the unwanted content (one click)
> 7) click save (one click)
> 8) add the wildcards to the URL so that it matches not just that page
> (one selection, one keystroke)
> 9) save (one click)
>
> the generated script looks like this
>
> function do_platypus_script() {
> platypus_do_function(window,
> 'erase',document.evaluate('/HTML[1]/BODY[1]/DIV[1]/TABLE[1]/TBODY[1]/TR[3]/TD[1]',
> document, null,
> XPathResult.FIRST_ORDERED_NODE_TYPE,null).singleNodeValue,'null');
> }; // Ends do_platypus_script
> window.addEventListener("load", function() { do_platypus_script() },
> false);//.user.js
>
> - o -
>
> I think you are already starting to see where I'm going with this.
>
> David (Karger) suggested me and DavidH to consider writing another
> firefox plugin that would allow users to write Piggy-Bank scrapers.
>
> I first dismissed the idea, because I wanted to attract more developers
> than end-user, but now I think he is right in pointing out that we are
> really talking about two different classes of people.
>
> Now, the question on the table is: how would such a scraper-editing
> plugin work?
>
> DavidH and I talked about it for a while, and we have two different
> approaches:
>
> 1) he was going to allow the user to select rectangles on the page and
> the dom elements that intersect those rectangles would be captured. The
> problem is that finding out those dom element intersections is not an
> easy task. In this regard, I think Platypus mouseover-highlight style is
> easier to implement and more effective.
>
> 2) my idea was a little different: since what we ask for is really
> content that changes between pages that we believe similar, I thought
> about algorithmic ways to allow the browser to differentiate between the
> template and its content, by allowing the user to indicate two or more
> pages that "look alike", for example, two paginated views. This
> pre-filtering would allow to remove all the 'templating' and static
> content that normally is not something we are interested in when scraping.
>
> Note how the two approaches are really orthogonal: there has to be a
> selection process at some point, if we want to enable scraping. Whether
> we do it using rectangles, or text selection, or block mouseover
> highlight, is secondary at this point.
>
> Also, whether or not we want to 'prefilter' the page to remove all
> static content, it's independent: some people might like that, some
> others might not, feeling that the templating is still guiding their
> selection choices.
>
> Anyway, should this be a different plugin or should this be built inside
> piggy bank?
>
> Also, once we have the content we want and the URL and the xpath (should
> we say xpointers?) locations, how do we turm them into RDF statements?
> how do we guide them to select the available ontologies or empower them
> to make their own?
>
Received on Wed Aug 03 2005 - 22:03:32 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT