[RT] Learning from Greasemonkey + Platypus

From: Stefano Mazzocchi <stefanom_at_mit.edu>
Date: Wed, 03 Aug 2005 14:46:24 -0700

Greasemonkey is cool. It's an extension framework built as an extension
(meta-extensions?) that allows you to customize/personalize particular
web sites, on the client side.


Greasemonkey has been taking off like wildfire (especially after Mark
Pilgrim's "Dive into Greasemonkey" book) but it still requires users to
know enough javascript (and related technologies like DOM and XPath) to
be able to achieve the desired functionality.

Here is where Platypus gets in.


Platypus (no, not the rdf-based wiki!) is a firefox extention that let
you edit and modify any given page *directly inside the browser* but,
hear hear, instead of saving the page as HTML (which would be totally
useless) it saves it as a greasemonkey script and loads it into

For example, you don't care about google ads? Here is what you have to do:

  1) go to google.com
  2) search something
  3) turn on platypus (one click)
  4) click on the platypus erase button (one click, platypus activates)
  5) move your mouse to the area in the page that you want to get rid of
(here platypus highlights the blocks of the page as you mouseover them)
  6) remove the unwanted content (one click)
  7) click save (one click)
  8) add the wildcards to the URL so that it matches not just that page
  (one selection, one keystroke)
  9) save (one click)

the generated script looks like this

function do_platypus_script() {
document, null,
}; // Ends do_platypus_script
window.addEventListener("load", function() { do_platypus_script() },

                                    - o -

I think you are already starting to see where I'm going with this.

David (Karger) suggested me and DavidH to consider writing another
firefox plugin that would allow users to write Piggy-Bank scrapers.

I first dismissed the idea, because I wanted to attract more developers
than end-user, but now I think he is right in pointing out that we are
really talking about two different classes of people.

Now, the question on the table is: how would such a scraper-editing
plugin work?

DavidH and I talked about it for a while, and we have two different

  1) he was going to allow the user to select rectangles on the page and
the dom elements that intersect those rectangles would be captured. The
problem is that finding out those dom element intersections is not an
easy task. In this regard, I think Platypus mouseover-highlight style is
easier to implement and more effective.

  2) my idea was a little different: since what we ask for is really
content that changes between pages that we believe similar, I thought
about algorithmic ways to allow the browser to differentiate between the
template and its content, by allowing the user to indicate two or more
pages that "look alike", for example, two paginated views. This
pre-filtering would allow to remove all the 'templating' and static
content that normally is not something we are interested in when scraping.

Note how the two approaches are really orthogonal: there has to be a
selection process at some point, if we want to enable scraping. Whether
we do it using rectangles, or text selection, or block mouseover
highlight, is secondary at this point.

Also, whether or not we want to 'prefilter' the page to remove all
static content, it's independent: some people might like that, some
others might not, feeling that the templating is still guiding their
selection choices.

Anyway, should this be a different plugin or should this be built inside
piggy bank?

Also, once we have the content we want and the URL and the xpath (should
we say xpointers?) locations, how do we turm them into RDF statements?
how do we guide them to select the available ontologies or empower them
to make their own?

Stefano Mazzocchi
Research Scientist                 Digital Libraries Research Group
Massachusetts Institute of Technology            location: E25-131C
77 Massachusetts Ave                   telephone: +1 (617) 253-1096
Cambridge, MA  02139-4307              email: stefanom at mit . edu
Received on Wed Aug 03 2005 - 21:42:51 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT