Re: model.addTag equivilant for xslt scraper - other piggybank issues from David Huynh on 2005-05-31 (stdin)

From: David Huynh <dfhuynh_at_csail.mit.edu>
Date: Tue, 31 May 2005 21:17:15 -0400

Brad Clements wrote:

>On 31 May 2005 at 12:28, Ryan Lee wrote:
>[snip]
>
>
>>>4. In an XSLTScreenScraper, can I use the document() method to traverse
>>>"child documents", just like utilities.processDocuments
>>>
>>>How about getLLsFromAddress using xslt? Perhaps utility and model could
>>>be exposed as exslt functions..
>>>
>>>
>>I don't know. I don't expect we've blocked that function.
>>
>>
>
>Well, xslt doesn't really have functions unless you're using exslt. So I don't
>think it's a matter of having "blocked" a function, so much as having not
>"exposed" the functions.
>
>--
>
>I haven't found any documentation on using xslt to screen-scrape, I found
>two examples, but not being very RDF knowledgeable at this point, I'm kind
>of guessing at what the examples do.
>
>
Our support for XSLT-based screen scrapers was a preliminary step toward
supporting GRDDL. I'd personally recommend Javascript for
screen-scraping unless the webpage to scrape is valid or almost-valid
(jtidy-able) XHTML and you're just performing simple transformations of
the XHTML to RDF. Could you tell us why Javascript might not be suitable
for your screen-scraping needs?

>>>5. I can't debug javascript screen scrapers using Venkman. I can't set a
>>>breakpoint in them, and if Venkman is running and I pause it, piggybank
>>>gets a bit confused
>>>
>>>
>>Not being a Venkman user, I can't provide any insight here.
>>
>>
>
>
>venkman is the Firefox javascript debugger. My point is, there seems to be
>no way to debug javascript screen scrapers w/o using alert and debugPrint
>and trying over and over again.
>
>If javascript debugging was available, that'd save a lot of time.
>
>
You're right. We just call eval(...) on the screen-scraper script. Do
you know how to call the script such that Venkman can work on it? We can
put that in if we know how.

>>I think we hadn't anticipated scraping to be so popular as to merit
>>functions for picking between them...
>>
>>
>
>When looking at my del.icio.us page, Piggybank seems to have a built-in
>RSS scraper that doesn't do exactly what I want, so I wrote a javascript one
>to process the rss directly.
>
>So, there are two scrapers, the built-in one, and mine.
>
>and you said:
>
>
>
>
>>I don't know how ultimately successful a generic scraper would be, but I
>>suppose you could make the URL list '^http://.*$' or something for now.
>>
>>
>
>Now imagine I have a scraper for "any wordpress blog" or any "Trac scm
>site".
>
>Their specific URLs could be anything, but the format of the data that's
>output on the page is the same. In this case, the scraper is specific to the
>underlying information provider software, but the url is generic.
>
>Therefore sticking a bunch of match url's in an .n3 file is difficult to maintain.
>
>On the other hand, wildcarding a bunch of scrapers seems like over kill too,
>what happens if a scraper "finds something" but it's not exactly the correct
>one for the page I'm currently viewing, what if two scrapers wildcarded as
>you suggest could produce *something* from the current page?
>
>I think it would be a) slow and a waste of resources having every single url I
>visit get scraped, b) problematic to determine which scraper's output results
>are the correct one.
>
>In these cases, I'd like to be able to explicitely pick a particular scraper, and
>say "apply this".
>
>Maybe "wild card scrapers" should only be usable via explicit selection..
>click on datacoin, choose scraper from list.
>
>
This is a very interesting case. Perhaps we can also make
screen-scrapers match against meta tags inside the HTML. And presumably
any wordpress blog might have some meta tag that indicates it is a
wordpress blog. This can be quite intensive when a page is just loaded,
as several screen-scrapers might need to be run against the page's HTML DOM.

>Also, I think javascript scrapers should be able to signal to piggybank the
>difference between:
>
>1. I found no data
>
>2. I don't match this page
>
>
>Currently, if I'm using xpath and I don't match the page, I don't add anything
>to the model (hopefully)
>
>But to the end user, that appears to be the same as just not finding any
>data.
>
>But I suggest they are not the same and the end-user may be helped
>(especially in the case of "generic scraper") to be informed if "scraper
>doesn't match this page".
>
>So, going back to my weak example of a "trac scraper", I could add some
>simple javascript that looks for particular elements/classes and what-not,
>and if not found, I could throw a javascript exception, or return a particular
>value or call a utility method -- do something that tells piggybank "I don't
>match this page".
>
>Piggybank could then do a number of things with that knowledge, like tell
>the user "the scraper you selected won't work on this page" or in the case of
>multiple wildcard scrapers, maybe try the next one.
>
>This is different than scraping a page, and finding no matching data.
>
>
The generic screen-scrapers would cause the data coin icon to show up
all the time. Is this desirable? It might raise false hope at the
prospect of being able to collect RDF data.

>oh, what about javascript scrapers calling other javascript scrapers?
>
>Suppose I have a bookmarked list of websites that can be scraped.
>Suppose I want to periodically update piggybank from all these pages.
>
>I could have a scraper run down my bookmark list, then "call" a new utility
>method that says "grab this url and scrape whatever you can from it".
>
>Or "grab this url and apply scraper 'x' "
>
>(though, now we need some way of providing configuration data to the outer
>scraper)
>
>But anyway, it seems like the API isn't complete w/o true recursion. The
>utility methods assume that the currently running script knows how to
>scrape the recursed document.
>
>I suppose for now that's generally true, but down the road, will it be helpful
>to be able to 'recurse' back into piggybank at a higher level?
>
>
Oh boy. I'm sure you're aware of the Greasemonkey and Chickenfoot
extensions. I wonder what's the best way to combine them with PB.

David
Received on Wed Jun 01 2005 - 01:15:28 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT