Re: model.addTag equivilant for xslt scraper - other piggybank issues from Brad Clements on 2005-05-31 (stdin)

From: Brad Clements <bkc_at_murkworks.com>
Date: Tue, 31 May 2005 13:43:01 -0400

On 31 May 2005 at 12:28, Ryan Lee wrote:

> > using piggybank. So, I'll have to modify that pop-up tag form and underlying
> > java xpcom code to support adding a dc:description property.
>
> Submit a patch :) Or add it as a feature request in our issues tracker

Noted

> > 4. In an XSLTScreenScraper, can I use the document() method to traverse
> > "child documents", just like utilities.processDocuments
> >
> > How about getLLsFromAddress using xslt? Perhaps utility and model could
> > be exposed as exslt functions..
>
> I don't know. I don't expect we've blocked that function.

Well, xslt doesn't really have functions unless you're using exslt. So I don't
think it's a matter of having "blocked" a function, so much as having not
"exposed" the functions.

--
I haven't found any documentation on using xslt to screen-scrape, I found 
two examples, but not being very RDF knowledgeable at this point, I'm kind 
of guessing at what the examples do.
> > 5. I can't debug javascript screen scrapers using Venkman. I can't set a 
> > breakpoint in them, and if Venkman is running and I pause it, piggybank 
> > gets a bit confused
> 
> Not being a Venkman user, I can't provide any insight here.
venkman is the Firefox javascript debugger. My point is, there seems to be 
no way to debug javascript screen scrapers w/o using alert and debugPrint 
and trying over and over again. 
If javascript debugging was available, that'd save a lot of time.
> I think we hadn't anticipated scraping to be so popular as to merit 
> functions for picking between them...
When looking at my del.icio.us page, Piggybank seems to have a built-in 
RSS scraper that doesn't do exactly what I want, so I wrote a javascript one 
to process the rss directly.
So, there are two scrapers, the built-in one, and mine.
and you said:
> I don't know how ultimately successful a generic scraper would be, but I 
> suppose you could make the URL list '^http://.*$' or something for now.
Now imagine I have a scraper for "any wordpress blog"  or any "Trac scm 
site".
Their specific URLs could be anything, but the format of the data that's 
output on the page is the same. In this case, the scraper is specific to the 
underlying information provider software, but the url is generic.
Therefore sticking a bunch of match url's in an .n3 file is difficult to maintain.
On the other hand, wildcarding a bunch of scrapers seems like over kill too, 
what happens if a scraper "finds something" but it's not exactly the correct 
one for the page I'm currently viewing, what if two scrapers wildcarded as 
you suggest could produce *something* from the current page?
I think it would be a) slow and a waste of resources having every single url I 
visit get scraped, b) problematic to determine which scraper's output results 
are the correct one.
In these cases, I'd like to be able to explicitely pick a particular scraper, and 
say "apply this".
Maybe "wild card scrapers" should only be usable via explicit selection.. 
click on datacoin, choose scraper from list.
--
Also, I think javascript scrapers should be able to signal to piggybank the 
difference between:
1. I found no data
2. I don't match this page
Currently, if I'm using xpath and I don't match the page, I don't add anything 
to the model (hopefully)
But to the end user, that appears to be the same as just not finding any 
data.
But I suggest they are not the same and the end-user may be helped 
(especially in the case of "generic scraper") to be informed if "scraper 
doesn't match this page".
So, going back to my weak example of a "trac scraper", I could add some 
simple javascript that looks for particular elements/classes and what-not, 
and if not found, I could throw a javascript exception, or return a particular 
value or call a utility method -- do something that tells piggybank "I don't 
match this page".
Piggybank could then do a number of things with that knowledge, like tell 
the user "the scraper you selected won't work on this page" or in the case of 
multiple wildcard scrapers, maybe try the next one. 
This is different than scraping a page, and finding no matching data.
--
trac is just an example I picked out of the blue
http://www.edgewall.com/trac/
--
oh, what about javascript scrapers calling other javascript scrapers?
Suppose I have a bookmarked list of websites that can be scraped. 
Suppose I want to periodically update piggybank from all these pages.
I could have a scraper run down my bookmark list, then "call" a new utility 
method that says "grab this url and scrape whatever you can from it".
Or "grab this url and apply scraper 'x' "
(though, now we need some way of providing configuration data to the outer 
scraper)
But anyway, it seems like the API isn't complete w/o true recursion. The 
utility methods assume that the currently running script knows how to 
scrape the recursed document. 
I suppose for now that's generally true, but down the road, will it be helpful 
to be able to 'recurse' back into piggybank at a higher level?
-- 
Brad Clements,                bkc_at_murkworks.com    (315)268-1000
http://www.murkworks.com                          
AOL-IM or SKYPE: BKClements
We must come down from our heights, and leave our straight 
paths, for the byways and low places of life, if we would 
learn truths by strong contrasts; and in hovels, in forecastles, 
and among our own outcasts in foreign lands, see what has been 
wrought upon our fellow-creatures by accident, hardship, or vice. 
- Richard Henry Dana, Jr. 1836

Received on Tue May 31 2005 - 17:41:25 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT