-- I haven't found any documentation on using xslt to screen-scrape, I found two examples, but not being very RDF knowledgeable at this point, I'm kind of guessing at what the examples do. > > 5. I can't debug javascript screen scrapers using Venkman. I can't set a > > breakpoint in them, and if Venkman is running and I pause it, piggybank > > gets a bit confused > > Not being a Venkman user, I can't provide any insight here. venkman is the Firefox javascript debugger. My point is, there seems to be no way to debug javascript screen scrapers w/o using alert and debugPrint and trying over and over again. If javascript debugging was available, that'd save a lot of time. > I think we hadn't anticipated scraping to be so popular as to merit > functions for picking between them... When looking at my del.icio.us page, Piggybank seems to have a built-in RSS scraper that doesn't do exactly what I want, so I wrote a javascript one to process the rss directly. So, there are two scrapers, the built-in one, and mine. and you said: > I don't know how ultimately successful a generic scraper would be, but I > suppose you could make the URL list '^http://.*$' or something for now. Now imagine I have a scraper for "any wordpress blog" or any "Trac scm site". Their specific URLs could be anything, but the format of the data that's output on the page is the same. In this case, the scraper is specific to the underlying information provider software, but the url is generic. Therefore sticking a bunch of match url's in an .n3 file is difficult to maintain. On the other hand, wildcarding a bunch of scrapers seems like over kill too, what happens if a scraper "finds something" but it's not exactly the correct one for the page I'm currently viewing, what if two scrapers wildcarded as you suggest could produce *something* from the current page? I think it would be a) slow and a waste of resources having every single url I visit get scraped, b) problematic to determine which scraper's output results are the correct one. In these cases, I'd like to be able to explicitely pick a particular scraper, and say "apply this". Maybe "wild card scrapers" should only be usable via explicit selection.. click on datacoin, choose scraper from list. -- Also, I think javascript scrapers should be able to signal to piggybank the difference between: 1. I found no data 2. I don't match this page Currently, if I'm using xpath and I don't match the page, I don't add anything to the model (hopefully) But to the end user, that appears to be the same as just not finding any data. But I suggest they are not the same and the end-user may be helped (especially in the case of "generic scraper") to be informed if "scraper doesn't match this page". So, going back to my weak example of a "trac scraper", I could add some simple javascript that looks for particular elements/classes and what-not, and if not found, I could throw a javascript exception, or return a particular value or call a utility method -- do something that tells piggybank "I don't match this page". Piggybank could then do a number of things with that knowledge, like tell the user "the scraper you selected won't work on this page" or in the case of multiple wildcard scrapers, maybe try the next one. This is different than scraping a page, and finding no matching data. -- trac is just an example I picked out of the blue http://www.edgewall.com/trac/ -- oh, what about javascript scrapers calling other javascript scrapers? Suppose I have a bookmarked list of websites that can be scraped. Suppose I want to periodically update piggybank from all these pages. I could have a scraper run down my bookmark list, then "call" a new utility method that says "grab this url and scrape whatever you can from it". Or "grab this url and apply scraper 'x' " (though, now we need some way of providing configuration data to the outer scraper) But anyway, it seems like the API isn't complete w/o true recursion. The utility methods assume that the currently running script knows how to scrape the recursed document. I suppose for now that's generally true, but down the road, will it be helpful to be able to 'recurse' back into piggybank at a higher level? -- Brad Clements, bkc_at_murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM or SKYPE: BKClements We must come down from our heights, and leave our straight paths, for the byways and low places of life, if we would learn truths by strong contrasts; and in hovels, in forecastles, and among our own outcasts in foreign lands, see what has been wrought upon our fellow-creatures by accident, hardship, or vice. - Richard Henry Dana, Jr. 1836Received on Tue May 31 2005 - 17:41:25 EDT
This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT