Re: scraper evolution

From: Eric Miller <>
Date: Fri, 27 Jan 2006 12:58:13 -0500

On Jan 27, 2006, at 12:38 PM, David Huynh wrote:

> I'm working on making scrapers more "declarative", thus easier to
> write, easier to update and adapt, easier to find errors in them.
> Update errors might thus be detectable automatically. And the users
> themselves (not the original scraper authors) can try to update the
> scrapers.

Interesting! :) But I'm not quite sure how this would help the use
case exactly. In the case below the scraper didn't die per se - minor
HTML tweaks simply caused the scraper not to collect all of the RDF
data that it originally able to gather. In this case, the user might
not know there was an error and thus know to correct the scrapers.

> We should also try to detect whether a scraper is "safe"
> automatically.

yes please! :)

eric miller                    
semantic web activity lead     
w3c world wide web consortium  
> David
> Eric Miller wrote:
>> The HTML pages harvested using an Open Worldcat scraper [1]  
>> changed and as a consequence the scraper broke. To be clear, the  
>> scraper when evoked didn't stop working per se, but rather it  
>> didn't glean all of the relevant RDF that it did originally. I've  
>> updated the scraper accordingly, but its unclear to me the best  
>> way to propagate these changes to others who might be using the  
>> scraper.
>> I can think of several possible options all of which have various  
>> pros / cons
>> 1) do nothing ... if folks realize its broken they'll look for an  
>> update
>> 2) real time auto-update ... every time scraper is invoked it  
>> checks to see if a new version is available
>> 3) periodically update ... check for updates nightly, monthly,  
>> etc. and then offers the user some sort of notification to update
>> I'm inclined to suggest 3, but curious as to others thoughts who  
>> might have been able to spend more time thinking about this than I  
>> have :)
>> [1]
>> --
>> eric miller                    
>> semantic web activity lead     
>> w3c world wide web consortium  
Received on Fri Jan 27 2006 - 17:57:34 EST

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT