Re: scraper evolution from Eric Miller on 2006-01-27 (stdin)

From: Eric Miller <em_at_w3.org>
Date: Fri, 27 Jan 2006 12:58:13 -0500

On Jan 27, 2006, at 12:38 PM, David Huynh wrote:

> I'm working on making scrapers more "declarative", thus easier to
> write, easier to update and adapt, easier to find errors in them.
> Update errors might thus be detectable automatically. And the users
> themselves (not the original scraper authors) can try to update the
> scrapers.

Interesting! :) But I'm not quite sure how this would help the use
case exactly. In the case below the scraper didn't die per se - minor
HTML tweaks simply caused the scraper not to collect all of the RDF
data that it originally able to gather. In this case, the user might
not know there was an error and thus know to correct the scrapers.

> We should also try to detect whether a scraper is "safe"
> automatically.

yes please! :)

--
eric miller                              http://www.w3.org/people/em/
semantic web activity lead               http://www.w3.org/2001/sw/
w3c world wide web consortium            http://www.w3.org/
>
> David
>
> Eric Miller wrote:
>
>> The HTML pages harvested using an Open Worldcat scraper [1]  
>> changed and as a consequence the scraper broke. To be clear, the  
>> scraper when evoked didn't stop working per se, but rather it  
>> didn't glean all of the relevant RDF that it did originally. I've  
>> updated the scraper accordingly, but its unclear to me the best  
>> way to propagate these changes to others who might be using the  
>> scraper.
>> I can think of several possible options all of which have various  
>> pros / cons
>>
>> 1) do nothing ... if folks realize its broken they'll look for an  
>> update
>> 2) real time auto-update ... every time scraper is invoked it  
>> checks to see if a new version is available
>> 3) periodically update ... check for updates nightly, monthly,  
>> etc. and then offers the user some sort of notification to update
>>
>> I'm inclined to suggest 3, but curious as to others thoughts who  
>> might have been able to spend more time thinking about this than I  
>> have :)
>>
>> [1] http://potlach.org/2005/10/scrapers/
>>
>> --
>> eric miller                              http://www.w3.org/people/em/
>> semantic web activity lead               http://www.w3.org/2001/sw/
>> w3c world wide web consortium            http://www.w3.org/
>>
>>
>

Received on Fri Jan 27 2006 - 17:57:34 EST

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT