Re: [RT] Learning from Greasemonkey + Platypus

From: David Huynh <dfhuynh_at_csail.mit.edu>
Date: Wed, 03 Aug 2005 20:56:05 -0400

Bruce D'Arcus wrote:

>
> On Aug 3, 2005, at 8:12 PM, Eric Miller wrote:
>
>> ...
>> - http://simile.mit.edu/piggy-bank/screen-scrapers-howto.html
>>
>> for additional details.
>
>
> Still, given that most web pages are not valid XML, if there was some
> way to pipe pages through Tidy first, that might open up more options?

XSLT-based scrapers automatically get the HTML that has already been
piped through Tidy. However, Tidy chokes on some HTML. Furthermore, some
HTML pages have Javascript that changes the HTML code when the pages
load to achieve the final looks. XSLT-based scrapers cannot get the
resulting HTML code as we can't run the Javascript code ourselves. For
this reason, I'd recommend using Javascript-based scrapers instead. They
get the final DOMs that the browser has after all Javascript code has
been run.

There are also certain things much easier done in Javascript than XSLT,
including loading additional pages to elaborate on the data further
(e.g., translating addresses into geo coordinates, jumping through
subsequent search result pages). For this reason, while I started
implementing scrapers in XSLT, I switched over to using Javascript.

David
Received on Thu Aug 04 2005 - 00:50:36 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT