On Aug 3, 2005, at 8:12 PM, Eric Miller wrote:
> ...
> - http://simile.mit.edu/piggy-bank/screen-scrapers-howto.html
>
> for additional details.
Still, given that most web pages are not valid XML, if there was some
way to pipe pages through Tidy first, that might open up more options?
About a year ago I wrote a simple workflow did this; to download
articles from news sites like the New York Times. A simple Applescript
in my newsreader flags urls for download, a Ruby script then goes
through, downloads the files, and converts them to XHTML. An XSLT (2.0)
stylesheet then removes all the crap and inserts additional metadata in
the headers.
Bruce
Received on Thu Aug 04 2005 - 00:25:40 EDT