Re: [RT] Learning from Greasemonkey + Platypus

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]

From: Bruce D'Arcus <bdarcus_at_gmail.com>
Date: Wed, 3 Aug 2005 20:29:02 -0400

On Aug 3, 2005, at 8:12 PM, Eric Miller wrote:

> ...
> - http://simile.mit.edu/piggy-bank/screen-scrapers-howto.html
>
> for additional details.

Still, given that most web pages are not valid XML, if there was some
way to pipe pages through Tidy first, that might open up more options?

About a year ago I wrote a simple workflow did this; to download
articles from news sites like the New York Times. A simple Applescript
in my newsreader flags urls for download, a Ruby script then goes
through, downloads the files, and converts them to XHTML. An XSLT (2.0)
stylesheet then removes all the crap and inserts additional metadata in
the headers.

Bruce
Received on Thu Aug 04 2005 - 00:25:40 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT