Re: model.addTag equivilant for xslt scraper - other piggybank issues

From: Brad Clements <bkc_at_murkworks.com>
Date: Sat, 28 May 2005 10:31:43 -0400

On 28 May 2005 at 11:06, Danny Ayers wrote:

> I'm curious about what you would need (not actually for simile, I'm
> doing some other XSLT to RDF/XML).
>
> Do you please have a link to a sample of the source data, and an
> example of the kind of statements you want to derive?
>

Well my use-case is pretty weak now, since I wrote a scraper in javascript.

Basically http://del.icio.us/rss/bkc returns my most recent 32 bookmark
entries in rss format.

here's an example copy/pasted from FF (ignore the folding '-' chars)

<item
rdf:about="http://www.amazon.com/exec/obidos/ASIN/0960726802/103-
6779802-8193463">
<title>work horse handbook</title>
-
        <link>
http://www.amazon.com/exec/obidos/ASIN/0960726802/103-6779802-
8193463
</link>
<dc:creator>bkc</dc:creator>
<dc:date>2005-05-17T16:26:00Z</dc:date>
<dc:subject>booklist</dc:subject>
-
        <taxo:topics>
-
        <rdf:Bag>
<rdf:li resource="http://del.icio.us/tag/booklist"/>
</rdf:Bag>
</taxo:topics>
</item>


Piggybank shows the datacoin by default on this page, but treats it as
generic "RSS", so it misses a lot of data, including the tags I've applied to
each entry.

What I wanted to do was

a) create web#Page entries for each of the RSS items

b) tag the tags I have and associate them with each new web#Page item.

That is, whitespace split dc:subject and call model.addTag for each, which
is what I did in javascript.

I couldn't find the spot in the source where piggybank parses the results of
an XSLTHarvester.. I thought maybe I could just stick in a collection of tags
in the containing Web#Page item.

However since tags are user specific (and hashed w/ email address) there'd
have to be special support in the piggybank parser to handle this properly.

--
So, I have 32 of 107 bookmarks imported into Piggybank and my thoughts 
are:
1. If piggybank with publishing to a databank will replace my use of 
del.icio.us tagging, I'll need a way to specify a comment when I tag a page 
using piggybank. So, I'll have to modify that pop-up tag form and underlying 
java xpcom code to support adding a dc:description property.
2. There seems to be no way to edit any data in piggybank. Is this a 
planned feature? I should be able to edit any existnig property and add new 
ones.
I can change tags on an item, but I would like to be able to add new 
properties to existing items, not just additional tags.
3. What's the difference between XSLTHarvester and 
XSLTScreenScraper?
4. In an XSLTScreenScraper, can I use the document() method to traverse 
"child documents", just like utilities.processDocuments 
How about getLLsFromAddress using xslt? Perhaps utility and model could 
be exposed as exslt functions..
5. I can't debug javascript screen scrapers using Venkman. I can't set a 
breakpoint in them, and if Venkman is running and I pause it, piggybank 
gets a bit confused
6. What if I have two scrapers that can operate on a page, plus the default 
rss handler? Clicking on the datacoin should give me the option of picking a 
particular "generic handler" (like RSS) or a custom screen scraper.
Perhaps I would like to have a "generic scraper" that could work on many 
pages, I don't want to have to edit the .n3 file to list all URLs, I should be 
able to pick a candidate scraper from a list.
(like, right-click the data coin)
--
Guess that's enough for now.
-- 
Brad Clements,                bkc_at_murkworks.com    (315)268-1000
http://www.murkworks.com                          
AOL-IM or SKYPE: BKClements
We must come down from our heights, and leave our straight 
paths, for the byways and low places of life, if we would 
learn truths by strong contrasts; and in hovels, in forecastles, 
and among our own outcasts in foreign lands, see what has been 
wrought upon our fellow-creatures by accident, hardship, or vice. 
- Richard Henry Dana, Jr. 1836
Received on Sat May 28 2005 - 14:30:04 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT