[announcement] dspace item screen scraper for Piggy Bank from Stefano Mazzocchi on 2005-08-09 (stdin)

From: Stefano Mazzocchi <stefanom_at_mit.edu>
Date: Tue, 09 Aug 2005 12:14:20 -0700

SIMILE was born as a way to enable dspace with semantic web technologies
in order to extend its metadata support. While we are really not there
yet (for reasons that not only depend on us), we do want to help in that
regard.

I've written a dspace item screen scraper and it's now located at

http://simile.mit.edu/repository/piggy-bank/trunk/src/scrapers/screen-scrapers.n3

(a minimal installation howto is contained inside that file) along with
a few others 'institutional ones'.

Unlike previous dspace scrapers that floated around, this one is
general, in the sense that it should work on any dspace installation, no
matter where located (and, of course, given the piggybank architecture
no matter if thru https or http).

WARNING: it works *only* on 'full metadata' view of items! For example,
it works on something like

  http://www.dspace.cam.ac.uk/handle/1810/1471?mode=full

and *not* something like

  http://www.dspace.cam.ac.uk/handle/1810/1471

NOTE: yes, I know this is suboptimal and I'm working to enhance the
scraper so that:

  1) you can do facetted browsing of a search, not just items.

  2) it works on both 'normal view' and 'full view'.

WARNING2: the scraper reacts on class="" features inside the HTML markup
that dspace generates, meaning that it is very resistant to UI changes,
but being a scraper is by nature a little fragile. It should indeed be
possible to use another and more data-oriented interface (SRW, OAI-PMH)
but AFAIK, there is no way to automatically determine where those web
services are.

SUGGESTIONs: here are my humble suggestions for the dspace community:

  1) indicate that a particular page was generated by dspace. The best
way to do this is to set a new header in the HTTP response, I would
suggest something like "X-DSpace: 1.2.0", this way it's possible to
understand if a URL is generated by dspace instead of doing regexp
matching of URLs (which is a more fragile approach). NOTE: since this is
a potential help for intruders, configure it in such a way that can be
turned into "X-DSpace: 1.x" to avoid giving too much help to the
attackers, but avoid it from being easily disabled.

  2) create a way for a service to discover where the web service URLs
are. A simple way to do this is to use the OPTION HTTP request and
return sufficient information in the headers. There might be more
"standard" way of achieving this but for sure people that want to crawl
or harvest need something like this (I think google expressed the same
concern, but proposed the use of robots.txt files, which I find highly
impractical as a solution since a dspace installation is not always root
of the web server and doesn't necessarely own the robots.txt file).

  3) well, of course, send some RDF data along with the iteams so that
we don't have to scrape it ourselves :-) [and yes, we can help on that
one if the need emerges]

Enjoy.

-- 
Stefano Mazzocchi
Research Scientist                 Digital Libraries Research Group
Massachusetts Institute of Technology            location: E25-131C
77 Massachusetts Ave                   telephone: +1 (617) 253-1096
Cambridge, MA  02139-4307              email: stefanom at mit . edu
-------------------------------------------------------------------

Received on Tue Aug 09 2005 - 19:10:30 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT