Re: [announcement] DSpace Scraper - Reloaded from Stefano Mazzocchi on 2005-08-10 (stdin)

From: Stefano Mazzocchi <stefanom_at_mit.edu>
Date: Wed, 10 Aug 2005 10:52:47 -0700

Eric Miller wrote:
>
> On Aug 10, 2005, at 12:48 AM, Stefano Mazzocchi wrote:
>
>> I started to see if I could tranform the existing xslt scraper into a
>> javascript and one and step after step, I turned it into a full-
>> featured scraper that:
>>
>> 1) works on all dspace installations worldwide! (well, all that I
>> tried)
>> 2) works on both the simple and the complete item view
>> 3) works on search results too! (so that you can have a facetted
>> browsing experience of a dspace installation... well, at least for a
>> limited amount of items)
>>
>> Point your piggy-bank-enabled firefox to
>>
>> http://simile.mit.edu/repository/piggy-bank/trunk/src/scrapers/
>> screen-scrapers.n3
>>
>> and follow the instructions.
>>
>> Happy scraping (again).
>
>
> Excellent! Well done!
>
> The regular expressing matching you've introduced is more complex than
> previous *scrapers. Might be worth mentioning this a bit more on the
> list for those interested.

The scraper will react on all URLs that have the form

^http(s)?://.*/((handle/[0-9\\.]+/[0-9\\.]+)|(simple-search)).*$

for those who don't know regular expressions the above is a very
condensed way to say something like

  http or https
  any web domain
  any location in the path of that domain
  either
    handle/[handle identifier][something else]
  or
    simple-search[something else]

the 'handle/' URLs indicate items and the 'simple-search' URLs indicate
search results (the scraper then internall differentiate between the two
to trigger a different behavior).

I would have been possible to write two different scrapers that reacted
on different (and simpler) regexps, but since they both shared some
code, I thought it was easier to write and maintain one. (performance
and complexity are virtually the same).

> Also, I'm concerned about the use of 'http://www.ontoweb.org/ontology/
> 1#' namespace in this (and other) scrapers. Looking at the code, it
> seems clear you're using this to type 'Publications'. I'd suggest we
> either work with the ontoweb folks for making this available or use
> another namespace for this term that resolves to something useful
> (ideally one that has some sort of clear persistence policy).

I follow the 'data first' approach. Get the data out and see what happens.

I'm happy to change that in anything else you want, but honestly I don't
care what it is.

> If there are not readily available terms the Dspace community needs I
> think we should work with them to find out what they are and simply
> write these to the web. Other than the class 'Publication' is there
> anything else you've come across in modeling the data?

Oh yeah, it's a jungle out there.

The idea that 'dspace is just dublin core' is practically laughable, if
you open the firefox 'javascript console' you'll see all the warnings
about fields that I don't know how to interpret and that are not listed
in any DC or in DCT.

NOTE: when dspace says 'dc' is not the 8 fields that we all know, it's
DC + DCT, which is more than 8 fields.

The scraper currently recognizes 17 fields:

  creator
  contributor.author
  date.available
  date.created
  date.issued
  description.abstract
  format.extent
  format.mimetype
  language.iso
  subject
  subject.other
  title
  publisher
  rights
  contributor.department
  contributor.institution
  type

the interesting one is the last one: I do *NOT* model the last one as
rdf:type but as dc:type, the reason being the fact that all the above
contain literals as objects, not resources and rdf:type is something
that asks very strongly for resources as objects.

So, one item, for example

   http://divinity.acadiau.ca/dspace/handle/1952/115

will have

  rdf:type -> <...#Publication>
  dc:type -> "Video"

therefore I interpreted the above as "that item was 'published' and this
is what I'm modeling with rdf:type" then "the item is a video and this
is what I model with dc:type".

Sure it's entirely possible to collect a bunch of this information from
all the various dspace sites and provide a list of all the values found
in the wild for those fields and assign URIs to those... but again 'data
first', I'm not interested in spending time for the 'right thing' before
I even know what I need to do with this data.

But if somebody has a better idea (or even better, patches), I'm wide
open to suggestions.

-- 
Stefano Mazzocchi
Research Scientist                 Digital Libraries Research Group
Massachusetts Institute of Technology            location: E25-131C
77 Massachusetts Ave                   telephone: +1 (617) 253-1096
Cambridge, MA  02139-4307              email: stefanom at mit . edu
-------------------------------------------------------------------

Received on Wed Aug 10 2005 - 17:48:58 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT