Re: [announcement] DSpace Scraper - Reloaded

From: Stefano Mazzocchi <>
Date: Wed, 10 Aug 2005 15:43:34 -0700

Matthew Cockerill wrote:
> Stefano,
> My apologies - misreading on my part as a result of overzealous
> mental multithreading, and not quite grasping the meaning of:
> "semantic linking by field collision" or "implicit ontological collision"

Oh, no need to apologize. Just wanted to make sure you understood what I

> [In fact, I could still do with some clarification on what you mean by
> those phrases: "ontological collision" has an impressively small
> number of hits as a Google search....]

There is a common implicit assumption when people try to do data
interoperability: two exact same symbols found in different systems are
implicitly related with an "sameAs" equivalence, read: their meaning is
self-referencing, it's independent on their usage context.

example: 'creator' in a dspace system in china with a 'creator' in a
e:print system in the UK.

As we know this was not enough to avoid the opposite problem: people use
the same symbol to mean different things (synonyms). So people when out
of their ways to make sure those simbols would not collide unless wanted
to. UUIDs, URIs, DOIs, handles, LSIDs, you name it: all part of a scheme
to avoid two symbols to 'collide' without a specific intention.

In piggy-bank we designed a system where anything you say it's your own
symbol. Semantic Collision (when two people independently use the same
symbol) will be extremely rare.

As you very well understand, without a way to 'draw relationships'
between these explicitly disconnected symbols, we didn't solve anything
we actually made it worse. There is no interoperability in a symbol soup.

But now we could say that when "dspace at mit" and "dspace at cambridge"
mean dc:title, they really mean the same thing.

Some people believe that even using something as simple as Dublin Core,
those 'explicit mappings' could go away, because people will just read
the doc and do the right thing.

Well, even without counting mistakes (that could be corrected and
normally don't span for the entire dataset but for just single records),
there is disagreement.

DSpace, for example, is a system that was designed to be the 'simplest
thing that could possibly work' and, in fact, locally, it works fine,
although people complain about the fact that they want to add their own
'metadata' to their data.

Simile was created to allow that, and to research ways to do this and
implement them.

People think that when everybody uses DC, a little bit of
interoperability is already there. Well, yes and no. Like I said, the
fact that dspace_at_mit's use of dc:creator is 'similar' to the
dspace_at_cambridge's use of dc:creator might not be something that is so

Sure, there are fields that are just wrong or misused, you can correct
those, but where disagreement on the interpretation comes along (like
rdf:type vs. dc:type) things start to get hairy.

> Anyway, great to hear that we're agreed (I think) on the importance of
> mapping and ontological glue, to make a kind of sense of the many
> overlapping ontologies that are going to be in use.


> I do certainly agree with the sentiment behind your "data first" blog
> posting, and can think of many examples showing the success of this
> approach.

so here's the deal: capture data first, no matter what ontology as long
as it's differentiating enough. Then encode mappings and/or rules
between ontologies. Then allow the systems to work on those but keep the
'inferred data' separated from the 'original data' so that different
people can use different mappings, and without bothering the other two
parties involved (this is, I think, they only way a system like this
would work on a global scale).

Stefano Mazzocchi
Research Scientist                 Digital Libraries Research Group
Massachusetts Institute of Technology            location: E25-131C
77 Massachusetts Ave                   telephone: +1 (617) 253-1096
Cambridge, MA  02139-4307              email: stefanom at mit . edu
Received on Wed Aug 10 2005 - 22:39:40 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT