RE: examples of linking bibliographic RDF to articles from MacKenzie Smith on 2005-07-16 (stdin)

From: MacKenzie Smith <kenzie_at_MIT.EDU>
Date: Sat, 16 Jul 2005 15:38:46 -0400

I'm kind of surprised that no one has brought up the Library of Congress's
Name Authority database -- 5.3 million authoritative, unique records for
personal and corporate names, based on several criteria including the
actual name, birth and death dates, geographic area, and titles of authored
works (see http://authorities.loc.gov/). These records also include
variant/related forms and information about the sources used to establish
the form. It's a very old practice so they don't assign URIs to each
authority record now, but I can't think of any reason why that would be
difficult... except that the official format predates XML so these records
are all in a weird binary format today -- the xml examples below are
totally made up. It's true that the names database mainly covers authors of
monographs rather than journal articles, but there's a lot of overlap and
the practice is quite extensible to anyone.

Examples:

<personalname>Frederick</personalname>
<numeration>II,</numeration>
<title>Holy Roman Emperor</title>
<dates>1194-1250</dates>
<geographicarea>Italy</geographicarea>

or

<personalname>Abelson, Harold</personalname>
<variantname>Abelson, H.<qualifier>(Harold)</qualifier></variantname>
<variantname>Abelson, Hal</variantname>
<sourcedata>His Caluculus of elementary functions, 1970.</sourcedata>
<sourcedata>His Apple Logo, 1982, c1981:<copyright>CIP t.p. (Harold
Abelson) copr. statement (H. Abelson)</copyright></sourcedata>

OCLC also has these and many more name authority records (e.g. from other
countries like the UK and Germany) and they offer a Web Service to do
automatic matching from a supplied name (we experimented with using that in
DSpace to cleanup personal names supplied by the authors directly).
Sometimes that involves a human selecting from a list of possible matches,
but often it's a unique match. I'm sure that OCLC is interested in the
intersection of name authority records and Web-based URIs for people, so
Eric or I could ask them what they come up with in that direction.

The data is public, replicated in a few places, and generated by librarians
all over the world using well-worn cataloging rules (so highly distributed,
if not exactly open to just anybody contributing, at least in the current
model). I agree with Matthew about the privacy issues... if you publish,
certain facts about you become publicly available and librarians dig up
more dirt on you so they can tell you apart from other people, and that's
not considered invasion of privacy.

MacKenzie

At 05:15 PM 7/15/2005 +0100, Matthew Cockerill wrote:
>In terms of central registries:
>
>Depends what you mean by central
>
>Coordinated registries which manage a global namespace exist and function
>well for, to give just a few examples:
>
>* product bar codes (UPC)
>* books (ISBN)
>* serials (ISSN)
>* Scholarly articles (DOI)
>* Protein structures (PDB ids)
>* internet hostnames (DNS)
>
>The same could apply for scholarly article author identifiers.
>
>I'm not convinced that privacy issues make this impossible. The
>information conveyed by the authorship URI would be simply that (I, John
>Smith, author of this article A, am one and same John Smith that is author
>of article B, but am not the same John Smith as the author of article C)
>
>Publication of scientific articles is not anonymous - it consciously and
>inherently involves a loss of privacy on the part of the author.
>An article may express the fact that the author has done research on
>animals that may make that author a target for animal rights campaigners,
>and yet that article will still identify that author and the institution
>they work at.
>
>Relative to that, it seems that the privacy issues involved in
>establishing which of the John Smiths are the same person are very minor.
>
>Matt
>
> > And if you think that a worldwide centralized URI registry would be
> > enough to solve the problem, rethink your strategy because it won't
> > happen: networked systems avoid centralization as the plague, because
> > it's inherently more vulnerable.
>
> > -----Original Message-----
> > From: Stefano Mazzocchi [mailto:stefanom_at_mit.edu]
> > Sent: 15 July 2005 17:04
> > To: general_at_simile.mit.edu
> > Subject: Re: examples of linking bibliographic RDF to articles
> >
> >
> > Matthew Cockerill wrote:
> > > Alf,
> > >
> > > Actually, long term I think scientific authors do need to
> > be identified URIs.
> > > I think ultimately the semantic web, and the increasing
> > sophistication of bibliographic tools built on it, will drive this.
> > >
> > > The need for author specific IDs has been raised several
> > times in the scientific literature.
> > > e.g.
> > http://www.nature.com/nature/journal/v411/n6835/full/411237b0.html
> > > Nature 411, 237 (17 May 2001) | doi: 10.1038/35077304
> > > Sorting out the Smiths
> > >
> > > [But there was a more recent one too.]
> > >
> > > The ambiguity in author names has pretty serious
> > consequences - e.g. it is disproportionately difficult for
> > editors of our journals to identify if someone is a suitable
> > reviewer for a paper, if they have a common name (Chinese
> > names being a particular challenge), since searching
> > bibliographic databases it can be very difficult to identify
> > which people are genuinely the same person.
> > >
> > > It might be felt that it's difficult to see how we can get
> > from the current situation, to a situation where author names
> > on papers are always include an ID (ideally a URI), but as
> > the major indexing services (PubMed, ISI, and Google
> > Scholar), say, increasingly base themselves on electronic
> > data supplied by the publisher, it's easily conceivable that
> > as well as (or perhaps better, instead of) requiring an email
> > address from authors, journals could require authors to
> > supply their author URI, obtained from an international open
> > registry analagous to Crossref.
> > >
> > > Obviously, not all journals would do this immediately, but
> > it's conceivable that a bibliographic service like PubMed or
> > Google Scholar could generate its own best estimate of the
> > set of distinct authors, represented within the corpus of
> > data that it indexes, using statistical text analysis
> > techniques, and could have it's own namespace of author URIs,
> > which would map onto the official author-registered URIs.
> > >
> > > Authors would then start to have a strong incentive to
> > register their real author URI, and to correct any
> > mismappings that exist in the attempts at author
> > disambiguation that were generated by the bibliographic databases.
> > >
> > > So although it's a pretty thorny problem, I do think that
> > the elements of an achievable solution may be starting to
> > fall into place.
> >
> > Using hashed email addresses to create URIs instead of hashed "paper +
> > author name" helps a lot because the amount of URIs we will
> > have to deal
> > with is reduced by orders of magnitude, but the problem is far from
> > being solved in that case.
> >
> > Truely unique IDs inherently exhibit privacy issues. I could
> > use my MIT
> > ID number, or my social security # or my italian fiscal code and these
> > will clearly identify me, but I don't want you to know them! (sure, i
> > could hash them, but do I trust sha-1 or md5 enough?)
> >
> > Even with email I might like you to know it, but not the spammers!
> >
> > Use and abuse have a thin line that separates them and once
> > the data is
> > out there you have no more control on where it goes.
> >
> > So, if you apply privacy concerns with the need to differentiation and
> > uniqueness, you can think that people will have *many* different URIs
> > that identify them, just like you have several different
> > email addresses
> > that all reach you, in one way or another and that you might use to
> > identify yourself differently depending on the context (this
> > allows you,
> > for example, to trace the percolation of your information thru a
> > system... just like people use hotmail or gmail accounts for
> > registering
> > to web sites they don't trust in keeping their email secret)
> >
> > My point is: without the ability to draw equivalences between URIs (or
> > state their difference), no system will work.
> >
> > And if you think that a worldwide centralized URI registry would be
> > enough to solve the problem, rethink your strategy because it won't
> > happen: networked systems avoid centralization as the plague, because
> > it's inherently more vulnerable.
> >
> > --
> > Stefano Mazzocchi
> > Research Scientist Digital Libraries Research Group
> > Massachusetts Institute of Technology location: E25-131C
> > 77 Massachusetts Ave telephone: +1 (617) 253-1096
> > Cambridge, MA 02139-4307 email: stefanom at mit . edu
> > -------------------------------------------------------------------
> >
> >
>This email has been scanned by Postini.
>For more information please visit http://www.postini.com

MacKenzie Smith
Associate Director for Technology
MIT Libraries
Building E25-131d
77 Massachusetts Avenue
Cambridge, MA 02139
(617)253-8184
kenzie_at_mit.edu
Received on Sat Jul 16 2005 - 19:35:53 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT