Re: RDF 101 [was Re: introduction and questions] from Stefano Mazzocchi on 2005-04-18 (stdin)

From: Stefano Mazzocchi <stefanom_at_mit.edu>
Date: Mon, 18 Apr 2005 18:52:45 -0400

Erik Hatcher wrote:
>
> On Apr 15, 2005, at 10:22 AM, Stefano Mazzocchi wrote:
>
>>> Now on to my questions....
>>> First, I'm utterly clueless about RDF.
>>
>>
>> That's totally fine. We do not expect our users to know RDF inside
>> out, and we are willing to help to get them up to speed.
>
> *whew*

I'm actually glad you are asking those questions because I fear a lot of
people here are just refraining from asking because shy or because they
fear that we would be mad at them for not knowing RDF.

Well, I hope they will see that it's clearly not the case: we are not
RDF fanatics, and we try not to use it as a golden hammer, we use it
because it makes sense and it solves our problems a lot better than
other formats, but that's about it.

>> the second means that you need a 'rdf:type' statement, or, using
>> RDF/XML, you need to say something like
>>
>> <blah:Blah rdf:about="http://your.host.com/uri/3809480">
>> ...
>>
>> instead of
>>
>> <rdf:Description rdf:about="http://your.host.com/uri/3809480">
>> ...
>
> I've now converted to using your recommended syntax. And I'm also
> adding a <dc:type> element that points to some different metadata that
> we have. Longwell2 is showing both together as "type" - is that
> correct? Is the implicit rdf:type somehow connected to dc:type?

hmmm, no, dc:type and rdf:type are two completely different things. What
do you mean by "joining them together"? are you sure you map the dc:
prefix and the rdf: prefix to the proper namespaces?

>> Note that the RDF/XML syntax is rather weird as it has special
>> meanings, for example
>>
>> <blah:Blah
>> xmlns:blah="http://blah.com/ns/blah#"
>> rdf:about="http://your.host.com/uri/3809480">
>> ...
>>
>> is completely equivalent to
>>
>> <rdf:Description rdf:about="http://your.host.com/uri/3809480">
>> <rdf:type rdf:resource="http://blah.com/ns/blah#Blah"/>
>> ...
>>
>> [this creates all sort of problems in RDF canonicalizations and some
>> people hate it and some love it, but hey, RDF/XML is even older than
>> the XML namespaces spec and it feels kinda pre-hystoric to me at
>> times, but it grows on you after a few months]
>
> *head spinning* - cool... glad to have some mentoring on this stuff, as
> I'd be stumbling in the dark for ages on things like this.

Yep, I stumbled on that as well.

>> I'm sure you've seen my "No-nonsense Guide to Semantic Web Specs for
>> XML People"
>
> Yes, and have also re-read them thanks to your pointers.
>
>> part 1 -> http://www.betaversion.org/~stefano/linotype/news/57/
>
> One thing that hasn't become clear yet is the use of the "#" at the end
> of the namespace URI's - you mention it'll become clear, but thus far it
> hasn't for me.

Ok.

First of all, there is a big discussion between the use of # and / for
the end of RDF namespaces. In theory, there is no difference, both are
valid identifiers. In practice, there is a different: the #... part of
the URL is *NOT* passed to the server by the browser when dereferencing
things.

For example, say you have to define a URI space for a collection of
things and you want these things to be dereferenceable (means, use the
http URI as a URL and get the RDF about that URI directly from the web
site... it's not mandated that you do that, but it's a good established
practice and the reason why http:// URIs are normally preferred to, say,
urn: URIs)

So, if you pick #, for example

   http://your.host.com/2005/04/paintings#

you can do something like

_at_prefix painting: <http://your.host.com/2005/04/paintings#> .

painting:1
   dc:title "Sunflowers"_at_en ;
   dc:creator "Gogh, Vincent van" ;
.

painting:2
   dc:title "Self Portrait"_at_en ;
   dc:creator "Gogh, Vincent van" ;
.

but if you have a 300k paintings (as the Artstor collection does!), if
you wanted to deference painting #1, which has a complete URI of

  <http://your.host.com/2005/04/paintings#1>

and you "wget" that URI, you get a huge RDF file and then you have to
scan, client side, for the #1 item.

Instead, of you used

  http://your.host.com/2005/04/paintings/1

you could just get the "1" page that will probably contain just a few
statements about that single painting.

Note: the big advantage of # over / is that you can clearly distinguish
between the model and the item, when you use / it's hard to tell which
is the model and which are the items... in welkin, we say that the token
after the last / is the item and the token before that is the model, but
it's a heuristical analysis. With # you are certain of what the modeler
wanted to do.

As a rule of thumb, I use # for ontologies and / for collections, but
that's my personal perference and not something that is mandated or
required or even considered a best practice. I find it easier to work with.

I also know that SPARQL is probably going to remove the problem because
dereferencing won't be done thru direct GET HTTP actions but thru a
SPARQL-enabled web service.

But that is yet to come in usage, so for now we are stuck with the
existing limitations.

>> If you want to get a little deeper, it's probably easier to keep
>> asking specific questions here as soon as you encounter a roadblock.
>
> With that said, I've gotten a bit further. I've RDF'd the Rossetti
> Archive files. I've done a 1-for-1 transformation from our XML files
> into RDF, though I'm sure this is not granular enough. I'm tossing this
> out details in case someone is interested in pointing me in the right
> directions - in other words, help if you want, and it'll be appreciated,
> but worries if not.

I hope you mean "no worries" here ;-) I really hope you don't come all
the way to boston to beat me up if I don't help you out ;-)

> I've zipped all the RDF files here:
>
> http://www.rossettiarchive.org/docs/rossetti_rdf.zip (2MB currently)
>
> and a specific example here:
>
> http://www.rossettiarchive.org/docs/1-1847.s244.raw.rdf

Suggestion, I would split the comma separated subjects in different
statements.

> This is just the beginnings, and there is much more metadata (rhyme,
> meter, genre, etc, etc, etc) available once I find the best buckets to
> put it in within RDF. And there are quite a number of connections
> between the objects in our archive as well.

Don't worry about creating your own things. The other nice thing about
RDF is that you don't have to create ontologies before you model the
data, you can do that incrementally, just start adding new types and new
properties and just write them down as you go, you'll annotate them
later and create your own RDF schemas and they you might map them to
existing schemas later on using OWL equivalences.

Think of meta-metadata.

> How do I represent these connections? For example, we "workcodes" on
> various objects (down to the <div> level within actual manuscripts) that
> connect things back to a formal work. You can see this aggregation our
> collection view like this:
>
> http://www.rossettiarchive.org/docs/1-1847.s244.rawcollection.html
>
> Hyperlinks with #anchors are down to the <div> level.
>
> I've distilled lots of our gory XML metadata out into fields I indexed
> with Lucene. Queries connecting workcodes can be made like this:
>
> http://www.rossettiarchive.org/rose/?query=workcode%3A1-1847.s244
>
> Putting these connections into RDF, of course, is the next goal.

Holy cow, that's going to be an interesting modelling and presentation
problem :-)

Encoding relationships in RDF is easy, just put unique names to all the
items and write relationships between them:

So, if you have

  <div id="1">
   <div id="1.1">
    ...
   </div>
  </div>

in XML, in RDF/N3 it's something like

_at_prefix text: <http://www.rossettiarchive.org/2005/04/text#> .
_at_prefix ros: <http://www.rossettiarchive.org/docs/1-1847.s244/> .

ros:1
   rdf:type text:Fragment;
   text:contains ros:1.1

ros:1.1
   rdf:type text:Fragment;

>> Welkin and Longwell2 do not need that finetuning, you can throw
>> whatever RDF at them and they will adjust to the data.
>
> I have now started using Longwell2, and it has been working nicely to
> see how I progress with the RDFization of the Rossetti Archive.

Awesome, glad to hear that.

>>> - Longwell2 - How do I get it to work with a sample dataset? I
>>> tried pointing longwell.properties the data directory of my Longwell
>>> TRUNK area, but it did not work.
>>
>>
>> you have to run it like
>>
>> ./longwell.sh longwell.properties datadir
>
>
> Just a minor correction... _run_ is needed:
>
> ./longwell.sh run longwell.properties datadir

yep, forgot that, thanks.

>> and it will load all the *.rdf, *.n3, *.rdfs, *.owl files found
>> recursively in the datadir.
>
> Sure enough it did!

:-D

>>> - Welkin - well done! It'll make more sense to me when I
>>> understand RDF a bit more, but it's a nice visualization.
>>
>>
>> Did you try it from the trunk or from the webstart release?
>
>
> At the time of writing, I was from WebStart. I've now been using
> trunk. The sliders (which I don't understand yet) get cropped a bit on
> Mac OS X.

Yeah, the UI needs a little finetuning, which is hard given how swing is
native on a mac and not really that controllable for those things.

> It'll be even cooler when I get the connections between
> documents in there :)

Definately.

> How can I open up directory full of RDF files in Welkin? Or will it
> accept a .zip file of .rdf files?

No, but you are welcome to submit a patch :-)

>>> - Charon - this looks like something we could really leverage
>>> with Collex - allowing folks that have legacy low-tech archives to
>>> be "collectable" somehow. This may be a place of collaboration for us.
>>
>> Awesome! Charon is based on cocoon, so 99.9% of the complexity is
>> already dealt for you by it. All you need to do is to write a few XSLT
>> stylesheets, take a look at
>>
>> http://simile.mit.edu/repository/charon/trunk/stylesheets/rdfize.xslt
>>
>> to see the 'core' action. This is the rdfizer targetted for a dspace
>> site. Charon was built with dspace in mind, but it's relatively easy
>> to modify it to be able to support multiple sites at the same time,
>> would that need emerge. I'd be happy to help out directly with that,
>> also because I would love Charon and Piggy-Bank to share XSLT RDFizers
>> (a-la GRDDL)
>>
>> http://www.w3.org/2004/01/rdxh/spec
>
>
> Excellent. The proxy idea will be one of the last things we do with
> Collex since we're aiming for the rich archives that have the technical
> wherewithal to supply their information as RDF. But I suspect it will
> come to having to do some proxying like this sometime down the road.

Yeah, proxying is a nice way to RDFize stuff with small impact on the
existing systems.

> Piggy-Bank - this is right up the "collection" alley we're looking for,
> but we'll want to be able to collect objects in non-Firefox browsers
> also somehow. One idea I have is to use a bookmarklet to "Collex It!"
> which will somehow send information to our system. We could have our
> archives marked with embedded RDF which is what Piggy-Bank leverages.
> And we could, perhaps in addition, have a <meta> tag that pointed to the
> RDF. When our system receives a URL to collect, it'd parse out the RDF
> and allow the user to pick the specific objects desired. Is this a
> reasonable approach? What other suggestions do folks have in this regard?

I admit I have not given much thought about an environemnt where we
don't have full control of the browser... I'll think about it.

-- 
Stefano Mazzocchi
Research Scientist                 Digital Libraries Research Group
Massachusetts Institute of Technology            location: E25-131C
77 Massachusetts Ave                   telephone: +1 (617) 253-1096
Cambridge, MA  02139-4307              email: stefanom at mit . edu
-------------------------------------------------------------------

Received on Mon Apr 18 2005 - 22:52:00 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT