Data discussion

From: Ryan Lee <ryanlee_at_w3.org>
Date: Tue, 18 Jan 2005 11:22:11 -0500

We recently had a discussion with Steve Hughes concerning how best to
put live data in Longwell. With his permission, here is the
conversation in full, minus the actual data (which is available, but
Steve is reworking it already). In chronological order:

-------
From: Steve Hughes
To: Simile developers

Hi all,

I am trying to set up a simple facet based search as a proof-of-concept
for Solar System Exploration science data search. I have loaded a very
small part of our existing ontology into the Protege tool and exported
the .rdf and .rdfs files. I have successfully downloaded and installed
Longwell and have worked through the examples. However, after several
hours of work, I am still not able to get the search working with my data.

The directions seem clear enough, so I expect that there is a
fundamental flaw (hopefully simple) with my configuration. All the files
are attached. Could someone give it a quick look?

The first two attached files are in data\bundles, the second two files
are in data\datasets, and rdf_.rdfs is in data\ontologies\unofficial.

Thanks,
Steve Hughes
Project Engineer
Planetary Data System
Jet Propulsion Lab
NASA


-------
From: Simile developers
To: Steve Hughes

One problem here is the Longwell facet browser expected all facet values
to be URIs. In your data they are literal values. Despite this it should
be possible to use the other browser, Knowle, that comes as part of
Longwell (sorry the naming is confusing) with this data. Haystack and
the cut down version of Haystack (Hayloft?) should also be able to
browse the data, but best to refer to the Haystack team for help with
this.

Here are some guidelines on how to modify your data so it is browsable
in Longwell. You have a lot of properties that you want to browse as
facet values, so for now let's say you want to browse ARCHIVE_STATUS and
DATA_OBJECT_TYPE as facets. To do this you need to modify them as
follows in your data set:

<rdf_:DATA_SET rdf:about="&rdf_;ODY-M-ACCEL-5-ALTITUDE-V1.0">
      <rdf_:ARCHIVE_STATUS>
         <rdf:Description rdf:about="&terms;pre_peer_review">
             <rdfs:label>PRE PEER REVIEW</rdfs:label>
         </rdf:Description>
      <rdf_:ARCHIVE_STATUS>
      <rdf_:DATA_OBJECT_TYPE>
         <rdf:Description rdf:about="&terms;table">
             <rdfs:label>TABLE</rdfs:label>
         </rdf:Description>
      <rdf_:DATA_OBJECT_TYPE>
</rdf:_DATA_SET>

As you can see, I've given "pre peer review" and "table" URIs. Why have
I done this? Well if they are facet values, then there is an implicit
assumption that they are going to occur more than once. So one way to
look at it is I am performing a data modelling task very similiar to
normalization in a relational database - I'm removing duplicated labels.
Normalizing them has a number of advantages. If I want to provide labels
in a different language then its easier, of if latter NASA decides it is
going to replace the term "PRE PEER REVIEW" with the term "PRE REVIEW"
it can do. It also allows you to associate a longer human readable
comment with the term, to explain exactly what it means. With scientific
terms, this may be very useful.

In some domains, it may be necessary to treat the vocabularies of facet
values even more formally, like thesauri. In the SemWeb, if you have
given your terms URIS, then you can use a proposal like SKOS to do this
- see http://www.w3.org/2004/02/skos/.

Another way to look at it is associating these labels with a URI is a
key ideas in SemWeb modelling. For example the string literal "table" by
itself is ambiguous. It could refer to a dataset, or something I eat my
dinner off. Non SemWeb applications get over this by making assumptions
e.g. that they assume all instances of table refer to a dataset. However
in the SemWeb we can't make that assumption, so we need to associate
URIs, to distinguish between the different meanings. Even if table means
dataset, we might want to distinguish between a what Nasa thinks a table
dataset should look like and what the WHO thinks it should look like.

Note you don't need to give labels to every instance of "pre peer
review", but if you do it doesn't matter, because the RDF processor will
turn these all into a single instance. If you look at one of the sample
datasets with Longwell you will see this, and in some of the samples
every instance has a label because many of the samples are created using
XSLT. You only seem to have small pieces of data, but if they were much
bigger, you might want to explore using XSLT to convert your RDF data
into a different form. For some discussion about this, see
http://www.hpl.hp.com/techreports/2004/HPL-2004-147.html

If you are writing the data by hand, once you've associated a label with
"pre peer review" you can subsequently use a simpler way to reference it
e.g.

<rdf_:DATA_SET rdf:about="&rdf_;ODY-M-ACCEL-5-ALTITUDE-V1.0">
      <rdf_:ARCHIVE_STATUS rdf:resource="&terms;pre_peer_review"/>
      <rdf_:DATA_OBJECT_TYPE rdf:resource="&terms;table"/>
</rdf:_DATA_SET>

Note you will have to assign a value to terms at the top in your DOCTYPE
definition e.g.

<!ENTITY terms
  'http://www.nasa.org/jpl/planetaryData/2005/01/11/vocabulary#'>

It is best to pick a URI here that you have access to so you have the
ability to publish a schema later.

You may also wish to change your rdf_ entity to one you have access to
for similar reasons, and I would suggest ending it with a hash e.g.

  <!ENTITY rdf_ 'http://protege.stanford.edu/rdf#'>

This is important, because otherwise tools like Jena, and probably
consequently Longwell will have difficulty separating fragments in the
URI as without the hash you will end up with URIs like

http://protege.stanford.edu/rdfMISSIONNAME

Some other suggestions: one common convention in SemWeb applications is
to start properties with lower case letters, and classes with upper case
letters. This can be important because some vocabularies / ontologies
you have properties and classes with the same name, so the case
difference is necessary to distinguish between them. It's also useful to
be able to tell which are properties and which are classes by eye. I
would recommend adopting a similar convention.

Also, in your browsing configuration, when you define your facets, you
seem to be making everything a facet. This is probably a bad idea - you
only want to make things facets if facet values will be associated with
more than one data instance. If a facet value is unique to a data
instance, then you just end with a long list, one for each data
instance, which isn't ideal. I'm not familiar with your data, but I'm
guessing that you are unlikely to want DATA_SET_ID, DATA_SET_NAME and
FULL_NAME to be facets.

hope this helps, best regards

Mark H. Butler, PhD
mark-h.butler_at_hp.com
HP Labs Bristol http://www.hpl.hp.com/people/marbut


-----
From: Simile developers
To: Steve Hughes

Hi Steve,

This was my work process so you can see how to get your data as-is to
show up, but Mark's message is the more important one. I kept a log of
what I did to get your data to show up, at the end of my message. One
other general note first:

LONGWELL

With only one proper facet in your data(rdf:type) and one type of class
(rdf_:DATA_SET), no facets (and no results) will show up, since the
analysis would show that selecting 'rdf:type rdf_:DATA_SET' wouldn't
change the size of the result set. To make it show up, I just added
another set of data that had 'rdf:type foaf:Person' in it and configured
the disp:displayClasses sequence to include foaf:Person.

When you've got more properties that can act as facets set up, you can
drop this hack.

LOG

I unzipped a fresh Longwell 1.0.1 and put your files here:

longwell-1.0.1/data/bundles/nasa/data.properties
longwell-1.0.1/data/bundles/nasa/conf/config.n3
longwell-1.0.1/data/datsets/rdf_/build.xml
longwell-1.0.1/data/datsets/rdf_/data/rdf_.rdf
longwell-1.0.1/data/ontologies/unofficial/rdf_.rdfs

I edited longwell-1.0.1/build.properties to add 'nasa' to list.bundles
and 'rdf_' to list.datasets. Running '/build.sh webapp -Dbundle=nasa'
succeeded.

Once built, I did any editing in
longwell-1.0.1/webapp/WEB-INF/

config.n3 was missing a colon in the prefix 'rdf_:'

The &rdfs; entity was using an out of date namespace; replaced it with
'http://www.w3.org/2000/01/rdf-schema#' across all RDF-ish files.

Parts of config.n3 were unused; removed disp:objectPropertyDisplays
sequence and all facets, as per Mark's message.

Added rdfs:label to disp:displayProperties

Changed all instances of http://protege.stanford.edu/rdf to
http://protege.stanford.edu/rdf# (rdf_.rdf, rdf_.rdfs, config.n3), as
per Mark's note on URI separators.

Threw in another data set and added one class used in it to
disp:displayClasses, as per above. You should be able to find extra
data in data/datasets/foaf/data if you've built the 'people' bundle.

And now Longwell displays. See attached files.

-- 
Ryan Lee                 ryanlee_at_w3.org
W3C Research Engineer    +1.617.253.5327
http://simile.mit.edu/
Received on Tue Jan 18 2005 - 16:22:26 EST

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:17 EDT