This is the page where we collect scenarios for data rewiring or general transformations that are considered needed during RDF-based data integration. Feel free to add more to this list if you can come up with any. Extra points for example rewrites, double points if the example is taken from the real world and uses a real vocabulary.
Example 1
A real world example; scrapped off a site I might get these statements:
[
a :Month ;
:title "Feb 2007" ;
:events [
:dates "1-2" ;
],
..., [
:dates "30-3/4" ;
];
].
and I want to rewrite them into:
[
a :Event
:start [
a :Date ;
:day 1 ;
:month 2 ;
:year 2002 ;
] ;
:end [
a :Date ;
:day 2 ;
:month 2 ;
:year 2002 ;
] ;
].
[
a :Event
:start [
a :Date ;
:day 30 ;
:month 2 ;
:year 2002 ;
] ;
:end [
a :Date ;
:day 4 ;
:month 3 ;
:year 2002 ;
]
].
Example 2
I have logs of HTTP traffic that I wish to make more meta.
- Meta data about the referring site/url such as
- live/dead
- geo location
- page rank
- first/last traffic dates
- counts of total traffic
- Same story for clients as ID by IP addr, or cookie
- Analagous story for browser cleints.
Regular Expression Rewrites
"Beatles, the" -> "the Beatles" "The Beatles." -> "The Beatles" "April 9, 2006" -> "2006-04-09"
Might call this Literal Harmonization.
Case Conversions
"the Beatles" -> "The Beatles"
e.g. force to title case.
Extract simple metadata from related object
I.e. type, format, word count, size, etc.
Introduce Synopsis, etc.
There is plenty of software out there to extract unusual phrases, words, and synopsis from text; i.e. text analysis tools (page through those). So this operator would apply those tools to the large attribute values or the resources referenced in the metadata. This can be as simple as introducing word counts and as fun as pucking out computer generated summaries or keywords ala Amazon's statistically interesting phrases, etc.
Add Blobs for particular viewers
I don't see any reason why you can't introduce complex information in the metadata that can only be usefully viewed by a particular viewer. For example tag clouds, or charts. The tiny charts displayed in Gadget are a good example.
Building URI
Aka Literal Reifications.
:foo a :album ;
:group "The Beatles" .
becomes
:foo a :album ;
:group <http://example.com/my_albums/groups/the_beatles_01> .
<http://example.com/my_albums/groups/the_beatles_01> a :Group ;
dc:title "The Beatles" ;
:provenance [
a :Provenance ;
dc:creator :magic_term_maker ;
dc:creationDate "..." ;
] .
Other examples:
"foo@example.net?Information Request" --> <mailto:foo@example.net?Information+Request> "(555) 555-1212" --> <tel:1-555-555-1212>
Parsing Literals
Aka Literal Internal Expansion.
:foo :authors "Jim, Mary, and Fred" ;
:location "42.358037, -71.060257" .
becomes
:foo :author "Jim", "Mary", "Fred" ;
:location [
a geo:Location ;
geo:lat "42.358037" ;
geo:long "-71.060257" ;
] .
Service Guided Literal Rewrites
Aka Literal External Expansion; e.g. using the value of the literal to query an external web service and provide more statements
:foo :location "Boston";
becomes
:foo :location "Boston, MA";
geo:lat "42.358037" ;
geo:long "-71.060257" .
Note this shows two steps both of which were presumably guided by a geolocation service of some kind, first making the general term Boston specific to a particular Boston, and then looking up a very particular geo location for that term.
Similar examples come to mind for any domain with a vocabulary; authors, animals, vegetables, minerals, etc.
Redlining
:foo :location "Boston" ;
:location "Cambarge" .
becomes
:foo :location "Boston" ;
:location "Cambarge";
:warning [
a :Error ;
:service :WhateverService ;
:time "..." ;
:message "unexpected location" ;
:predicate :location ;
:value "Cambarge" ;
] .
Note: this is useful for unsupervised expansion operations. Basically, we have to pass down the processing pipeline enough information for supervision and cleanup to be performed later on.
Merging Equal Subgraphs
:foo :location [
a geo:Location;
geo:lat "1" ;
geo:long "2" ;
].
:bar :location [
a geo:Location ;
geo:lat "1" ;
geo:long "2" ;
].
becomes
:foo :location :_1 .
:bar :location :_1 .
:_1 a geo:Location;
geo:lat "1" ;
geo:long "2" .
What exactly we mean by equal certainly needs to be configurable.
BNode Identification
Create an globally unique identifier for a node that is anonymous in this model, so that it can be shared between datasets
:_1 a geo:Location;
geo:lat "1" ;
geo:long "2" .
becomes
places:P_1_2_02 a geo:Location;
geo:lat "1" ;
geo:long "2" .
Of course places:P_1_2_02 is hardly the only choice; other obvious examples include places:1072 (i.e. a serial number), or places:8234A9DF382675D86E (a hash).
It is worth mentioning that if the created identifier is only a function of the BNode content, certain types of subgraph merging (see above) are a no-cost byproduct of this operation.
Type Emergence
Emerge the type of a node based on the existence of predicates or other substructures attached to it.
:NiceConference
geo:lat "1" ;
geo:long "2" ;
:abstracts_due "12/1/2009" .
becomes
:NiceConference
a :Conference ;
a geo:Location ;
geo:lat "1" ;
geo:long "2" ;
:abstracts_due "12/1/2009" .
Spanning Tree Rewrites
If you have a spanning tree over a portion of the statements then there are a number of rewrites that involve moving information up and down that tree. For example you have items that represent months associated with those are items representing events starting in that month. You might wish to move information on the month items onto the events, or vis-versa. A variant of this is when you have a collection of items taken from a particular source and you wish to annotate the items to indicate that.
Collection Gathering
Introducing new items that collect all items matching some pattern.
Garbage Collecting
Some use cases where this arises
- Given a pile-o-data containing public and private information glean out only the public portion prior to revealing it.
- Conversely glean out just the material of interest to the peeping tom.
- Given a pile-o-data containing redlinig or other work in progress mark-up create a clear, or unmarked up copy.
- Conversely glean out only the markup of a particular kind, i.e. the todo list.
- Pluck out only that data with conforms to a given ontology.
- Or the inverse.
Deleting statements that are unreachable via patterns declared to be useful.
This is a fun, but very specialized, lump-o-code. Presumably there is a mark on each statement denoting that it's garbage. An iterative process then toggles the bit until it settles out. Or is the mark on the subjects? Hm...
Well simple examples aren't that hard and are quite useful.
Others
- URI Rewriting - fix namespace prefix misspelling
- Subjects Smooshing - merge the properties of two (or more) nodes that are considered equivalent
- Predicate Mapping - merge the objects-sets of two predicates that are considered equivalent
- Datatype conversions/emergence, e.g. [geo:lat "12.3"] --> [geo:lat 12.3^^xsd:float]
- Introduce hash values based on various property patterns, which in turn maybe be used to guide smooshing.

