[RT] rethinking the need for inferencing

From: Stefano Mazzocchi <stefanom_at_mit.edu>
Date: Mon, 06 Mar 2006 19:36:17 -0800

NOTE: RT stands for "random thoughts". RT's represent 'brainstorming'
threads that I introduced in the Apache Cocoon community as a way to
foster innovation and induce some knowledge transfer between the people
in the core of the community to and from its periphery.

RTs have no rules and this means that you can say whatever you want,
doesn't have to be particularly smart, or effective... the amazing thing
is that the result if often nobody's idea, but a collective thought process.


                               - o -


One of the big problems on the table for RDF is scalability, both in
terms of fast querying of massive amounts of triples, especially when
this querying involves inferencing results our of a huge pile of
semi-structured data following implicit (OWL) or explicit (RuleML and
friends) codified rules.

Personally, I'm not a great fan of description logics. It would work in
a world that resembles a math book, where thousands of people spend
lifetimes making sure that everything has coherent symbolic
representation and codified relationships among them. I don't believe
such a world to be socio-economically feasible (it's already hard enough
for math itself, especially now that numeric analysis and computers blur
  the picture of logic proving).

The only things we needed off of RDF/S and OWL are what has been named
"OWL tiny": basically sub-typing and equivalences. And no, sets that are
members of themselves are not the kind of problem I'm likely to encounter.

All I want is something that says "vra:creator" is just like
"dc:creator", so that I can write software that depends on dc:creator
and tollerates variety.

Another thing I need is to say is that "urn:isbn:39409848" is really the
same thing as "urn:lccn:309840984" (warning: numbers are made up!).

Semantic web experts will tell you that you need an 'inferencer' for
that: mix the data with some OWL equivalence statements, blend in an
inferencer, and voila', you get the results out.

But what kind of results?

here is where it gets tricky: if vra:creator is equivalent to
dc:creator, do you remove the original property and attach a new one, or
add a new one and leave the existing one there? There is no way to tell
an OWL inferencer what to do with that.

Even more, if urn:a is equivalent to urn:b, do we copy all properties of
urn:a to urn:b and viceversa, or do we take urn:a as the master, move
all the properties of urn:b to urn:a and leave urn:b attached to urn:a
with the equivalence? Again, two strategies and now way to differentiate
them.

The OWL will tell you that this is an implementation detail, but if I
want to scale and I can't afford to do all the inferencing at runtime, I
need to predict what's going to happen to my graph after an inferencing
operation, because the graph resulting in the different strategies have
very different properties.

                                - o -

So, I'm thinking: is inferencing what we really need?

In my previous XML life (around Apache Cocoon), everything revolved
around pipelines which had three stages:

  1) generation (anything to XML)
  2) transformation (XML to XML)
  3) serialization (XML to anything)

I'm so used to that model that I've tried to replicate some of this here
and in fact RDFizers are the first step of that pipeline, but the other
two stages are missing: one that transforms an RDF graph into another
and one that transforms an RDF graph into something else.

Fresnel is, in fact, the first of our attempts into #3. You can, in
fact, see Fresnel as a "spanning tree" extractor (which is the simplest
thing that could allow us to hook an RDF pipeline into an XML pipeline,
therefore maximizing reusability of existing software). In theory, if
you have an XML pipeline already available, a 'spanning tree' extractor
is all you need to perform all sort of serialization operations.

This leaves us with the problem of performing #2, possibly pipelined one
after the other.

All transformation phases require access to the input data stream for
the transformation to happen, some of these retain state and some do
not. It is always preferable to have stages that do not retain state (or
that retain, reliably, a small state in a windowed scenario)

The beauty of pipelines is that it creates a sort of Lego(tm)-like set
of relatively-simple-yet-not-so-trivial-to-write operators that can be
used to create very specific operators without requiring complicated
programming, but just high level assembly. Moreover, it allows
development of these components to be parallelized and, probably more
important, to achieve similar results with different algorithms and
strategies.

An OWL reasoner or a rule engine for RDF are similar to an XSLT
transformer for XML and both are general enough to be impossible to
define, a priori, the amount of state the need to have access to to
perform their operation. In the "OWL Full" sense, we cannot even tell
that the operation will terminate at all! (what's funny is that this is
also true for XSLT, as it's true for any turing-complete languages,
being possible to write a turing machine in XSLT)

CWM is also an RDF transformer, where transformation rules are defined
in N3.

                                 - o -

What I personally don't like about rule engines is that they tend to
start simple and focused and later they grow into monstrous
turing-complete languages or they do a magnificent job at that 80/20
rule, too bad that what you need is *always* 81% and adding that 1% to
the 80% is enough work to make you hate it. (XSLT feels like that to me
almost all the times).

So, before we go down the path of defining which architecure we want for
this RDF trasnformation pipeline, I think it's important to define what
operations we want to be able to achieve.

  smooshing
  =========

smooshing is a graph operation that takes, for example, this graph as input

  A -(a)-> [foo]
  B -(b)-> [bar]
  B -(sameAs)-> A

and returns you

  A -(a)-> [foo]
  A -(b)-> [bar]
  B -(sameAs)-> A
  B -(b)-> [bar]**

where the ** statement can be present or not depending on whether not we
want the smoosher to be preserving (no statements are removed) or
pruning (only one node ends up having the properties attached to it)

The smoosher can also act on property equivalences:

  A -(a)-> [foo]
  B -(b)-> [bar]
  b -(sameAs)-> a

returns

  A -(a)-> [foo]
  B -(a)-> [bar]
  b -(sameAs)-> a
  B -(b)-> [bar]**


  subclassing
  ===========

subclassing is a graph operation that takes, for example, this graph as
input

  A -(a)-> [foo]
  B -(b)-> [bar]
  B -(subClassOf)-> A

and returns

  A -(a)-> [foo]
  B -(b)-> [bar]
  B -(a)-> [foo]
  B -(subClassOf)-> A

this also works for properties (even if the result is different):

  A -(a)-> [foo]
  B -(b)-> [bar]
  b -(subClassOf)-> a

returns

  A -(a)-> [foo]
  B -(b)-> [bar]
  B -(a)-> [bar]
  b -(subClassOf)-> a

NOTE: the real operations are more complicated than this if the
equivalences involve classes and not just instances.


  extracting
  ==========


given a graph, a property, a regular expression and a graph pattern,
returns a graph that has the literal associated with the property
"extracted" into subgraphs that are appended to the existing graph.

This is useful to turn something like

  A -(name)-> "Smith, Joe 1932-1978"

into

  A -(lastname)-> "Smith"
  A -(firstname)-> "Joe"
  A -(dates)-> "1932-1978"


  distancing
  ==========

given a graph, a property, a string distance function, a threshold and a
property, generate a graph that contains equivalences between nodes if
they contain a property that is closer than the threshold. For example

  A -(name)-> "Stefano Mazzocchi"
  B -(name)-> "Stefano Mazzochi"

  f is 'edit distance'
  t is 2
  p is 'sameAs'

returns

  A -(name)-> "Stefano Mazzocchi"
  B -(name)-> "Stefano Mazzochi"
  A -(sameAs)-> B

Another version of a distancing operator is to use a node distance
function and a node type instead of a string distance function and a
property.


  projecting
  ==========

projecting is a graph operation that takes a graph, a node type and a
property as input and returns a graph as a result. For example,
projecting this graph

  A1 -(type)-> Z
  A2 -(type)-> Z
  A1 -(a)-> B
  A2 -(a)-> B

along on type Z with property -(1)-> returns

  A1 -(type)-> Z
  A2 -(type)-> Z
  A1 -(1)-> A2
  A2 -(1)-> A1

The utility of this operation is to perform the same type of graph
operation that Amazon does with books "the people that bought this CD
also bought this other".

NOTE: there are many variations of this projection operator and I'm
currently not sure I understand which ones makes most sense.


Boy, this was longer than I expected. Well, hope it makes some sense.

Please, pile up your operators.

-- 
Stefano Mazzocchi
Research Scientist                 Digital Libraries Research Group
Massachusetts Institute of Technology            location: E25-131C
77 Massachusetts Ave                   telephone: +1 (617) 253-1096
Cambridge, MA  02139-4307              email: stefanom at mit . edu
-------------------------------------------------------------------
Received on Tue Mar 07 2006 - 03:34:40 EST

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT