Notes from Daniel, David, and Adam's meeting on 10/04/2006

In Longwell, one starts off at a screen with a list of all rdf:type objects and a search textbox.

To begin the faceted browsing, you click on an object to see a list of all objects of that type, or search for text, which is indexed by Lucene, and mapped back into the triple store for triples containing that text. To get a list of all objects with type rdf:type, one enters ?* rdf:type ?t.

Let's assume the user has chosen the object Book. Longwell issues the query ?x rdf:type :Book to get a list of all subjects that have type book. So far, we have only seen filter queries (SQL WHERE clauses)

In addition to all matching subjects and displaying them, Longwell also displays a box on the right hand side with the predicates (other than rdf:type) of subjects in the subject list. It also lists the objects that are involved in those triples.

To get a list of other predicates, we search for (? rdf:type :Book) ?p ?*. This is our first join query. Assuming we have a naive implementation of a triple store (three columns), we must self join to get the set of subjects of :Book objects, and find all predicates of those objects.

Since we want to list objects of the predicates, we now do another self join for the set of all objects. This query would be ?* ((?s rdf:type :Book) ?p ?*) ?o. An example of this would be a :Book whose :Author is "Hal Abelson".

There are also aggregate queries: We want to know how many subjects there are that are books, how many triples exist with predicates for those subjects, and how many times objects appear (how many "Hal Abelson" entries are there).

David indicated that he'd like to be able to look for queries with sets of subjects or objects. In a relational model, such a query is no different in structure: we simply join on a different part of the table.

He also brought up the concept of a sliding operation. We might have the first generation subjects, and we will want to search for objects who are their children ((?x :generation "1st") :parent_of ?p)

Finally, David would like to be able to search for multiple predicates at once (this would unify ontologies). An example might be mail-message {dc:title, rdf:label, :subject} ?x, which would get the subject of a mail message from sources that refer to the subject of a mail message as a dc:title, rdf:label, or :subject. This would be accomplished easily with a disjunction, or UNION SQL query.

E-mail from David after the meeting

Hey guys,

I missed a few cases yesterday. They might be important.

1. In each facet box (on the right side of Longwell), we also want to list a (missing) choice:

  author:
  Hal Abelson (39)
  David Karger (26)
  missing (3)

That means that there are 3 books with no author data. This is quite crucial for people to clean up their data by finding out what data is missing.

2. RDF data is usually very dirty. So, even if the ontology says that the objects of the author predicate must always be RDF resources (URIs or blank nodes), sometimes you'll find a literal value in there.

  :bookA :author :DavidKarger .
  :bookB :author "I need to hunt down the author" .

People are pretty creative when they enter data :-). Another variation of this dirty nature is

  :bookA :publishingDate "2006-09-12"^xsd:date .
  :bookB :publishingDate "the night before Thanksgiving 2004" .

In the :author case above, we need the facet to look like this

  author:
  David Karger (26)
  I need to hunt down the author (2)

However, while "I need to hunt down the author" can be returned immediately as a literal, :DavidKarger needs another hop over :name or rdfs:value or dc:title or whatever. I'm hoping to be able to get this result set back from the db in one shot

  <label, isLiteral, uri, count>
  <"David Karger", false, :DavidKarger, 26>
  <"I need to hunt down the author", true, null, 2>

If I only get { <:DavidKarger, 26>, <"I need to hunt down the author", 2>, ... } in one shot, then I need to send { <:DavidKarger>, ... } to the db in another query. And that can be a very long list of resources--extremely inefficient.

This same problem comes up in sorting the results by author.

The dichotomy of resources and literals is quite painful to deal with. I'm not sure if you have any better abstraction over them.

3. For continuous valued predicates (date/time, numeric, etc.), we need MIN, MAX in addition to COUNT.

4. Sometimes literals store structured data within them, e.g., "-39.09,-7.29" for latitude/longitude. Even "http://foo.com/bar.html" can be considered a structure because you might want to filter by domain name.

One more thing: after you have filtered down to a collection of things in Longwell, then doing further filtering or simply computing the UI require issuing a lot of queries that embed the query by which you got to that collection of things. Remember that those facets are computed by only slightly different queries. Some smart caching specific to faceted browsing would be nice.

Thanks,

David