Scaling facetted browsing to a very large curpus

From: Stefano Mazzocchi <>
Date: Sun, 24 Oct 2004 19:31:31 -0400

David R. Karger wrote:

> Here are some preliminary thoughts.


sorry for not having written back to you sooner, but I was preparing the
longwell distribution and I didn't want to do this in a hurty.

> A set of 1000 facets is a corpus.

First of all, I would like to introduce a terminological distinction so
that we know what we are talking about:

  facet := metadata field that is considered important enough in a
particular search&browse (s&b) context.

  facet value := literal content of a facet

I personally strongly doubt that any particular s&b context will ever
contain 1000 facets. It will, for sure, contain more than 1000 facet
values, but the amount of facets that would be useful to be presented to
a user in a s&b interface would hardly exceed 25/50, IMO.

This, of course, includes the cases where facets are considered facet
values of other facets and a hierarchical distribution can be imagined.

The problem with this approach (which was suggested to me at the CIDOC
CRM meeting in Crete) is that those 'meta-facets' tend to be
increasingly abstract [for example, a hierarchy such as "entity ->
object -> thing -> stuff" is present in the CRM] and hardly of any use
for the diversity of end users that we are targetting since any kind of
hierarchical distribution tends to feel natural to some and completely
arbitrary to others.

> How do you search a corpus? You
> need useful descriptions of each element of the corpus. Textual
> descriptions and metadata. Textual description is obvious---ideally,
> when we create an ontology, someone should write a document describing
> each predicate in the ontology. Of course, since we are in the
> metadata business, it's nice to think about metadata for the
> predicates. There's obvious stuff---eg, the RDFS info about the
> predicates, like the domain and range of the predicate. But it's
> interesting to think about what other kinds of predicates can be
> descriptive for an end user. To think about this I'd like to take a
> look at the list of predicates. Is it available somewhere?

Before I can answer, I think I need a more precise definition of what
you mean by "predicates" in this context.

What I can tell you is that we have a bundle in longwell that I placed
to s&b our own collection of ontologies and that we are using RDFS/OWL
in conjunction with dublin core to add descriptive metadata to the
ontologies we use so that we can s&b them as they were any other dataset

[the difference is in the statistical distribution of the graph 'shape'
as ontologies tend to have a very different distribution of density of
linkage between them, but this is another topic]

> Given the metadata, there are various tool for searching it. I'd like
> to see Vineet's metadata-based fuzzy browser applied to this problem,
> for example.

Can you give us more information on this? Will this be part of the
minimal Haystack distribution that Steve is working on?

