Re: Scaling facetted browsing to a very large curpus

From: Stefano Mazzocchi <>
Date: Fri, 29 Oct 2004 14:35:21 -0400

[boy, I hate your mail client ;-)]

David R. Karger wrote:

> > A set of 1000 facets is a corpus.
> First of all, I would like to introduce a terminological distinction so
> that we know what we are talking about:
> facet := metadata field that is considered important enough in a
> particular search&browse (s&b) context.
> facet value := literal content of a facet
> Good clarification. I had assumed you meant 1000 facets. This
> affected the rest of my message (ie, by predicate I meant facet).


> well, one good way to explore the facet values is to look at the
> reverse arrows. ie, for a given facet value, what are the items that
> have that facet value. This can give the user some feel for how that
> value's facet is working.

oh, interesting. Are you suggesting we use the literal as an identifier
and group all the nodes that have the same literal as they were pointing
to the same URI? hmmmm

> > Given the metadata, there are various tool for searching it. I'd like
> > to see Vineet's metadata-based fuzzy browser applied to this problem,
> > for example.
> Can you give us more information on this? Will this be part of the
> minimal Haystack distribution that Steve is working on?
> Best source of info is vineet. He'd be happy to demo it for
> you.


> Basically, we apply techniques from textual information
> retrieval. Given an initial text search, various "query refinement"
> methods suggest new terms that the user might want to add to the
> query, or other documents similar to the current result set. Now, for
> "term" substitute "facet and value" and you get something that works
> for semistructured data.

Right, I'll contact him for that and report back to the list.

> This highlights, I think, one of the best ways people will use the
> data---they'll use it to refine current result sets, rather than to
> formulate queries from scratch.

I can hardly agree more.

> It shouldn't be in the minimal distribution, but it is proving
> difficult to untangle so it might be. Anything not in the minimal
> distribution will be available as an add-on.


Stefano Mazzocchi
Research Scientist                 Digital Libraries Research Group
Massachusetts Institute of Technology            location: E25-131C
77 Massachusetts Ave                   telephone: +1 (617) 253-1096
Cambridge, MA  02139-4307              email: stefanom at mit . edu
Received on Fri Oct 29 2004 - 18:35:17 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:17 EDT