Re: Scaling facetted browsing to a very large curpus

From: David R. Karger <karger_at_mit.edu>
Date: Thu, 28 Oct 2004 00:41:36 -0400

   Mailing-List: contact general-help_at_simile.mit.edu; run by ezmlm
   X-No-Archive: yes
   Reply-To: <general_at_simile.mit.edu>
   Date: Sun, 24 Oct 2004 19:31:31 -0400
   From: Stefano Mazzocchi <stefanom_at_mit.edu>
   X-LocalTest: Nonlocal Origin ([18.51.2.218]
   X-Spam-Level:
   X-Spam-Status: No, hits=-4.9 required=5.0 tests=AWL,BAYES_00 autolearn=ham
           version=2.63
   X-SpamBouncer: 2.0 beta (10/20/04)
   X-SBNote: Bulk Email (From_Daemon/Listserv/Resent/Precedence)
   X-SBScore: 0 (Spam Threshold: 20) (Block Threshold: 9)
   X-SBClass: Bulk

   David R. Karger wrote:


> A set of 1000 facets is a corpus.

   First of all, I would like to introduce a terminological distinction so
   that we know what we are talking about:

     facet := metadata field that is considered important enough in a
   particular search&browse (s&b) context.

     facet value := literal content of a facet

Good clarification. I had assumed you meant 1000 facets. This
affected the rest of my message (ie, by predicate I meant facet).

> How do you search a corpus? You
> need useful descriptions of each element of the corpus. Textual
> descriptions and metadata. Textual description is obvious---ideally,
> when we create an ontology, someone should write a document describing
> each predicate in the ontology. Of course, since we are in the
> metadata business, it's nice to think about metadata for the
> predicates. There's obvious stuff---eg, the RDFS info about the
> predicates, like the domain and range of the predicate. But it's
> interesting to think about what other kinds of predicates can be
> descriptive for an end user. To think about this I'd like to take a
> look at the list of predicates. Is it available somewhere?

   Before I can answer, I think I need a more precise definition of what
   you mean by "predicates" in this context.

I meant facet.

   What I can tell you is that we have a bundle in longwell that I placed
   to s&b our own collection of ontologies and that we are using RDFS/OWL
   in conjunction with dublin core to add descriptive metadata to the
   ontologies we use so that we can s&b them as they were any other dataset

   [the difference is in the statistical distribution of the graph 'shape'
   as ontologies tend to have a very different distribution of density of
   linkage between them, but this is another topic]

well, one good way to explore the facet values is to look at the
reverse arrows. ie, for a given facet value, what are the items that
have that facet value. This can give the user some feel for how that
value's facet is working.

> Given the metadata, there are various tool for searching it. I'd like
> to see Vineet's metadata-based fuzzy browser applied to this problem,
> for example.

   Can you give us more information on this? Will this be part of the
   minimal Haystack distribution that Steve is working on?

Best source of info is vineet. He'd be happy to demo it for
you. Basically, we apply techniques from textual information
retrieval. Given an initial text search, various "query refinement"
methods suggest new terms that the user might want to add to the
query, or other documents similar to the current result set. Now, for
"term" substitute "facet and value" and you get something that works
for semistructured data.

This highlights, I think, one of the best ways people will use the
data---they'll use it to refine current result sets, rather than to
formulate queries from scratch.

It shouldn't be in the minimal distribution, but it is proving
difficult to untangle so it might be. Anything not in the minimal
distribution will be available as an add-on.

   --
   Stefano Mazzocchi
   Research Scientist Digital Libraries Research Group
   Massachusetts Institute of Technology location: E25-131C
   77 Massachusetts Ave telephone: +1 (617) 253-1096
   Cambridge, MA 02139-4307 email: stefanom at mit . edu
   -------------------------------------------------------------------
Received on Thu Oct 28 2004 - 04:41:36 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:17 EDT