Re: Explanation of facet frequency bug from David R. Karger on 2004-10-19 (stdin)

From: David R. Karger <karger_at_mit.edu>
Date: Tue, 19 Oct 2004 15:53:05 -0400

It seems to me that if you hae 1000 facets to deal with, then the
problem of looking at or finding the right facets is an information
retrieval problem in its own right. Rather than hardwiring special
purpose tools for working with the facets, we should be representing
the facets in a way that lets them be treated as a corpus to which we
apply all the traditional information retrieval tools.

   Mailing-List: contact general-help_at_simile.mit.edu; run by ezmlm
   X-No-Archive: yes
   Reply-To: <general_at_simile.mit.edu>
   Date: Tue, 19 Oct 2004 15:08:32 +0100
   X-MS-Has-Attach:
   X-MS-TNEF-Correlator:
   Thread-Topic: Explanation of facet frequency bug
   From: "Butler, Mark H (Labs Bristol)" <mark-h.butler_at_hp.com>
   X-OriginalArrivalTime: 19 Oct 2004 14:08:33.0335 (UTC) FILETIME=[1E476470:01C4B5E5]
   X-LocalTest: Nonlocal Origin ([18.51.2.218]
   X-Spam-Level:
   X-Spam-Status: No, hits=-4.9 required=5.0 tests=AWL,BAYES_00 autolearn=ham
           version=2.63
   X-SpamBouncer: 2.0 beta (9/23/04)
   X-SBPass: NoBounce
   X-SBClass: OK
   X-Folder: Bulk

   Hi Ryan

   Some context here:

   Early on in the design of the browser, I was faced with the problem that
   in the Artstor data certain facets had very sparse facet values e.g. on
   average 2 of items of instance data per facet value. This means for a
   collection of 2000 images you would have 1000 facet values. This isn't a
   particular big collection, but browsing a list of a 1000 strings isn't
   particularly practical.

   Most solutions to improving this involve grouping facet values. This
   could just be a simple alphabetic grouping, it could be some kind of
   distance based machine learning / statistical clustering, or it could be
   a hierarchical grouping based on SKOS broader / narrower information or
   RDFS subclass information. Here the latter approach seemed preferable,
   as it demonstrated the value add of using a description format like RDF,
   whereas the other two approaches would have been equally applicable if
   we using a relational database as our underlying representation.

   Therefore in the prototype I decided to explore the latter, although the
   Artstor data was still rather flat even with hierarchical grouping. One
   of the potential applications of integrating thesauri like LOC TGM was
   it seemed there might be some potential to transform this rather flat
   space of facet values to a more hierarchical organisation, although I
   never got that to work successfully, and it turned out to be
   problematic, not least because if we have two different classification
   schemes, we have no guarantee that if they use the same term that they
   are referring to the same concept, if we adopt the term / concept
   separation that SKOS does.

   So why the problems with counting hierarchically structured facet
   values?

   Two reasons: first the facet data is not guaranteed to be hierarchical -
   it's a lattice - so there is a danger that a single entry may occur
   multiple places in the tree leading to double counting. Second when we
   use inference, then if an entry is at a certain point in the tree, if
   the inferencer processes the relevant hierarchical relationship, it will
   also be at all the parent nodes. So again we run the risk of double
   couting.

   In the demo, I came up with some hacks to avoid this problem, that
   relied on the fact that the dataset would constrain the maximum depth of
   hierarchical relations to 2. In fact, in the Artstor data, the actual
   data has a potential depth of 3 but the transform used in the April demo
   constrained this to 2 in order to make things work. So of course when
   datasets are introduced that break this rule, the problems become
   apparent.

   However rather than remove the code together, I would suggest it is
   retained but a configuration option is added so it can be turned on and
   off for different datasets. This retains the existing functionality in
   the demos. I would also suggest adding comments about why the counting
   may be unreliable in the relevant source file - perhaps copied from this
   email. The reason I am advocating retaining this is I think retaining
   the demonstration of this UI metaphor is interesting and helps
   demonstrates some of the value add of using RDF i.e. the fact it can
   represent hierarchical relations, even if the logic of the way facet
   values are counted needs a fundamental rethink.

   I did have some ideas of a better solution to this problem. If it was
   possible to distinguish between inferred and non-inferred triples -
   which is possible with the Jena reasoner, which I hoped you would be
   using by now rather than the hacked reasoner I wrote for the April demo
   - then it would be possible to avoid double counting due to inference.

   The problem of double counting due to shoehorning a lattice into a
   hierarchy is potentially more complex. One way to do it would be to use
   two hashes to represent each parent node. One would contain all the
   visible entries at the parent node i.e. the facet values that belonged
   to that parent node in the base data i.e. without inferencer. The other
   would contain an invisible set of entries i.e. the facet values that
   belong to that node from the base and inferred data. The second hash
   structure is created purely for the purpose of doing accurrate counting,
   and because it is a hash it automatically removes duplicates due to a
   facet value occurring at multiple points in the tree. This will give the
   correct counts but will be computationally expensive, so the problem is
   to come up with a more efficient way of doing this. In fact, even the
   old algorithm that just used a single structure was reasonably
   computationally intensive, so I did spend some time optimizing the code
   you removed.

   Stefano requested the functionality that any facet values that return
   the same results set should be hidden. Getting correct counts (outlined
   above) is part of the solution to making this work reliably. The other
   part is to only apply this rule to leaf nodes. This means we aren't
   quite getting the requested functionality, as we might have a parent
   node that returns the same result set, as long as it contains nodes
   corresponding to facet values that do subset the results set but I think
   that behavior is probably OK, or at least preferable to not authoring
   hierarchical browsing at all.

   Dr Mark H. Butler
   mark-h.butler_at_hp.com
   HP Labs Bristol http://www.hpl.hp.com/people/marbut

   -----Original Message-----
   From: Ryan Lee [mailto:ryanlee_at_w3.org]
   Sent: 19 October 2004 11:39
   To: dev_at_simile.mit.edu
   Subject: Explanation of facet frequency bug

   Sometimes the facet browser has inaccurate counts for the rdf:type
   facet.

   The problem is with multiple levels of type subclassing.

   As in the TR dataset,

   rec:FirstEdition rdfs:subClassOf rec:REC .
   rec:REC rdfs:subClassOf rec:TRPub .
   rec:TRPub rdfs:subClassOf doc:Work .

   When a class rec:REC is first counted as a facet value for rdf:type, it
   creates a FacetValue object in an overall facet hash with a key of its
   superclass URI rec:TRPub containing a 'narrower term' hash of facets
   containing a FacetValue keyed on the class's URI, rec:REC.

   overall{'rec:TRPub' : (rec:TRPub, narrower{'rec:REC' : (rec:REC)})}

   But when rec:TRPub is counted as a facet value for rdf:type (for the
   same object), it creates a FacetValue object in the 'narrower term' hash
   of its superclass, doc:Work.

   overall{'doc:Work' : (doc:Work, narrower{'rec:TRPub' : (rec:TRPub)})}

   The next time rec:TRPub shows up, in a subsequent object, the hash
   already contains a key for rec:TRPub on its own AND as a narrower term
   in the doc:Work narrower terms hash. But the way things were written it
   would always go into the overall hash under its own term, never again as
   a narrower term.

   So any hierarchy greater than two levels will always introduce an off by
   one count.

   The solution is to increment all instances of the URI in use. But this
   highlights other bugs. With a correct frequency count, facets with
   restriction frequencies equal to the result set size are eliminated,
   including their narrower terms, even if those terms will produce useful
   restrictions. If those narrower terms are terminal leaves in the
   subclass hierarchy, they won't be in the overall hash anywhere else and
   won't be displayed. If they aren't (or are placed in the hash), they'll
   show up twice - once as subclasses, again as their own restrictions -
   when the result size has nothing to do with that subtree of the class
   hierarchy.

   The easiest solution is to drop this broader/narrower technique, which
   is what I committed. My apologies if you were getting attached to the
   dynamic tree open/close feature.

   --
   Ryan Lee ryanlee_at_w3.org
   W3C Research Engineer +1.617.253.5327
   http://simile.mit.edu/
Received on Tue Oct 19 2004 - 19:53:05 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:17 EDT