RE: Explanation of facet frequency bug

From: Butler, Mark H (Labs Bristol) <"Butler,>
Date: Tue, 19 Oct 2004 15:08:32 +0100

Hi Ryan

Some context here:

Early on in the design of the browser, I was faced with the problem that
in the Artstor data certain facets had very sparse facet values e.g. on
average 2 of items of instance data per facet value. This means for a
collection of 2000 images you would have 1000 facet values. This isn't a
particular big collection, but browsing a list of a 1000 strings isn't
particularly practical.

Most solutions to improving this involve grouping facet values. This
could just be a simple alphabetic grouping, it could be some kind of
distance based machine learning / statistical clustering, or it could be
a hierarchical grouping based on SKOS broader / narrower information or
RDFS subclass information. Here the latter approach seemed preferable,
as it demonstrated the value add of using a description format like RDF,
whereas the other two approaches would have been equally applicable if
we using a relational database as our underlying representation.

Therefore in the prototype I decided to explore the latter, although the
Artstor data was still rather flat even with hierarchical grouping. One
of the potential applications of integrating thesauri like LOC TGM was
it seemed there might be some potential to transform this rather flat
space of facet values to a more hierarchical organisation, although I
never got that to work successfully, and it turned out to be
problematic, not least because if we have two different classification
schemes, we have no guarantee that if they use the same term that they
are referring to the same concept, if we adopt the term / concept
separation that SKOS does.

So why the problems with counting hierarchically structured facet
values?

Two reasons: first the facet data is not guaranteed to be hierarchical -
it's a lattice - so there is a danger that a single entry may occur
multiple places in the tree leading to double counting. Second when we
use inference, then if an entry is at a certain point in the tree, if
the inferencer processes the relevant hierarchical relationship, it will
also be at all the parent nodes. So again we run the risk of double
couting.

In the demo, I came up with some hacks to avoid this problem, that
relied on the fact that the dataset would constrain the maximum depth of
hierarchical relations to 2. In fact, in the Artstor data, the actual
data has a potential depth of 3 but the transform used in the April demo
constrained this to 2 in order to make things work. So of course when
datasets are introduced that break this rule, the problems become
apparent.

However rather than remove the code together, I would suggest it is
retained but a configuration option is added so it can be turned on and
off for different datasets. This retains the existing functionality in
the demos. I would also suggest adding comments about why the counting
may be unreliable in the relevant source file - perhaps copied from this
email. The reason I am advocating retaining this is I think retaining
the demonstration of this UI metaphor is interesting and helps
demonstrates some of the value add of using RDF i.e. the fact it can
represent hierarchical relations, even if the logic of the way facet
values are counted needs a fundamental rethink.

I did have some ideas of a better solution to this problem. If it was
possible to distinguish between inferred and non-inferred triples -
which is possible with the Jena reasoner, which I hoped you would be
using by now rather than the hacked reasoner I wrote for the April demo
- then it would be possible to avoid double counting due to inference.

The problem of double counting due to shoehorning a lattice into a
hierarchy is potentially more complex. One way to do it would be to use
two hashes to represent each parent node. One would contain all the
visible entries at the parent node i.e. the facet values that belonged
to that parent node in the base data i.e. without inferencer. The other
would contain an invisible set of entries i.e. the facet values that
belong to that node from the base and inferred data. The second hash
structure is created purely for the purpose of doing accurrate counting,
and because it is a hash it automatically removes duplicates due to a
facet value occurring at multiple points in the tree. This will give the
correct counts but will be computationally expensive, so the problem is
to come up with a more efficient way of doing this. In fact, even the
old algorithm that just used a single structure was reasonably
computationally intensive, so I did spend some time optimizing the code
you removed.

Stefano requested the functionality that any facet values that return
the same results set should be hidden. Getting correct counts (outlined
above) is part of the solution to making this work reliably. The other
part is to only apply this rule to leaf nodes. This means we aren't
quite getting the requested functionality, as we might have a parent
node that returns the same result set, as long as it contains nodes
corresponding to facet values that do subset the results set but I think
that behavior is probably OK, or at least preferable to not authoring
hierarchical browsing at all.

Dr Mark H. Butler
mark-h.butler_at_hp.com
HP Labs Bristol http://www.hpl.hp.com/people/marbut

-----Original Message-----
From: Ryan Lee [mailto:ryanlee_at_w3.org]
Sent: 19 October 2004 11:39
To: dev_at_simile.mit.edu
Subject: Explanation of facet frequency bug

Sometimes the facet browser has inaccurate counts for the rdf:type
facet.

The problem is with multiple levels of type subclassing.

As in the TR dataset,

rec:FirstEdition rdfs:subClassOf rec:REC .
rec:REC rdfs:subClassOf rec:TRPub .
rec:TRPub rdfs:subClassOf doc:Work .

When a class rec:REC is first counted as a facet value for rdf:type, it
creates a FacetValue object in an overall facet hash with a key of its
superclass URI rec:TRPub containing a 'narrower term' hash of facets
containing a FacetValue keyed on the class's URI, rec:REC.

overall{'rec:TRPub' : (rec:TRPub, narrower{'rec:REC' : (rec:REC)})}

But when rec:TRPub is counted as a facet value for rdf:type (for the
same object), it creates a FacetValue object in the 'narrower term' hash
of its superclass, doc:Work.

overall{'doc:Work' : (doc:Work, narrower{'rec:TRPub' : (rec:TRPub)})}

The next time rec:TRPub shows up, in a subsequent object, the hash
already contains a key for rec:TRPub on its own AND as a narrower term
in the doc:Work narrower terms hash. But the way things were written it
would always go into the overall hash under its own term, never again as
a narrower term.

So any hierarchy greater than two levels will always introduce an off by
one count.

The solution is to increment all instances of the URI in use. But this
highlights other bugs. With a correct frequency count, facets with
restriction frequencies equal to the result set size are eliminated,
including their narrower terms, even if those terms will produce useful
restrictions. If those narrower terms are terminal leaves in the
subclass hierarchy, they won't be in the overall hash anywhere else and
won't be displayed. If they aren't (or are placed in the hash), they'll
show up twice - once as subclasses, again as their own restrictions -
when the result size has nothing to do with that subtree of the class
hierarchy.

The easiest solution is to drop this broader/narrower technique, which
is what I committed. My apologies if you were getting attached to the
dynamic tree open/close feature.

-- 
Ryan Lee                 ryanlee_at_w3.org
W3C Research Engineer    +1.617.253.5327
http://simile.mit.edu/
Received on Tue Oct 19 2004 - 14:08:34 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:17 EDT