Re: Explanation of facet frequency bug from Ryan Lee on 2004-10-19 (stdin)

From: Ryan Lee <ryanlee_at_w3.org>
Date: Tue, 19 Oct 2004 17:26:13 -0400

Hi Mark,

Thanks for the background and explanation. Offhand, is there anything
else you'd classify in the code as needing reconsideration? I wasn't
aware you'd already been thinking of a solution for this problem (I take
back the phrasing of 'bug') and would like to not duplicate your efforts.

I've added a configuration option (in config.n3) for turning on the
hierarchical display. It defaults to the flat display as most datasets
aren't going to conform to the required structure for hierarchical
display to function as intended.

Certainly that's just a temporary solution for the upcoming release.
I'll take some time after the release to consider how to implement a
better solution.

Butler, Mark H (Labs Bristol) wrote:
> Hi Ryan
>
> Some context here:
>
> Early on in the design of the browser, I was faced with the problem that
> in the Artstor data certain facets had very sparse facet values e.g. on
> average 2 of items of instance data per facet value. This means for a
> collection of 2000 images you would have 1000 facet values. This isn't a
> particular big collection, but browsing a list of a 1000 strings isn't
> particularly practical.
>
> Most solutions to improving this involve grouping facet values. This
> could just be a simple alphabetic grouping, it could be some kind of
> distance based machine learning / statistical clustering, or it could be
> a hierarchical grouping based on SKOS broader / narrower information or
> RDFS subclass information. Here the latter approach seemed preferable,
> as it demonstrated the value add of using a description format like RDF,
> whereas the other two approaches would have been equally applicable if
> we using a relational database as our underlying representation.
>
> Therefore in the prototype I decided to explore the latter, although the
> Artstor data was still rather flat even with hierarchical grouping. One
> of the potential applications of integrating thesauri like LOC TGM was
> it seemed there might be some potential to transform this rather flat
> space of facet values to a more hierarchical organisation, although I
> never got that to work successfully, and it turned out to be
> problematic, not least because if we have two different classification
> schemes, we have no guarantee that if they use the same term that they
> are referring to the same concept, if we adopt the term / concept
> separation that SKOS does.
>
> So why the problems with counting hierarchically structured facet
> values?
>
> Two reasons: first the facet data is not guaranteed to be hierarchical -
> it's a lattice - so there is a danger that a single entry may occur
> multiple places in the tree leading to double counting. Second when we
> use inference, then if an entry is at a certain point in the tree, if
> the inferencer processes the relevant hierarchical relationship, it will
> also be at all the parent nodes. So again we run the risk of double
> couting.
>
> In the demo, I came up with some hacks to avoid this problem, that
> relied on the fact that the dataset would constrain the maximum depth of
> hierarchical relations to 2. In fact, in the Artstor data, the actual
> data has a potential depth of 3 but the transform used in the April demo
> constrained this to 2 in order to make things work. So of course when
> datasets are introduced that break this rule, the problems become
> apparent.
>
> However rather than remove the code together, I would suggest it is
> retained but a configuration option is added so it can be turned on and
> off for different datasets. This retains the existing functionality in
> the demos. I would also suggest adding comments about why the counting
> may be unreliable in the relevant source file - perhaps copied from this
> email. The reason I am advocating retaining this is I think retaining
> the demonstration of this UI metaphor is interesting and helps
> demonstrates some of the value add of using RDF i.e. the fact it can
> represent hierarchical relations, even if the logic of the way facet
> values are counted needs a fundamental rethink.
>
> I did have some ideas of a better solution to this problem. If it was
> possible to distinguish between inferred and non-inferred triples -
> which is possible with the Jena reasoner, which I hoped you would be
> using by now rather than the hacked reasoner I wrote for the April demo
> - then it would be possible to avoid double counting due to inference.
>
> The problem of double counting due to shoehorning a lattice into a
> hierarchy is potentially more complex. One way to do it would be to use
> two hashes to represent each parent node. One would contain all the
> visible entries at the parent node i.e. the facet values that belonged
> to that parent node in the base data i.e. without inferencer. The other
> would contain an invisible set of entries i.e. the facet values that
> belong to that node from the base and inferred data. The second hash
> structure is created purely for the purpose of doing accurrate counting,
> and because it is a hash it automatically removes duplicates due to a
> facet value occurring at multiple points in the tree. This will give the
> correct counts but will be computationally expensive, so the problem is
> to come up with a more efficient way of doing this. In fact, even the
> old algorithm that just used a single structure was reasonably
> computationally intensive, so I did spend some time optimizing the code
> you removed.
>
> Stefano requested the functionality that any facet values that return
> the same results set should be hidden. Getting correct counts (outlined
> above) is part of the solution to making this work reliably. The other
> part is to only apply this rule to leaf nodes. This means we aren't
> quite getting the requested functionality, as we might have a parent
> node that returns the same result set, as long as it contains nodes
> corresponding to facet values that do subset the results set but I think
> that behavior is probably OK, or at least preferable to not authoring
> hierarchical browsing at all.
>
> Dr Mark H. Butler
> mark-h.butler_at_hp.com
> HP Labs Bristol http://www.hpl.hp.com/people/marbut
>
> -----Original Message-----
> From: Ryan Lee [mailto:ryanlee_at_w3.org]
> Sent: 19 October 2004 11:39
> To: dev_at_simile.mit.edu
> Subject: Explanation of facet frequency bug
>
> Sometimes the facet browser has inaccurate counts for the rdf:type
> facet.
>
> The problem is with multiple levels of type subclassing.
>
> As in the TR dataset,
>
> rec:FirstEdition rdfs:subClassOf rec:REC .
> rec:REC rdfs:subClassOf rec:TRPub .
> rec:TRPub rdfs:subClassOf doc:Work .
>
> When a class rec:REC is first counted as a facet value for rdf:type, it
> creates a FacetValue object in an overall facet hash with a key of its
> superclass URI rec:TRPub containing a 'narrower term' hash of facets
> containing a FacetValue keyed on the class's URI, rec:REC.
>
> overall{'rec:TRPub' : (rec:TRPub, narrower{'rec:REC' : (rec:REC)})}
>
> But when rec:TRPub is counted as a facet value for rdf:type (for the
> same object), it creates a FacetValue object in the 'narrower term' hash
> of its superclass, doc:Work.
>
> overall{'doc:Work' : (doc:Work, narrower{'rec:TRPub' : (rec:TRPub)})}
>
> The next time rec:TRPub shows up, in a subsequent object, the hash
> already contains a key for rec:TRPub on its own AND as a narrower term
> in the doc:Work narrower terms hash. But the way things were written it
> would always go into the overall hash under its own term, never again as
> a narrower term.
>
> So any hierarchy greater than two levels will always introduce an off by
> one count.
>
> The solution is to increment all instances of the URI in use. But this
> highlights other bugs. With a correct frequency count, facets with
> restriction frequencies equal to the result set size are eliminated,
> including their narrower terms, even if those terms will produce useful
> restrictions. If those narrower terms are terminal leaves in the
> subclass hierarchy, they won't be in the overall hash anywhere else and
> won't be displayed. If they aren't (or are placed in the hash), they'll
> show up twice - once as subclasses, again as their own restrictions -
> when the result size has nothing to do with that subtree of the class
> hierarchy.
>
> The easiest solution is to drop this broader/narrower technique, which
> is what I committed. My apologies if you were getting attached to the
> dynamic tree open/close feature.
>

-- 
Ryan Lee                 ryanlee_at_w3.org
W3C Research Engineer    +1.617.253.5327
http://simile.mit.edu/

Received on Tue Oct 19 2004 - 21:26:16 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:17 EDT