RE: Explanation of facet frequency bug from Butler, Mark H (Labs Bristol) on 2004-10-20 (stdin)

From: Butler, Mark H (Labs Bristol) <"Butler,>
Date: Wed, 20 Oct 2004 12:04:26 +0100

Hi Ryan

I think the other big issue is the current approach to inference in
Longwell is a hack, but as I understand that the approach Jena currently
takes to equivalen is very expensive, and although this is a known
problem it has not been resolved, so the Jena inferencer probably still
won't cope with the demo datasets. Other groups have encountered the
same problem, so Dave Reynolds is aware of this and has some thoughts
about a better way to implement equivalence, but work has not yet begun
on it.

Switching to a Jena reasoner would have all sorts of advantages,
especially as it would be possible to distinguish between base and
inferred triples. The browser code tries to distinguish between
preferred and alternative versions of equivalent resources (I'm not
quite sure of the names, but I think you'll know what I mean) to prevent
things appearing twice. This is really ugly / hacky, but switching to
the Jena inferencer would make it easy to solve this as it does this
automatically for equivalent resources, which would be a lot more
elegant.

The other big hack are the database models, as there are two types, one
that queries the persistant model directly, the other that loads the
model from the database into memory and then queries that. The latter is
clearly cheating, as this approach won't scale - you might as well use
an in memory model. I think the way to avoid this is to have some kind
of caching implementation, and if you remember when you came to Bristol
this is one thing we spoke about. Specifically Ian Dickinson had a
simple caching algorithm that implemented the Jena API, so that seemed a
good starting point. So I think getting rid of the in memory models and
replacing them with a proper cached model would be a lot better,
although this might have implications for performance statistics - some
databases like Cholestrol already use such a caching mechanism.

Limiting the browser to two levels of hierarchical terms also really
helped performance, so there is a general issue here about how scalable
the navigation approaches used in Longwell are, and whether they needed
rethinking. Longwell is not alone in this - Haystack had similar
problems when we first tried to use some of the demo datasets.

hope that helps, best regards

Mark

-----Original Message-----
From: Ryan Lee [mailto:ryanlee_at_w3.org]
Sent: 19 October 2004 22:26
To: general_at_simile.mit.edu
Subject: Re: Explanation of facet frequency bug

Hi Mark,

Thanks for the background and explanation. Offhand, is there anything
else you'd classify in the code as needing reconsideration? I wasn't
aware you'd already been thinking of a solution for this problem (I take
back the phrasing of 'bug') and would like to not duplicate your
efforts.

I've added a configuration option (in config.n3) for turning on the
hierarchical display. It defaults to the flat display as most datasets
aren't going to conform to the required structure for hierarchical
display to function as intended.

Certainly that's just a temporary solution for the upcoming release.
I'll take some time after the release to consider how to implement a
better solution.

Butler, Mark H (Labs Bristol) wrote:
> Hi Ryan
>
> Some context here:
>
> Early on in the design of the browser, I was faced with the problem
> that in the Artstor data certain facets had very sparse facet values
> e.g. on average 2 of items of instance data per facet value. This
> means for a collection of 2000 images you would have 1000 facet
> values. This isn't a particular big collection, but browsing a list of

> a 1000 strings isn't particularly practical.
>
> Most solutions to improving this involve grouping facet values. This
> could just be a simple alphabetic grouping, it could be some kind of
> distance based machine learning / statistical clustering, or it could
> be a hierarchical grouping based on SKOS broader / narrower
> information or RDFS subclass information. Here the latter approach
> seemed preferable, as it demonstrated the value add of using a
> description format like RDF, whereas the other two approaches would
> have been equally applicable if we using a relational database as our
underlying representation.
>
> Therefore in the prototype I decided to explore the latter, although
> the Artstor data was still rather flat even with hierarchical
> grouping. One of the potential applications of integrating thesauri
> like LOC TGM was it seemed there might be some potential to transform
> this rather flat space of facet values to a more hierarchical
> organisation, although I never got that to work successfully, and it
> turned out to be problematic, not least because if we have two
> different classification schemes, we have no guarantee that if they
> use the same term that they are referring to the same concept, if we
> adopt the term / concept separation that SKOS does.
>
> So why the problems with counting hierarchically structured facet
> values?
>
> Two reasons: first the facet data is not guaranteed to be hierarchical

> - it's a lattice - so there is a danger that a single entry may occur
> multiple places in the tree leading to double counting. Second when we

> use inference, then if an entry is at a certain point in the tree, if
> the inferencer processes the relevant hierarchical relationship, it
> will also be at all the parent nodes. So again we run the risk of
> double couting.
>
> In the demo, I came up with some hacks to avoid this problem, that
> relied on the fact that the dataset would constrain the maximum depth
> of hierarchical relations to 2. In fact, in the Artstor data, the
> actual data has a potential depth of 3 but the transform used in the
> April demo constrained this to 2 in order to make things work. So of
> course when datasets are introduced that break this rule, the problems

> become apparent.
>
> However rather than remove the code together, I would suggest it is
> retained but a configuration option is added so it can be turned on
> and off for different datasets. This retains the existing
> functionality in the demos. I would also suggest adding comments about

> why the counting may be unreliable in the relevant source file -
> perhaps copied from this email. The reason I am advocating retaining
> this is I think retaining the demonstration of this UI metaphor is
> interesting and helps demonstrates some of the value add of using RDF
> i.e. the fact it can represent hierarchical relations, even if the
> logic of the way facet values are counted needs a fundamental rethink.
>
> I did have some ideas of a better solution to this problem. If it was
> possible to distinguish between inferred and non-inferred triples -
> which is possible with the Jena reasoner, which I hoped you would be
> using by now rather than the hacked reasoner I wrote for the April
> demo
> - then it would be possible to avoid double counting due to inference.

>
> The problem of double counting due to shoehorning a lattice into a
> hierarchy is potentially more complex. One way to do it would be to
> use two hashes to represent each parent node. One would contain all
> the visible entries at the parent node i.e. the facet values that
> belonged to that parent node in the base data i.e. without inferencer.

> The other would contain an invisible set of entries i.e. the facet
> values that belong to that node from the base and inferred data. The
> second hash structure is created purely for the purpose of doing
> accurrate counting, and because it is a hash it automatically removes
> duplicates due to a facet value occurring at multiple points in the
> tree. This will give the correct counts but will be computationally
> expensive, so the problem is to come up with a more efficient way of
> doing this. In fact, even the old algorithm that just used a single
> structure was reasonably computationally intensive, so I did spend
> some time optimizing the code you removed.
>
> Stefano requested the functionality that any facet values that return
> the same results set should be hidden. Getting correct counts
> (outlined
> above) is part of the solution to making this work reliably. The other

> part is to only apply this rule to leaf nodes. This means we aren't
> quite getting the requested functionality, as we might have a parent
> node that returns the same result set, as long as it contains nodes
> corresponding to facet values that do subset the results set but I
> think that behavior is probably OK, or at least preferable to not
> authoring hierarchical browsing at all.
>
> Dr Mark H. Butler
> mark-h.butler_at_hp.com
> HP Labs Bristol http://www.hpl.hp.com/people/marbut
>
> -----Original Message-----
> From: Ryan Lee [mailto:ryanlee_at_w3.org]
> Sent: 19 October 2004 11:39
> To: dev_at_simile.mit.edu
> Subject: Explanation of facet frequency bug
>
> Sometimes the facet browser has inaccurate counts for the rdf:type
> facet.
>
> The problem is with multiple levels of type subclassing.
>
> As in the TR dataset,
>
> rec:FirstEdition rdfs:subClassOf rec:REC .
> rec:REC rdfs:subClassOf rec:TRPub .
> rec:TRPub rdfs:subClassOf doc:Work .
>
> When a class rec:REC is first counted as a facet value for rdf:type,
> it creates a FacetValue object in an overall facet hash with a key of
> its superclass URI rec:TRPub containing a 'narrower term' hash of
> facets containing a FacetValue keyed on the class's URI, rec:REC.
>
> overall{'rec:TRPub' : (rec:TRPub, narrower{'rec:REC' : (rec:REC)})}
>
> But when rec:TRPub is counted as a facet value for rdf:type (for the
> same object), it creates a FacetValue object in the 'narrower term'
> hash of its superclass, doc:Work.
>
> overall{'doc:Work' : (doc:Work, narrower{'rec:TRPub' : (rec:TRPub)})}
>
> The next time rec:TRPub shows up, in a subsequent object, the hash
> already contains a key for rec:TRPub on its own AND as a narrower term

> in the doc:Work narrower terms hash. But the way things were written
> it would always go into the overall hash under its own term, never
> again as a narrower term.
>
> So any hierarchy greater than two levels will always introduce an off
> by one count.
>
> The solution is to increment all instances of the URI in use. But
> this highlights other bugs. With a correct frequency count, facets
> with restriction frequencies equal to the result set size are
> eliminated, including their narrower terms, even if those terms will
> produce useful restrictions. If those narrower terms are terminal
> leaves in the subclass hierarchy, they won't be in the overall hash
> anywhere else and won't be displayed. If they aren't (or are placed
> in the hash), they'll show up twice - once as subclasses, again as
> their own restrictions - when the result size has nothing to do with
> that subtree of the class hierarchy.
>
> The easiest solution is to drop this broader/narrower technique, which

> is what I committed. My apologies if you were getting attached to the

> dynamic tree open/close feature.
>

-- 
Ryan Lee                 ryanlee_at_w3.org
W3C Research Engineer    +1.617.253.5327
http://simile.mit.edu/

Received on Wed Oct 20 2004 - 11:05:33 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:17 EDT