Re: Sneak peek from Steve Dunham on 2006-03-03 (stdin)

From: Steve Dunham <dunhamsteve_at_gmail.com>
Date: Fri, 3 Mar 2006 14:30:23 -0800

On 3/3/06, David Huynh <dfhuynh_at_csail.mit.edu> wrote:
> Steve Dunham wrote:
>
> >On 3/3/06, David Huynh <dfhuynh_at_csail.mit.edu> wrote:
> >
> >
> >>of Piggy Bank + Solvent 3.x... work in early progress...
> >>
> >> http://people.csail.mit.edu/dfhuynh/research/media/UIST%202006/uist2006.swf
> >>
> >>(Look ma, no angle bracket and no URI! No bnode. No xpath. No custom javascript scraper. It's just the ol' Web. With semantics.)
> >>
> >>
> >
> >It's looking _very_ slick. I particularly like how the results are
> >filtered in-place.
> >
> >
> Thanks, Steve :-) Yup, the "in-place" is the whole message to the user:
> "it is the same Web that you've come to know and love, only better."
>
> >A few ideas for enhancements (not that you don't have enough to keep
> >you busy forever):
> >
> > * Guess the property names from the values. (Maybe using some kind
> >of text classification.)
> >
> >
> There's a possibility of using the fixed text nearby a field value as
> the field name. But that is not always reliable. I'm also thinking about
> retaining units (like $ from "$19.99" and % from "35% off") and formats
> (like the strikethru in "was $19.99, now only $15.99"). Such are syntax
> details that hint at semantics and help keep context.
>
> > * Guess type of data (from the properties and text classification of
> >the items/page)

> Not sure how to do this... Could you say some more on this?

Say we know that both the words hardcover, softcover frequently occur
as objects of the book:format property. (From our lucene index or some
other statistics on our database.) Then it is likely that the field
we're looking at is "book:format". (And, perhaps, that the object is a
book:Book.) If we wished we could also calculate the probability that
the the property is "book:format" given the values that appear in it.

I believe this problem is generally referred to as text
classification. With a decent set of training data, it'd learn that
St / Street / ... frequently occur in addresses, etc.

We could augment this with some hand-coded checks (regular expressions
for phone numbers, email addresses, dates, etc.)

> > * Save to piggy bank (of course).

> Definitely. Now that I have total control of the Web page, I can even
> add tagging UI right on top of each item... :-) Add context menu command
> like "Call this phone number", "Add to my calendar using this date
> field", etc.

Not total control - you seem to have some dueling-AJAX issues with Netflix. :)

(I think their javascript is preventing piggy bank from getting some
hover events, so I can't select titles.)

> > * Short term - I think you could make a good guess at what elements
> >are addresses and put up a google map. That'd have good demo-value.
> >
> >
> Yes, I can see a little Google Map in the sidebar... Do you know of any
> address detection Javascript code out there?

A very naive one, which I'm making up on the spot is:

  var states = { 'AZ':1, ...};
  var addr_RE = /[0-9][0-9A-z]* .*, ?([A-Z]{2}) *([0-9]{5})?/;

  var m = text.match(addr_RE);
  if (m && states[m[1]]) {
     ... we have a likely address ...
  }

I could do better in a little more space, if I thought about it a bit.
(The USPS has a list of street designations and abbreviations, which
might help.) A variation on the above might suffice for a demo. The
.* could be made a little more specific. (you'd want dashes, spaces,
and commas, at least.)
Received on Fri Mar 03 2006 - 22:28:49 EST

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT