Re: Sneak peek

From: David Huynh <dfhuynh_at_csail.mit.edu>
Date: Sat, 04 Mar 2006 07:51:47 -0500

Steve Dunham wrote:

>[snip]
>
>
>Say we know that both the words hardcover, softcover frequently occur
>as objects of the book:format property. (From our lucene index or some
>other statistics on our database.) Then it is likely that the field
>we're looking at is "book:format". (And, perhaps, that the object is a
>book:Book.) If we wished we could also calculate the probability that
>the the property is "book:format" given the values that appear in it.
>
>I believe this problem is generally referred to as text
>classification. With a decent set of training data, it'd learn that
>St / Street / ... frequently occur in addresses, etc.
>
>We could augment this with some hand-coded checks (regular expressions
>for phone numbers, email addresses, dates, etc.)
>
>
Interesting... Let me think more about this.

>>> * Save to piggy bank (of course).
>>>
>>>
>>Definitely. Now that I have total control of the Web page, I can even
>>add tagging UI right on top of each item... :-) Add context menu command
>>like "Call this phone number", "Add to my calendar using this date
>>field", etc.
>>
>>
>Not total control - you seem to have some dueling-AJAX issues with Netflix. :)
>
>(I think their javascript is preventing piggy bank from getting some
>hover events, so I can't select titles.)
>
>
Argh... Those Web 2.0 people always get in our way. :-)


>A very naive one, which I'm making up on the spot is:
>
> var states = { 'AZ':1, ...};
> var addr_RE = /[0-9][0-9A-z]* .*, ?([A-Z]{2}) *([0-9]{5})?/;
>
>
> var m = text.match(addr_RE);
> if (m && states[m[1]]) {
> ... we have a likely address ...
> }
>
>I could do better in a little more space, if I thought about it a bit.
>(The USPS has a list of street designations and abbreviations, which
>might help.) A variation on the above might suffice for a demo. The
>.* could be made a little more specific. (you'd want dashes, spaces,
>and commas, at least.)
>
>
Thanks! I'll test that out when I get a chance!

David
Received on Sat Mar 04 2006 - 12:52:10 EST

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT