Haystack meeting notes

From: Stephen J. Garland <garland_at_csail.mit.edu>
Date: Tue, 28 Sep 2004 16:56:00 -0400

Following are some notes from last week's meeting. Notes from today's
meeting will be in a subsequent message (probably tomorrow morning).


Haystack Group Meeting, September 21, 2004

Informal talk by Nick Matsakis on record linkage

The object of Nick's work is to develop algorithms for lifting strings
into first-class resources. A sample application of such algorithms
would be to determine the identity of music albums in ITunes, which are
currently coded as strings. A given album may appear under more than
one title (for example, "The White Album" and "TWA [The Beatles]"). One
approach to deciding whether two titles represent the same album is to
match other properties associated with the titles (for example, the list
of tracks on the album).

Nick will be developing algorithms that apply to multiple data sets,
that employ user feedback, and that are based on Conditional Random
Fields (which can accommodate features that are dependent on one
another). The basis probability distribution is
   P(yij|xi,xj) = 1/Z(x) exp(sigma_ijk lambda_fk(yij,xi,xj)
where the feature functions fk measure properties such as string edit
distance or property matches.

Nick is working with three sample data sets and several annotations for
each data set (* indicates annotations that aare first-class objects):
  bibliographies (BibTex)
    publications*, authors, venues
  music
    tracks*, albums, artist
  email
    messages*, people, addresses


Informal talk by Steve Garland on Minimal Haystack

Steve Garland has been restructuring the code base for Haystack into a
core subset and a set of add-on packages. The current version of the
core subset is available from the svn repository at
     http://simile.mit.edu/haystack/branches/garland/smallHay.
This core version can be run either as a stand-alone application (using
ant) or as an Eclipse plug-in.

Steve is still pruning material out of the core. Currently, the core
subset and the full version of Haystack contain the following number of
files:
 core full
  602 1105 java source files
   82 232 adenine source files
   10 102 jar files
    2 22 dlls
The compilation time for the core is 1/4 that of the full version.

Group members are requested to check out and run the core subset. Let
Steve know what's broken, what else can be removed from the core, and
what your priorities are for seeing features restored as add-ons.

Candidates for further thinning of the core (and people who have
indicated an interest in working on the candidates) are:
     the legacy adenine parser/compiler (Punya to replace remaining uses)
     navigation (Vineet to repackage as an add-on)
     Eclipse starting points (Steve)
     other features labeled as extensions in Vineet's code
     reorganization

Candidates for add-ons (and people responsible for adding them on) are:
     Simile (Steve)
     bioinformatics (Sumudu)
     calendar (Marios)
     wrapperinduction (Ryan)
     instant messaging
     mail
     music
     rss
     weather

Open question: what should we call the core? It's still a little too
big to be called "minimal" or "small".


Administrivia

Instead of presenting progress reports at the weekly meetings, group
members should submit weekly progress reports by email on Mondays to
Nick Matsakis, who will then email a digest to the group.

Someone should check if db3 needs to be part of the Haystack
distribution and, if so, what should be done about its (GPL) license.

Next week's agenda: Informal talks by Artem on Java 1.5 and Punya on the
design of a new language as a possible replacement for adenine.
Received on Tue Sep 28 2004 - 20:55:55 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:17 EDT