Scraping Scenario

This page describes how we have used our tools Solvent, Crowbar, Juggler, Exhibit, and Timegrid (Timegrid is not yet released) together to create the Picker application for students to choose classes at MIT.

Rationale

The official MIT Course Catalog was designed in the early 1990s and is now showing its age compared to recent innovations on the Web. Go to any particular page (e.g., this page) to see some of the problems:

  • All classes for both Fall, Spring, and Summer are shown even though in most cases the students just want to see classes for the coming term. Worse, there is no quick means to narrow the classes down to the coming term.
  • In fact, there is hardly any browsing feature. In recent years, online retailers have incorporated "faceted browsing" features to let users quickly narrow down their products by several dimensions (e.g., zoom capability, resolution, price, brand, etc. for cameras). Such browsing features would help students pick their classes.
  • A very important criterion in picking classes is to arrive at a weekly schedule with no conflict. The course catalog does little to support that. Most students have to use Excel to make their own weekly schedule, and that is a tedious and errorprone task performed by thousands of students every single term.
  • Another commonly done task for undergrad students is choosing elective classes after they have chosen their core classes. That involves seeing which elective classes fit into the schedule already made up of core classes.

To fix some of these problems, all we really needed was the data. Getting that data was not trivial. If you inspect the HTML code of those pages, you can see how fragile that HTML is. This is the HTML code for a single class:

   <a name="6.01"></a>
   <p><b>6.01</b> 
       <b>Introduction to EECS I<br>(New)</b><br>
       <img alt="______" src="/icns/hr.gif"><br>
       <img width=16 height=16 align="bottom" alt="Undergrad" title="Undergrad" src="/icns/under.gif">  
       (<img width=16 height=16 align="bottom" alt="Fall" title="Fall" src="/icns/fall.gif">, 
        <img width=16 height=16 align="bottom" alt="Spring" title="Spring" src="/icns/spring.gif">) 
       <img src="/icns/lab.gif" alt="1/2 Institute Lab" 
           title="1/2 Institute Lab" width="16" height="16" align="bottom">
       
       <br>Prereq: <I><a href="m8a.html#8.02">8.02</a></I>
       <br>Units: 2-4-6
       <br><a href="editcookie.cgi?add=6.01"><img align="bottom" border=1 width=16 height=16 
           alt="Add to schedule" title="Add to schedule" src="/icns/button1.gif"></a> 
       <b>Compulsory:</b> 
       <i>Diagnostic Test Required</i> 
       <b>Lecture:</b> 
       <i>T10-11.30</i> (<a href="http://whereis.mit.edu/map-jpg?mapterms=26">26-100</a>) 
       <b>Lab:</b> 
       <i>T11.30-1,R10-1</i> (<a href="http://whereis.mit.edu/map-jpg?mapterms=34">34-501</a>) or 
       <i>T2-3.30,R2-5</i> (<a href="http://whereis.mit.edu/map-jpg?mapterms=34">34-501</a>) or 
       <i>W11.30-1,F10-1</i> (<a href="http://whereis.mit.edu/map-jpg?mapterms=34">34-501</a>)
       <!--s-->
       
       <br><img alt="______" src="/icns/hr.gif">
       <br>
       An integrated introduction to electrical engineering and computer science, 
       taught using substantial laboratory experiments with mobile robots. Key ...
       <br><I>H. Abelson, L. P. Kaelbling, J. K. White</I>
   </p><!--end-->

There is little hierarchy in the code! A lot of information is actually in the images.

Workflow

Step 1. Scraping

We start by using Solvent to inspect the DOM of each of those catalog web pages. Several iterations got us to a scraper that can recover most of the data at sufficient accuracy. Since Solvent right now only outputs RDF, we changed the scraper a little so that it outputs JSON, and use Crowbar to automate the scraping process for all pages. Here is the latest scraper:

 http://simile.mit.edu/repository/course-picker/trunk/src/workflow/spring-fall/scraper.js

Here is one of the files that that scraper generates. It is already in the JSON format that Exhibit accepts.

Step 2. Post-processing

Note that for each course (known as "department" in other schools), there can be more than one catalog web page. For example, course 6 (Electrical Engineering and Computer Science) has 3 pages altogether (1, 2, and 3). After scraping is complete, we still need to merge all the JSON files for each course together into a single JSON file. We use Juggler to automate this step.

Step 3. Building the Web Application

We use Exhibit and Timegrid to build the web application, which consists of only HTML, CSS, Javascript, and JSON files. That is, only static files. There is absolutely no server-side infrastructure whatsoever. You can see the whole web application live from our source code repository

 http://simile.mit.edu/repository/course-picker/trunk/src/webapp/

And if you click on index.html, you run the web application! No database, no web server setup, no server side scripting.


Automation

As long as the official course catalog pages remain roughly the same, we can automate the scraping process completely and keep the data up to date for every coming term. You can check out the Picker code base from

 http://simile.mit.edu/repository/course-picker/trunk/

and follow these steps to see how that automation works.