Web App Makeover - A Complete, Automated Scraping Scenario
Dec 17th, 2007 by dfhuynh
Our project offers quite a diverse toolkit of more than a dozen tools. And these tools are at different levels of maturity. Consequently, sometimes it can be hard for people other than our team to understand how all of these pieces fit together into a coherent, compelling story. Once in a while, we need to step back from the code editor and turn to the blog to put down in words where we are really heading with all that code…
Here is my attempt at doing that: I have written up this wiki page to document how we ourselves have used several of our tools to automate the scraping of the official MIT course catalog web site and provide better browsing features on the same data that it contains:
| before | after |
![]() |
![]() |
| click the images to see the sites | |
The tools used include Solvent, Crowbar, Juggler, Exhibit, and Timegrid. (Juggler and Timegrid are still not yet officially released. You can use them at your own risk.) See this wiki page for more details.
Note that our scraping tools (Solvent and Crowbar) let you deal with web pages at the level of the DOM (e.g., evaluating XPaths, retrieving HTML attributes) rather than at the level of streaming characters. This higher level of abstraction is easier to operate in. Furthermore, Solvent and Crowbar can wait for all the dynamic Javascript code in web pages to finish running; this means that you can even scrape those new Web 2.0 sites rather than just static web pages.
Secondly, in the tools used in this particular scenario, you code in only Javascript rather than a hodgepodge of languages (Perl, Python, etc.). Perhaps this uniformity helps lower the barrier to web app makeovers.
We are continually making our tools easier to use. But hopefully they are already useful and usable to many of our target users right now. If you have similar scenarios using our tools, please share with us! Thanks!


Some interesting stuff you and your team developed. The timeplotting we used several times yet, but it would be nice to get this all translated in to Dutch or something.
Are there language files used, or is the whole idea not designed to become bi-lingual?
Dear Guys at the SMILE Lab (Guys Smiley),
I realize that my timing is poor — it’s been almost a year since I surfed back to the SMILE lab and found this post. My purpose is to ask whether you would be so kind as to post links to resources that would constitute a layperson’s overview of where the Semantic Web is headed. Do not mistake the following comment for cynicism — it’s purely motivated by confusion. I’m befuddled about all of the talk of the inherent productivity gains that will come with Semantic Web schemas and apps. I do understand that there is an immense amount of back-end work still going on to standardize and disseminate these schemas and leverage them. Nevertheless, I’ve seen very few apps that are ‘live’ on the web and deliver immediate gains. (What’s so spectacular about Twine?)
For instance, in the Wiki page you’ve linked to here is not clear to me whether the coding in Solvent (& Crowbar) can immediately be applied to other websites/databases, or whether, in effect, a programmer will have to write an entirely new scraper just to mine the data on any other site.
Perhaps what would be most helpful would be a “state of the field” post. What, in particular, are the major *barriers* still to be overcome in developing and disseminating Semantic Web applications widely around the web to lay users?
Thank you for your attention.
Sincerely,
Martin, a.k.a. a.k.a.