Screencast
This entire tutorial has been recorded as a screencast, under Scraping Apartments.
Introduction
Say you want to rent an apartment in Toronto, Canada. You were told about this web site, [1], where you can search for apartments. Browse to that site in another window so that you can easily switch back to this tutorial.
On that site, specify Region to be Toronto Downtown and Results per page to be 25. Then click Search.
You will get back a list of apartments, spanning several search result pages. Great! Now you want to know where the apartments are located because you want to live close to downtown - but not that close. To find out where an apartment is located, you need to click on to the corresponding More Information button and then once the linked Web page gets loaded, you need to click on the Map It link.
Really, though, what you want is to see all apartments on a single map rather than clicking twice for each of several dozens of apartments. Unfortunately, there's no feature for placing all apartments on one map. The site's owner made that decision and you, the customer and user, are out of luck. Sure, you could ask for a new feature and wait and see what comes of it...
...or you can just do it yourself using Piggy Bank. What we need to do is to selectively pick out bits and pieces of data from the results (particularly, addresses) and use them to plot a map. You need some code to do that picking and that piece of code is called a screen scraper (or just a scraper).
We have written that screen scraper for you, all you need to do is install it into your Piggy Bank and activate it.
Loading and Activating a Screen Scraper
Go to the VacancyGuide screen scraper and click the 'data coin'
. Piggy Bank will load the information about this screen scraper and present you a page like the following:
- Click on the Save button corresponding to the VacancyGuide.com Screen Scraper (circled in red above). The scraper then turns from grayed to colored and adds a small gears icon to its upper right corner.
- Click on the gear icon to activate this particular screen scraper. You should get a confirmation dialog box. Click OK. You should then see the gear icon changed from gray to blue. That's it for installing a scraper.
- DISCLAIMER: Note that screen scrapers contain executable code and they can harm your computer. Only install screen scrapers from sources that you trust. Each scraper has a link to its Javascript code, which you can read to verify that it is safe. We, the Simile team, do not hold any responsibility over damages to your computer or your data caused by screen scrapers not written by us.
- Click on the Save button corresponding to the VacancyGuide.com Screen Scraper (circled in red above). The scraper then turns from grayed to colored and adds a small gears icon to its upper right corner.
- Now that you have installed a screen scraper for VacancyGuide.com, switch back to the Firefox window containing the apartment search results. You should now see a data coin icon the the bottom right corner of the window. Make sure that Firefox's status bar is showing (menu View → Status Bar is checked).
- Click on the data coin icon. Piggy Bank will then proceed to download the screen scraper's code. Then it runs the code against the apartment search result page. The code simulates clicking on those More Information buttons and you can see its progress in the following dialog box.
- Then the scraper looks up the longitudes and latitudes of the apartments' addresses.
- And finally, after all the data has been scraped, Piggy Bank shows it to you through Piggy Bank's own interface:
If you don't get anything similar, perhaps the VacancyGuide.com site has changed and we need to update the scraper's code. Please let us know.
- Now find the Map View as shown below and click on it.
- And get this:
- And one last step: find and click on heat in the features list. This will narrow the search results down to only apartments that claim to provide heating — a good feature in Canada.
- Read on to understand what you've just done using Piggy Bank.
What Just Happened?
Like many other Web sites, VacancyGuide.com keeps its information—apartments' addresses, features, rents, etc.— in a database. However, VacancyGuide.com does not let you access its database directly. Rather, it creates a Web site through which you can search, browse, and view its data. This Web site stands between you and the database. You are limited by the functionalities provided by the Web site regardless of how rich and powerful the database is. For example,
- if the Web site does not let you view all apartments on a map, there is not much you can do (before you get Piggy Bank);
- if the Web site does not provide summary information (e.g., there are 30 apartments with heat among 78 apartments that your search returns), you are left to do the analysis yourself.
You can certainly copy-and-paste the bits and pieces of the search results over to other applications (e.g., a spreadsheet program) or other Web sites (e.g., Google Maps) and perform the analysis you need, it's a LOT of copy-and-paste for even just a few dozens search results.
Take an extreme scenario: Suppose when you perform a search on a Web site, the Web site faxes or mails you the search results on paper. So, in order to do any analysis of the results, you must type them into your computer. Luckily, Web sites don't do that. However, at the current state of the Web, you still need to copy-and-paste the data. That's a little better than re-typing, but it's still inefficient.
But why do you need to copy-and-paste the data to other applications and other Web sites when it's all right there in your Web browser?! Why can't you just point Google Maps to the search result pages from VacancyGuide.com and be able to get a map of all apartments?
That is because VacancyGuide.com (as well as many other Web sites) does not describe its search results at the level of details that your computer and other Web sites can understand. So, an address like “32 Vassar St., Cambridge, MA” looks no different from a song title like “99 Bottles of Beer on The Wall”. Without extensive analysis of these two strings, your computer and Google Maps cannot tell that one is an address an the other is a song title.
Even though inside VacancyGuide.com's database, “32 Vassar St., Cambridge, MA” is marked as an address, that marking is lost when that address is incorporated into the search results sent back to your Web browser. What the screen scraper you used did was to recover that marking—to decide that “32 Vassar St., Cambridge, MA” is indeed an address.
What you did through this tutorial using Piggy Bank and the screen scraper was reconstructing a tiny part of VacancyGuide.com's database on your own computer. Having that tiny database on your computer, Piggy Bank can then offer its own functionalities to search, browse, and view the data within—you are no longer limited by the original Web site's functionalities.
Summary
When data is inside Web sites, you have little control over it. When data is on your own computer, you can use software on your computer (including Piggy Bank) to search, browse, and view that data.


