A screen scraper in Piggy Bank is a piece of code that extracts "pure" information from within a web page’s content, and possibly from related pages. Screen scrapers are implemented in Javascript; a basic understanding of Javascript and programming is necessary for you to be able to write a screen scraper, but don't worry if you aren't an expert - a superficial grasp of Javascript can still bring you a long way.

Contents

Is there a tool to help me write a scraper?

Yes, there is! It's called Solvent, and it's another Firefox extension that works with Piggy Bank to help you write screen scrapers. This tutorial expects you to have Solvent installed, so if you haven't done so already, go to the Solvent page, install it, restart your browser, and come back here.

What will we scrape?

This tutorial will lead you through an example of scraping apartment listings from http://www.craigslist.org/. First, browse to that site, and check it out yourself. Click on a city on the right, then click on "apts / housing" under "housing" (for some cities, you might need to choose among several brokerage options.) You should arrive at a page that looks like Boston's apartment listings: http://boston.craigslist.org/aap/. The various apartment listing entries on that page are what we wish to extract.

You may safely switch between viewing this tutorial and the Craigslist pages, using tabs or otherwise; so long as you keep Solvent open (see below), its state won't change from page to page. But do keep in mind that testing a Solvent script runs on the page you're viewing; if your test results don't look right or don't show up at all, make sure you're viewing the correct page!

Firing up Solvent

Solvent operates by using your understanding of a page's contents to generate code. Much of the work can be done by using your mouse to point at one piece of interesting information and then choosing items from lists for further annotation; Solvent will then extrapolate how to handle the rest of the information in the page. It will bring you as far as generating files ready for publication on the web.

In order to start Solvent after you have it installed, simply click the "spray can" icon on the bottom right of your browser's window (next to the Piggy Bank "data coin").

The Solvent "spray can" icon that activates it

After clicking on it, Solvent becomes visible in the bottom half of your browser window:

The Solvent user interface

To start, choose the Capture tab, then click the Capture button with the butterfly net:

Choose capture

Capture

The capturing process involves moving your mouse to a place in the page where one piece of interesting information is located; if you're fortunate, Solvent will be able to extract where all the rest of the information is located based on that one. Click to select; don't worry about links, they won't be followed when you click. As you move the mouse, a yellow background and red border will follow you to show which part of the page would be selected if you clicked:

Select a page element, which will be surrounded by red border and yellow background

This information is distilled into a language called XPath, a W3C standard for navigating an XML document. You don't need to know XPath either, but familiarity can be very helpful in refining your selection.

Take note of the yellow margin on the right side of the window. The red rectangles show every piece of information Solvent can extract based on the one you've already selected. If there are too few or too many, based on your expectations, you will need to refine your selection.

If your selected portion is too big, click the Capture button once more and try to narrow the area down to a more specific part of the page. If you want a larger portion, in case you change your mind and also wish to include the area, there are two ways to do so. The first is to delete the last part of the XPath, effectively going up a hierarchy from a more specific selection to a less specific one, by selecting the blue left-up-arrow:

This will delete the last segment of the XPath

This may prove the easiest method of operation for some, by choosing the most specific thing possible, then broadening the selection.

The second way to select a larger area is to simply try the Capture button again and select a page element that covers a bigger area; in this example, this could be done by pointing to the white space at the end of the line:

Selecting a higher level element by pointing at the white space at the end of the line

Breaking down an item

Once you're satisfied with your selection, you're ready to further dissect your selection into per-item properties. If the selected area represents an apartment item, this step helps clarify which parts of the apartment area are what pieces of information, such as its price, location, etc. Now look at the list in the Capture tab, with the Item and Variable headings. Start by clicking the arrow on Item 1 (P):

Expand an item by choosing the arrow in front of it

Each of these lines is the text or the hyperlink of a sub-element within the selected area. Notice how some sites use text to separate information, as Craigslist does by splitting the price from other information with a space character. We'll address that in a bit. For now, we can operate simply on these sub-elements.

The Name drop-down menu at the top right of the Solvent workspace provides the tools for marking item properties.

First off, every item needs a URI. To assign this apartment item a URI, click the first expanded line, which should start with http://. Now click the Name drop-down menu:

Select item line and Name drop-down menu

For this line we choose Item's URI from the Name menu:

The variable column for this line now contains the text URI

Note the Variable column for that line now reads "URI." Select the next line and use this as the item's title by selecting Item's title from the Name menu.

The next line contains information about the general neighborhood in which an apartment is located. Every item needs a URI and a title, but not every item has a neighborhood; from now on, the properties are going to be specific to the information you're extracting from this page, i.e., apartments.

Since there is no predefined property available, define your own by choosing Custom property from the Name menu. A dialog asking you to enter the property URI appears; you may already have a property in mind, but if you don't, just invent one. For the purposes of this tutorial, we'll use http://simile.mit.edu/2007/03/ontologies/solvent-tutorial#area (copy and paste is appropriate).

Likewise, the final line contains information about how the apartment is being sold or let; we can use http://simile.mit.edu/2007/03/ontologies/solvent-tutorial#seller for this custom property.

Generate the scraper

At this point, you could further refine your item's properties, but that will require editing code. See the Advanced section for more on that. Instead, let's see where this scraper stands so far. Click the Generate button and notice the left side of the Solvent workspace is now filled with code. Don't touch it yet: next click Run, above the code, and see the Results tab. Congratulations, you've assembled your first working scraper!

Unfortunately, it's not entirely done. You'll notice a line like:

<http://[some city].craigslist.org/gbs/nfb/[some number].html> a <http://simile.mit.edu/ns#Unknown> ;

right near the beginning of the results. You should really have a type for your items. In this case, you'll probably want to call them an Apartment type. Scroll through your generated scraper code and find this line (about two pages down):

data.addStatement(uri, rdf + 'type', 'http://simile.mit.edu/ns#Unknown', false); // Use your own type here

Change 'http://simile.mit.edu/ns#Unknown' to 'http://simile.mit.edu/2007/03/ontologies/solvent-tutorial#Apartment'. Now try the Run again - if it worked, you should see the type you just changed reflected in your Results tab.

Final script

And here's the script you should end up with:

var rdf = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#';
var dc = 'http://purl.org/dc/elements/1.1/';

var namespace = document.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
  return (prefix == 'x') ? namespace : null;
} : null;

var getNode = function(document, contextNode, xpath, nsResolver) {
  return document.evaluate(xpath, contextNode, nsResolver, XPathResult.ANY_TYPE,null).iterateNext();
}

var cleanString = function(s) {
  return utilities.trimString(s);
}

var xpath = '/html/body[@class="toc"]/blockquote/p';
var elements = utilities.gatherElementsOnXPath(document, document, xpath, nsResolver);
for each (var element in elements) {
  // element.style.backgroundColor = 'red';
  
  try {
    var uri = cleanString(getNode(document, element, './A[1]', nsResolver).href);
  } catch (e) { log(e); }
  
  data.addStatement(uri, rdf + 'type', 'http://simile.mit.edu/2007/03/ontologies/solvent-tutorial#Apartment', false); // Use your own type here
  // log('Scraping URI ' + uri);
  
  try {
    data.addStatement(uri, dc + 'title', cleanString(getNode(document, element, './A[1]/text()[1]', nsResolver).nodeValue), true);
  } catch (e) { log(e); }
  
  try {
    data.addStatement(uri, 'http://simile.mit.edu/2007/03/ontologies/solvent-tutorial#area', cleanString(getNode(document, element, './FONT[1]/text()[1]', nsResolver).nodeValue), true);
  } catch (e) { log(e); }
  
  try {
    data.addStatement(uri, 'http://simile.mit.edu/2007/03/ontologies/solvent-tutorial#seller', cleanString(getNode(document, element, './I[1]/A[1]/text()[1]', nsResolver).nodeValue), true);
  } catch (e) { log(e); }
  
}

Website matching

Now that you've got a working scraper, you'll want to be able to install it in Piggy Bank and share it with the world. The next step is to identify URLs for which your scraper is applicable. Click the URLs tab in the upper left. For this tutorial, we were using Craigslist-Boston's apartments listing. If you only want that first listings page, then, while viewing the listing page, click the Grab button in the upper right. This will take the current page's URL and escape it for use in Piggy Bank's matching algorithm as such:

http://boston\.craigslist\.org/aap/

The matching is done with regular expressions. You may skip the next paragraph if you're already familiar enough with them to do your own URL matching; note that periods must be escaped if they're not to be wild card characters.

If you want to match the first page and all subsequent pages, you'll have to expand the pattern a bit. The 'wild card character' that matches any character is the period, '.' To match any series of characters, add an asterisk, which stands for '0 or more.' Both of these can be found in the Insert drop-down menu in the upper right. Add '.*' to the end of your current pattern and notice how the Match / Not Match indicator now also shows a match for the 'next 100' page (link found at the bottom of the first page). To capture any city's apartment listings, replace 'boston' in the pattern with the same '.*' string, and note that any city's listings you visit will also be matched.

http://.*\.craigslist\.org/aap/.*

You can add more than one URL pattern if your scraper works across sites.

Save and publish

You have a scraper and you've marked out which sites it can be used for. All that remains is for you to save the scraper, its metadata, and publish them both. Decide now where you want to publish both files, especially the Javascript file as that URL in included in the scraper metadata. Piggy Bank can't find the code without it.

Click the disk icon in the upper left; you'll be presented with a dialog box with three tabs: one for the scraper, one for you, and one for saving files.

For the Scraper's Info tab, enter a name for your scraper and a URI. We've been inventing URIs for properties the scraper generates; it's time to invent another one. Make sure it's unique and, preferably, at a domain name you control. You will probably want to add a description to the scraper so future users will understand what they can expect to be generated by your scraper.

In the Author's Info tab, enter your name so you can get credit for your work.

In the last tab, Files and URLs, choose the code buffer your scraper is contained in, which should be 1 if you haven't added more buffers, and then choose where to save the Javascript and metadata files. Lastly, toggle the 'Publish code on the web at this URL' checkbox on, and enter the URL where you plan to publish the Javascript. Click OK and your files should be saved to your local disk.

Close Solvent. Transfer both files to their published destinations on the web. Visit your metadata file with your browser. Chances are you already know how to install a screen scraper; if not, read our brief explanation of the subject.

Advanced

There are some other tricks you can do with the Solvent interface to assist in code generation, though you will have to know Javascript to really use them; this section is for those who are confident in their Javascript skills.

Certain properties of the apartment are hidden away in plain text. To get at them, you'll have to do some extra processing of the text. For instance, the price of an apartment is contained within what we called the title of an apartment item. To extract it, first assign that line a custom property (dealing with price) - you should see both Title and your new custom property URI under the Variable heading now.

Now choose Process text further... from the Name menu. Under the Data tab, there are two text areas reflecting the input and output of the code in the third text area. The input field is the raw text of the line you selected. The output will change depending on the code you enter into the code section. This may help you keep track of your scraping operations by segmenting the text manipulation per-element instead of all in one large code buffer.

There are, of course, several ways to manipulate strings; here's one solution:

var words = input.split(" ");
output = words[0].substr(1);

Unfortunately, you can't use the text for one property and your code's results for another; both properties will be assigned the value of whatever code you add. To fix this, find those properties in the generated code and change accordingly. For the 'title' property, you can remove all parts of the anonymous function, and the title will revert to just the text once more.

Back to generating the scraper.