Referee is a command line application that processes web server log files and finds pages on the web that talk about the requested pages mining the referrer logs, then accessing the pages, extracting some useful content from them (dealing with all the encoding and HTML problems!) and their associated news feeds (to capture more metadata). This metadata is then exported as RDF/N3 files on disk.

Contents


What do I need to use it?

First you have to make sure that your web server uses the NCSA "combined" log format. Referee will not work if your log format is different than that. This means that, for Apache HTTPD, your httpd.conf (or equivalent) file needs to contains something like:

CustomLog log/access_log combined

and not something like

CustomLog logs/access_log common 

For more info, refer to the Apache HTTPD documentation pages at http://httpd.apache.org/docs/ or to the documentation of your web server in order to understand if it supports this log file format.

Once you have "combined" web server logs, Referee requires three things to run:

  1. a Java Virtual Machine installed on your machine (version 1.4 or greater). [type java -version at your shell prompt to know what version you have]. If you don't have it, go to http://www.java.com and download it.
  1. Apache Maven installed (version 2.0 or greater) [type 'mvn -version' at your shell prompt to know what version you have] If maven is not installed, go to http://maven.apache.org/ and download it. Don't panic, the installation is really fast and simple.
  1. a network connection (this is because Maven will download the required libraries when you build the software)

Once you're set (and you have the maven command mvn in your path), go to your command shell and type:

 mvn package

this will download the required libraries, compile, package, test, and copy the required dependencies to the ./target directory.

Now you are ready to launch it, and you can do it by typing

  • (unix) ./referee
  • (win32) referee

at the command line and follow the instructions the command gives you.

How does it work?

Referee works like this:

process log files that match the regexp pattern recursively
 for each new log event
   parse event
   if it has a referrer URL
     if requested URL matches the white list
       if referrer is remote and is not blocked by the black list
         if referrer was never processed before
           fetch URL
           if URL exists
             grep requested URL in the referrering URL body
             if request URL is found
               try parse the content for HTML
                 if successful
                   extract metadata from the page
                   extract a textual context from around the link that referred
                   if the page contains RSS feeds
                      obtain additional metadata from those feeds
                   store the resulting data in a triple store
                   every given log events, dump the information to a file

What do I do with that RDF data it generated?

You can use RDF browsing tools such as Longwell, Welkin or Piggy Bank to browse, visualize and explore the information.

You can also load the data into your favorite triple store and run SPARQL queries on it (for example to integrate with your blogs) or create an Atom feed of the latest backlink comments on your own web site.

It was a conscious decision to write Referee as an RDFizer, focusing on the production aspects and leaving the data consumption aspect as a separate concern and using RDF as the interface between the two.

Why RDF? Can't you just use XML or Atom?

RDF is a data model, XML is both a syntax and a data model and Atom is a specific type of XML data model. As far of syntax, I don't really care (you can, in fact, serialize RDF in XML), but the data model is important.

XML's data model is a tree while RDF's data model is a graph. All trees are graphs but not all graphs are trees, therefore RDF's data model is more general than XML's. This rather academic if you don't need that extra power that a graph data model gives you (and for many many situations, you don't, which is why people think RDF is unnecessarely complicated).'

But Referee does need that extra power because it divides 'Pages' from 'Comments'. A 'comment' is a piece of text that surrounds the hypertextual link that brought somebody to your pages, while a 'page' is the HTML page that contained that 'comment'.

Referee is unique in treating those separately because it allows you to know that different pages contain the same exact comment, which is more and more common since content is aggregated in many different ways.

So, Referee treats the content that referes to your page (the 'comment') and its containers (the 'pages') as different items, each with its own globally unique identifier. But this means that a page can have more than one comment and a comment can reside in more than one page. There is no way to model that data structure with a tree.

Of course, you could add extra semantics to the XML data model and achieve that, but you would be reinventing the RDF wheel in doing so (and losing all the tools out there that are already RDF aware)

This is the main reason for RDF but there are also a lot of other benefits. One for all is that data is now composed of very atomic statements, which means that data that referee generates can be easily merged together from different logs of different virtual hosts, without fear of colliding because every Page, Comment or Feed has a globally unique identifier.

But I'm also aware that there are many more programs that know how to digest XML or Atom, compared to those who are capable of digesting RDF. But RDF is simply a data model: it is entirely possible (actually, desirable) to use the data that Referee produces to generate, for example, an Atom feed of the new comments on your web site to generate your very private and very precise ego feed.

At the end, it was a carefully planned design decision to separate the content production from the content consumption and use RDF as the interface between the two concern islands. And I suspect that the more programs are capable of digesting RDF, the more this design pattern will become common as the statement-oriented nature of RDF is a very flexible and natural data model to work with.

Can you give me examples of uses?

Sure! Suppose you are running a web site called http://www.blah.com/ and your logs are written on disk to a single file located at /var/logs/apache/www.blah.com.access.log. To process this, simply type:

./referee /var/logs/apache/www.blah.com.access.log http://www.blah.com/ results

and referee will output results to that folder.

What if I want to know what's going on?

You can use the -v command line argument to change the verbosity. info is probably what you're looking for to get an idea of what's going on. debug is very verbose.

What if my logs are split on multiple files?

You can pass both a directory or a file as the first argument to Referee. If it's a folder, Referee will consider every file contained in that folder, recursively, to be a log and will attempt to read logs from it.

If you want to process only a subset of those files, you can pass a regular expression to Referee using the -p or --pattern command line argument followed by the regular expression that matches the files that you want Referee to process. For example, if your log files are named like "yyyymmdd-blah-access.log" and avoid other *.log files do:

./referee -p '.*-blah-access.log' /var/logs/apache/ http://www.blah.com/ results

and referee will process only those.

What if I care about only a subset of the URL space?

With the -u or -urls command line argument, you can tell referee to care only for requested URLs that match the given regular expression. So, for example, if you care only for your own home page in a virtual host that you share with others, do

./referee -u '/my.home/.*' /var/logs/apache/ http://www.blah.com/ results

What if I want to filter some referrers to speed things up?

By default, Referee reads the blacklist.txt file and avoids processing all referrer URLs that match 'any of the regexps found in the black list. We ship with a bunch of regexps that filter out search engines and known spammers, but feel free to change it to fit your environment and your taste by adding/removing any regexp you want. If you want to suggest other regexp, add them here and we'll incorporate them.

I can't find what I'm looking for, what do I do?

You might want to read the list of frequently asked questions or add your own question there so that we can answer it.


Back to Referee.