RE: Access to PiggyBank RDF/Triple Store

From: David Burden <>
Date: Thu, 8 Sep 2005 22:53:19 +0100

Long follow-up (no time to write a shorter one!)

We have a product,, which is a web-based virtual
screen-reader. It lets blind users have web sites read out to them without
needing either special software on the PC, or on the target web site.

The way it works is that our server grabs the target web page and
screen-scrapes the text on it into an array of paragraph size chunks. It
then does the same for the links, email addresses and forms. It then stores
this data in a new page in Javascript arrays, adds the hooks to a web-based
Text To Speech (TTS) system (we currently use Sitepal -
), plus our own keyboard navigation interface, and sends the new page to the
user. The user's browser then calls the TTS (which is implemented in Flash
so needs no specific download) and speaks the first piece of text. The user
then uses simple key presses to hear their way through the text, follow
links, send emails and fill forms.

The net result is almost the equivalent of a PC based screen-reader like
JAWS, but available from almost any web connected PC.

So far so dumb. Even "accessible" web pages can be a real pig to use. Most
web pages will have 30+ chunks of text and 50+ links. To tackle this we have
started to look at how we can do some smart analysis of the page to identify
things like where the main text starts, where the main navigation is, where
the sidebar stories are etc.

Then along comes PiggyBank. If a PB screenscraper is generating an RDF
version of the page - hopefully with intelligent choices for the segments,
triples etc - then this is a far better starting point for our software.

Being server based people we'd ideally like to call the scrapers ourselves,
and get the RDF pumped straight to our server based parser. However other
alternatives could be:

- since our "client application" is javascript find a way of running that on
a pb browser so that it can access the rdf and parse it there and then
- write a voice enabled pb browser

The coup de main though is that we'd ultimately like to move away from the
screen-reader model to an AI/agent based one - I'm doing papers on this for
a VI conference in the UK in November, and a European IT journal and I can
see that pb will figure. Our chatbot/AI engine (see
currently used AIML ( an AI XML format), but we are about to move the
"facts" portion into RDF. Now if we can get web pages in RDF then it becomes
a lot easier for our AI to start reading the page (we can currently "read"
RSS feeds or screenscrape portions of pages, but it's not the same). That
information could then be used by the AI as part of its conversations with
sighted users. In the VI scenario, rather than the blind user having to step
laboriously through the site the AI could "read it", and then the blind user
can have a natural language conversations with the chatbot/AI about the site
in the same way that they would if they had a sighted person sat next to

OK some of this may not be easy, but we think it's more or less do'able now
as all the building blocks are falling into place.

So now you have the context, back to the original question - how do I get at
the raw RDF - or the scrapers?


-----Original Message-----
From: Stefano Mazzocchi []
Sent: 08 September 2005 17:51
Subject: Re: Access to PiggyBank RDF/Triple Store

David Burden wrote:
> Is there a way of getting access to the RDF/triple store that piggybank
> produces in the browser?

Sure. What kind of access?

> We are working on solutions for visually
> impaired users, and have the real content of a web page extracted the
> way that Piggy Bank does could open the way to a far better solution for
> these users.

Very interesting.

Stefano Mazzocchi
Research Scientist                 Digital Libraries Research Group
Massachusetts Institute of Technology            location: E25-131C
77 Massachusetts Ave                   telephone: +1 (617) 253-1096
Cambridge, MA  02139-4307              email: stefanom at mit . edu
Received on Thu Sep 08 2005 - 21:49:01 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT