[FYI] MouseHole and Charon from Stefano Mazzocchi on 2005-09-06 (stdin)

From: Stefano Mazzocchi <stefanom_at_mit.edu>
Date: Tue, 06 Sep 2005 13:24:16 -0400

(kudos to Erik to point me to MouseHole)

http://tinyurl.com/cxpc9

A while ago we thought on how to enable piggybank to those who don't
have firefox. We thought about decoupling the frontend (the part that
resides in the browser and does the scraping) from the backend (the part
that is in java and serves, browses and stores the RDF).

MouseHole is very similar to what I implemented in Charon a few months ago

   http://simile.mit.edu/charon/

and didn't advertise on the front page because we hit a few walls (see
below) and we decided to invest more in PiggyBank. Charon is an scraping
proxy written on cocoon in a few hours of hacking.... the problem with
scraping proxies/gateways (and MouseHole is another one) is that they
work on HTTP only, HTTPS considers them a 'man in the middle attack'.

This was the reason why we stopped working on that approach and moved
upstream directly after the page dom has been loaded, after the data has
exited the crypto tunnel.

the reason for this is that Dspace_at_MIT is behind SSL, MIT does
everything thru kerberos and digital certificates, including deciding
whether or not you have enough rights to access the data contained in
dspace (or MIT has enough rights to let you see that data, which is
similar but not the same). Allowing our users to rdfize data coming from
secure connections is critical for us: if we can't do that, it's too
easy for the other end to protect against scraping by simply switching
over to SSL.

Transparent proxy/gateways approaches have the advantage of being fully
and transparently cross-browser, to be able to continue to run (this
avoid the feeling of your browser being slower) but have a few major
drawbacks:

   1) they don't and won't work on HTTPs

   2) they can easily become a bottleneck

   3) as for #2, they can easily be blocked on the other end

#2 and #3 are preventable when you install the proxy on every machine
and you avoid having a centralized instance. #1 could be solved if every
https connection is an http connection to the proxy which then encrypts
from that point on... but it creates *all sorts* of issues with the fact
that secure pages are perceived as not secure anymore and also digital
certificate administration.

it would also become a honeypot for hackers to attack: phishing paradise
if you can get your code to execute into that proxy.

don't get me wrong, the are tons of cases where it makes perfect sense
to have such a proxy/gateway, but for what we need, it's easier to
convince somebody to install firefox than to convince them to manage
their SSL certificates into another appliance... and this, pretty much,
is what drove us to the browser and away from the proxy.

-- 
Stefano Mazzocchi
Research Scientist                 Digital Libraries Research Group
Massachusetts Institute of Technology            location: E25-131C
77 Massachusetts Ave                   telephone: +1 (617) 253-1096
Cambridge, MA  02139-4307              email: stefanom at mit . edu
-------------------------------------------------------------------

Received on Tue Sep 06 2005 - 17:19:57 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT