Piggy bank runs scrapers it grabs from arbitrary third parties. We need formal technical, as well as social rituals and supporting technical mechanisms for controlling the risk this creates.

Formal technical mechanisms certainly means finding a way to run the scrappers in a 'sandbox' of some kind that limits the damange they can do. For example we could mimic this the what GreaseMonkey does, or we roll our own solution (see below for why that's desirable).

Social mechinisms (and supporting tools) certainly means ways that users can be aware of changing scrapers and are given the opportunity to assess if they accept the risk of downloading and updating scrapers; including tools that allow them to clearly see the particulars of the changes.


Possible Approaches

  • Mimic GreaseMonkey
    • Pro: This is probably well tested, widely reviewed, and the code is at hand.
    • Con: Page authors can frustrate scraping using the techniques available to [frustrate GreaseMonkey]
    • Con: Possible that arbitrary pages could mimic a scraper
  • Create our own sandbox
    • Create empty scope and add back only what we approve as safe, run 3rd party script there.
      • Pro: Sounds great
      • Con: We don't know how.
    • Remove from scope dangerous things
      • Pro: We know how, we even do this already to a limite extent.
      • Con: Enumerating the long tail of danger is impossible.

Inspirations for the Sandbox

  var s = new C.u.Sandbox("");
  var res = C.u.evalInSandbox("var five = 5; 2 + five", s);
  var outerFive = s.five; = res;
  var thirtyFive = C.u.evalInSandbox("five * seven", s);
  void evalInSandbox(in AString source/*, obj */);

Firefox uses this call in javascript only once but it gives good inspiration:

onStopRequest: function(request, ctxt, status) {
    if(!ProxySandBox) {
       ProxySandBox = new Sandbox();
    // add predefined functions to pac
    var mypac = pacUtils + pac;
    ProxySandBox.myIpAddress = myIpAddress;
    ProxySandBox.dnsResolve = dnsResolve;
    ProxySandBox.alert = proxyAlert;
    // evaluate loaded js file
    evalInSandbox(mypac, ProxySandBox, pacURL);
    this.done = true;

browsing around in other branches yields another instance of usage which shows you how scary things can get in a browser hold together by javascript:

_isTrustedWindow: function(obj) {
  var s = new Components.utils.Sandbox("http://localhost.localdomain.:0/");
  /* Some notes:
   * 1. Doing an instanceof check outside of the sandbox is not safe because
   *    it would call the QueryInterface method of an untrusted object.
   * 2. Inside the sandbox (which does not have chrome privileges), the
   *    QueryInterface method of an untrusted object will never get called
   *    since it has a different origin.
   * 3. We cannot check whether the object is an instance of nsIDOMWindow
   *    because XPConnect wraps the window argument as an nsIDOMWindow
   *    due to the argument type (nsIDOMWindow, suprise suprise).
  s.nsIInterfaceRequestor = Ci.nsIInterfaceRequestor;
  s.obj = obj;
  const IS_TRUSTED_CODE = "obj instanceof nsIInterfaceRequestor;"
  return Components.utils.evalInSandbox(IS_TRUSTED_CODE, s);
  • The Javascript function watch could be used by us to invalidate a script in case they try to change anything in the execution scope's objects we pass to the scraper.


Collaborative Development: Possible Approaches

Strict collaborative schemes slow down everything, so the stronger the sandboxing scheme the better.

There are principles/points/whatever; all tentative at this point.

  • No script should be installed without prompting the end user.
  • No change should be installed without give the user to see what changed, though that might be just transcript of diff and commit messages.
  • Some scripts should be changed only by members of trusted groups.
  • Versioning and release points are good.
  • Sign off schemes are nice.
  • Voting schemes are nice.
  • Voting, signoff, trust groups - all have issues about what group gets the franchise.
  • We need something simple sooner, rather than something complex.
  • 'Commit then review' is much much better than 'review then commit'.
  • The wiki model confounds commit and release.
  • Code should be fetched at the same time as metadata to avoid version skew?

See Also