Re: XSLT scraper help?

From: David Huynh <dfhuynh_at_csail.mit.edu>
Date: Wed, 26 Apr 2006 16:08:07 -0400

Hi Jon,

Piggy Bank uses whatever XML parser that comes with the JDK. The Java
code that parses the XSLT looks like this:

    javax.xml.parsers.DocumentBuilderFactory factory =
javax.xml.parsers.DocumentBuilderFactory.newInstance();
    javax.xml.parsers.DocumentBuilder builder =
factory.newDocumentBuilder();
               
    xsltDoc = builder.parse(xsltIS);

I guess ultimately that resolves to some apache stuff that has gotten
included in the JDK.

I have seen something similar if not the same to the error you observed.
But I couldn't seem to fix it the last time I tried.

I'm sorry you're having difficulties with writing Javascript scrapers. I
personally find Javascript much easier, more flexible and powerful than
XSLT to write scrapers. That's precisely why, although started out
supporting XSLT scrapers, we switched over to Javascript scrapers. If
there's anything puzzling about writing Javascript scrapers, I'd be glad
to explain and help out. As for XSLT, I'm afraid I won't be of too much
help.

I suppose you must have already looked through this documentation, but
just in case:
    http://simile.mit.edu/piggy-bank/screen-scrapers-howto.html

David

Jon Crump wrote:
> Dear all,
>
> I finally got the absolute minimum javascript screen scraper to work.
> I'm finding the code for more complex scrapers pretty daunting, and in
> the absence of a more extensive tutorial for Solvent, I thought I'd
> try an xslt scraper since I'm a good deal more familiar with that
> language. To have a look at an example, I downloaded David's csail
> directory scraper and activated it. When I try to scrape the csail
> directory, I get a great string of errors, all
>
> Caused by: java.io.IOException: Pipe not connected
>
> I'm accustomed to using saxon8 at the command line, or within Oxygen.
> I gather that PB wants to use something else and I need a pipe
> connected to it. Have I surmised the problem accurately? Can anyone
> tell me what I have to fiddle with to get this to work? Is this an
> Apache thing?
>
> Jon
>
> Java console output follows:
>
> 11:30:22.033 [...orpus.Corpus] Warning: Internal error:
> java.io.FileNotFoundException: /Users/jjc/Library/Application
> Support/Firefox/Profiles/n7f6chgo.PBtesting/piggy-bank/temporary-sources/model1145987983563/database/namespaces.dat
> (No such file or directory) on null at [-1,-1] (936831ms)
> 11:30:22.062 [...oraryProfile] java.io.IOException:
> java.io.FileNotFoundException: /Users/jjc/Library/Application
> Support/Firefox/Profiles/n7f6chgo.PBtesting/piggy-bank/temporary-sources/model1145987983563/database/namespaces.dat
> (No such file or directory) (29ms)
> network: Connecting http://www.csail.mit.edu/directory/directory.php
> with proxy=DIRECT
> network: Connecting
> http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd with proxy=DIRECT
> network: Connecting http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
> with proxy=DIRECT
> network: Connecting http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
> with proxy=DIRECT
> network: Connecting http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
> with proxy=DIRECT
> ERROR: 'Pipe not connected'
>
> 11:30:52.154 [...oraryProfile]
> javax.xml.transform.TransformerException: java.io.IOException: Pipe
> not connected (30092ms)
> javax.xml.transform.TransformerException: java.io.IOException: Pipe
> not connected
> at
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:650)
>
> at
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:279)
>
> at
> edu.mit.simile.piggyBank.TemporaryProfile.scrapeWithXSLT(TemporaryProfile.java:690)
>
> at
> edu.mit.simile.piggyBank.TemporaryProfile.scrapeWithGRDDL(TemporaryProfile.java:617)
>
> at
> edu.mit.simile.piggyBank.TemporaryProfile.scrape(TemporaryProfile.java:519)
>
> at
> edu.mit.simile.piggyBank.TemporaryProfile.load(TemporaryProfile.java:363)
> at
> edu.mit.simile.piggyBank.TemporaryProfile$LoadingThread.run(TemporaryProfile.java:143)
>
> Caused by: java.io.IOException: Pipe not connected
> at
> com.sun.org.apache.xml.internal.serializer.ToStream.endElement(ToStream.java:2011)
>
> at
> com.sun.org.apache.xml.internal.serializer.ToXMLStream.endElement(ToXMLStream.java:468)
>
> at
> com.sun.org.apache.xml.internal.serializer.ToUnknownStream.endElement(ToUnknownStream.java:331)
>
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$1()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$0()
>
> at GregorSamsa.applyTemplates()
> at GregorSamsa.transform()
> at
> com.sun.org.apache.xalan.internal.xsltc.runtime.AbstractTranslet.transform(AbstractTranslet.java:594)
>
> at
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:640)
>
> ... 6 more
> ---------
> java.io.IOException: Pipe not connected
> at
> com.sun.org.apache.xml.internal.serializer.ToStream.endElement(ToStream.java:2011)
>
> at
> com.sun.org.apache.xml.internal.serializer.ToXMLStream.endElement(ToXMLStream.java:468)
>
> at
> com.sun.org.apache.xml.internal.serializer.ToUnknownStream.endElement(ToUnknownStream.java:331)
>
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$1()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$0()
>
> at GregorSamsa.applyTemplates()
> at GregorSamsa.transform()
> at
> com.sun.org.apache.xalan.internal.xsltc.runtime.AbstractTranslet.transform(AbstractTranslet.java:594)
>
> at
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:640)
>
> at
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:279)
>
> at
> edu.mit.simile.piggyBank.TemporaryProfile.scrapeWithXSLT(TemporaryProfile.java:690)
>
> at
> edu.mit.simile.piggyBank.TemporaryProfile.scrapeWithGRDDL(TemporaryProfile.java:617)
>
> at
> edu.mit.simile.piggyBank.TemporaryProfile.scrape(TemporaryProfile.java:519)
>
> at
> edu.mit.simile.piggyBank.TemporaryProfile.load(TemporaryProfile.java:363)
> at
> edu.mit.simile.piggyBank.TemporaryProfile$LoadingThread.run(TemporaryProfile.java:143)
>
> ---------
> java.io.IOException: Pipe not connected
> at java.io.PipedOutputStream.write(PipedOutputStream.java:120)
> at
> com.sun.org.apache.xml.internal.serializer.WriterToUTF8Buffered.flushBuffer(WriterToUTF8Buffered.java:382)
>
> at
> com.sun.org.apache.xml.internal.serializer.WriterToUTF8Buffered.write(WriterToUTF8Buffered.java:309)
>
> at
> com.sun.org.apache.xml.internal.serializer.ToStream.endElement(ToStream.java:2005)
>
> at
> com.sun.org.apache.xml.internal.serializer.ToXMLStream.endElement(ToXMLStream.java:468)
>
> at
> com.sun.org.apache.xml.internal.serializer.ToUnknownStream.endElement(ToUnknownStream.java:331)
>
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$1()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$3()
>
> at GregorSamsa.applyTemplates()
> at
> GregorSamsa.http$colon$$slash$$slash$www$dot$w3$dot$org$slash$1999$slash$xhtml$colon$template$dot$0()
>
> at GregorSamsa.applyTemplates()
> at GregorSamsa.transform()
> at
> com.sun.org.apache.xalan.internal.xsltc.runtime.AbstractTranslet.transform(AbstractTranslet.java:594)
>
> at
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:640)
>
> at
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:279)
>
> at
> edu.mit.simile.piggyBank.TemporaryProfile.scrapeWithXSLT(TemporaryProfile.java:690)
>
> at
> edu.mit.simile.piggyBank.TemporaryProfile.scrapeWithGRDDL(TemporaryProfile.java:617)
>
> at
> edu.mit.simile.piggyBank.TemporaryProfile.scrape(TemporaryProfile.java:519)
>
> at
> edu.mit.simile.piggyBank.TemporaryProfile.load(TemporaryProfile.java:363)
> at
> edu.mit.simile.piggyBank.TemporaryProfile$LoadingThread.run(TemporaryProfile.java:143)
>
Received on Wed Apr 26 2006 - 20:06:42 EDT

This archive was generated by hypermail 2.3.0 : Thu Aug 09 2012 - 16:39:18 EDT