XQuery/Overview of Page Scraping Techniques

      Motivation

      You want a toolkit for pulling information out of web pages, even if those pages are not well formed XML files.

      Method

      XQuery is an ideal toolkit for manipulating well-formed HTML; you need only use the doc() function, e.g. doc('http://www.example.org/index.html') or doc('/db/path/to/index.html'). But, if a webpage is not well-formed XML, you will get errors about the source not being well-formed.

      Luckily, there are programs that transform HTML files into well-formed XML files.

      eXist provides several such tools. One is the httpclient module's get function, httpclient:get(). To use this function you need to enable the httpclient module, by modifying the conf.xml file so that the module is loaded the next time you start eXist. Uncomment the following line:

         <module class="org.exist.xquery.modules.httpclient.HTTPClientModule"
            uri="http://exist-db.org/xquery/httpclient" />
      

      For example the following example performs an HTTP GET on the list of all the feeds from the IBM web site:

      let $feeds-url := 'http://www.ibm.com/ibm/syndication/us/en/?cm_re=footer-_-ibmfeeds-_-top_level'
      let $data := httpclient:get(xs:anyURI($feeds-url), true(), <Headers/>)
      return $data
      

      Sometimes the HTML is so malformed that even httpclient:get() will not be able to salvage the HTML. For example, if an element has two @id elements, you will get the error, "Error XQDY0025: element has more than one attribute 'id'". In this case, you may need to download the HTML source and clean up the HTML just enough so that eXist can parse the rest. Then, store the file in your database, and use the util:parse-html() function (which passes the text through the Neko HTML parser to make it well-formed).

      The following XQuery will clean up HTML (saved as text file, because it is still malformed):

      let $html-txt := util:binary-to-string(util:binary-doc('/db/html-file-saved-as-text.txt'))
      let $data := util:parse-html($html-txt)
      return $data
      

      Testing your HTTP Client with an Simple Echo Script

      Once you have the have the results in

      Source code for echo.xq

      xquery version "1.0";
      declare namespace httpclient="http://exist-db.org/xquery/httpclient";
      let $feeds-url := 'http://www.ibm.com/ibm/syndication/us/en/?cm_re=footer-_-ibmfeeds-_-top_level'
      let $http-get-data := httpclient:get(xs:anyURI($feeds-url), true(), <Headers/>)
      return
      <echo-results>
         {$http-get-data}
      </echo-results>
      
      Last modified on 29 November 2009, at 03:48