XQuery/Overview of Page Scraping Techniques

Motivation

edit

You want a toolkit for pulling information out of web pages, even if those pages are not well formed XML files.

Method

edit

XQuery is an ideal toolkit for manipulating well-formed HTML; you need only use the doc() function, e.g. doc('http://www.example.org/index.html') or doc('/db/path/to/index.html'). But, if a webpage is not well-formed XML, you will get errors about the source not being well-formed.

Luckily, there are programs that transform HTML files into well-formed XML files.

eXist provides several such tools. One is the httpclient module's get function, httpclient:get(). To use this function you need to enable the httpclient module, by modifying the conf.xml file so that the module is loaded the next time you start eXist. Uncomment the following line:

   <module class="org.exist.xquery.modules.httpclient.HTTPClientModule"
      uri="http://exist-db.org/xquery/httpclient" />

For example the following example performs an HTTP GET on the list of all the feeds from the IBM web site:

let $feeds-url := 'http://www.ibm.com/ibm/syndication/us/en/?cm_re=footer-_-ibmfeeds-_-top_level'
let $data := httpclient:get(xs:anyURI($feeds-url), true(), <Headers/>)
return $data

Sometimes the HTML is so malformed that even httpclient:get() will not be able to salvage the HTML. For example, if an element has two @id elements, you will get the error, "Error XQDY0025: element has more than one attribute 'id'". In this case, you may need to download the HTML source and clean up the HTML just enough so that eXist can parse the rest. Then, store the file in your database, and use the util:parse-html() function (which passes the text through the Neko HTML parser to make it well-formed).

The following XQuery will clean up HTML (saved as text file, because it is still malformed):

let $html-txt := util:binary-to-string(util:binary-doc('/db/html-file-saved-as-text.txt'))
let $data := util:parse-html($html-txt)
return $data

Testing your HTTP Client with an Simple Echo Script

edit

Once you have the have the results in

Source code for echo.xq

xquery version "1.0";
declare namespace httpclient="http://exist-db.org/xquery/httpclient";
let $feeds-url := 'http://www.ibm.com/ibm/syndication/us/en/?cm_re=footer-_-ibmfeeds-_-top_level'
let $http-get-data := httpclient:get(xs:anyURI($feeds-url), true(), <Headers/>)
return
<echo-results>
   {$http-get-data}
</echo-results>