XQuery/Link gathering

Motivation

You want to gather the links on a blog page.

Method

We use the doc() function to perform an HTTP GET on a remote web page. If the page is a well formed XML file you can then extract all the unorder list items by adding a ul predicate to the doc function.

This script fetches the blog page and selects the urls in the link section, which reference other blog articles. Each referenced article is fetched and the urls marked as external are selected. The result is returned as XML.

declare namespace q = "http://www.w3.org/1999/xhtml";

let $nav := doc("http://www.advocatehope.org/tech-tidbits/theory-of-the-web-as-one-big-database")//q:ul[@class="portletNavigationTree navTreeLevel0"]
for $href in $nav//@href
let $page := data($href)
let $content := doc($page)//q:div[@id="content"]
for $links in $content//q:a[@title="external-link"]
   <link>{ data($links/@href) }</link>


Version 2

Dropping the intermediate variables allows the structure to be seen more clearly:

let $uri := "http://www.advocatehope.org/tech-tidbits/theory-of-the-web-as-one-big-database"
       for $page in doc($uri)//q:ul[@class="portletNavigationTree navTreeLevel0"]//@href
       for $link  in doc($page)//q:div[@id="content"]//q:a[@title="external-link"]/@href


Repository Schemas

Daniel is proposing a standard for supporting the extraction of data such as this from a site. Such a schema would define a view of a set of documents sufficient to allow the extraction above to be based on the schema.

We can go some way towards this with a view schema represented as an ER model, with added implementation-dependent paths.

<model  name="blog-links">
    <type name="url" datatype="string"/>
    <entity name="page"  >
        <attribute name="inner"  max="N"   path="//q:ul[@class='portletNavigationTree navTreeLevel0']//@href"  type="page"/>
        <attribute name="external" max="N"  path="//q:div[@id='content']//q:a[@title='external-link']/@href"  type="url"/>

This schema can then be used by a generic script link gathering script:

let $start := request:get-parameter("page",())
let $view := request:get-parameter("view",())
let $schema := doc($view)
let $inner := $schema//entity[@name='page']/attribute[@name='inner']/@path 
let $external := $schema//entity[@name='page']/attribute[@name='external']/@path  
       for $page in util:eval(concat('doc($start)',$inner))
       for $link in util:eval(concat('doc($page)',$external))

This script now performs the task of link gathering on any site whose page structure can be defined in terms of the schema with appropriate paths.


Relative and absolute URIs

The previous version works only if the URIs are absolute. A little more work is needed if not:

declare variable $start := request:get-parameter("page",());
declare variable $view := request:get-parameter("view",());
declare variable $schema := doc($view);
declare variable $base :=  substring-before($start,local:local-uri($start));

declare function local:local-uri($uri) {
  if (contains($uri,"/"))
  then local:local-uri(substring-after($uri,"/"))
  else $uri

declare function local:absolute-uri($url) {
   if (starts-with($url,"http://"))
   then $url
   else concat($base,$url)

let $inner := $schema//entity[@name='page']/attribute[@name='inner']/@path 
let $external := $schema//entity[@name='page']/attribute[@name='external']/@path  

let $starturi := local:absolute-uri($start)

       for $page in util:eval(concat('doc($starturi)',$inner))
       let $pageuri := local:absolute-uri($page)
       for $link in util:eval(concat('doc($pageuri)',$external))

So with a different schema - same model, different paths:

<model name="site-links">
    <type name="url" datatype="string"/>
    <entity name="page"  >
         <attribute name="inner" max="N" path="//div[@class='nav']//a/@href" type="page"/>
        <attribute name="external" max="N" path="//div[@class='content']//a/@href" type="url"/>

which is a view schema of this test site


Virtual Paths

The navigation path is still hard-coded in the script. We would like to write path expressions where the steps are defined in the schema. This path would then be interpreted in the context of the schema.

View Schema

In this example, the test site has been expanded to include a separate index page and some additional components in the view:

<model name="site-links">
    <entity name="externalPage">
        <attribute name="title" path="/head/title"/>
    <entity name="index">
        <attribute name="link" max="N" path="//div[@class='index']//a/@href" type="page"/>
     <entity name="page">
        <attribute name="title" path="//head/title"/>
        <attribute name="inner" max="N" path="//div[@class='nav']//a/@href" type="page"/>
        <attribute name="external" max="N" path="//div[@class='content']//a/@href"
         <attribute name="author" min="0" path="//div[@class='content']/span[@class='author']"/>


Path language

This prototype uses a simple path language.The step -> dereferences a relative or absolute URL. Where a step is recognised as an attribute of the current entity, the associated path expression is used, otherwise the step is executed as XPath. The first step identifies the (entity) type of the initial document.

For example:

index/link/->/title

List the titles of the pages in the index.

import module namespace vp ="http://www.cems.uwe.ac.uk/xmlwiki/vp" at "../Gov/vp.xqm";

let $uri := "http://www.cems.uwe.ac.uk/xmlwiki/Gov/site/index.html"
let $schema := "/db/Wiki/Gov/site3.xml"
  {vp:process-path($uri,"index/link/->/title",$schema) }


index/link/->/author/string(.)

List the authors of the pages referenced in the index.

import module namespace vp ="http://www.cems.uwe.ac.uk/xmlwiki/vp" at "../Gov/vp.xqm";

let $uri := "http://www.cems.uwe.ac.uk/xmlwiki/Gov/site/index.html"
let $schema := "/db/Wiki/Gov/site3.xml"
  {vp:process-path($uri,"index/link/->/author/string(.)",$schema) }


page/inner/->/external

List the url of all distinct external links of all pages referenced by the index page.

import module namespace vp ="http://www.cems.uwe.ac.uk/xmlwiki/vp" at "../Gov/vp.xqm";
declare option exist:serialize "method=xhtml media-type=text/html";

let $uri := "http://www.cems.uwe.ac.uk/xmlwiki/Gov/site/index.html"
let $schema := "/db/Wiki/Gov/site3.xml"
    {for $uri in distinct-values(vp:process-path($uri,"index/link/->/external",$schema))
    order by $uri
    <a href="{$uri}">{string($uri)}</a>


page/inner/->/inner/->/title

List the titles of pages linked to the initial page.

import module namespace vp ="http://www.cems.uwe.ac.uk/xmlwiki/vp" at "../Gov/vp.xqm";

let $uri := "http://www.cems.uwe.ac.uk/xmlwiki/Gov/site/test1.html"
let $schema := "/db/Wiki/Gov/site3.xml"
  {vp:process-path($uri,"page/inner/->/inner/->/title",$schema) }


Script

The core function processes a virtual path in the context of a schema.

declare function vp:process-steps($nodes,$context,$steps,$base,$schema) {
if (empty($steps))
then $nodes
 let $step := $steps[1]
 let $entity := $schema//entity[@name=$context]
    if ( $step = "->" )
          let $newnodes :=
               for $node in $nodes 
               return vp:get-doc($node,$base)
             vp:process-steps($newnodes, $context, subsequence($steps,2),$base,$schema)
   if ($entity/attribute[@name=$step])
      let $attribute :=$entity/attribute[@name=$step]
      let $next := string($schema//entity[@name=$attribute/@type]/@name)
      let $path := string($attribute/@path)
      let $newnodes := 
             for $node in $nodes
             let $newnode := util:eval(concat("$node",$path))
             return $newnode
          vp:process-steps($newnodes, $next, subsequence($steps,2),$base,$schema)
       let $newnodes := 
             for $node in $nodes
             let $newnode := util:eval(concat("$node/",$step))
             return $newnode
          vp:process-steps($newnodes, $context, subsequence($steps,2),$base,$schema) 

Acknowledgments

This example is based on an article by Daniel Bennett [1].