XQuery/Link gathering

Motivation

edit

You want to gather the links on a blog page.

Method

edit

We use the doc() function to perform an HTTP GET on a remote web page. If the page is a well formed XML file you can then extract all the unorder list items by adding a ul predicate to the doc function.


This script fetches the blog page and selects the urls in the link section, which reference other blog articles. Each referenced article is fetched and the urls marked as external are selected. The result is returned as XML.

declare namespace q = "http://www.w3.org/1999/xhtml";

<results>
{
let $nav := doc("http://www.advocatehope.org/tech-tidbits/theory-of-the-web-as-one-big-database")//q:ul[@class="portletNavigationTree navTreeLevel0"]
for $href in $nav//@href
let $page := data($href)
let $content := doc($page)//q:div[@id="content"]
for $links in $content//q:a[@title="external-link"]
return
   <link>{ data($links/@href) }</link>
}
</results>

Execute

Version 2

edit

Dropping the intermediate variables allows the structure to be seen more clearly:

declare namespace q = "http://www.w3.org/1999/xhtml";

let $uri := "http://www.advocatehope.org/tech-tidbits/theory-of-the-web-as-one-big-database"
return
<results>
{
       for $page in doc($uri)//q:ul[@class="portletNavigationTree navTreeLevel0"]//@href
       for $link  in doc($page)//q:div[@id="content"]//q:a[@title="external-link"]/@href
       return 
          <link>{data($link)}</link>
}
</results>

Execute

Repository Schemas

edit

Daniel is proposing a standard for supporting the extraction of data such as this from a site. Such a schema would define a view of a set of documents sufficient to allow the extraction above to be based on the schema.

We can go some way towards this with a view schema represented as an ER model, with added implementation-dependent paths.


<model  name="blog-links">
    <type name="url" datatype="string"/>
    <entity name="page"  >
        <attribute name="inner"  max="N"   path="//q:ul[@class='portletNavigationTree navTreeLevel0']//@href"  type="page"/>
        <attribute name="external" max="N"  path="//q:div[@id='content']//q:a[@title='external-link']/@href"  type="url"/>
     </entity>
</model>

This schema can then be used by a generic script link gathering script:

let $start := request:get-parameter("page",())
let $view := request:get-parameter("view",())
let $schema := doc($view)
let $inner := $schema//entity[@name='page']/attribute[@name='inner']/@path 
let $external := $schema//entity[@name='page']/attribute[@name='external']/@path  
return
<results>
{
       for $page in util:eval(concat('doc($start)',$inner))
       for $link in util:eval(concat('doc($page)',$external))
       return      
         <link>{string($link)}</link>
}
</results>

This script now performs the task of link gathering on any site whose page structure can be defined in terms of the schema with appropriate paths.

Execute

Relative and absolute URIs

edit

The previous version works only if the URIs are absolute. A little more work is needed if not:

declare namespace q = "http://www.w3.org/1999/xhtml";

declare variable $start := request:get-parameter("page",());
declare variable $view := request:get-parameter("view",());
declare variable $schema := doc($view);
declare variable $base :=  substring-before($start,local:local-uri($start));

declare function local:local-uri($uri) {
  if (contains($uri,"/"))
  then local:local-uri(substring-after($uri,"/"))
  else $uri
};

declare function local:absolute-uri($url) {
   if (starts-with($url,"http://"))
   then $url
   else concat($base,$url)
};

let $inner := $schema//entity[@name='page']/attribute[@name='inner']/@path 
let $external := $schema//entity[@name='page']/attribute[@name='external']/@path  

let $starturi := local:absolute-uri($start)
return
<results>

{
       for $page in util:eval(concat('doc($starturi)',$inner))
       let $pageuri := local:absolute-uri($page)
       for $link in util:eval(concat('doc($pageuri)',$external))
       return      
         <link>{string($link)}</link>
}
</results>

So with a different schema - same model, different paths:

<model name="site-links">
    <type name="url" datatype="string"/>
    <entity name="page"  >
         <attribute name="inner" max="N" path="//div[@class='nav']//a/@href" type="page"/>
        <attribute name="external" max="N" path="//div[@class='content']//a/@href" type="url"/>
    </entity>
</model>

which is a view schema of this test site

Execute


Virtual Paths

edit

The navigation path is still hard-coded in the script. We would like to write path expressions where the steps are defined in the schema. This path would then be interpreted in the context of the schema.

View Schema

edit

In this example, the test site has been expanded to include a separate index page and some additional components in the view:

<model name="site-links">
    <entity name="externalPage">
        <attribute name="title" path="/head/title"/>
    </entity>
    <entity name="index">
        <attribute name="link" max="N" path="//div[@class='index']//a/@href" type="page"/>
     </entity>
     <entity name="page">
        <attribute name="title" path="//head/title"/>
        <attribute name="inner" max="N" path="//div[@class='nav']//a/@href" type="page"/>
        <attribute name="external" max="N" path="//div[@class='content']//a/@href"
        type="externalPage"/>
         <attribute name="author" min="0" path="//div[@class='content']/span[@class='author']"/>
    </entity>
</model>

Index

Path language

edit

This prototype uses a simple path language.The step -> dereferences a relative or absolute URL. Where a step is recognised as an attribute of the current entity, the associated path expression is used, otherwise the step is executed as XPath. The first step identifies the (entity) type of the initial document.

For example:

index/link/->/title

edit

List the titles of the pages in the index.

import module namespace vp ="http://www.cems.uwe.ac.uk/xmlwiki/vp" at "../Gov/vp.xqm";

let $uri := "http://www.cems.uwe.ac.uk/xmlwiki/Gov/site/index.html"
let $schema := "/db/Wiki/Gov/site3.xml"
return
  <result>
  {vp:process-path($uri,"index/link/->/title",$schema) }
  </result>

Run

index/link/->/author/string(.)

edit

List the authors of the pages referenced in the index.

import module namespace vp ="http://www.cems.uwe.ac.uk/xmlwiki/vp" at "../Gov/vp.xqm";

let $uri := "http://www.cems.uwe.ac.uk/xmlwiki/Gov/site/index.html"
let $schema := "/db/Wiki/Gov/site3.xml"
return
  <result>
  {vp:process-path($uri,"index/link/->/author/string(.)",$schema) }
  </result>

Run

page/inner/->/external

edit

List the url of all distinct external links of all pages referenced by the index page.

import module namespace vp ="http://www.cems.uwe.ac.uk/xmlwiki/vp" at "../Gov/vp.xqm";
declare option exist:serialize "method=xhtml media-type=text/html";

let $uri := "http://www.cems.uwe.ac.uk/xmlwiki/Gov/site/index.html"
let $schema := "/db/Wiki/Gov/site3.xml"
return
  <ul>
    {for $uri in distinct-values(vp:process-path($uri,"index/link/->/external",$schema))
    order by $uri
    return
    <li>
    <a href="{$uri}">{string($uri)}</a>
    </li>
   }
  </ul>

Run

page/inner/->/inner/->/title

edit

List the titles of pages linked to the initial page.

import module namespace vp ="http://www.cems.uwe.ac.uk/xmlwiki/vp" at "../Gov/vp.xqm";

let $uri := "http://www.cems.uwe.ac.uk/xmlwiki/Gov/site/test1.html"
let $schema := "/db/Wiki/Gov/site3.xml"
return
  <result>
  {vp:process-path($uri,"page/inner/->/inner/->/title",$schema) }
  </result>

Run

Script

edit

The core function processes a virtual path in the context of a schema.


declare function vp:process-steps($nodes,$context,$steps,$base,$schema) {
if (empty($steps))
then $nodes
else 
 let $step := $steps[1]
 let $entity := $schema//entity[@name=$context]
 return 
    if ( $step = "->" )
    then 
          let $newnodes :=
               for $node in $nodes 
               return vp:get-doc($node,$base)
           return
             vp:process-steps($newnodes, $context, subsequence($steps,2),$base,$schema)
   else
   if ($entity/attribute[@name=$step])
      then 
      let $attribute :=$entity/attribute[@name=$step]
      let $next := string($schema//entity[@name=$attribute/@type]/@name)
      let $path := string($attribute/@path)
      let $newnodes := 
             for $node in $nodes
             let $newnode := util:eval(concat("$node",$path))
             return $newnode
      return
          vp:process-steps($newnodes, $next, subsequence($steps,2),$base,$schema)
  else 
       let $newnodes := 
             for $node in $nodes
             let $newnode := util:eval(concat("$node/",$step))
             return $newnode
      return
          vp:process-steps($newnodes, $context, subsequence($steps,2),$base,$schema) 
};

Acknowledgments

edit

This example is based on an article by Daniel Bennett [1].