XQuery/Link gathering
Motivation
editYou want to gather the links on a blog page.
Method
editWe use the doc() function to perform an HTTP GET on a remote web page. If the page is a well formed XML file you can then extract all the unorder list items by adding a ul predicate to the doc function.
This script fetches the blog page and selects the urls in the link section, which reference other blog articles. Each referenced article is fetched and the urls marked as external are selected. The result is returned as XML.
declare namespace q = "http://www.w3.org/1999/xhtml";
<results>
{
let $nav := doc("http://www.advocatehope.org/tech-tidbits/theory-of-the-web-as-one-big-database")//q:ul[@class="portletNavigationTree navTreeLevel0"]
for $href in $nav//@href
let $page := data($href)
let $content := doc($page)//q:div[@id="content"]
for $links in $content//q:a[@title="external-link"]
return
<link>{ data($links/@href) }</link>
}
</results>
Version 2
editDropping the intermediate variables allows the structure to be seen more clearly:
declare namespace q = "http://www.w3.org/1999/xhtml";
let $uri := "http://www.advocatehope.org/tech-tidbits/theory-of-the-web-as-one-big-database"
return
<results>
{
for $page in doc($uri)//q:ul[@class="portletNavigationTree navTreeLevel0"]//@href
for $link in doc($page)//q:div[@id="content"]//q:a[@title="external-link"]/@href
return
<link>{data($link)}</link>
}
</results>
Repository Schemas
editDaniel is proposing a standard for supporting the extraction of data such as this from a site. Such a schema would define a view of a set of documents sufficient to allow the extraction above to be based on the schema.
We can go some way towards this with a view schema represented as an ER model, with added implementation-dependent paths.
<model name="blog-links">
<type name="url" datatype="string"/>
<entity name="page" >
<attribute name="inner" max="N" path="//q:ul[@class='portletNavigationTree navTreeLevel0']//@href" type="page"/>
<attribute name="external" max="N" path="//q:div[@id='content']//q:a[@title='external-link']/@href" type="url"/>
</entity>
</model>
This schema can then be used by a generic script link gathering script:
let $start := request:get-parameter("page",())
let $view := request:get-parameter("view",())
let $schema := doc($view)
let $inner := $schema//entity[@name='page']/attribute[@name='inner']/@path
let $external := $schema//entity[@name='page']/attribute[@name='external']/@path
return
<results>
{
for $page in util:eval(concat('doc($start)',$inner))
for $link in util:eval(concat('doc($page)',$external))
return
<link>{string($link)}</link>
}
</results>
This script now performs the task of link gathering on any site whose page structure can be defined in terms of the schema with appropriate paths.
Relative and absolute URIs
editThe previous version works only if the URIs are absolute. A little more work is needed if not:
declare namespace q = "http://www.w3.org/1999/xhtml";
declare variable $start := request:get-parameter("page",());
declare variable $view := request:get-parameter("view",());
declare variable $schema := doc($view);
declare variable $base := substring-before($start,local:local-uri($start));
declare function local:local-uri($uri) {
if (contains($uri,"/"))
then local:local-uri(substring-after($uri,"/"))
else $uri
};
declare function local:absolute-uri($url) {
if (starts-with($url,"http://"))
then $url
else concat($base,$url)
};
let $inner := $schema//entity[@name='page']/attribute[@name='inner']/@path
let $external := $schema//entity[@name='page']/attribute[@name='external']/@path
let $starturi := local:absolute-uri($start)
return
<results>
{
for $page in util:eval(concat('doc($starturi)',$inner))
let $pageuri := local:absolute-uri($page)
for $link in util:eval(concat('doc($pageuri)',$external))
return
<link>{string($link)}</link>
}
</results>
So with a different schema - same model, different paths:
<model name="site-links">
<type name="url" datatype="string"/>
<entity name="page" >
<attribute name="inner" max="N" path="//div[@class='nav']//a/@href" type="page"/>
<attribute name="external" max="N" path="//div[@class='content']//a/@href" type="url"/>
</entity>
</model>
which is a view schema of this test site
Virtual Paths
editThe navigation path is still hard-coded in the script. We would like to write path expressions where the steps are defined in the schema. This path would then be interpreted in the context of the schema.
View Schema
editIn this example, the test site has been expanded to include a separate index page and some additional components in the view:
<model name="site-links">
<entity name="externalPage">
<attribute name="title" path="/head/title"/>
</entity>
<entity name="index">
<attribute name="link" max="N" path="//div[@class='index']//a/@href" type="page"/>
</entity>
<entity name="page">
<attribute name="title" path="//head/title"/>
<attribute name="inner" max="N" path="//div[@class='nav']//a/@href" type="page"/>
<attribute name="external" max="N" path="//div[@class='content']//a/@href"
type="externalPage"/>
<attribute name="author" min="0" path="//div[@class='content']/span[@class='author']"/>
</entity>
</model>
Path language
editThis prototype uses a simple path language.The step -> dereferences a relative or absolute URL. Where a step is recognised as an attribute of the current entity, the associated path expression is used, otherwise the step is executed as XPath. The first step identifies the (entity) type of the initial document.
For example:
index/link/->/title
editList the titles of the pages in the index.
import module namespace vp ="http://www.cems.uwe.ac.uk/xmlwiki/vp" at "../Gov/vp.xqm";
let $uri := "http://www.cems.uwe.ac.uk/xmlwiki/Gov/site/index.html"
let $schema := "/db/Wiki/Gov/site3.xml"
return
<result>
{vp:process-path($uri,"index/link/->/title",$schema) }
</result>
index/link/->/author/string(.)
editList the authors of the pages referenced in the index.
import module namespace vp ="http://www.cems.uwe.ac.uk/xmlwiki/vp" at "../Gov/vp.xqm";
let $uri := "http://www.cems.uwe.ac.uk/xmlwiki/Gov/site/index.html"
let $schema := "/db/Wiki/Gov/site3.xml"
return
<result>
{vp:process-path($uri,"index/link/->/author/string(.)",$schema) }
</result>
page/inner/->/external
editList the url of all distinct external links of all pages referenced by the index page.
import module namespace vp ="http://www.cems.uwe.ac.uk/xmlwiki/vp" at "../Gov/vp.xqm";
declare option exist:serialize "method=xhtml media-type=text/html";
let $uri := "http://www.cems.uwe.ac.uk/xmlwiki/Gov/site/index.html"
let $schema := "/db/Wiki/Gov/site3.xml"
return
<ul>
{for $uri in distinct-values(vp:process-path($uri,"index/link/->/external",$schema))
order by $uri
return
<li>
<a href="{$uri}">{string($uri)}</a>
</li>
}
</ul>
page/inner/->/inner/->/title
editList the titles of pages linked to the initial page.
import module namespace vp ="http://www.cems.uwe.ac.uk/xmlwiki/vp" at "../Gov/vp.xqm";
let $uri := "http://www.cems.uwe.ac.uk/xmlwiki/Gov/site/test1.html"
let $schema := "/db/Wiki/Gov/site3.xml"
return
<result>
{vp:process-path($uri,"page/inner/->/inner/->/title",$schema) }
</result>
Script
editThe core function processes a virtual path in the context of a schema.
declare function vp:process-steps($nodes,$context,$steps,$base,$schema) {
if (empty($steps))
then $nodes
else
let $step := $steps[1]
let $entity := $schema//entity[@name=$context]
return
if ( $step = "->" )
then
let $newnodes :=
for $node in $nodes
return vp:get-doc($node,$base)
return
vp:process-steps($newnodes, $context, subsequence($steps,2),$base,$schema)
else
if ($entity/attribute[@name=$step])
then
let $attribute :=$entity/attribute[@name=$step]
let $next := string($schema//entity[@name=$attribute/@type]/@name)
let $path := string($attribute/@path)
let $newnodes :=
for $node in $nodes
let $newnode := util:eval(concat("$node",$path))
return $newnode
return
vp:process-steps($newnodes, $next, subsequence($steps,2),$base,$schema)
else
let $newnodes :=
for $node in $nodes
let $newnode := util:eval(concat("$node/",$step))
return $newnode
return
vp:process-steps($newnodes, $context, subsequence($steps,2),$base,$schema)
};
Acknowledgments
editThis example is based on an article by Daniel Bennett [1].