XQuery/Multiple page scraping and Voting behaviour
Often the necessary data is spread over multiple web pages.
Here is an example where data is taken from multiple pages to gather together the voting behaviour of a member in the US House of Representatives.
An index of the issues in any session of the House are provided by pages such as [1]. For here, one can see that the pages reporting on any of sequentially numbered votes are generated by queries such as [2]
The results are returned as an XML page rendered in a browser using XSLT. The XQuery doc() function retrieves the underlying XML.
The following query aggregates the voting behavior for a specific member over 6 specific votes:
<results>
{for $i in 10 to 15
let $path := concat("http://clerk.house.gov/evs/2007/roll0",$i,".xml")
let $report := doc($path)
let $bill := $report//vote-metadata
let $specificvote := $report//recorded-vote[legislator/@name-id = "E000215"]
let $result := concat(data($specificvote//legislator)," voted ",data($specificvote/vote)," ",data($bill/vote-question)," of ",data($bill//legis-num))
return
<result>{$result}</result>
}
</results>
More generally, the following function will return an XML node containing the extracted data. In general the vote pages encode the roll number with leading zeros, with minimum length of 3 digits:
declare function local:voting($repid as xs:string, $year as xs:integer, $rollnumbers as xs:integer*) {
for $rollno in $rollnumbers
let $zeropaddedrollnum := concat(string-pad("0",max((0,3 - string-length(xs:string($rollno))))),xs:string($rollno))
let $path := concat("http://clerk.house.gov/evs/",$year,"/roll",$zeropaddedrollnum,".xml")
let $report := doc($path)
let $bill := $report//vote-metadata
let $specificvote := $report//recorded-vote[legislator/@name-id = $repid]
return
<result>
<year>{$year}</year>
{$bill/rollcall-num}
{$bill/vote-question}
{$bill/legis-num}
{$specificvote/legislator}
{$specificvote/vote}
</result>
};
<report>
{local:voting("E000215",2007,10 to 15)}
</report>
Note. It would be preferable to use the asp endpoint since this does not involve the complication arising here from leading zeros, but that produces mal-formed XML (??)