XQuery/Multiple page scraping and Voting behaviour

< XQuery

Often the necessary data is spread over multiple web pages.

Here is an example where data is taken from multiple pages to gather together the voting behaviour of a member in the US House of Representatives.

An index of the issues in any session of the House are provided by pages such as [1]. For here, one can see that the pages reporting on any of sequentially numbered votes are generated by queries such as [2]

The results are returned as an XML page rendered in a browser using XSLT. The XQuery doc() function retrieves the underlying XML.

The following query aggregates the voting behavior for a specific member over 6 specific votes:

{for $i in 10 to 15
let $path := concat("http://clerk.house.gov/evs/2007/roll0",$i,".xml")
let $report := doc($path)
let $bill := $report//vote-metadata
let $specificvote := $report//recorded-vote[legislator/@name-id = "E000215"]
let $result := concat(data($specificvote//legislator)," voted ",data($specificvote/vote)," ",data($bill/vote-question)," of ",data($bill//legis-num))


More generally, the following function will return an XML node containing the extracted data. In general the vote pages encode the roll number with leading zeros, with minimum length of 3 digits:

declare function local:voting($repid as xs:string, $year as xs:integer, $rollnumbers as xs:integer*) {
for $rollno in $rollnumbers
let $zeropaddedrollnum  := concat(string-pad("0",max((0,3 - string-length(xs:string($rollno))))),xs:string($rollno))
let $path := concat("http://clerk.house.gov/evs/",$year,"/roll",$zeropaddedrollnum,".xml")
let $report := doc($path)
let $bill := $report//vote-metadata
let $specificvote := $report//recorded-vote[legislator/@name-id = $repid]

  {local:voting("E000215",2007,10 to 15)}


Note. It would be preferable to use the asp endpoint since this does not involve the complication arising here from leading zeros, but that produces mal-formed XML (??)