XQuery/Keyword Search

Motivation edit

You want to create a Google-style keyword search interface to an XML database with relevance-ranked, full-text search of selected nodes and search results in which the keyword in context is highlighted, as shown below.

 

Method edit

Our search engine will receive keywords from a simple HTML form, assigning them to the variable $q. Then it (1) parses the keywords, (2) constructs the scope of the query, (3) executes the query, (4) scores and sorts the hits according to the score, (5) shows the linked results with a summary containing the keyword highlighted in context, and (6) paginates the results.

Note: This tutorial was written against eXist 1.3, which was a development version of eXist; since then eXist 1.4 has been released, which altered several aspects of eXist slightly. This article has not yet been fully updated to account for the changes. The most notable changes are that (1) the kwic.xql file referenced here is now a built-in module and (2) the previous default fulltext search index (whose search operator is below as &=) is disabled by default in favor of the new, Lucene-based fulltext index, which speeds both search and scoring considerably. The changes required to make the code work with 1.4 will be extensive, but nonetheless the article is instructive in its current form. Lastly, this example will not run under versions prior to 1.3.


Example Collections and Data edit

Let's assume that you have three collections:

  /db/test
  /db/test/articles
  /db/test/people

The articles and people collections contain XML files with different schemas: "articles" contains structured content, and "people" contains biographical information about people mentioned in the articles. We want to search both collections using a full-text keyword search, and we want to search specific nodes of each collection: the body of the articles and the names of the people. Fundamentally, our search string is:

for $hit in (collection('/db/test/articles')/article/body,
             collection('/db/test/people')/person/biography)[. &= $q]

Note: "&=" is an eXist fulltext search operator, and it will return nodes that match the tokenized contents of $q. See [1] for more information.

Assume you have two collections:

Collection A edit

File='/db/test/articles/1.xml'

<article id="1" xmlns="http://en.wikibooks.org/wiki/XQuery/test">
    <head>
        <author id="2"/>
        <posted when="2009-01-01"/>
    </head>
    <body>
        <title>A Day at the Races</title>
        <div>
            <head>So much for taking me out to the ballgame</head>
            <p>My dad, <person target="1">John</person>, was a great guy, but he sure was a bad
                driver...</p>
            <p>...</p>
        </div>
    </body>
</article>

Collection B edit

File='/db/test/people/2.xml'

<person id="2" xmlns="http://en.wikibooks.org/wiki/XQuery/test">
    <name>Joe Doe</name>
    <role type="author"/>
    <contact type="e-mail">joeschmoe@mail.net</contact>
    <biography>Joe Doe was born in Brooklyn, New York, and he now lives in Boston, Massachusetts.</biography>
</person>

Search Form edit

File='/db/test/search.xq'

xquery version "1.0";

declare namespace test="http://en.wikibooks.org/wiki/XQuery/test";

declare option exist:serialize "method=xhtml media-type=text/html";

<html>
<head><title>Keyword Search</title></head>
<body>
    <h1>Keyword Search</h1>
    <form method="GET">
        <p>
            <strong>Keyword Search:</strong>
            <input name="q" type="text"/>
        </p>
        <p>
            <input type="submit" value="Search"/>
        </p>
    </form>
</body>
</html>

Note that the form element can also contain an action attribute such as action="search.xq" to specify the XQuery function to use.

Receive Search Submission edit

It's nice to show the received results in the search field, so we can capture the search submission in variable $q using the request:get-parameter() function. We change the input element so it contains the value of $q as soon as there is a value.

let $q := xs:string(request:get-parameter("q", ""))

...

<input name="q" type="text" value="{$q}"/>

Filter Search Parameters edit

In order to prevent XQuery injection attacks, it is good practice to force the $q variable into a type of xs:string and to filter out unwanted characters from the search parameters.

let $q := xs:string(request:get-parameter("q", ""))
let $filtered-q := replace($q, "[&amp;&quot;-*;-`~!@#$%^*()_+-=\[\]\{\}\|';:/.,?(:]", "")

An alternative method of filtering is to only allow characters that are in a whitelist:

let $q := xs:string(request:get-parameter("q", ""))
let $filtered-q := replace($q, "[^0-9a-zA-ZäöüßÄÖÜ\-,. ]", "")

Construct Search Scope edit

In the context of a native XML database, the scope of a search can be very fine-grained, using the full expressive power of XPath. We can choose to target specific collections, documents, and nodes within documents. We can also target specific element namespaces, and we can use predicates to limit results to elements with a specific attribute. In the case of our example, we will target two collections and a specific XPath for each case. We create this search scope as using a sequence of XPath expressions:

let $scope := 
    ( 
        collection('/db/test/articles')/article/body,
        collection('/db/test/people')/people/person/biography
    )

Construct Search String and Execute Search edit

Although we could execute our search directly using the example above (under "Example Collections and Data"), we'll have much more flexibility if we first construct our search as a string and then execute it using the util:eval() function.

let $search-string := concat('$scope', '[. &amp;= "', $filtered-q, '"]')
let $hits := util:eval($search-string)

Score and Sort Search Results edit

Without sorting our results, the results would come back in "document order" -- the order in which the database executed the search. Results can be sorted according to any criteria: alphabetical order, date order, the number of keyword matches, etc. We will use a simple relevance algorithm to score our results: the number of keyword matches divided by the string length of the matching node. Using this algorithm, a hit with 1 match that is 10 characters long will score higher than a hit with 2 matches and that is 100 characters in length.

let $sorted-hits :=
    for $hit in $hits
    let $keyword-matches := text:match-count($hit)
    let $hit-node-length := string-length($hit)
    let $score := $keyword-matches div $hit-node-length
    order by $score descending
    return $hit

Show Results with Highlighted Keyword in Context edit

We want to show each result as an HTML div element containing 3 components: The title of the hit, a summary with an excerpt of the hit showing the keywords highlighted in context, and a link to display the full hit. Depending on the collection, these components will be constructed differently; we use the collection as the 'hook' to drive the display of each type of result. (Note: Other 'hooks' could be used, including namespace, node name, etc.)

We will create our highlighted keyword search summary by importing a module called kwic.xql and using a function inside called kwic:summarize(). The kwic:summarize() function highlights the first matching keyword term in a hit, and returns the surrounding text. kwic.xql was written by Wolfgang Meier and is distributed in eXist version 1.3b. We will place kwic.xql in the eXist database inside the /db/test/ collection.

xquery version "1.0";

import module namespace kwic="http://exist-db.org/xquery/kwic" at "xmldb:exist:///db/test/kwic.xql";

...

let $results := 
    for $hit in $sorted-hits[position() = ($start to $end)]
    let $collection := util:collection-name($hit)
    let $document := util:document-name($hit)
    let $base-uri := replace(request:get-url(), 'search.xq$', '')
    let $config := <config xmlns="" width="60"/>
    return 
        if ($collection = '/db/test/articles') then
            let $title := doc(concat($collection, '/', $document))//test:title/text()
            let $summary := kwic:summarize($hit, $config)
            let $url := concat('view-article.xq?article=', $document)
            return 
                <div class="result">
                    <p>
                        <span class="title"><a href="{$url}">{$title}</a></span><br/>
                        {$summary/*}<br/>
                        <span class="url">{concat($base-uri, $url)}</span>
                    </p>
                </div>
        else if ($collection = '/db/test/people') then
            let $title := doc(concat($collection, '/', $document))//test:name/text()
            let $summary := kwic:summarize($hit, $config)
            let $url := concat('view-person.xq?person=', $document)
            return 
                <div class="result">
                    <p>
                        <span class="title"><a href="{$url}">{$title}</a></span><br/>
                        {$summary/*}<br/>
                        <span class="url">{concat($base-uri, $url)}</span>
                    </p>
                </div>
        else 
            let $title := concat('Unknown result. Collection: ', $collection, '. Document: ', $document, '.')
            let $summary := kwic:summarize($hit, $config)
            let $url := concat($collection, '/', $document)
            return 
                <div class="result">
                    <p>
                        <span class="title"><a href="{$url}">{$title}</a></span><br/>
                        {$summary/*}<br/>
                        <span class="url">{concat($base-uri, $url)}</span>
                    </p>
                </div>

Paginate and Summarize Results edit

In order to reduce the result list to a manageable number, we can use URL parameters and XPath predicates to return only 10 results at a time. To do so, we need to define two new variables: $perpage and $start. As the user retrieves each page of results, the $start value will be passed to the server as a URL parameter, driving a new set of results using the XPath predicate.

let $perpage := xs:integer(request:get-parameter("perpage", "10"))
let $start := xs:integer(request:get-parameter("start", "0"))
let $end := $start + $perpage
let $results := 
    for $hit in $sorted-hits[$start to $end]
    ...

We also need to provide links to each page of results. To do so, we will mimic Google's pagination links, which start by displaying 10 results per page, grow up to 20 results per page, and show previous and next results. Our pagination links will only show if there are more than 10 results, and will be a simple HTML list that can be styled with CSS.

let $perpage := xs:integer(request:get-parameter("perpage", "10"))
let $start := xs:integer(request:get-parameter("start", "0"))
let $total-result-count := count($hits)
let $end := 
    if ($total-result-count lt $perpage) then 
        $total-result-count
    else 
        $start + $perpage
let $number-of-pages := 
    xs:integer(ceiling($total-result-count div $perpage))
let $current-page := xs:integer(($start + $perpage) div $perpage)
let $url-params-without-start := replace(request:get-query-string(), '&amp;start=\d+', '')
let $pagination-links := 
    if ($total-result-count = 0) then ()
    else 
        <div id="search-pagination">
            <ul>
                {
                (: Show 'Previous' for all but the 1st page of results :)
                    if ($current-page = 1) then ()
                    else
                        <li><a href="{concat('?', $url-params-without-start, '&amp;start=', $perpage * ($current-page - 2)) }">Previous</a></li>
                }
                
                {
                (: Show links to each page of results :)
                    let $max-pages-to-show := 20
                    let $padding := xs:integer(round($max-pages-to-show div 2))
                    let $start-page := 
                        if ($current-page le ($padding + 1)) then
                            1
                        else $current-page - $padding
                    let $end-page := 
                        if ($number-of-pages le ($current-page + $padding)) then
                            $number-of-pages
                        else $current-page + $padding - 1
                    for $page in ($start-page to $end-page)
                    let $newstart := $perpage * ($page - 1)
                    return
                        (
                        if ($newstart eq $start) then 
                            (<li>{$page}</li>)
                        else
                            <li><a href="{concat('?', $url-params-without-start, '&amp;start=', $newstart)}">{$page}</a></li>
                        )
                }
                
                {
                (: Shows 'Next' for all but the last page of results :)
                    if ($start + $perpage ge $total-result-count) then ()
                    else
                        <li><a href="{concat('?', $url-params-without-start, '&amp;start=', $start + $perpage)}">Next</a></li>
                }
            </ul>
        </div>

We should also provide a plain English summary of the search results, in the form "Showing all 5 of 5 results", or "Showing 10 of 1200 results."

let $how-many-on-this-page := 
    (: provides textual explanation about how many results are on this page, 
     : i.e. 'all n results', or '10 of n results' :)
    if ($total-result-count lt $perpage) then 
        concat('all ', $total-result-count, ' results')
    else
        concat($start + 1, '-', $end, ' of ', $total-result-count, ' results')

Putting it All Together edit

Here is the complete search.xq, with some CSS to make the results look nice. This search XQuery is quite long, and lends itself well to refactoring by moving sections of code into separate functions.

File='/db/test/search.xq'

xquery version "1.0";

import module namespace kwic="http://exist-db.org/xquery/kwic" at "xmldb:exist:///db/test/kwic.xql";

declare namespace test="http://en.wikibooks.org/wiki/XQuery/test";

declare option exist:serialize "method=xhtml media-type=text/html";

let $q := xs:string(request:get-parameter("q", ""))
let $filtered-q := replace($q, "[&amp;&quot;-*;-`~!@#$%^*()_+-=\[\]\{\}\|';:/.,?(:]", "")
let $scope := 
    ( 
        collection('/db/test/articles')/test:article/test:body,
        collection('/db/test/people')/test:person/test:biography
    )
let $search-string := concat('$scope', '[. &amp;= "', $filtered-q, '"]')
let $hits := util:eval($search-string)
let $sorted-hits :=
    for $hit in $hits
    let $keyword-matches := text:match-count($hit)
    let $hit-node-length := string-length($hit)
    let $score := $keyword-matches div $hit-node-length
    order by $score descending
    return $hit
let $perpage := xs:integer(request:get-parameter("perpage", "10"))
let $start := xs:integer(request:get-parameter("start", "0"))
let $total-result-count := count($hits)
let $end := 
    if ($total-result-count lt $perpage) then 
        $total-result-count
    else 
        $start + $perpage
let $results := 
    for $hit in $sorted-hits[position() = ($start + 1 to $end)]
    let $collection := util:collection-name($hit)
    let $document := util:document-name($hit)
    let $config := <config xmlns="" width="60"/>
    let $base-uri := replace(request:get-url(), 'search.xq$', '')
    return 
        if ($collection = '/db/test/articles') then
            let $title := doc(concat($collection, '/', $document))//test:title/text()
            let $summary := kwic:summarize($hit, $config)
            let $url := concat('view-article.xq?article=', $document)
            return 
                <div class="result">
                    <p>
                        <span class="title"><a href="{$url}">{$title}</a></span><br/>
                        {$summary/*}<br/>
                        <span class="url">{concat($base-uri, $url)}</span>
                    </p>
                </div>
        else if ($collection = '/db/test/people') then
            let $title := doc(concat($collection, '/', $document))//test:name/text()
            let $summary := kwic:summarize($hit, $config)
            let $url := concat('view-person.xq?person=', $document)
            return 
                <div class="result">
                    <p>
                        <span class="title"><a href="{$url}">{$title}</a></span><br/>
                        {$summary/*}<br/>
                        <span class="url">{concat($base-uri, $url)}</span>
                    </p>
                </div>
        else 
            let $title := concat('Unknown result. Collection: ', $collection, '. Document: ', $document, '.')
            let $summary := kwic:summarize($hit, $config)
            let $url := concat($collection, '/', $document)
            return 
                <div class="result">
                    <p>
                        <span class="title"><a href="{$url}">{$title}</a></span><br/>
                        {$summary/*}<br/>
                        <span class="url">{concat($base-uri, $url)}</span>
                    </p>
                </div>
let $number-of-pages := 
    xs:integer(ceiling($total-result-count div $perpage))
let $current-page := xs:integer(($start + $perpage) div $perpage)
let $url-params-without-start := replace(request:get-query-string(), '&amp;start=\d+', '')
let $pagination-links := 
    if ($number-of-pages le 1) then ()
    else
        <ul>
            {
            (: Show 'Previous' for all but the 1st page of results :)
                if ($current-page = 1) then ()
                else
                    <li><a href="{concat('?', $url-params-without-start, '&amp;start=', $perpage * ($current-page - 2)) }">Previous</a></li>
            }
            
            {
            (: Show links to each page of results :)
                let $max-pages-to-show := 20
                let $padding := xs:integer(round($max-pages-to-show div 2))
                let $start-page := 
                    if ($current-page le ($padding + 1)) then
                        1
                    else $current-page - $padding
                let $end-page := 
                    if ($number-of-pages le ($current-page + $padding)) then
                        $number-of-pages
                    else $current-page + $padding - 1
                for $page in ($start-page to $end-page)
                let $newstart := $perpage * ($page - 1)
                return
                    (
                    if ($newstart eq $start) then 
                        (<li>{$page}</li>)
                    else
                        <li><a href="{concat('?', $url-params-without-start, '&amp;start=', $newstart)}">{$page}</a></li>
                    )
            }
            
            {
            (: Shows 'Next' for all but the last page of results :)
                if ($start + $perpage ge $total-result-count) then ()
                else
                    <li><a href="{concat('?', $url-params-without-start, '&amp;start=', $start + $perpage)}">Next</a></li>
            }
        </ul>
let $how-many-on-this-page := 
    (: provides textual explanation about how many results are on this page, 
     : i.e. 'all n results', or '10 of n results' :)
    if ($total-result-count lt $perpage) then 
        concat('all ', $total-result-count, ' results')
    else
        concat($start + 1, '-', $end, ' of ', $total-result-count, ' results')
return

<html>
<head>
    <title>Keyword Search</title>
    <style>
        body {{ 
            font-family: arial, helvetica, sans-serif; 
            font-size: small 
            }}
        div.result {{ 
            margin-top: 1em;
            margin-bottom: 1em;
            border-top: 1px solid #dddde8;
            border-bottom: 1px solid #dddde8;
            background-color: #f6f6f8; 
            }}
        #search-pagination {{ 
            display: block;
            float: left;
            text-align: center;
            width: 100%;
            margin: 0 5px 20px 0; 
            padding: 0;
            overflow: hidden;
            }}
        #search-pagination li {{
            display: inline-block;
            float: left;
            list-style: none;
            padding: 4px;
            text-align: center;
            background-color: #f6f6fa;
            border: 1px solid #dddde8;
            color: #181a31;
            }}
        span.hi {{ 
            font-weight: bold; 
            }}
        span.title {{ font-size: medium; }}
        span.url {{ color: green; }}
    </style>
</head>
<body>
    <h1>Keyword Search</h1>
    <div id="searchform">
        <form method="GET">
            <p>
                <strong>Keyword Search:</strong>
                <input name="q" type="text" value="{$q}"/>
            </p>
            <p>
                <input type="submit" value="Search"/>
            </p>
        </form>
    </div>

    {
    if (empty($hits)) then ()
    else
        (
        <h2>Results for keyword search &quot;{$q}&quot;.  Displaying {$how-many-on-this-page}.</h2>,
        <div id="searchresults">{$results}</div>,
        <div id="search-pagination">{$pagination-links}</div>
        )
    }
</body>
</html>