XQuery/Filtering Words

Motivation

edit

Sometimes you have a text body and you want to filter out words that are on a given list, often called a stoplist.

Screen Image

edit
 
Screen Image

Sample Program

edit
xquery version "1.0";

(: Test to see if a word is in a list :)

declare namespace exist = "http://exist.sourceforge.net/NS/exist";
declare option exist:serialize "method=xhtml media-type=text/html indent=yes omit-xml-declaration=yes";

(: A list of words :)
let $stopwords :=
<words>
   <word>a</word>
   <word>and</word>
   <word>in</word>
   <word>the</word>
   <word>or</word>
   <word>over</word>
</words>

let $input-text := 'a quick brown fox jumps over the lazy dog'
return
<html>
   <head>
      <title>Test of is a word on a list</title>
     </head>
   <body>
   <h1> Test of is a word on a list</h1>

   <h2>WordList</h2>
   <table border="1">
     <thead>
       <tr>
         <th>StopWord</th>
       </tr>
     </thead>
     <tbody>{
     for $word in $stopwords/word
     return
        <tr>
           <td align="center">{$word}</td>
        </tr>
     }</tbody>
   </table>

   <h2>Sample Input Text</h2>
   <p>Input Text: <div style="border:1px solid black">{$input-text}</div></p>
   <table border="1">
     <thead>
       <tr>
         <th>Word</th>
         <th>On List</th>
       </tr>
     </thead>
     <tbody>{
     for $word in tokenize($input-text, '\s+')
     return
     <tr>
        <td>{$word}</td>
        <td>{
          if ($stopwords/word = $word)
            then(<span style="color:green;">true</span>)
            else(<span style="color:red;">false</span>)
        }</td>
     </tr>
     }</tbody>
   </table>
  </body>
</html>

Discussion

edit

The input string is split into words using the tokenize function which accepts two parameters, the string to be parsed and a separator expressed as a regular expression. Here words are separated by one or more spaces. The result is a sequence of words.

This program uses XPath generalized equality to compare the sequence $stopwords/word with the sequence (of one item) $word. This is true if the two sequences have items in common, that is if the stoplist contains the word.

Alternative coding

edit

You can also use a quantified expression to perform a stopword lookup using the some...satisfies – see XQuery/Quantified Expressions expression such as:

   some $word in $stopwords
   satisfies ($word = $thisword)

There are other alternatives; the stop words as a sequence of strings, or a long string and use contains() or a element in the database.

There are however significant differences in performance. There is a set of tests which show the differences in a number of alternatives. Unit Tests

What these tests reveal is that, on the eXist db platform, both the suggested implementations are far from optimal. Testing against a sequence of strings takes about a fifth of the time to compare with elements. Generalised equality is equally superior to the use of a qualified expression.

edit

It would appear that the preferable approach is:

let $stopwords := ("a","and","in","the","or","over")
let $input-string :=  'a quick brown fox jumps over the lazy dog'
let $input-words := tokenize($input-string, '\s+')
return
    for $word in $input-words
    return $stopwords = $word

If the stop words are held as an element, it is better to convert to a sequence of atoms first:

let $stopwords :=
<words>
   <word>a</word>
   <word>and</word>
   <word>in</word>
   <word>the</word>
   <word>or</word>
   <word>over</word>
</words>
let $stopwordsx := $stopwords/word/string(.)
let $input-string :=  'a quick brown fox jumps over the lazy dog'
let $input-words := tokenize($input-string, '\s+')
return
    for $word in $input-words
    return $stopwordsx = $word

Note that referencing the stop list in the database slightly improved performance.