XQuery/Filtering Words

      Motivation

      Sometimes you have a text body and you want to filter out words that are on a given list, often called a stoplist.

      ↑Jump back a section

      Screen Image

      Screen Image
      ↑Jump back a section

      Sample Program

      xquery version "1.0";
      
      (: Test to see if a word is in a list :)
      
      declare namespace exist = "http://exist.sourceforge.net/NS/exist";
      declare option exist:serialize "method=xhtml media-type=text/html indent=yes omit-xml-declaration=yes";
      
      (: A list of words :)
      let $stopwords :=
      <words>
         <word>a</word>
         <word>and</word>
         <word>in</word>
         <word>the</word>
         <word>or</word>
         <word>over</word>
      </words>
      
      let $input-text := 'a quick brown fox jumps over the lazy dog'
      return
      <html>
         <head>
            <title>Test of is a word on a list</title>
           </head>
         <body>
         <h1> Test of is a word on a list</h1>
         
         <h2>WordList</h2>
         <table border="1">
           <thead>
             <tr>
               <th>StopWord</th>   
             </tr>
           </thead>
           <tbody>{
           for $word in $stopwords/word
           return
              <tr>       
                 <td align="center">{$word}</td>              
              </tr>
           }</tbody>
         </table>
         
         <h2>Sample Input Text</h2>
         <p>Input Text: <div style="border:1px solid black">{$input-text}</div></p>
         <table border="1">
           <thead>
             <tr>
               <th>Word</th>
               <th>On List</th>       
             </tr>
           </thead>
           <tbody>{
           for $word in tokenize($input-text, '\s+')
           return
           <tr>       
              <td>{$word}</td>      
              <td>{
                if ($stopwords/word = $word)
                  then(<font color="green">true</font>)
                  else(<font color="red">false</font>)
              }</td>
           </tr>
           }</tbody>
         </table>
        </body>
      </html>
      


      Execute

      ↑Jump back a section

      Discussion

      The input string is split into words using the tokenize function which accepts two parameters, the string to be parsed and a separator expressed as a regular expression. Here words are separated by one or more spaces. The result is a sequence of words.

      This program uses XPath generalized equality to compare the sequence $stopwords/word with the sequence (of one item) $word. This is true if the two sequences have items in common, that is if the stoplist contains the word.

      ↑Jump back a section

      Alternative coding

      You can also use a quantified expression to perform a stopword lookup using the some...satisfies - see XQuery/Quantified Expressions expression such as:

         some $word in $stopwords
         satisfies ($word = $thisword)
      

      There are other alternatives; the stop words as a sequence of strings, or a long string and use contains() or a element in the database.

      There are however significant differences in performance. There is a set of tests which show the differences in a number of alternatives. Unit Tests

      What these tests reveal is that, on the eXist db platform, both the suggested implementations are far from optimal. Testing against a sequence of strings takes about a fifth of the time to compare with elements. Generalised equality is equally superior to the use of a qualified expression.

      ↑Jump back a section

      Recommended Practice

      It would appear that the preferable approach is:

      let $stopwords := ("a","and","in","the","or","over")
      let $input-string :=  'a quick brown fox jumps over the lazy dog'
      let $input-words := tokenize($input-string, '\s+')
      return 
          for $word in $input-words 
          return $stopwords = $word
      

      If the stop words are held as an element, it is better to convert to a sequence of atoms first:

      let $stopwords :=
      <words>
         <word>a</word>
         <word>and</word>
         <word>in</word>
         <word>the</word>
         <word>or</word>
         <word>over</word>
      </words>
      let $stopwordsx := $stopwords/word/string(.)
      let $input-string :=  'a quick brown fox jumps over the lazy dog'
      let $input-words := tokenize($input-string, '\s+')
      return 
          for $word in $input-words 
          return $stopwordsx = $word
      

      Note that referencing the stop list in the database slightly improved performance.

      ↑Jump back a section
      Last modified on 20 November 2009, at 14:59