XQuery/Latent Semantic Indexing

Motivation

edit

You have a collection of documents and for any document you want to find out what documents are the most similar to any given document.

Method

edit

We will use a text-mining technique called "Latent Semantic Indexing". We will first create a matrix of all concept words (terms) by all the documents. Each cell will have the frequency count of terms in each document. We then send this term-document matrix to a service that performs a standard Singular Value Decomposition or SVD. SVD is a very compute-intensive algorithm that can take many hours or days of calculation if you have a large number of words and documents. The SVD service then return a set of "Concept Vectors" that can be used to group related documents.

Sample Data

edit

To keep the example simple, we will just use the document titles, not the full documents.

Here are some document titles:

XQuery Tutorial and Cookbook 
XForms Tutorial and Cookbook 
Auto-generation of XForms with XQuery 
Building RESTful Web Applications with XRX 
XRX Tutorial and Cookbook 
XRX Architectural Overview 
The Return on Investment of XRX 

Our first step will be to build a Word-Document Matrix. This matrix has all the words in the document in a column and one column for each document.

We will do this in several steps.

  1. Get all the words from all the documents an put them into a single sequence
  2. Create a list of the distinct words that are not "stop words"
  3. For each word:
    1. For each document count the frequency that this word appears in the document


Sample Word-Document Matrix

edit
Word 1 2 3 4 5 6 7
Applications 0.03125
Architectural 0.03125
Auto-generation 0.03125
Building 0.03125
Cookbook 0.03125 0.03125 0.03125
Investiment 0.03125
Overview 0.03125
RESTful 0.03125
Return 0.03125
Tutorial 0.03125 0.03125 0.03125
Web 0.03125
XForms 0.03125 0.03125
XQuery 0.03125 0.03125
XRX 0.03125 0.03125 0.03125 0.03125

Sample Program Source

edit
xquery version "1.0";

declare option exist:serialize "method=xhtml media-type=text/html indent=yes";

(: this is where we get our data :)
let $app-collection := '/db/apps/latent-semantic-analysis'
let $data-collection := concat($app-collection , '/data')

(: get all the titles where $titles is a sequence of titles :)
let  $titles := collection($data-collection)/html/head/title/text()
let $doc-count := count($titles)

(: A list of words :)
let $stopwords :=
<words>
   <word>a</word>
   <word>and</word>
   <word>in</word>
   <word>the</word>
   <word>of</word>
   <word>or</word>
   <word>on</word>
   <word>over</word>
   <word>with</word>
</words>

(: a sequence of words in all the document titles :)
(: the \s is the generic whitespace regular expression :)
let $all-words :=
   for $title in $titles
      return
         tokenize($title, '\s')

(: just get a distinct list of the sorted words that are not stop words :)
let $concept-words :=
   for $word in distinct-values($all-words)
   order by $word
      return
         if ($stopwords/word = lower-case($word))
            then ()
            else $word

let $total-word-count := count($all-words)
return
<html>
    <head>
        <title>All Document Words</title>
    </head>
    <body>
        <p>Doc count =<b>{$doc-count}</b> Word count = <b>{$total-word-count}</b></p>
        
        <h2>Documents</h2>
        <ol>
        {for $title in $titles
           return
               <li>{$title}</li>
         }
         </ol>
         
         <h2>Word-Document Matrix</h2>
         <table border="1">
            <thead>
               <tr>
               <th>Word</th>
               {for $doc at $count in $titles
                       return
                          <th>{$count}</th>
                    }
               </tr>
            </thead>
             {for $word in $concept-words
             return
                 <tr>
                    <td>{$word}</td>
                    {for $title in $titles
                       return
                          <td>{if (contains($title, $word)) 
                                 then (1 div $total-word-count)
                                 else (' ')}</td>
                    }
                 </tr>
             }
          </table>
    </body>
</html>

Creating Sigma Values

edit

The Sigma matrix is a matrix that is multiplied by both the word vectors and the documents vectors:

[Word Document Matrix] = [Word Vectors] X [Sigma Values] X [Document Vectors]