XQuery/Latent Semantic Indexing
Motivation
editYou have a collection of documents and for any document you want to find out what documents are the most similar to any given document.
Method
editWe will use a text-mining technique called "Latent Semantic Indexing". We will first create a matrix of all concept words (terms) by all the documents. Each cell will have the frequency count of terms in each document. We then send this term-document matrix to a service that performs a standard Singular Value Decomposition or SVD. SVD is a very compute-intensive algorithm that can take many hours or days of calculation if you have a large number of words and documents. The SVD service then return a set of "Concept Vectors" that can be used to group related documents.
Sample Data
editTo keep the example simple, we will just use the document titles, not the full documents.
Here are some document titles:
XQuery Tutorial and Cookbook XForms Tutorial and Cookbook Auto-generation of XForms with XQuery Building RESTful Web Applications with XRX XRX Tutorial and Cookbook XRX Architectural Overview The Return on Investment of XRX
Our first step will be to build a Word-Document Matrix. This matrix has all the words in the document in a column and one column for each document.
We will do this in several steps.
- Get all the words from all the documents an put them into a single sequence
- Create a list of the distinct words that are not "stop words"
- For each word:
- For each document count the frequency that this word appears in the document
Sample Word-Document Matrix
editWord | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
Applications | 0.03125 | ||||||
Architectural | 0.03125 | ||||||
Auto-generation | 0.03125 | ||||||
Building | 0.03125 | ||||||
Cookbook | 0.03125 | 0.03125 | 0.03125 | ||||
Investiment | 0.03125 | ||||||
Overview | 0.03125 | ||||||
RESTful | 0.03125 | ||||||
Return | 0.03125 | ||||||
Tutorial | 0.03125 | 0.03125 | 0.03125 | ||||
Web | 0.03125 | ||||||
XForms | 0.03125 | 0.03125 | |||||
XQuery | 0.03125 | 0.03125 | |||||
XRX | 0.03125 | 0.03125 | 0.03125 | 0.03125 |
Sample Program Source
editxquery version "1.0";
declare option exist:serialize "method=xhtml media-type=text/html indent=yes";
(: this is where we get our data :)
let $app-collection := '/db/apps/latent-semantic-analysis'
let $data-collection := concat($app-collection , '/data')
(: get all the titles where $titles is a sequence of titles :)
let $titles := collection($data-collection)/html/head/title/text()
let $doc-count := count($titles)
(: A list of words :)
let $stopwords :=
<words>
<word>a</word>
<word>and</word>
<word>in</word>
<word>the</word>
<word>of</word>
<word>or</word>
<word>on</word>
<word>over</word>
<word>with</word>
</words>
(: a sequence of words in all the document titles :)
(: the \s is the generic whitespace regular expression :)
let $all-words :=
for $title in $titles
return
tokenize($title, '\s')
(: just get a distinct list of the sorted words that are not stop words :)
let $concept-words :=
for $word in distinct-values($all-words)
order by $word
return
if ($stopwords/word = lower-case($word))
then ()
else $word
let $total-word-count := count($all-words)
return
<html>
<head>
<title>All Document Words</title>
</head>
<body>
<p>Doc count =<b>{$doc-count}</b> Word count = <b>{$total-word-count}</b></p>
<h2>Documents</h2>
<ol>
{for $title in $titles
return
<li>{$title}</li>
}
</ol>
<h2>Word-Document Matrix</h2>
<table border="1">
<thead>
<tr>
<th>Word</th>
{for $doc at $count in $titles
return
<th>{$count}</th>
}
</tr>
</thead>
{for $word in $concept-words
return
<tr>
<td>{$word}</td>
{for $title in $titles
return
<td>{if (contains($title, $word))
then (1 div $total-word-count)
else (' ')}</td>
}
</tr>
}
</table>
</body>
</html>
Creating Sigma Values
editThe Sigma matrix is a matrix that is multiplied by both the word vectors and the documents vectors:
[Word Document Matrix] = [Word Vectors] X [Sigma Values] X [Document Vectors]