XQuery/Auto-generation of Index Config Files
Motivation
editYou want to automatically generate a index configuration file based on instance or XML schema data.
Creation of an index configuration file is difficult for new users. To help new users get started it is frequently benefitial to generate a sample collection.xconf file for these users based on simple analysis of sample instance data or XML Schemas that are provided by the users.
Index Types
editThere are several types of indexes you may want to create. Range indexes are very useful when you have identifiers or you want to sort results based on element content. Fulltext indexes are most frequently used of language text that contains full sentences with punctuation.
FullText Indexes
editThe following is some example code on how one might do this.
Lucene fulltext indexes are most useful when they index fulltext sentences. One approach is to scan an instance document for full sentences looking for longer strings with punctuation. Although a full implementation would involve the inclusion of a "Natural Language Processor" library such as Apache UIMA, we can begin with some very simple rules.
Here are some sample steps in the process for non-mixed-text content. Mixed text can also be done but the steps are more complex:
- get a list of all elements in a sample index file
- classify the elements according to if they have simple or complex content
- if they have simple content, look for sentences (spaces and punctuation)
- for each element that has fulltext create a lucene index
Sample Code for Namespace Generation
editThis creates an index on <foo> with every namespace that is used in the collection.
let $defaultNamespaces :=
for $value in distinct-values(
for $doc in collection($dataLocation)
let $ns := namespace-uri($doc/*)
return
if ($ns)
then $ns
else ()
)
return element ns { $value }
let $index1 :=
"<collection xmlns='http://exist-db.org/collection-config/1.0'><index"
let $index2 :=
for $ns in $defaultNamespaces
return concat(' xmlns:ns',index-of($defaultNamespaces,$ns),$eq,$qt,$ns,$qt)
let $index3 :=
"><fulltext default='none' attribute='no'/><lucene><analyzer
class='org.apache.lucene.analysis.standard.StandardAnalyzer'/><analyzer
id='ws' class='org.apache.lucene.analysis.WhitespaceAnalyzer'/><text
qname='foo'/>"
let $index4 :=
for $ns in $defaultNamespaces
let $prefix := concat('ns',index-of($defaultNamespaces,$ns))
return concat('<text qname=',$qt,$prefix,':foo',$qt,'/>')
let $index5 :=
"</lucene></index></collection>"
let $index := util:parse(string-join(($index1,$index2,$index3,$index4,$index5),""))
let $status := xmldb:store($indexLocation,"collection.xconf",$index)
let $result :=xmldb:reindex($dataLocation)