XQuery/Auto-generation of Index Config Files

Motivation

edit

You want to automatically generate a index configuration file based on instance or XML schema data.

Creation of an index configuration file is difficult for new users. To help new users get started it is frequently benefitial to generate a sample collection.xconf file for these users based on simple analysis of sample instance data or XML Schemas that are provided by the users.

Index Types

edit

There are several types of indexes you may want to create. Range indexes are very useful when you have identifiers or you want to sort results based on element content. Fulltext indexes are most frequently used of language text that contains full sentences with punctuation.

FullText Indexes

edit

The following is some example code on how one might do this.

Lucene fulltext indexes are most useful when they index fulltext sentences. One approach is to scan an instance document for full sentences looking for longer strings with punctuation. Although a full implementation would involve the inclusion of a "Natural Language Processor" library such as Apache UIMA, we can begin with some very simple rules.

Here are some sample steps in the process for non-mixed-text content. Mixed text can also be done but the steps are more complex:

  1. get a list of all elements in a sample index file
  2. classify the elements according to if they have simple or complex content
  3. if they have simple content, look for sentences (spaces and punctuation)
  4. for each element that has fulltext create a lucene index

Sample Code for Namespace Generation

edit

This creates an index on <foo> with every namespace that is used in the collection.

let $defaultNamespaces :=
 for $value in distinct-values(
     for $doc in collection($dataLocation)
       let $ns := namespace-uri($doc/*)
       return
         if ($ns)
         then $ns
         else ()
     )
   return element ns { $value }

let $index1 :=
 "<collection xmlns='http://exist-db.org/collection-config/1.0'><index"
let $index2 :=
 for $ns in $defaultNamespaces
   return concat(' xmlns:ns',index-of($defaultNamespaces,$ns),$eq,$qt,$ns,$qt)
let $index3 :=
 "><fulltext default='none' attribute='no'/><lucene><analyzer
class='org.apache.lucene.analysis.standard.StandardAnalyzer'/><analyzer
id='ws' class='org.apache.lucene.analysis.WhitespaceAnalyzer'/><text
qname='foo'/>"
let $index4 :=
 for $ns in $defaultNamespaces
   let $prefix := concat('ns',index-of($defaultNamespaces,$ns))
   return concat('<text qname=',$qt,$prefix,':foo',$qt,'/>')
let $index5 :=
 "</lucene></index></collection>"

let $index := util:parse(string-join(($index1,$index2,$index3,$index4,$index5),""))
let $status := xmldb:store($indexLocation,"collection.xconf",$index)
let $result :=xmldb:reindex($dataLocation)