XQuery/Splitting Files

Motivation

You have a single large XML document with many consistent records in it. You want to split it into many smaller documents so that each can be edited by a separate user. There are many good reasons to split large files up. Some have to do with how much data you want to load into an editor at a time or how you want to publish individual files to a remote site.

eXist and many other systems do versioning and keep date/time stamps for each file. Using smaller files these functions may be easier to do.

Method

We will create an XQuery that will iterate through all the records in the document. For each record we will use the XQuery function to store a document in a collection. The format of this function is:

xmldb:store($collection, $filename, $data)

Where:

$collection is a string that holds the path to the collection we will be storing the data for each record. For example '/db/test/data'
$filename is the name of the file. The name can either be derived from the data or it can be generated by a sequence counter in the split query. For example 'Hello.xml" or "1.xml".
$data is the data we will be storing into the file

Sample Input XML Document

One way to get started is to put your data such as business terms into a spreadsheet and convert it to XML. The oXygen XML editor has some very good tools for converting spreadsheets to XML format. Make sure you put "Term" and "Definition" in the first row and use this row to define the element names.

<root>
   <row>
      <Term>Hi</Term>
      <Definition>An informal short greeting.</Definition>
   </row>
   <row>
      <Term>Hello</Term>
      <Definition>A more formal greeting.</Definition>
   </row>
</root>

Sample XQuery

xquery version "1.0";

let $input-document := '/db/test/input.xml'
let $collection := '/db/test/terms'

(: the login used must have write access to the collection :)
let $output-collection := xmldb:login($collection, 'my-login', 'my-password')

return
<SplitResults>{
     for $term-data in doc($input-document)/root/row
        (: For brevity we will create a file name with the term name.  Change this to be an ID function if you want :)
        let $term-name := $term-data/Term/text()
        let $documentname := concat($term-name, '.xml')
        let $store-return := xmldb:store($collection, $documentname, $term-data)
     return
        <store-result>
           <store>{$term-name}</store>
           <documentname>{$documentname}</documentname>
        </store-result>
}</SplitResults>

Using A Sequence Counter for Artificial Keys

Sometimes there are not any elements in the importing record that can be used as a unique key or are not appropriate to use as an artificial key. In this case you will want to use a counter to create an XML document with a unique number in it. The sequence number generated is called an "artificial key" since it is not really related directly to any data elements in the record.

You can achieve this by adding an "at counter" to your for loop. To do this just add the string at $count after the for variable like the following

for $term-data at $count in $input-file/row

The store function can then use the $count variable to create a file name with this number:

let $filename := concat($count, '.xml')

Adding a ID to each item using the XQuery update Operator

Once you have inserted the data into a collection you will then want to assign each item a unique ID. This is called an artificial key since it is created by an artifical import process and is not related to data inside of the item. Artificial keys are usually assigned by the computer system that stores the data but not derived from the data.

<item>
   <person-name>John Doe</person-name>
   ...
</item>

You can also automatically add an ID to each item by doing the following:

  for $item at $count in $items
  return
     update insert <id>{$count}</id> preceding $item/person-name

After this update the new ID element will be inserted before the person-name element:

<item>
   <id>47</id>
   <person-name>John Doe</person-name>
   ...
</item>

It is a best practice to make sure that items do not already have an ID element.

  for $item at $count in $items[not(id)]
  return
     update insert <id>{$count}</id> preceding $item/person-name

This prevents duplicate ids from being added if the script gets run twice. You can also modify this to start the count one higher then the largest id in a collection.

  (: get the largest ID in the collection :)
  let $largest-id := max(  collection($my-collection)/*/id/text() )
  let $offset := $largest-id + 1
  for $item at $count in $items[not(id)]
  return
     update insert <id>{$count + $offset}</id> preceding $item/person-name

References

The split pattern is documented in the Enterprise Pattern Integration Web site. Note that pattern is called "Splitter" dispite the fact that the name in the URL is "Sequencer".

Also note that the size of the file you select to load into the client has a large impact on the way that concurrent edits are performed. This has a large impact on what data needs to be locked for editing. See XRX Locking Grain Design for more information.