XQuery/Parsing CSV

Motivation

You want to parse a file of comma-separated values (CSV) text into an xml structure or you have a flat file with very similar line/field structure that you want to convert into XML format.

Method

We will use the tokenize($input, '\n') function to break the input file into separate lines. We will then use the tokenize($input, '\s*,\s*') function to parse each line into separate fields. The regexp term \s* will remove whitespace.

Basic Example

xquery version "1.0";

let $csv-input-sample :=
'John, Smith, x123
Peg, Jones , x456
Sue, Adams , x789
Dan, McCoy , x321'

let $lines := tokenize($csv-input-sample, '\n')

return
<results>{
  for $line in $lines
  let $fields := tokenize($line, '\s*,\s*')
  return
  <row>{
    for $field in $fields
    return
      <field>{$field}</field>
  }</row>
}</results>

Execute

Sample Output

<results>
   <row>
      <field>John</field>
      <field>Smith</field>
      <field>x123</field>
   </row>
   <row>
      <field>Peg</field>
      <field>Jones</field>
      <field>x456</field>
   </row>
   <row>
      <field>Sue</field>
      <field>Adams</field>
      <field>x789</field>
   </row>
   <row>
      <field>Dan</field>
      <field>McCoy</field>
      <field>x321</field>
   </row>
</results>

Example With Data Dictionary in Row 1

This second example will use the first row of the CSV file as a data dictionary of the element names for each column of each row.

xquery version "1.0";

let $csv :=
'name,faculty
alice,anthropology
bob,biology'

let $lines := tokenize($csv, '\n')
let $head := tokenize($lines[1], ',')
let $body := remove($lines, 1)
return
    <people>
        {
            for $line in $body
            let $fields := tokenize($line, ',')
            return
                <person>
                    {
                        for $key at $pos in $head
                        let $value := $fields[$pos]
                        return
                            element { $key } { $value }
                    }
                </person>
        }
    </people>

Execute

Sample Output

<people>
  <person>
    <name>alice</name>
    <faculty>anthropology</faculty>
  </person>
  <person>
    <name>bob</name>
    <faculty>biology</faculty>
  </person>
</people>

Adding Configuration File Options

Many times you have a family of CSV files that all may have very similar options for import. In this case it is useful to be able to pass a series of configuration parameters to a single XQuery function. These configuration parameters include:

What the field delimiter is (comma is the default)
The element name of the root node
The element name of each line or row

<file-import-config>
   <field-separator>,<field-separator>
   <root-element-name>people</root-element-name>
   <line-element-name>person</line-element-name>
</file-import-config>

You can then use this configuration file in the CSV parser:

xquery version "1.0";
let $config := 
<file-import-config>
   <field-separator>:</field-separator>
   <root-element-name>People</root-element-name>
   <line-element-name>Person</line-element-name>
</file-import-config>

let $csv :=
'name:faculty
alice:anthropology
bob:biology'

let $lines := tokenize($csv, '\n')
let $head := tokenize($lines[1], $config/field-separator)
let $body := remove($lines, 1)
return
    element {$config/root-element-name}
        {
            for $line in $body
            let $fields := tokenize($line, $config/field-separator)
            return
                element {$config/line-element-name}
                    {
                        for $key at $pos in $head
                        let $value := $fields[$pos]
                        return
                            element { $key } { $value }
                    }
        }

Execute

CSV complications

The code above assumes a simple form of CSV. In practice CSV must handle more complex cases for which as simple use of tokeniser() is insufficient to parse a line. Strings containing one or more separators will be double-quoted. Quotes within quoted strings also need to be handled.

The following implementations handle some or all of these complications, although it is often not clear which complications are considered.

XSLT 2.0 Stylesheet for transforming CSV files to XML
Matthew Royal's parser Sadly this Marklogic only since it uses mutable maps.
Zorba CSV importer
BaseX CSV module
David Cassel another Marklogic specific implementation

Related work

Adam Retter's work on CSV schemas and Validation