Apache Ant/Converting Excel to XML

Motivation Edit

You want to automatically extract a well-formed XML file from a binary Excel document.

Method Edit

We will us the java Ant task within a build target.

Input File Edit

We will create a sample Microsoft Excel file that has two columns like the following:

Screen image for spreadsheet input

Save this into a file 'sample.xls'.

Next, download the Apache Tika jar file and put is on your local hard drive.

You can get the downloads from here: http://tika.apache.org/download.html the Main Tika jar file is about 27MB.

I put the tika jar file in D:\Apps\tika but you can change this.

Create a file called "build.xml"

Sources Edit

<project name="extract-xml-from-xsl" default="extract-xml-from-xsl">
    <description>Sample Extract XML from Excel xsl file with Apache Tika</description>
    <property name="lib.dir" value="D:\Apps\tika"/>
    <property name="input-file" value="sample.xls"/>
    <target name="extract-xml-from-xsl">
        <echo message="Extracting XML from Excel file: ${input-file}"/>
        <java jar="${lib.dir}/tika-app-1.3.jar" fork="true" failonerror="true"
            maxmemory="128m" input="${input-file}" output="sample.xml">
            <arg value="-x" />

The <java> task will run tika. The argument "-x" (for XML will extract the XML from the input.

Other command line options are listed here: http://tika.apache.org/1.3/gettingstarted.html

Now open your DOS or UNIX shell and cd into the place with your build file. Type "ant" into a command shell.

Run Edit

$ ant
Buildfile: D:\ws\doc-gen\trunk\build\tika\build.xml

     [echo] Extracting XML from Excel file: sample.xls

Total time: 1 second

Sample Output Edit

Note that the output is a well formed HTML file with a table in it:

<html xmlns="http://www.w3.org/1999/xhtml">
        <meta name="meta:last-author" content="Dan" />
        <meta name="meta:creation-date" content="2013-03-04T17:20:19Z" />
        <meta name="dcterms:modified" content="2013-03-04T17:22:01Z" />
        <meta name="meta:save-date" content="2013-03-04T17:22:01Z" />
        <meta name="Last-Author" content="Dan" />
        <meta name="Application-Name" content="Microsoft Excel" />
        <meta name="dc:creator" content="Dan" />
        <meta name="Last-Modified" content="2013-03-04T17:22:01Z" />
        <meta name="Author" content="Dan" />
        <meta name="dcterms:created" content="2013-03-04T17:20:19Z" />
        <meta name="date" content="2013-03-04T17:22:01Z" />
        <meta name="modified" content="2013-03-04T17:22:01Z" />
        <meta name="creator" content="Dan" />
        <meta name="Creation-Date" content="2013-03-04T17:20:19Z" />
        <meta name="meta:author" content="Dan" />
        <meta name="extended-properties:Application" content="Microsoft Excel" />
        <meta name="Content-Type" content="application/vnd.ms-excel" />
        <meta name="Last-Save-Date" content="2013-03-04T17:22:01Z" />
        <div class="page"><h1>Sheet1</h1>