Last modified on 18 December 2011, at 15:48

XQuery/XML Differences

MotivationEdit

You want to find the differences between two XML files and output a "colored diff" file of the differences.

Background on XML DifferencesEdit

Unlike plain text files, XML structural differences that must be considered when comparing two XML files.

For example when comparing two attributes for an element the order that the attributes appear in a file is not significant. The following two lines are technically the same even though the order of the attributes is different:

<myelement attr1="abc" attr2="def"/>
<myelement attr2="def" attr1="abc"/>

XML differences also tend to ignore the spaces and tabs used when indenting and XML file to make it more readable.

So the traditional Longest Common Subsequence (LCS) algorithms used tools such as UNIX diff, GNU diff, or the Subversion diff will not usually give us the results that we desire. [1]

XML Differencing AlgorithmsEdit

There are many different algorithms for doing comparisons between tree structured data. Because hierarchical data can be so complex each algorithm will have different precision and performance considerations. There are also many options to consider. For example:

  • Do you want to ignore XML comments?
  • Do you want to ignore Processor Instructions (PIs)?
  • Do you want to ignore case (uppercase/lowercase) differences?
  • Do you want to ignore whitespace between elements?
  • Can you assume that the structure of the XML documents being compared is identical and only the text is different?
  • Are you interested if the order of attributes change?
  • Do you want your differences algorithm to output a list of changes to be made on the first or second file?

For our first version we will just do a simple scan of the elements and text within the elements.

MethodEdit

We will create a recursive XQuery function that compares all the nodes of an XML file.

XML Difference Output FormatEdit

We want to create an XML output format that allows the user to easily display the output using a side-by-side file comparison method.

For example the output might look like:

<xml-diffs>
  <parameters>
      <output-format-code>xml<output-format-code>
      <show-original-indicator>false<show-original-indicator>
  </parameters>
  <diff>
    <change>...<change>
  <diff>
  <diff>
    <addition>...<addition>
  <diff>
  <diff>
    <deletion>...<deletion>
  <diff>
</xml-diffs>

Formatting the output for HTML and CSSEdit

The above output could be considered a raw semantic markup without concern as to how the web site wants to display the output using standard HTML div blocks and CSS. As a second step we can place the output in two HTML

blocks, one for the initial file usually on the left and one for the second file, usually on the right with the changes marked using
tags for the changes. Each div will have a class property that allows the CSS to file to place the output anywhere on an HTML page. For example the
may be placed on the left and the
may be styled with green.

AlgorithmEdit

O(ND) Difference Algorithm was originally designed to compare text files using linebreaks as a fundamental unit of comparison. We will need to modify it to recursively compare XML elements and attributes. XML comparison also should not report differences in the order of attributes.

To be continued...

ReferencesEdit

  1. "S. Chawathe, A. Rajaraman, H. Garcia-Molina and J. Widom" ("June 1996"). "Change Detection in Hierarchically Structured Information". "Proceedings of the ACM SIGMOD International" "Conference on Management of Data, Montreal".