XQuery/XML Differences

Motivation

You want to find the differences between two XML files and output a "colored diff" file of the differences.

Background on XML Differences

Unlike plain text files, XML structural differences must be considered when comparing two XML files.

For example when comparing two attributes for an element the order that the attributes appear in a file is not significant. The following two lines are technically the same even though the order of the attributes is different:

<myelement attr1="abc" attr2="def"/>
<myelement attr2="def" attr1="abc"/>

XML differences also tend to ignore the spaces and tabs used when indenting and XML file to make it more readable.

So the traditional Longest Common Subsequence (LCS) algorithms used by tools such as UNIX diff, GNU diff, or the Subversion diff will not usually give us the results that we desire. ^[1]

XML Differencing Algorithms

There are many different algorithms for doing comparisons between tree structured data. Because hierarchical data can be so complex each algorithm will have different precision and performance considerations. There are also many options to consider. For example:

Do you want to ignore XML comments?
Do you want to ignore Processor Instructions (PIs)?
Do you want to ignore case (uppercase/lowercase) differences?
Do you want to ignore whitespace between elements?
Can you assume that the structure of the XML documents being compared is identical and only the text is different?
Are you interested if the order of attributes change?
Do you want your differences algorithm to output a list of changes to be made on the first or second file?

For our first version we will just do a simple scan of the elements and text within the elements.

Method

We will create a recursive XQuery function that compares all the nodes of an XML file.

XML Difference Output Format

We want to create an XML output format that allows the user to easily display the output using a side-by-side file comparison method.

For example the output might look like:

<xml-diffs>
  <parameters>
      <output-format-code>xml<output-format-code>
      <show-original-indicator>false<show-original-indicator>
  </parameters>
  <diff>
    <change>...<change>
  <diff>
  <diff>
    <addition>...<addition>
  <diff>
  <diff>
    <deletion>...<deletion>
  <diff>
</xml-diffs>

Formatting the output for HTML and CSS

The above output could be considered a raw semantic markup without concern as to how the web site wants to display the output using standard HTML div blocks and CSS. As a second step we can place the output in two HTML <div>...</div> blocks, one for the initial file usually on the left and one for the second file, usually on the right with the changes marked using <div>...</div> tags for the changes. Each div will have a class property that allows the CSS to file to place the output anywhere on an HTML page. For example the <div class="orignal"> may be placed on the left and the <div class="addition"> may be styled with green.

Algorithm

O(ND) Difference Algorithm was originally designed to compare text files using linebreaks as a fundamental unit of comparison. We will need to modify it to recursively compare XML elements and attributes. XML comparison also should not report differences in the order of attributes.

To be continued...

References

↑ "S. Chawathe, A. Rajaraman, H. Garcia-Molina and J. Widom" ("June 1996"). "Change Detection in Hierarchically Structured Information". "Proceedings of the ACM SIGMOD International". "Conference on Management of Data, Montreal". {{cite journal}}: Check date values in: |date= (help)CS1 maint: multiple names: authors list (link)

{{citation}}: Empty citation (help)

An O(ND) Difference Algorithm and its Variations" by Eugene Myers Algorithmica Vol. 1 No. 2, 1986, p 251
[http://www.cs.wisc.edu/niagara/papers/xdiff.pdf X-Diff: An Effective Change Detection Algorithm for XML Documents Yuan Wang, David J. DeWitt, Jin-Yi Cai, University of Wisconsin – Madison

[1] "S. Chawathe, A. Rajaraman, H. Garcia-Molina and J. Widom" ("June 1996"). "Change Detection in Hierarchically Structured Information". "Proceedings of the ACM SIGMOD International". "Conference on Management of Data, Montreal". {{cite journal}}: Check date values in: |date= (help)CS1 maint: multiple names: authors list (link)

[1]