XQuery/Filtering Nodes

Motivation

edit

You want to create filters that remove or replace specific nodes in an XML stream. This stream may be in-memory XML documents and may not be on-disk.

Method

edit

To process all nodes in a tree we will start with recursive function called the identity transform. This function copies the source tree into the output tree without change. We begin with this process and then add some exception processing for each filter.

(: return a deep copy of  the element and all sub elements :)
declare function local:copy($element as element()) as element() {
   element {node-name($element)}
      {$element/@*,
          for $child in $element/node()
              return
               if ($child instance of element())
                 then local:copy($child)
                 else $child
      }
};

This function uses an XQuery construct called computed element constructor to construct an element. The format of the element constructor is the following:

  element {ELEMENT-NAME} {ELEMENT-VALUE}

In the above case ELEMENT-VALUE is another query that finds all the child elements of the current node. The for loop selects all nodes of the current element and does the following pseudo-code:

  if the child is another element ''(this uses the "instance of" instruction)''
      then copy the child ''(recursively)''
      else return the child ''(we have a leaf element of the tree)''

If you understand this basic structure of this algorithm you can now modify it to filter out only the elements you want. You just start with this template and modify various sections.

Note that you can also achieve this function by using the typeswitch operator:

declare function local:copy($n as node()) as node() {
   typeswitch($n)
      case $e as element()
         return
            element {name($e)}
                    {$e/@*,
                     for $c in $e/(* | text())
                         return local:copy($c) }         
      default return $n
 };

Removing all attributes

edit

The following function removes all attributes from elements since attributes are not copied.

declare function local:copy-no-attributes($element as element()) as element() {
   element {node-name($element)}
      {
      for $child in $element/node()
         return
            if ($child instance of element())
               then local:copy-no-attributes($child)
               else $child
      }
};

This function can also be arrived at by using the typeswitch operator:

declare function local:copy($n as node()) as node() {
   typeswitch($n)
      case $e as element()
         return
            element {name($e)}
                    {for $c in $e/(* | text())
                         return local:copy($c) }         
      default return $n
  };

The function can be parameterized by adding a second function argument to indicate what attributes should be removed.

Change all the attribute names for a given element

edit
declare function local:change-attribute-name-for-element(
   $node as node(),
   $element as xs:string,
   $old-attribute as xs:string,
   $new-attribute as xs:string
   ) as element() {
       element
         {node-name($node)}
         {if (string(node-name($node))=$element)
           then
              for $att in $node/@*
              return
                if (name($att)=$old-attribute)
                  then
                     attribute {$new-attribute} {$att}
                   else
                      attribute {name($att)} {$att}
           else
              $node/@*
           ,
               for $child in $node/node()
                 return if ($child instance of element())
                    then local:change-attribute-name-for-element($child, $element, $old-attribute, $new-attribute)
                    else $child 
         }
};

Replacing all attribute values

edit

For all elements that have specific attribute names replace old attribute values with new attribute value.

declare function local:change-attribute-values
    (
        $node as node(),
        $element-name as xs:string*,
        $attribute-name as xs:string*,
        $old-attribute-value as xs:string*,
        $new-attribute-value as xs:string
    )
        as element() 
    {

        element{node-name($node)}
        {
        if (string(node-name($node))=$element-name)
        then
           for $attribute in $node/@*
               let $found-attribute-name := name($attribute)
               let $found-attribute-value := string($attribute)
                   return
                       if ($found-attribute-name = $attribute-name and $found-attribute-value = $old-attribute-value)
                       then attribute {$found-attribute-name} {$new-attribute-value}
                       else attribute {$found-attribute-name} {$found-attribute-value}
        else $node/@*
        ,
        for $node in $node/node()
           return 
               if ($node instance of element())
               then local:change-attribute-values($node, $element-name, $attribute-name, $old-attribute-value, $new-attribute-value)
               else $node 
    }
};

Removing named attributes

edit

Attributes are filtered in the predicate expression not(name()=$attribute-name) so that named attributes are omitted.

declare function local:copy-filter-attributes(
       $element as element(),
       $attribute-name as xs:string*) as element() {
    element {node-name($element)}
            {$element/@*[not(name()=$attribute-name)],
                for $child in $element/node()
                   return if ($child instance of element())
                      then local:copy-filter-attributes($child, $attribute-name)
                      else $child
            }
  };

Removing named elements

edit

Likewise, elements can be filtered in a predicate:

declare function local:remove-elements($input as element(), $remove-names as xs:string*) as element() {
   element {node-name($input) }
      {$input/@*,
       for $child in $input/node()[not(name(.)=$remove-names)]
          return
             if ($child instance of element())
                then local:remove-elements($child, $remove-names)
                else $child
      }
};

This adds the node() qualifier and the name of the node in the predicate:

/node()[not(name(.)=$element-name)]

To use this function just pass the input XML as the first parameter and a sequence of element names as strings as the second parameter. For example:

  let $input := doc('my-input.xml')
  let $remove-list := ('xxx', 'yyy', 'zzz')
  local:remove-elements($input,  $remove-list)

Renaming Elements Using a Map

edit

Suppose we have a file of elements that we want to rename using a filter. We want to store the rename rules in a file like this:

let $rename-map :=
<rename-map>
   <map>
      <from>b</from>
      <to>x</to>
   </map>
   <map>
      <from>d</from>
      <to>y</to>
   </map>
   <map>
      <from>f</from>
      <to>z</to>
   </map>
</rename-map>

The rename elements function is the following

declare function local:rename-elements($input as node(), $map as node()) as node() {
let $current-element-name := name($input)
return
   (: we create a new element with a name and a content :)
   element
        { (: the new name is created here :)
        if (local:element-in-map($current-element-name, $map)  ) 
           then local:new-name($current-element-name, $map)
           else node-name($input)
        }
        { (: the element content is created here :)
        $input/@*, (: copy all attributes :)
        for $child in $input/node()
         return
            if ($child instance of element())
               then local:rename-elements($child, $map)
               else $child
        }
};

(: return true() if an element is in the form of a rename map :)
declare function local:element-in-map($element-name as xs:string, $map as node()) as xs:boolean {
exists($map/map[./from = $element-name])
};

(: return the new element name of an element in a rename map :)
declare function local:new-name($element-name as xs:string, $map as node()) as xs:string {
$map/map[./from = $element-name]/to
};

The following is the input and output

<data>
   <a q="joe">a</a>
   <b p="5" q="fred" >bb</b>
   <c>
        <d>dd</d>
         <a q="dave">aa</a>
         <e>EE</e>
         <f>FF</f>
   </c>
</data>
<data>
   <a q="joe">a</a>
   <x q="fred" p="5">bb</x>
   <c>
            <y>dd</y>
            <a q="dave">aa</a>
            <e>EE</e>
            <z>FF</z>
   </c>
</data>

Removing Empty Elements

edit

Many RDBMS systems export rows of data that are converted into XML. It is a best practice to remove any XML elements that have no text content.

The XQuery function will take in a single element and it returns an optional element.

The first test is to check if a child element or text is present. If it is then an element is constructed and the attribute are added. Then for each child element the function calls itself.

declare function local:remove-empty-elements($element as element()) as element()? {
if ($element/* or $element/text())
  then 
   element {node-name($element)}
      {$element/@*,
          for $child in $element/node()
              return
               if ($child instance of element())
                 then local:remove-empty-elements($child)
                 else $child
      }
    else ()
};

Example Input

let $input :=
<root>
   <a>A</a>
   <!-- remove these -->
   <b></b>
   <c> </c>
   <d>
      <e>E</e>
      <!-- and this -->
      <f>   </f>
   </d>
</root>

Example Output

<root>
   <a>A</a>
   <!-- remove these -->
   <d>
      <e>E</e>
      <!-- and this -->
   </d>
</root>

Note that even if an element contains spaces, carriage returns or tabs, the element will be removed.

Example illustrating the above filters

edit

The following script demonstrates these functions:

let $x :=
<data>
   <a q="joe">a</a>
   <b p="5" q="fred" >bb</b>
   <c>
        <d>dd</d>
         <a q="dave">aa</a>
   </c>
</data>
return
 <output>
    <original>{$x}</original>
    <fullcopy> {local:copy($x)}</fullcopy>
    <noattributes>{local:copy-no-attributes($x)}  </noattributes>
    <filterattributes>{local:copy-filter-attributes($x,"q")}</filterattributes>
    <filterelements>{local:copy-filter-elements($x,"a")}</filterelements>
    <filterelements2>{local:copy-filter-elements($x,("a","d"))}  </filterelements2>
 </output>

Run

Converting to XHTML Namespace

edit
declare function local:xhtml-namespace($nodes as node()*) as node()* {
for $node in $nodes
   return
    if ($node instance of element())
      then
         element {QName('http://www.w3.org/1999/xhtml', local-name($node))}
            {$node/@*, local:xhtml-namespace($node/node())}
      else $node
 };

Adding a Namespace

edit

Here is a function that will add a namespace to the root element in an XML document. Note that it uses an element constructor with two parameters. The first parameter is the element name and the second is the element content. The element name is created using the QName() function and the element content is created using {$in/@*, $in/node()}, which will add attributes and all child nodes.

declare function local:change-root-namespace($in as element()*, $new-namespace as xs:string, $prefix as xs:string) as element()? {
for $element in $in
   return
     element {QName($new-namespace,
         concat($prefix,
                if ($prefix = '')
                   then '' else ':',
                local-name($in)))}
           {$in/@*, $in/node()}
 };

If we use the following input:

 let $input :=
<a>
  <b>
     <c a1="A1" a2="A2">
       <d a1="A1" a2="A2">DDD</d>
     </c>
  </b>
  <e>EEE</e>
</a>
return local:change-root-namespace($input, 'http://example.com', 'e')
(: $in is a sequence of nodes! :)
declare function local:change-namespace-deep($in as node()*, $new-namespace as xs:string, $prefix as xs:string )  as node()* {
  for $node in $in
  return if ($node instance of element())
         then element
               {QName ($new-namespace,
                          concat($prefix,
                                if ($prefix = '')
                                then '' else ':',
                                local-name($node)))
               }
               {$node/@*, local:change-namespace-deep($node/node(), $new-namespace, $prefix)}
         else
            (: step through document nodes :)
            if ($node instance of document-node())
               then local:change-namespace-deep($node/node(), $new-namespace, $prefix)
               (: for comments and PIs :)
               else $node
 };
 
let $input :=
<a>
  <b>
     <c a1="A1" a2="A2">
       <!-- comment -->
       <d a1="A1" a2="A2">DDD</d>
     </c>
  </b>
  <e>EEE</e>
</a>

return local:change-namespace-deep($input, 'http://example.com', '')

The function will return the following output:

<e:a xmlns:e="http://example.com">
   <b>
      <c a1="A1" a2="A2">
         <d a1="A1" a2="A2">DDD</d>
      </c>
   </b>
   <e>EEE</e>
</e:a>
<a xmlns="http://example.com">
   <b>
      <c a1="A1" a2="A2"><!-- comment -->
         <d a1="A1" a2="A2">DDD</d>
      </c>
   </b>
   <e>EEE</e>
</a>

Note that if you use null as the prefix, then the prefix will not be used in the root element, However, the namespace will be used.

Removing unwanted namespaces

edit

Some systems do not allow you to have precise control of the namespaces used after doing an update despite the use of copy-namespaces declarations.

Remove TEI Namespace

edit

The following XQuery function is an example that will remove the TEI namespace from a node.

declare function local:clean-namespaces($node as node()) {
    typeswitch ($node)
        case element() return
            if (namespace-uri($node) eq "http://www.tei-c.org/ns/1.0") then
                element { QName("http://www.tei-c.org/ns/1.0", local-name($node)) } {
                    $node/@*, for $child in $node/node() return local:clean-namespaces($child)
                }
            else
                $node
        default return
            $node
};

Below two functions will remove any namespace from a node, nnsc stands for no-namespace-copy. The first one performs much faster: From my limited understanding it jumps attributes quicker. The other one still here, something tricky might be hidden there.

Remove ALL Namespaces

edit

The following recursive function will remove all namespaces from elements and attributes. Note that the local-name() function is used to generate the namespace free attribute name.

(: return a deep copy of the elements and attributes without ANY namespaces :)
declare function local:remove-namespaces($element as element()) as element() {
     element { local-name($element) } {
         for $att in $element/@*
         return
             attribute {local-name($att)} {$att},
         for $child in $element/node()
         return
             if ($child instance of element())
             then local:remove-namespaces($child)
             else $child
         }
};

Version 2

This version uses @* and * to generate a single sequence of attributes and elements. Elements are passed a recursive function but attributes are returned directly as $child.

(: return a deep copy of the element with out namespaces :)
declare function local:nnsc2($element as element()) as element() {
     element { QName((), local-name($element)) } {
         for $child in $element/(@*,*)
         return
             if ($child instance of element())
             then local:nnsc2($child)
             else $child
     }
};

Conversely, if you want to add a namespace to an element, a starting point in this Misztur, Chrisblog post: http://fgeorges.blogspot.com/2006/08/add-namespace-node-to-element-in.html

Remove extra whitespace

edit
declare function forxml:sanitize($forxml-result)
{
   let $children := $forxml-result/*
   return
       if(empty($children)) then ()
       else
           for $c in $children
           return
           (
               element { name($c) }
               {
                    $c/@*,
                    if(functx:is-a-number($c/text()))
                    then number($c/text())
                    else normalize-space($c/text()),
                    forxml:sanitize($c)
               }
            )
};

Contributed by Chris Misztur.

Removing elements with no string value

edit

Elements which contain no string value or which contain whitespace only can be removed:

declare function local:remove-empty-elements($nodes as node()*)  as node()* {
   for $node in $nodes
   return
     if ($node instance of element())
     then if (normalize-space($node) = '')
          then ()
          else element { node-name($node)}
                { $node/@*,
                  local:remove-empty-elements($node/node())}
     else if ($node instance of document-node())
     then local:remove-empty-elements($node/node())
     else $node
 } ;

Removing empty attributes

edit

Attributes which contain no text can be stripped:

declare function local:remove-empty-attributes($element as element()) as element() {
element { node-name($element)}
{ $element/@*[string-length(.) ne 0],
for $child in $element/node( )
return 
    if ($child instance of element())
    then local:remove-empty-attributes($child)
    else $child }
};

One Function for Several In-Memory Operations

edit

You can integrate several functions for altering the node tree in one function. In the following function a number of common operations on elements are facilitated.

The parameters passed are 1) the node tree to be operated on, 2) any new item(s) to be inserted, 3) the action to be performed, 4) the name(s) of the element(s) targeted by the action.

The function can insert one or more elements supplied as a parameter in a certain position relative to (before or after or as the first or last child of) target elements in the node tree.

One or more elements can be inserted in the same position as the target element(s), i.e. they can substitute for them.

If the action is 'remove', the target element(s) are removed. If the action is 'remove-if-empty', the target element(s) are removed if they have no (normalized) string value. If the action is 'substitute-children-for-parent', the target element(s) are substituted by their child element(s). (In the last three cases the new content parameter is not consulted and should, for clarity, be the empty sequence).

If the action to be taken is 'change-name', the name of the element is changed to the first item of the new content.

If the action to be taken is 'substitute-content', any children of the target element(s) are substituted with the new content.

Note that context-free functions, for instance current-date(), can be passed as new content.

declare function local:change-elements($node as node(), $new-content as item()*, $action as xs:string, $target-element-names as xs:string+) as node()+ {
        
        if ($node instance of element() and local-name($node) = $target-element-names)
        then
            if ($action eq 'insert-before')
            then ($new-content, $node) 
            else
            
            if ($action eq 'insert-after')
            then ($node, $new-content)
            else
            
            if ($action eq 'insert-as-first-child')
            then element {node-name($node)}
                {
                $node/@*
                ,
                $new-content
                ,
                for $child in $node/node()
                    return $child
                }
                else
            
            if ($action eq 'insert-as-last-child')
            then element {node-name($node)}
                {
                $node/@*
                ,
                for $child in $node/node()
                    return $child 
                ,
                $new-content
                }
                else
                
            if ($action eq 'substitute')
            then $new-content
            else 
                
            if ($action eq 'remove')
            then ()
            else 
                
            if ($action eq 'remove-if-empty')
            then
                if (normalize-space($node) eq '')
                then ()
                else $node
            else

            if ($action eq 'substitute-children-for-parent')
            then $node/*
            else
            
            if ($action eq 'substitute-content')
            then
                element {name($node)}
                    {$node/@*,
                $new-content}
            else
                
            if ($action eq 'change-name')
            then
                element {$new-content[1]}
                    {$node/@*,
                for $child in $node/node()
                    return $child}
                
            else ()
        
        else
        
            if ($node instance of element()) 
            then
                element {node-name($node)} 
                {
                    $node/@*
                    ,
                    for $child in $node/node()
                        return 
                            local:change-elements($child, $new-content, $action, $target-element-names) 
                }
            else $node
};

A typeswitch is not used because it requires static parameters.

Having the following main,

let $input := 
<html>
    <head n="1">1</head>
    <body>
        <p n="2">2</p>
        <p n="3">3</p>
    </body>
</html>

let $new-content := <p n="4">0</p>

return 
    local:change-elements($input, $new-content, 'insert-as-last-child', ('body'))

the result will be

<html>
    <head n="1">1</head>
    <body>
        <p n="2">2</p>
        <p n="3">3</p>
        <p n="4">0</p>
    </body>
</html>

Note that if the target element is 'p', the $new-node will be inserted in relation to every element named 'p'.

You can fill in any functions you need and delete those that you do not need.

This following function facilitates several operations on attributes. This is more complicated than working with elements, for element names have to be considered as well.

The parameters passed are 1) the node tree to be operated on, 2) a new attribute name, 3) the new attribute contents, 4) the action to be performed, 5) the name(s) of the element(s) targeted by the action, 6) the name(s) of the attribute(s) targeted by the action.

By just using the action parameter, you can remove all empty-attributes.

If you wish to remove all named attributes, you need to supply the name of the attribute to be removed.

If you wish to change all values of named attributes, you need to supply the new value as well.

If you wish to attach an attribute, with name and value, to a specific element, you need to supply parameters for the element the attribute is to be attached to, the name of the attribute, and the value of the attribute, as well as the action.

If you wish to remove an attribute from a specific element, you need to supply parameters for the element the attribute is to be removed from, the name of the attribute, as well as the action.

If you wish to change the name of an attribute attached to a specific element, you need to supply parameters for the element the attribute is attached to, the name the attribute has, the new the attribute is to have, as well as the action.

declare function local:change-attributes($node as node(), $new-name as xs:string, $new-content as item(), $action as xs:string, $target-element-names as xs:string+, $target-attribute-names as xs:string+) as node()+ {
    
            if ($node instance of element()) 
            then
                element {node-name($node)} 
                {
                    if ($action = 'remove-all-empty-attributes')
                    then $node/@*[string-length(.) ne 0]
                    else 
                        
                    if ($action = 'remove-all-named-attributes')
                    then $node/@*[name(.) != $target-attribute-names]
                    else 
                    
                    if ($action = 'change-all-values-of-named-attributes')
                    then element {node-name($node)}
                    {for $att in $node/@*
                        return 
                            if (name($att) = $target-attribute-names)
                            then attribute {name($att)} {$new-content}
                            else attribute {name($att)} {$att}
                    }
                    else
                        
                    if ($action = 'attach-attribute-to-element' and name($node) = $target-element-names)
                    then ($node/@*, attribute {$new-name} {$new-content})
                    else 

                    if ($action = 'remove-attribute-from-element' and name($node) = $target-element-names)
                    then $node/@*[name(.) != $target-attribute-names]
                    else 

                    if ($action = 'change-attribute-name-on-element' and name($node) = $target-element-names)
                    then 
                        for $att in $node/@*
                            return
                                if (name($att) = $target-attribute-names)
                                then attribute {$new-name} {$att}
                                else attribute {name($att)} {$att}
                    else
                    
                    if ($action = 'change-attribute-value-on-element' and name($node) = $target-element-names)
                    then
                        for $att in $node/@*
                            return 
                                if (name($att) = $target-attribute-names)
                                then attribute {name($att)} {$new-content}
                                else attribute {name($att)} {$att}
                    else 

                    $node/@*
                    ,
                    for $child in $node/node()
                        return 
                            local:change-attributes($child, $new-name, $new-content, $action, $target-element-names, $target-attribute-names) 
                }
            else $node
};

Having the following main,

let $input := 
<xml>
    <head n="1">1</head>
    <body>
        <p n="2">2</p>
        <p n="3">3</p>
    </body>
</xml>

return 
    local:change-attributes($input, 'y', current-date(), 'change-attribute-value-on-element', 'p', 'n')

the result will be

<xml>
    <head n="1" x="">1</head>
    <body>
        <p n="2013-11-30+01:00">2</p>
        <p n="2013-11-30+01:00">3</p>
    </body>
</xml>

References

edit

W3C page on computed element constructors