XQuery/Regular Expressions

Motivation

edit

You want to test to see if a text matches a specific pattern of characters. You want to replace patterns of text with other patterns. You have text with repeating patterns and you would like to break the text up into discrete items.

Method

edit

To deal with the above three problems, XQuery has the following functions:

  • matches($input, $regex) - returns a true if the input contains a regular expression
  • replace($input, $regex, $string) - replaces an input string that matches a regular expression with a new string
  • tokenize($input, $regex) - returns a sequence of items matching a regular expression

Through these functions we have access to the powerful syntax of regular expressions.

Summary of Regular Expressions

edit

Regular expressions ("regex") are a field unto itself. If you wish to derive full benefit from this way of describing strings with patterns, you should consult a separate introduction. Priscilla Walmsley's XQuery (Chapter 18) has a clear summary of the functionality offered.

  • fn:matches($input, $regex, $flags) takes a string and a regular expression as input. If the regular expression matches any part of the string, the function returns true. If it does not match, it returns false. Enclose the string with anchors (^ at the beginning and $ at the end), if you only want the function to return true when the pattern matches the entire string. Note that this is different than the XML Schema patterns where ^ and $ are implied.
  • fn:replace($input, $regex, $string, $flags) takes a string, a regular expression, and a replacement string as input. It returns a new string that is the string with all matches of the pattern in the input string replaced with the replacement string. You can use $1 to $99 to re-insert groups of characters captured with parentheses into the replacement string.
  • fn:tokenize($input, $regex, $flags) returns an array of strings that consists of all the substrings in the input string between all the matches of the pattern. The array will not contain the matches themselves.

In regular expressions, most characters represent themselves, so you are not obliged to use the special regex syntax in order to utilise these three functions. In regular expressions, a dot (.) represents all characters except newlines. Immediately following a character or an expression such as a dot, one can add a quantifier which tells how many times the character should be repeated: "*" for "0, 1 or many times" "?" for "0 or 1 times," and "+" for "1 or many times." The combination "*?" replaces the shortest substring that matches the pattern. NB: this only scratches the surface of the subject of regular expressions!

The three functions all accept optional flag parameters to set matching modes. The following four flags are available:

  • i makes the regex match case insensitive.
  • s enables "single-line mode" or "dot-all" mode. In this mode, the dot matches every character, including newlines, so the string is treated as a single line.
  • m enables "multi-line mode". In this mode, the anchors "^" and "$" match before and after newlines in the string as well in addition to applying to the string as a whole.
  • x enables "free-spacing mode". In this mode, whitespace in regex pattern is ignored. This is mainly used when one has divided a complicated regex over several lines, but do not intend the newlines to be matched.

If one do not use a flag, one can just leave the slot empty or write "".

Examples of matches()

edit
let $input := 'Hello World'
return
<result>{
  (matches($input, 'Hello World') =  true(),
   matches($input, 'Hi') =  false(),
   matches($input, 'H.*') = true(),
   matches($input, 'H.*o W.*d') =  true(),
   matches($input, 'Hel+o? W.+d') = true(),
   matches($input, 'Hel?o+') = false(),
   matches($input, 'hello', "i") = true(), 
   matches($input, 'he l lo', "ix") = true() ,
   matches($input, '^Hello$') = false(), 
   matches($input, '^Hello') = true()
    )}
</result>

Execute

Examples of tokenize()

edit
<result>{
(let $input := 'red,orange,yellow,green,blue'
return deep-equal( tokenize($input, ',') , ('red','orange','yellow','green','blue'))
 ,
let $input := 'red,
orange,	     yellow,  green,blue'
return  deep-equal(tokenize($input, ',\s*') , ('red','orange','yellow','green','blue'))
,
let $input := 'red   ,
orange  ,	     yellow    ,  green ,  blue'
return  not(deep-equal(tokenize($input, ',\s*') , ('red','orange','yellow','green','blue')))
,
let $input := 'red   ,
orange  ,	     yellow    ,  green ,  blue'
return  deep-equal(tokenize($input, '\s*,\s*') , ('red','orange','yellow','green','blue'))
)
}</result>

In the second example, "\s" represents one whitespace character and thus matches the newline before "orange" and the tab character before "yellow". It is quantified with "*" so the pattern removes whitespace after the comma, but not before it. To remove all whitespace, use the pattern '\s*,\s*'.

Execute

Examples of replace()

edit
<result>{
( 
let $input := 'red,orange,yellow,green,blue'
return ( replace($input, ',', '-') = 'red-orange-yellow-green-blue' )
,
let $input := 'Hello World'
return (
    replace($input, 'o', 'O') = "HellO WOrld" ,
    replace($input, '.', 'X') = "XXXXXXXXXXX" ,
    replace($input, 'H.*?o', 'Bye') = "Bye World" 
    )
,
let $input := 'HellO WOrld'
return ( replace($input, 'o', 'O', "i") = "HellO WOrld" )
,
let $input := 'Chapter 1 … Chapter 2 …'
  return ( replace($input, "Chapter (\d)", "Section $1.0")  =  "Section 1.0 … Section 2.0 …")
)
}</result>

In the last example, "\d" represents any digit; the parenthesis around "\d" binds the variable "$1" to whatever digit it matches; in the replacement string, this variable is replaced by the matched digit.

Execute

Larger examples

edit

References

edit

The Regular Expression Library has more than 2,600 sample regular expressions: Regular Expression Library

This page has a very useful summary of the regular expression patterns: Regular Expression Cheat Sheet

This page describes how to use Regular Expressions within XQuery and XPath: XQuery and XPath Regular Expressions