Tcl Programming/regexp

Overview

edit

Another language (I wouldn't want to call it "little") embedded inside Tcl is regular expressions. They may not look like this to you - but they follow (many) rules, indeed. The purpose is to describe a string pattern to match with - for searching, extracting, or replacing substrings.

Regular expressions are used in the regexp, regsub commands, and optionally in lsearch and switch. Note that this language is very different from Tcl itself, so in most cases it is best to brace an RE, to prevent the Tcl parser from misunderstanding them.

Before the gory details begin, let's start with some examples:

regexp {[0-9]+[a-z]} $input 

returns 1 if $input contains one or more digits, followed by a lowercase letter.

set result [regsub -all {[A-Z]} $input ""]

deletes all uppercase letters from $input, and saves that to the result variable.

lsearch -all -inline -regexp $input {^-}

returns all elements in the list $input which start with a "-".

Character classes

edit

Many characters just stand for themselves. E.g.

a

matches indeed the character "a". Any Unicode can be specified in the \uXXXX format. In brackets (not the same as Tcl's), a set of alternatives (a "class") is defined:

[abc]

matches "a", "b", or "c". A dash (-) between two characters spans the range between them, e.g.

[0-9]

matches one decimal digit. To have literal "-" in a set of alternatives, put it first or last:

[0-9-]

matches a digit or a minus sign. A bracketed class can be negated by starting it with ^, e.g.

[^0-9]

matches any character that is not a decimal digit. The period "." represents one instance of any character. If a literal "." is intended (or in general, to escape any character that has a special meaning to the regexp syntax), put a backslash "\" before it - and make sure the regular expression is braced, so the Tcl parser doesn't eat up the backslash early...).

Quantifiers

edit

To repeat a character (set) other than once, there are quantifiers put behind it, e.g.

a+     matches one or more "a"s,
a?     matches zero or one "a",
a*     matches zero or more "a"s.

There is also a way of numeric quantification, using braces (again, not the same as Tcl's):

a{2}   matches two "a"s - same as "aa"
a{1,5} matches one to five "a"s
a{1,}  one or more - like a+
a{0,1} zero or one - like a?
a{0,}  zero or more - like a*

The + and * quantifiers act "greedy", i.e. they consume the longest possible substring. For non-greedy behavior, which provides the shortest possible match, add a "?" in behind. Examples:

% regexp -inline {<(.+)>} <foo><bar><grill>
<foo><bar><grill> foo><bar><grill

This matches until the last close-bracket

% regexp -inline {<(.+?)>} <foo><bar><grill>
<foo> foo

This matches until the first close-bracket.

Anchoring

edit

By default, a regular expression can match anywhere in a string. You can limit that to the beginning (^) and/or end ($):

regexp {^a.+z$} $input

succeeds if input begins with "a" and ends with "z", and has one or more characters between them.

Grouping

edit

A part of a regular expression can be grouped by putting parentheses () around it. This can have several purposes:

  • regexp and regsub can extract or refer to such substrings
  • the operator precedence can be controlled

The "|" (or) operator has high precedence. So

foo|bar grill

matches the strings "foo" or "bar grill", while

(foo|bar) grill

matches "foo grill" or "bar grill".

(a|b|c) ;# is another way to write [abc]

For extracting substrings into separate variables, regexp allows additional arguments:

regexp ?options? re input fullmatch part1 part2...

Here, variable fullmatch will receive the substring of input that matched the regular expression re, while part1 etc. receive the parenthesized submatches. As fullmatch is often not needed, it has become an eye-candy idiom to use the variable name "->" in that position, e.g.

regexp {(..)(...)} $input -> first second

places the first two characters of input in the variable first, and the next three in the variable second. If $input was "ab123", first will hold "ab", and second will contain "123".

In regsub and regexp, you can refer to parenthesized submatches with \1 for the first, \2 for the second, etc. \0 is the full match, as with regexp above. Example:

% regsub {(..)(...)} ab123 {\2\1=\0}
123ab=ab123

Here \1 contains "ab", \2 contains "123", and \0 is the full match "ab123". Another example, how to find four times the same lowercase letter in a row (the first occurrence, plus then three):

regexp {([a-z])\1{3}} $input

More examples

edit

Parse the contents in angle brackets (note that the result contains the full match and the submatch in paired sequence, so use foreach to extract only the submatches):

% regexp -all -inline {<([^>]+)>} x<a>y<b>z<c>d
<a> a <b> b <c> c

Insert commas between groups of three digits into the integer part of a number:

% regsub -all {\d(?=(\d{3})+($|\.))} 1234567.89 {\0,}
1,234,567.89

In other countries, you might use an apostrophe (') as separator, or make groups of four digits (used in Japan).

In the opposite direction, converting such formatted numbers back to the regular way for use in calculations, the task consists simply of removing all commas. This can be done with regsub:

% regsub -all , 1,234,567.89 ""
1234567.89

but as the task involves only constant strings (comma and empty string), it is more efficient not to use regular expressions here, but use a string map command:

% string map {, ""} 1,234,567.89
1234567.89