Regular Expressions/Print version


Introduction

What is a regular expression?

A regular expression is a method of representing a string matching pattern. Regular expressions enable strings that match a particular pattern within textual data records to be located and modified and they are often used within utility programs and programming languages that manipulate textual data. Regular expressions are extremely powerful.

Example applications

Various software applications use regular expressions to locate, select or modify particular sections text. For example, a regular expression could be used to:

  • replace the word "snake" with the word "serpent" throughout an entire piece of text
  • locate pieces of text containing the words "fox" and "sheep" on the same line

Regular expression components

Regular expressions are made up of three types of components:

  • anchors used to specify the position of the pattern in relation to a line of text.
  • character sets used to match one or more characters in a single position.
  • modifiers used to specify how many times a character set is repeated.

Syntax varies across application programs

The syntax of regular expressions varies across application programs. For example the shell uses a limited form of regular expression called shell regular expressions for filename substitution, whereas AWK uses a superset of extended regular expressions syntax.

Supporting Software

Regular expressions are supported by various software tools, including command line tools, plain text editors and programming languages. Most of these tools are available for various computing platforms, including Linux, Windows and Mac OS X. The tools use slightly different syntax styles. Let's look at some notable ones.

The tools that use regular expressions are enlisted below:

Command line
tools
Plain text
editors
Programming
languages
grep ed .NET
egrep vi Awk
sed Emacs Java
Notepad++ JavaScript
Perl
PHP
Python
Ruby
Tcl

A regular expression can be considered to be a little computer program that finds or isolates a subset of a larger set of text. In the same way that an ordinary computer program needs a computer to execute it, a regular expression needs a software application to interpret it — to give it meaning.

For example, a regular expression can be used to tell an editor to find the next occurrence of the word "Chapter" followed by several spaces and digits. Or you can use a regular expression to tell the UNIX grep command to show only those lines of a file that contain the word "Wiki" followed by either the word "Books" or the word-fragment "pedia". We will discuss the exact syntax of such regular expressions in the next chapter.


Syntaxes

There are several variants of regular expressions. These variants differ not only in their concrete syntax but also in their capabilities. Individual tools that support regular expressions also have their own peculiarities.

Greedy expressions

Quantifiers such as * and + match as much as they can: they are greedy. For some uses, their greediness does not fit. For example, let as assume you want to find the first string enclosed in quotation marks, in the following text:

These words include "cat", "mat", and "pat".

The pattern ".*" matches the italicized part of the text below, that is, "cat", "mat", and "pat" instead of the desired "cat":

These words include "cat", "mat", and "pat".

To fix this, some flavours of regular expressions provide non-greedy operators such as *?, +?, and }?. In PHP, adding a "U" at the end of the regexp makes the quantifier non-greedy, as in /".*"/U. In flavours that support neither of the two options, you can specify what is not to be matched, as in ("[^"]*") to fix the discussed example. However, when dealing with bracketed expressions, (\[\[[^\]]*\]\]) fails to match on A B C [[D E] F G]].

Comparison table

A comparison table or matrix that shows which features or flavors of regular expressions are available in which tool or programming language is available from regular-expressions.info.


Simple Regular Expressions

The Simple Regular Expression syntax is widely used on Unix based systems for the purposes of backwards compatibility. Most regular-expression-aware Unix utilities, such as grep and sed, use it by default while providing support for extended regular expressions with command line arguments (see below). This syntax is deprecated on POSIX compliant systems and should not be used by new utilities.

When simple regular expression syntax is being used, most characters, except metacharacters are treated as literal characters and match only themselves (for example, "a" matches "a", "(bc" matches "(bc", etc).

Operators
Operator Effect
. The dot operator matches any single character.
[ ] boxes enable a single character to be matched against character lists or character ranges.
[^ ] A complement box enables a single character not within a character list or character range to be matched.
^ A caret anchor matches the start of the line (or any line, when applied in multiline mode)
$ A dollar anchor matches the end of the line (or any line, when applied in multiline mode)
( ) parentheses are used to define a marked subexpression. The matched text section can be recalled at a later time.
\n Where n is a digit from 1 to 9; matches what the nth marked subexpression matched. This irregular construct has not been adopted in the extended regular expression syntax.
* A single character expression followed by "*" matches zero or more copies of the expression. For example, "ab*c" matches "ac", "abc", "abbbc" etc. "[xyz]*" matches "", "x", "y", "zx", "zyx", and so on.
  • \n*, where n is a digit from 1 to 9, matches zero or more iterations of what the nth marked subexpression matched. For example, "(a.)c\1*" matches "abcab" and "abcabab" but not "abcac".
  • An expression enclosed in "\(" and "\)" followed by "*" is deemed to be invalid. In some cases (e.g. /usr/bin/xpg4/grep of SunOS 5.8), it matches zero or more iterations of the string that the enclosed expression matches. In other cases (e.g. /usr/bin/grep of SunOS 5.8), it matches what the enclosed expression matches, followed by a literal "*".

Examples

Examples:

  • "^[hc]at"
    • Matches hat and cat but only at the beginning of a line.
  • "[hc]at$"
    • Matches hat and cat but only at the end of a line.

Use in Tools

Tools and languages that utilize this regular expression syntax include:


Basic Regular Expressions

Basic Regular Expressions: Note that particular implementations of regular expressions interpret the backslash symbol differently in front of some of the metacharacters. For example, egrep and perl interpret unbackslashed parentheses and vertical bars as metacharacters, reserving the backslashed versions to mean the literal characters themselves. Old versions of grep did not support the pipe alternation operator.

Operators
Operator Effect
. The dot operator matches any single character.
[ ] A box enables a single character to be matched against a character list or character range.
[^ ] A compliment box enables a single character not within a character list or character range to be matched.
* An asterisk specifies zero or more characters to match.
^ The caret anchor matches the beginning of the line.
$ The dollar anchor matches the end of the line.
Examples:
Example Match
".at" any three-character string like hat, cat or bat
"[hc]at" hat and cat
"[^b]at" all the matched strings from the regex ".at" except bat
"^[hc]at" hat and cat, but only at the beginning of a line
"[hc]at$" hat and cat, but only at the end of a line

Many ranges of characters depend on the chosen locale setting. For example, in some settings letters are organized as abc..yzABC..YZ, while in some they are organized as aAbBcC..yYzZ.

The Posix Basic Regular Expressions syntax provided extensions for consistency between utility programs such as grep, sed and awk. These extensions are not supported by some traditional implementations of Unix tools.

Use in Tools

Tools and languages that utilize this regular expression syntax include: TBD


Perl-Compatible Regular Expressions

Perl has a richer and more predictable syntax than even the POSIX Extended Regular Expressions syntax. An example of its predictability is that \ always quotes a non-alphanumeric character. An example of something that is possible to specify with Perl but not POSIX is whether part of the match wanted to be greedy or not. For instance in the pattern /a.*b/, the .* will match as much as it can, while in the pattern /a.*?b/, .*? will match as little. So given the string "a bad dab", the first pattern will match the whole string, and the second will only match "a b".

For these reasons, many other utilities and applications have adopted syntaxes that look a lot like Perl's. For example, Java, Ruby, Python, PHP, exim, BBEdit, and even Microsoft's .NET Framework all use regular expression syntax similar to that used in perl. Not all "Perl-compatible" regular expression implementations are identical, and many implement only a subset of Perl's features.

Examples

Conventions used in the examples: The character 'm' is not always required to specify a perl match operation. For example, m/[^abc]/ could also be rendered as /[^abc]/. The 'm' is only necessary if the user wishes to specify a match operation without using a forward-slash as the regex delimiter. Sometimes it is useful to specify an alternate regex delimiter in order to avoid "delimiter collision". See 'perldoc perlre' for more details.

   metacharacter(s) ;; the metacharacters column specifies the regex syntax being demonstrated
   =~ m//           ;; indicates a regex match operation in perl    
   =~ s///          ;; indicates a regex substitution operation in perl

In the table heading below, "M-c" stands for "Metacharacters".

M-c Description Example
All the if statements return a TRUE value.
. Normally matches any character except a newline. Within square brackets the dot is literal.
if ("Hello World\n" =~ m/...../) {
  print "Yep"; # Has length >= 5\n";
}
( ) Groups a series of pattern elements to a single element. When you match a pattern within parentheses, you can use any of $1, $2, ... later to refer to the previously matched pattern.
if ("Hello World\n" =~ m/(H..).(o..)/) {
  print "We matched '$1' and '$2'\n";
}

Output:

We matched 'Hel' and 'o W';

+ Matches the preceding pattern element one or more times.
if ("Hello World\n" =~ m/l+/) {
  print "One or more \"l\"'s in the string\n";
}
? Matches the preceding pattern element zero or one times.
if ("Hello World\n" =~ m/H.?e/) {
  print "There is an 'H' and a 'e' separated by ";
  print "0-1 characters (Ex: He Hoe)\n";
}
? Modifies the *, +, or {M,N}'d regexp that comes before to match as few times as possible.
if ("Hello World\n" =~ m/(l.+?o)/) {
  print "Yep"; # The non-greedy match with 'l' followed
  # by one or more characters is 'llo' rather than 'llo wo'.
}
* Matches the preceding pattern element zero or more times.
if ("Hello World\n" =~ m/el*o/) {
  print "There is an 'e' followed by zero to many ";
  print "'l' followed by 'o' (eo, elo, ello, elllo)\n";
}
{M,N} Denotes the minimum M and the maximum N match count.
if ("Hello World\n" =~ m/l{1,2}/) {
 print "There is a substring with at least 1 ";
 print "and at most 2 l's in the string\n";
}
[...] Denotes a set of possible character matches.
if ("Hello World\n" =~ m/[aeiou]+/) {
  print "Yep"; # Contains one or more vowels
}
| Separates alternate possibilities.
if ("Hello World\n" =~ m/(Hello|Hi|Pogo)/) {
  print "At least one of Hello, Hi, or Pogo is ";
  print "contained in the string.\n";
}
\b Matches a word boundary.
if ("Hello World\n" =~ m/llo\b/) {
  print "There is a word that ends with 'llo'\n";
}
\w Matches an alphanumeric character, including "_".
if ("Hello World\n" =~ m/\w/) {
  print "There is at least one alphanumeric ";
  print "character in the string (A-Z, a-z, 0-9, _)\n";
}
\W Matches a non-alphanumeric character, excluding "_".
if ("Hello World\n" =~ m/\W/) {
  print "The space between Hello and ";
  print "World is not alphanumeric\n";
}
\s Matches a whitespace character (space, tab, newline, form feed)
if ("Hello World\n" =~ m/\s.*\s/) {
  print "There are TWO whitespace characters, which may";
  print " be separated by other characters, in the string.";
}
\S Matches anything BUT a whitespace.
if ("Hello World\n" =~ m/\S.*\S/) {
  print "Contains two non-whitespace characters " .
        "separated by zero or more characters.";
}
\d Matches a digit, same as [0-9].
if ("99 bottles of beer on the wall." =~ m/(\d+)/) {
  print "$1 is the first number in the string'\n";
}
\D Matches a non-digit.
if ("Hello World\n" =~ m/\D/) {
  print "There is at least one character in the string";
  print " that is not a digit.\n";
}
^ Matches the beginning of a line or string.
if ("Hello World\n" =~ m/^He/) {
  print "Starts with the characters 'He'\n";
}
$ Matches the end of a line or string.
if ("Hello World\n" =~ m/rld$/) {
  print "Is a line or string ";
  print "that ends with 'rld'\n";
}
\A Matches the beginning of a string (but not an internal line).
if ("Hello\nWorld\n" =~ m/\AH/) {
  print "Yep"; # The string starts with 'H'.
}
\Z Matches the end of a string (but not an internal line).
if ("Hello\nWorld\n"; =~ m/d\n\Z/) {
  print "Yep"; # Ends with 'd\\n'\n";
}
[^...] Matches every character except the ones inside brackets.
if ("Hello World\n" =~ m/[^abc]/) {
  print "Yep"; # Contains a character other than a, b, and c.
}

Use in Tools

Tools and languages that utilize Perl regular expression syntax include:

See also

Links


POSIX Basic Regular Expressions

The POSIX Basic Regular Expression (BRE) syntax provided extensions to achieve consistency between utility programs such as grep, sed and awk. These extensions are not supported by some traditional implementations of Unix tools.

History

Traditional Unix regular expression syntax followed common conventions that often differed from tool to tool. The POSIX Basic Regular Expressions syntax was developed by the IEEE, together with an extended variant called Extended Regular Expression syntax. These standards were designed mostly to provide backward compatibility with the traditional Simple Regular Expressions syntax, providing a common standard which has since been adopted as the default syntax of many Unix regular expression tools.

Syntax

In POSIX Basic Regular Expression syntax, most characters are treated as literals — they match only themselves (e.g., a matches "a"). The exceptions, listed below, are called metacharacters or metasequences.

Metacharacter Description
. Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor, character encoding, and platform specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c".
[ ] A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c", and [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z].

The - character is treated as a literal character if it is the last or the first character within the brackets: [abc-], [-abc]. The ] character can be included in a bracket expression if it is the first character: []abc]. The bracket expression may also contain character classes, equivalence classes, and collating characters.

[^ ] Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c", and [^a-z] matches any single character that is not a lowercase letter from "a" to "z". These forms can be mixed: [^abcx-z] matches any character other than "a", "b", "c", "x", "y", or "z".

The - character is treated as a literal character if it is the last character or the first characted after ^: [^abc-], [^-abc]. The ] character is treated as a literal character if it is the first character after ^: [^]abc]. The expression may also contain character classes, equivalence classes, and collating characters.

^ Matches the starting position within the string, if it is the first character of the regular expression.
$ Matches the ending position of the string, if it is the last character of the regular expression.
* Matches the preceding element zero or more times. For example, ab*c matches "ac", "abc", "abbbc", etc. [xyz]* matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on.
BRE: \{m\}
ERE: {m}
Matches the preceding element exactly m times. For example, a\{3\} matches only "aaa".
BRE: \{m,\}
ERE: {m,}
Matches the preceding element at least m times. For example, a\{3,\} matches "aaa", "aaaa", "aaaaa", "aaaaaa", "aaaaaaa", and so on.
BRE: \{m,n\}
ERE: {m,n}
Matches the preceding element at least m and not more than n times. For example, a\{3,5\} matches only "aaa", "aaaa", and "aaaaa". This is not found in a few older instances of regular expressions.
BRE: \( \)
ERE: ( )
Defines a subexpression. It is treated as a single element. For example, ab* matches "a", "ab", "abb" and so on, while \(ab\)* matches "", "ab", "abab", "ababab", and so on. The string matched within the parentheses can be recalled later (see the next entry, \n). A subexpression is also called a marked subexpression, a block or a capturing group.
BRE only: \n Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is theoretically irregular (an expression with this construct does not obey the mathematical definition of regular expression), and was not adopted in the POSIX ERE syntax.

Examples:

  • .at matches any three-character string ending with "at", including "hat", "cat", and "bat".
  • [hc]at matches "hat" and "cat".
  • [^b]at matches all strings matched by .at except "bat".
  • ^[hc]at matches "hat" and "cat", but only at the beginning of the string or line.
  • [hc]at$ matches "hat" and "cat", but only at the end of the string or line.
  • \[.\] matches any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]" and "[b]".

Character classes

The POSIX standard defines some classes or categories of characters as shown below. These classes are used within brackets.

POSIX class similar to meaning
[:upper:] [A-Z] uppercase letters
[:lower:] [a-z] lowercase letters
[:alpha:] [A-Za-z] upper- and lowercase letters
[:digit:] [0-9] digits
[:xdigit:] [0-9A-Fa-f] hexadecimal digits
[:alnum:] [A-Za-z0-9] digits, upper- and lowercase letters
[:punct:] punctuation (all graphic characters except letters and digits)
[:blank:] [ \t] space and TAB characters only
[:space:] [ \t\n\r\f\v] blank (whitespace) characters
[:cntrl:] control characters
[:graph:] [^ [:cntrl:]] graphic characters (all characters which have graphic representation)
[:print:] [[:graph:] ] graphic characters and space
[:word:] [[:alnum:]_] Alphanumeric characters with underscore character _, meaning alnum + _. It is a perl specific character class.

For example,

  • a[[:digit:]]b matches "a0b", "a1b", ..., "a9b".
  • a[:digit:]b is an error: character classes must be in brackets
  • [[:digit:]abc] matches any digit, "a", "b", and "c".
  • [abc[:digit:]] is the same as above
  • [^ABZ[:lower:]] matches any character except lowercase letters, A, B, and Z.

Collating symbols

Collating symbols, like character classes, are used in brackets and have the form [.ch.]. Here ch is a digraph. Collating systems are defined by the locale.

Equivalence classes

Equivalence classes, like character classes and collating symbols, are used in brackets and have the form [=a=]. They stand for any character which is equivalent to the given. According to the standard[1],

For example, if 'a', 'à', and 'â' belong to the same equivalence class, then "[[=a=]b]", "[[=à=]b]", and "[[=â=]b]" are each equivalent to "[aàâb]".

Equivalence classes, like collating symbols, are defined by the locale.

External links

Use in Tools

Tools and languages that utilize this regular expression syntax include:


POSIX-Extended Regular Expressions

The more advanced "extended" regular expressions can sometimes be used with Unix utilities by including the command line flag "-E". Other Unix utilities, like awk, use it by default.

The main difference is that some backslashes are removed: \{…\} becomes {…} and \(…\) becomes (…). Examples:

  • "[hc]+at" matches with "hat", "cat", "hhat", "chat", "hcat", "ccchat" etc.
  • "[hc]?at" matches "hat", "cat" and "at"
  • "([cC]at)|([dD]og)" matches "cat", "Cat", "dog" and "Dog"

The characters (,),[,],.,*,?,+,|,^ and $ are special symbols and have to be escaped with a backslash symbol in order to be treated as literal characters. For example:

"a\.(\(|\))" matches with the string "a.)" or "a.("

Modern regular expression tools allow a quantifier to be specified as non-greedy, by putting a question mark after the quantifier: (\[\[.*?\]\]).


Table of metacharacters

The following metacharacters are used:

Metacharacter Description
. Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor, character encoding, and platform specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c".
[ ] A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z].

The - character is treated as a literal character if it is the last or the first (after the ^) character within the brackets: [abc-], [-abc]. Note that backslash escapes are not allowed. The ] character can be included in a bracket expression if it is the first (after the ^) character: []abc].

[^ ] Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". As above, literal characters and ranges can be mixed.
^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.
$ Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.
BRE: \( \)
ERE: ( )
Defines a marked subexpression. The string matched within the parentheses can be recalled later (see the next entry, \n). A marked subexpression is also called a block or capturing group.
\n Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is theoretically irregular and was not adopted in the POSIX ERE syntax. Some tools allow referencing more than nine capturing groups.
* Matches the preceding element zero or more times. For example, ab*c matches "ac", "abc", "abbbc", etc. [xyz]* matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on. \(ab\)* (in BRE) or (ab)* (in ERE) matches "", "ab", "abab", "ababab", and so on.
BRE: \+
ERE: +
Matches the preceding element one or more times. For example, ab\+c (in BRE) or ab+c (in ERE) matches "abc", "abbbc", etc., but not "ac", [xyz]\+ (in BRE) or [xyz]+ (in ERE) matches "x", "y", "z", "zx", "zyx", "xyzzy", and so on. \(ab\)\+ (in BRE) or (ab)+ (in ERE) matches "ab", "abab", "ababab", and so on.
BRE: \?
ERE: ?
Matches the preceding element one or zero times. For example, ab\?c (in BRE) or ab?c (in ERE) matches either "ac" or "abc", while \(ab\)\? (in BRE) or (ab)? (in ERE) matches "" or "ab".
BRE: \|
ERE: |
Matches the preceding element or the following element. For example, abc\|def (in BRE) or abc|def (in ERE) matches either "abc" or "def".
BRE: \{m,n\}
ERE: {m,n}
Matches the preceding element at least m and not more than n times. For example, a\{3,5\} (in BRE) or a{3,5} (in ERE) matches only "aaa", "aaaa", and "aaaaa".
BRE: \{m\}
ERE: {m}
Matches the preceding element exactly m times.
BRE: \{m,\}
ERE: {m,}
Matches the preceding element at least m times.
BRE: \{,n\}
ERE: {,n}
Matches the preceding element not more than n times. For example, ba\{,2\}b (in BRE) or ba{,2}b (in ERE) matches only "bb", "bab", and "baab".


Character classes

The POSIX standard defines some classes or categories of characters as shown in the following table:

POSIX class similar to meaning
[:upper:] [A-Z] uppercase letters
[:lower:] [a-z] lowercase letters
[:alpha:] [[:upper:][:lower:]] upper- and lowercase letters
[:alnum:] [[:alpha:][:digit:]] digits, upper- and lowercase letters
[:digit:] [0-9] digits
[:xdigit:] [0-9A-Fa-f] hexadecimal digits
[:punct:] [.,!?:…] punctuation
[:blank:] [ \t] space and TAB characters only
[:space:] [ \t\n\r\f\v] blank (whitespace) characters
[:cntrl:] control characters
[:graph:] [^\t\n\r\f\v] printed characters
[:print:] [^ \t\n\r\f\v] printed characters and space

Links:

Use in Tools

Tools and languages that utilize this regular expression syntax include:

  • AWK - uses a superset of the extended regular expression syntax

Links


Non-POSIX Basic Regular Expressions

Non POSIX Basic Regular Expression Syntax: An additional non-POSIX class understood by some tools is [:word:], which is usually defined as [:alnum:] plus underscore. This form of regular expression is used to reflect the fact that in many programming languages these characters may be used in identifiers.

Operators
Operator Effect
. The dot operator matches any single character.
[ ] boxes enable a single character to be matched against a character lists or character range.
[^ ] A complement box enables a single character not within in a character list or character range to be matched.
* An asterisk specifies zero or more characters to match.
^ The caret anchor matches the beginning of the line
$ The dollar anchor matches the end of the line

The editor vim further distinguishes word and word-head classes (using the notation \w and \h) since in many programming languages the characters that can begin an identifier are not the same as those that can occur in other positions.

(For an ASCII chart color-coded to show the POSIX classes, see ASCII.)

Use in Tools

Tools and languages that utilize this regular expression syntax include:


Emacs Regular Expressions

Notes on regular expressions used in text editor Emacs:

  • For backslash escaping (magic vs literal), Emacs uses a mixture of BRE and ERE. Like in ERE, Emacs supports unescaped +, ?. Like in BRE, Emacs supports escaped \(, \), \|, \{, \}.
  • GNU extensions to regular expressions supported by Emacs include \w, \W, \b, \B, \<, \>, \` , \' (start and end of buffer)
  • No "\s" like in PCRE; whitespace is matched by "\s-".
  • No "\d" like in PCRE; use [0-9] or [[:digit:]]
  • No lookahead and no lookbehind like in PCRE
  • Emacs regexp can match characters by syntax using mode-specific syntax tables ("\sc", "\s-", "\s ") or by categories ("\cc", "\cg").

Use in Tools

Tools and languages that utilize this regular expression syntax include:

Links


Shell Regular Expressions

The Unix shell recognises a glob syntax for use with filename substitution. With the extglob extension of ksh and bash, it is equivalent to a basic form of regular expression in expressiveness.

Operators
Operator Effect
? The hook operator specifies any single character.
[ ] boxes enable a single character to be matched against a character lists or character range.
[! ] A compliment box enables a single character not within in a character list or character range to be matched. A non-standard equivalent, supported by GNU bash, is the caret syntax [^ ].
* An asterisk specifies zero or more characters to match.
?(pattern-list) Matches zero or one occurrence of the given patterns. extglob extension.
*(pattern-list) Matches zero or more occurrences of the given patterns. extglob extension.
+(pattern-list) Matches one or more occurrences of the given patterns. extglob extension.
@(pattern-list) Matches exactly one of the given patterns. extglob extension.
!(pattern-list) Matches anything except one of the given patterns. extglob extension.

Differences from common regular expressions are:

  • The asterisk and hook operators mean different things.
  • The compliment bow is formed using the exclamation mark (!) in the POSIX standard, not a caret (^). Some shells, such as GNU bash, allow the caret to be used as an extension.
  • The extglob extensions are not standard. They are only available in Korn shell (enabled by default) and GNU bash (enabled via shopt -s extglob).

Use in Tools

Tools and languages that utilize this regular expression syntax include:

  • For the general glob syntax, all POSIX "bourne-compatible" shells.
  • For the extglob part, Korn shell and bash.

Links


Implementation

Implementations and running times

There are at least 3 different algorithms that decide if (and how) a given string matches a regular expression. They are based on different representations of the regular expression as a Finite Automaton and on the amount of functionality present in the matcher.

  1. An NFA based matcher without back-references and look ahead/behind. An input of size O(n) can be tested against a regular expression of size O(m) in time O(nm), and additional O(m) extra space by simulating an NFA using Thompson's algorithm. If c sub-match capture groups are to be recorded, then the running time increases to O(nm log c), but the space requirement remains O(m).
  2. An NFA based matcher with back-references and look ahead/behind. Such a matcher needs to be implemented using backtracking. An input of size O(n) can be tested against a regular expression of size O(m) in time O(2mn) using backtracking. Some effort is needed to ensure that the backtracking based matcher doesn't enter an infinite loop, testing the same path over and over again.
  3. A DFA based matcher. DFA based matchers can't support back-references, sub-match captures, or look ahead/behind. This is the oldest and fastest kind of matcher and relies on a result in formal language theory that allows every nondeterministic Finite State Machine (NFA) to be transformed into a deterministic finite state machine (DFA). The algorithm performs or simulates this transformation and then runs the resulting DFA on the input string, one symbol at a time. The latter process (DFA matching) takes time that is proportional to the length of the input string. More precisely, a regular expression of size m on an input alphabet of size S can be converted into a DFA in time O(2mS), and subsequently an input string of size n can be tested against a DFA of any size in time O(n).

The DFA based algorithm is fast to match input against a regular expression, but can be used only for matching and not for recalling grouped subexpressions. There is a variant that can recall grouped subexpressions, but its running time slows down to O(n2m) [citation needed].

The running time of the backtracking based algorithm can be exponential, which simple implementations exhibit when matching against expressions like "(a|aa)*b" that contain both alternation and unbounded quantification and force the algorithm to consider an exponential number of subcases. More complex implementations identify and speed up various common cases where they would otherwise run slowly.

Even though backtracking implementations only give an exponential guarantee in the worst case, they allow much greater flexibility and provide more expressive power. For instance any implementation that allows the use of backreferences, or implements the various improvements that Perl introduced, must use a backtracking implementation.

Some implementations try to provide the best of both algorithms by first running a fast DFA match to see if the string matches the regular expression at all, and only in that case perform a potentially slower backtracking match.


Glossary

This is a glossary of the book.

\A
In some flavors, the beginning of a string but not of a line in the string
\b
In some flavors, a word boundary
\B
In some flavors, a complement to \b
BRE
Basic regular expressions
\d
In some flavors, a digit
\D
In some flavors, a complement to \d
Emacs
A scriptable text editor with support for regular expressions
ERE
Extended regular expressions
GNU
A project to create a free-as-in-freedom operating system, which provides extensions to regular expressions used in tools such as Grep or Sed
Greedy
Of an operator, matching as much as it can
Grep
A command-line tool for finding lines in a text files that match a regular expression
Java
A byte-compiled programming language with support for regular expressions in its standard library since version 1.4
JavaScript
A scripting languages for the web supported by web browsers, with built-in support for regular expressions
Metacharacter
A character or sequence of characters with a special meaning, such as "." or "\+".
PCRE
Perl compatible regular expressions
Perl
An interpreted scripting language noted for its regular expressions
PHP
An interpreted scripting language with support for regular expressions
Regex
A regular expression
Regular expression
A string containing special characters indicating patterns, intended to match literal strings
\s
In some flavors, a whitespace character: space, tab, newline, form feed
\s-
In Emacs, a whitespace character
\S
In some flavors, a complement to \s
Sed
A non-interactive editor or command-line tool noted for its "s" command substituting strings that match a regular expression with other strings
\u13F
In some flavors, the character with the hexadecimal Unicode value of 13F.
Vim
A scriptable text editor with support for regular expressions
\w
In some flavors, an alphanumeric character, including "_"
\W
In some flavors, a complement to \w
\xF7
In some flavors, the character with the hexadecimal ASCII value of F7.
\x{13F}
In some flavors, the character with the hexadecimal Unicode value of 13F.
\Z
In some flavors, the end of a string but not of a line in the string
\<
In some flavors, an empty string before the beginning of a word
\>
In some flavors, an empty string after the end of a word
^
The beginning of a line
$
The end of a line
.
Any single character, but possibly not a newline
[
The opening of a character class
]
The closing of a character class
(
In some flavors, the opening of a group
)
In some flavors, the closing of a group
\(
In some flavors, the opening of a group
\)
In some flavors, the closing of a group
{
In some flavors, the opening of a match-count iterator
}
In some flavors, the closing of a match-count iterator
\{
In some flavors, the opening of a match-count iterator
\}
In some flavors, the closing of a match-count iterator
|
In some flavors, a marking of an alternative
\|
In some flavors, a marking of an alternative
\1
In some flavors, a backreference to the 1st group
\2
In some flavors, a backreference to the 2nd group
*
Any number of the previous
+
In some flavors, one or more of the previous
\+
In some flavors, one or more of the previous
?
In some flavors, one or none of the previous
\?
In some flavors, one or none of the previous
*?
In some flavors, a non-greedy version of *
+?
In some flavors, a non-greedy version of +
}?
In some flavors, a non-greedy version of }


Links

See also

External Links