An Awk Primer/Search Patterns (1)

As you already know, Awk goes line-by-line through a file, and for each line that matches the search pattern, it executes a block of statements. So far, we have only used very simple search patterns, like /gold/, but you will now learn advanced search patterns. The search patterns on this page are called regular expressions. A regular expression is a set of rules that can match a particular string of characters.

Simple Patterns

edit

The simplest kind of search pattern that can be specified is a simple string, enclosed in forward-slashes (/). For example:

/The/

This searches for any line that contains the string "The". This will not match "the" as Awk is case-sensitive, but it will match words like "There" or "Them".

This is the crudest sort of search pattern. Awk defines special characters or meta-characters that can be used to make the search more specific. For example, preceding the string with a ^ ("caret") tells Awk to search for the string at the beginning of the input line. For example:

/^The/

This matches any line that begins with the string "The". Lines that contain "The" but don't start with it will not be matched.

Similarly, following the string with a $ matches any line that ends with search pattern. For example:

/The$/

Lines that do not end with "The" will not be matched in this example.

But what if we actually want to search the text for a character like ^ or $? Simple, we just precede the character with a backslash (\). For example:

/\$/

This will matches any line with a "$" in it.

Alternatives

edit

There are many different meta-characters that can be used to customize the search pattern.

Character Sets

edit

It is possible to specify a set of alternative characters using square brackets ([]):

/[Tt]he/ 

This example matches the strings "The" and "the". A range of characters can also be specified. For example:

/[a-z]/

This matches any character from "a" to "z", and:

/[a-zA-Z0-9]/

This matches any letter or number.

A range of characters can also be excluded by preceding the range with a ^. This is different from the caret that matches the beginning of a string because it is found inside the square brackets. For example:

/^[^a-zA-Z0-9]/

This matches any line that doesn't start with a letter or digit. You can actually include a caret in a character set by making sure that it is not the first character listed.

Alternation

edit

A vertical bar (|) allows regular expressions to be logically OR'ed. For example:

/(^Germany)|(^Netherlands)/

This matches lines that start with the word "Germany" or the word "Netherlands". Notice how parentheses are used to group the two expressions.

Wildcards

edit

The dot (.) allows "wildcard" matching, meaning it can be used to specify any arbitrary character. For example:

/f.n/

This will matches "fan", "fun", "fin", but also "fxn", "f4n", and any other string that has an f, one character, then an n.

Repetition

edit

This use of the dot wildcard should be familiar to UNIX shell users, but Awk interprets the * wildcard in a subtly different way. In the UNIX shell, the * substitutes for a string of arbitrary characters of any length, including zero, while in Awk the * simply matches zero or more repetitions of the previous character or expression. For example, "a*" would match "a", "aa", "aaa", and so on. That means that ".*" will match any string of characters. As a more complicated example,

/(ab|c)*/

This matches "ab", "abab", "ababab", "c", "cc", "ccc", and even "abc", "ababc", "cabcababc", or any other similar combination.

There are other characters that allow matches with repeated characters. A ? matches exactly zero or one occurrences of the previous regular expression, while a + matches one or more occurrences of the previous regular expression. For example:

/^[+-]?[0-9]+$/

This matches any line that consists only of a (possibly signed) integer number. This is a somewhat confusing example and it is helpful to break it down by parts:

  • ^ Find string at beginning of line.
  • [+-]? Specify possible "+" or "-" sign for number.
  • [0-9]+ Specify at least one digit "0" through "9".
  • $ Specify that the line ends with the number.

Specific Repetition

edit

Though this syntax is defined by the Single Unix Specification many awk don't implement this feature. With GNU Awk up to version 4.0 you need—posix or—re-interval option to enable it.

If a regular expression is to be matched a particular number of times, curly brackets ({}) can be used. For example:

/f[eo]{2}t/

This matches "foot" or "feet".

To specify that the number is a minimum, follow it with a comma.

/[0-9]{3,}/

This matches a number with at least three digits. It is a much easier way of writing

/[0-9][0-9][0-9]+/

Two comma-separated number can be placed within the curly brackets to specify a range.

/^[0-9]{4,7}$/

This matches a line consisting of a number with 4, 5, 6, or 7 digits. Too many or not enough digits, and it won't match.

Practice

edit

Unless you are already familiar with regular expressions, you should try writing a few of your own. To see if you get it right, make a quick Awk program that prints any line matching the expression, and test it out.

  1. Write a regular expression that matches postal codes. A U.S. ZIP code consists of 5 digits and an optional hyphen with 4 more digits.
  2. Write a regular expression that matches any number, including an optional decimal point followed by more digits.
  3. Write a regular expression that finds e-mail addresses. Look for valid characters (letters, numbers, dots, and underscores) followed by an "@", then more valid characters with at least one dot.
  4. Write a regular expression that matches phone numbers. It should handle area codes, extension numbers, and optional hyphens and parentheses to group the digits. Make sure it doesn't match phone numbers that are too long or too short.

On the next page, you will learn other kinds of search patterns.