Theory of Computation: Regular expressions

PAPER 1 - ⇑ Theory of computation ⇑

← Compact set notation Regular expressions Backus-naur form →


You will often want to check the format of a string being input and if it is incorrect you will want it to be submitted again. For example you might want someone to input the name of their best friend, meaning that they shouldn't be inputting any letters or spaces, and it should start with a capital letter:

   Code Output

Name of best friend: Beanie //(CORRECT)
Name of best friend: jonny5 //(STOP THIS)

To do this we can match the input string against some rules, regular expressions or regex, in this case we only want characters from the alphabet:

[A-Z][a-z]+

Breaking apart the rule:

  • [A-Z] - start exactly one instance of a capital letter
  • [a-z]+ - followed by as many lower case letters as you like (that's what the + means)

Another example might be checking for the correct spelling of a famous composer:

"Handel", "Händel", and "Haendel"

We can check this using the pattern H(ä|ae?)ndel. Let's take a look at what this means:

  • H - start with an H
  • (ä|ae?) - includes an ä or (the | symbol) an a followed by an optional e (e? means the e is optional)

Most regular expression tools provide the following operations to construct expressions.

Boolean "or"

A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey".

Grouping

Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of "gray" and "grey".

Quantification

A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur.

  • ? The question mark indicates there is zero or one of the preceding element. For example, colou?r matches both "color" and "colour".
  • * The asterisk indicates there is zero or more of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.
  • + The plus sign indicates there is one or more of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".

Most programming languages have regular expression functions. In VB.NET we can use regular expressions by using the Regex routine:

' this code enforces the name rule from earlier
Dim name As String
Console.Write("Name of best friend: ")
name = Console.Readline()

' match the string against a regular expression
Dim m As Match = Regex.Match(name, "[A-Z][a-z]+")

If (m.Success) Then
    Console.WriteLine("You have input the name correctly")
Else
    Console.WriteLine("Incorrect format!")
End If

A common use for regular expressions is in checking that you have a correctly typed email address. A rule for that is this: ^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$.

You can find out more about Regular expression on wikipedia and you will cover regular expressions in more detail in A2.