Visual Basic/Regular Expressions

Sometimes, the built in string functions are not the most convenient or elegant solution to the problem at hand. If the task involves manipulating complicated patterns of characters, regular expressions can be a more effective tool than sequences of simple string functions.

Visual Basic has no built-in support for regular expressions. It can use regular expressions via VBScript Regular Expression Library, though. If you have Internet Explorer installed, you almost certainly have the library. To use it, you must add a reference to the project; on the Project menu choose References and scroll down to Microsoft VBScript Regular Expressions. There might be more than one version; if so, choose the one with the highest version number, unless you have some particular reason to choose an old version, such as compatibility with that version on another machine.

Class outline

edit

Class outline of VBScript.RegExp class:

  • Attributes
    • RegExp.Pattern
    • RegExp.Global
    • RegExp.IgnoreCase
    • RegExp.MultiLine
  • Methods
    • RegExp.Test
    • RegExp.Replace
    • RegExp.Execute

Constructing a regexp

edit

A method of constructing a regular expression object:

  Set Regexp = CreateObject("VBScript.RegExp")
  Regexp.Pattern = "[0-9][0-9]*"

A method of constructing a regular expression object that requires that, in Excel, you set a reference to Microsoft VBScript Regular Expressions:

  Set Regexp = new RegExp
  Regexp.Pattern = "[0-9][0-9]*"

Testing for match

edit

An example of testing for match of a regular expression

  Set RegExp = CreateObject("VBScript.RegExp")
  RegExp.Pattern = "[0-9][0-9]*"
  If RegExp.Test("354647") Then
   MsgBox "Test 1 passed."
  End If
  If RegExp.Test("a354647") Then
   MsgBox "Test 2 passed." 'This one passes, as the matching is not a whole-string one
  End If
  If RegExp.Test("abc") Then
   MsgBox "Test 3 passed." 'This one does not pass
  End If

An example of testing for match in which the whole string has to match:

  Set RegExp = CreateObject("VBScript.RegExp")
  RegExp.Pattern = "^[0-9][0-9]*$"
  If RegExp.Test("354647") Then
   MsgBox "Test 1 passed."
  End If
  If RegExp.Test("a354647") Then
   MsgBox "Test 2 passed." 'This one does not pass
  End If

Finding matches

edit

An example of iterating through the collection of all the matches of a regular expression in a string:

  Set Regexp = CreateObject("VBScript.RegExp")
  Regexp.Pattern = "a.*?z"
  Regexp.Global = True 'Without global, only the first match is found
  Set Matches = Regex.Execute("aaz abz acz ad1z")
  For Each Match In Matches
   MsgBox "A match: " & Match
  Next

Finding groups

edit

An example of accessing matched groups:

  Set Regexp = CreateObject("VBScript.RegExp")
  Regexp.Pattern = "(a*) *(b*)"
  Regexp.Global = True
  Set Matches = Regexp.Execute("aaa bbb")
  For Each Match In Matches
   FirstGroup = Match.SubMatches(0) '=aaa
   SecondGroup = Match.SubMatches(1) '=bbb
  Next

Replacing

edit

An example of replacing all sequences of dashes with a single dash:

  Set Regexp = CreateObject("VBScript.RegExp")
  Regexp.Pattern = "--*"
  Regexp.Global = True
  Result = Regexp.Replace("A-B--C----D", "-") '="A-B-C-D"

An example of replacing doubled strings with their single version with the use of two sorts of backreference:

  Set Regexp = CreateObject("VBScript.RegExp")
  Regexp.Pattern = "(.*)\1"
  Regexp.Global = True
  Result = Regexp.Replace("hellohello", "$1") '="hello"

Splitting

edit

There is no direct support for splitting by a regular expression, but there is a workaround. If you can assume that the split string does not contain Chr(1), you can first replace the separator regular expression with Chr(1), and then use the non-regexp split function on Chr(1).

An example of splitting by a non-zero number of spaces:

 SplitString = "a b c  d"
 Set Regexp = CreateObject("VBScript.RegExp")
 Regexp.Pattern = " *"
 Regexp.Global = True
 Result = Regexp.Replace(SplitString , Chr(1))
 SplitArray = Split(Result, Chr(1))
 For Each Element In SplitArray
  MsgBox Element 
 Next

Example application

edit

For many beginning programmers, the ideas behind regular expressions are so foreign that it might be worth presenting a simple example before discussing the theory. The example given is in fact the beginning of an application for scraping web pages to retrieve source code so it is relevant too.

Imagine that you need to parse a web page to pick up the major headings and the content to which the headings refer. Such a web page might look like this:

File:Visual Basic Classic RegExp Example Web Page.png
 <html>
  <head>
   <title>RegEx Example</title>
  </head>
  <body>
   <h1>RegEx Example</h1>
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    <h2>Level Two in RegEx Example</h2>
     bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
   <h1>Level One</h1>
    cccccccccccccccccccccccccccccccccccccc
    <h2>Level Two in Level One</h2>
     dddddddddddddddddddddddddddddddddddd
  </body>
 </html>

What we want to do is extract the text in the two h1 elements and all the text between the first h1 and the second h1 as well as all the text between the second h1 element and the end of body tag.

We could store the results in an array that looks like this:

"RegEx Example" " aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n<h2>Level Two in RegEx Example</h2>\nbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
"Level One" " cccccccccccccccccccccccccccccccccccccc\n<h2>Level Two in Level One</h2>\n dddddddddddddddddddddddddddddddddddd"

The \n character sequences represent end of line marks. These could be any of carriage return, line feed or carriage return followed by line feed.

A regular expression specifies patterns of characters to be matched and the result of the matching process is a list of sub-strings that match either the whole expression or some parts of the expression. An expression that does what we want might look like this:

 "<h1>\s*([\s\S]*?)\s*</h1>"
 

Actually it doesn't quite do it but it is close. The result is a collection of matches in an object of type MatchCollection:

 Item 0
  .FirstIndex:89
  .Length:24
  .Value:"<h1>RegEx Example</h1>"
  .SubMatches:
   .Count:1
   Item 0
    "RegEx Example"
 Item 1
  .FirstIndex:265
  .Length:20
  .Value:"<h1>Level One</h1>"
  .SubMatches:
   .Count:1
   Item 0
    "Level One"
 

The name of the item is in the SubMatches of each item but where is the text? To get that we can simply use Mid$ together with the FirstIndex and Length properties of each match to find the start and finish of the text between the end of one h1 and the start of the next. However, as usual there is a problem. The last match is not terminated by another h1 element but by the end of body tag. So our last match will include that tag and all the stuff that can follow the body. The solution is to use another expression to get just the body first:

"<body>([\s\S]*)</body>"

This returns just one match with on sub-match and the sub match is everything between the body and end body tags. Now we can use our original expression on this new string and it should work.

Now that you have seen an example here is a detailed description of the expressions used and the property settings of the Regular Expression object used.

A regular expression is simply a string of characters but some characters have special meanings. In this expression:

"<body>([\s\S]*)</body>"

there are three principal parts:

"<body>"
"([\s\S]*)"
"</body>"

Each of these parts is also a regular expression. The first and last are simple strings with no meaning beyond the identity of the characters, they will match any string that includes them as a substring.

The middle expression is rather more obscure. It matches absolutely any sequence of characters and also captures what it matches. Capturing is indicated by surrounding the expression with round brackets. The text that is captured is returned as one of the SubMatches of a match.

<body> matches just <body>
( begins a capture expression
[ begins a character class
\s specifies the character class that includes all white space characters
\S specifies the character class that includes all non-white space characters
] ends the character class
* means that as many instances of the preceding expression as possible are to be matched
) ends the capture expression
</body> matches </body>

In the case studies section of this book there is a simple application that you can use to test regular expressions: Regular Expression Tester.

edit
Previous: Built In String Functions Contents Next: Arrays