Regular Expressions/Implementation

      Implementations and running times

      There are at least two different algorithms that decide if (and how) a given string matches a regular expression.

      The oldest and fastest relies on a result in formal language theory that allows every nondeterministic Finite State Machine (NFA) to be transformed into a deterministic finite state machine (DFA). The algorithm performs or simulates this transformation and then runs the resulting DFA on the input string, one symbol at a time. The latter process takes time linear to the length of the input string. More precisely, an input string of size n can be tested against a regular expression of size m in time O(n+2m) or O(nm), depending on the details of the implementation. This algorithm is often referred to as DFA. It is fast, but can be used only for matching and not for recalling grouped subexpressions. There is a variant that can recall grouped subexpressions, but its running time slows down to O(n2m)[citation needed].

      The other algorithm is to match the pattern against the input string by backtracking. (This algorithm is sometimes called NFA, but this terminology is highly confusing.) Its running time can be exponential, which simple implementations exhibit when matching against expressions like "(a|aa)*b" that contain both alternation and unbounded quantification and force the algorithm to consider an exponential number of subcases. More complex implementations identify and speed up various common cases where they would otherwise run slowly.

      Even though backtracking implementations only give an exponential guarantee in the worst case, they allow much greater flexibility and provide more expressive power. For instance any implementation that allows the use of backreferences, or implements the various improvements that Perl introduced, must use a backtracking implementation.

      Some implementations try to provide the best of both algorithms by first running a fast DFA match to see if the string matches the regular expression at all, and only in that case perform a potentially slower backtracking match.

      syntax · Examples

      Last modified on 29 October 2011, at 17:14