Raku Programming/Grammars

Grammars

edit

Regular expressions by themselves are useful but limited. It can be difficult to reuse regexes, difficult to group them into logical bunches, and very difficult to inherit regexes from one bunch to another. This is where grammars come in. Grammars are to regexes what classes are to data and code routines. Grammars allow regexes to act like normal first-class components of the programming language and make use of the cool features of the class system. Grammars can be inherited and overloaded like classes. In fact, the Raku grammar itself can be modified to add new features to the language on the fly. We will see examples of that later.

Rules, Tokens and Protos

edit

Grammars are broken into components called rules, tokens and protos. Tokens are like the regexes we've already seen. Rules are like subroutines because they can call other rules or tokens. Protos are like default multisubs, they define a rule prototype that can be overridden.

Tokens

edit

Tokens are regex that don't backtrack meaning that if a portion of the expression has been matched, this portion will not be altered even if it prevents a larger portion of the expression from matching. While this sacrifices some of the flexibility of regexes, it allows more complex parsers to be created efficiently.

token number {
  \d+ ['.' \d+]?
}

Rules

edit

Rules are ways to combine tokens and other rules together. Rules are all given names, and can refer to other rules or tokens in the same grammar using < > angle brackets. Like tokens they do not backtrack but spaces within them are interpreted literally instead of being ignored:

rule URL {
  <protocol>'://'<address>
}

This rule matches a URL string where a protocol name such as "ftp" or "https" is followed by the literal symbol "://" and then a string representing an address. This rule depends on two sub-rules, <protocol> and <address>. These could be defined as either tokens or rules, so long as they are in the same grammar:

grammar URL {
  rule TOP {
    <protocol>'://'<address>
  }
  token protocol {
    'http'|'https'|'ftp'|'file'
  }
  rule address {
    <subdomain>'.'<domain>'.'<tld>
  }
  ...
}

Protos

edit

Protos define a type of rules or tokens. For example, we could define a proto-token <protocol> and then define several tokens representing different protocols. Within one of these tokens, we can refer to its name as <sym>:

grammar URL {
  rule TOP {
    <protocol>'://'<address>
  }
  proto token protocol {*}

  token protocol:sym<http> {
    <sym>
  }
  token protocol:sym<https> {
    <sym>
  }
  token protocol:sym<ftp> {
    <sym>
  }
  token protocol:sym<ftps> {
    <sym>
  }
  ...
}

This would be equivalent to saying:

token protocol {
  <http> | <https> | <ftp> | <ftps>
}
token http {
  http
}
...

but is more extensible, allowing types of protocol to be specified later. For example if we wanted to define a new type of URL which also supported the "spdy" protocol, we could use:

grammar URL::WithSPDY is URL {
  token protocol:sym<spdy> {
    <sym>
  }
}

Matching Grammars

edit

Once we have a grammar like the one defined above, we can match it with the .parse method:

my Str $mystring = "http://www.wikibooks.org";

if URL.parse($mystring) {
  #if it matches a URL, do something
}

Match Objects

edit

A match object is a special data type that represents the parse state of a grammar. The current match object is stored in the special variable $/.

Parser Actions

edit

A grammar can be turned into an interactive parser by combining it with a class of parser actions. As the grammar matches certain rules, corresponding action methods can be called with the current match object.