Parrot Virtual Machine/Parrot Grammar Engine

Parrot Grammar Engine

edit

The Parrot Grammar Engine (PGE) is an implementation of Perl 6's regular expression syntax on Parrot. The Perl 6 regular expression syntax is much more powerful and expressive than the regular expression syntax in Perl 5. Since regular expression implementations in many other languages are based on the Perl 5 implementation, Perl 6 is more powerful than those too. The Perl 6 regular expression and grammar syntax is very similar to parser generators like the Perl 5 module "Parse::RecDescent", or the parser-generator programs Yacc and Bison. Unlike Yacc/Bison, however, the Perl 6 grammar engine uses a recursive descent algorithm instead of an LALR one. The differences are not important to the casual reader, but we will go into some more details about them later.

Perl 6 grammars are combinations of lexical analyzers (which break the source code into tokens) and parsers (which combine tokens into patterns). The lexical analysis portion of the grammar is the regular expression engine. The parser portion of the grammar engine is the rules system that we will describe here.

Parsers and Terminology

edit

Before we talk about PGE any further, we need to discuss some more terminology, and we also need to take a quick look at the algorithms that parsers use. By understanding the algorithms and the limitations of those algorithms, you the compiler designer should be able to employ these tools more optimally.

Input to a compiler is a source code file, written in a particular high-level language, which consists of various components. Consider the following statement in the C programming language:

int x = add_both(5, 4);

This statement needs to be broken up into smaller chunks which represent the individual components. A chunk, as we have already seen, is called a "token". The regular expression engine, the lexical analyzer, is used to break the input into tokens by iteratively matching the input against known regular expressions. Using this method, the statement above can be broken down into the following tokens:

KEYWORD IDENTIFIER OPERATOR IDENTIFIER PAREN INTEGER OPERATOR INTEGER PAREN 
 "int"     "x"        "="   "add_both"  "("    "5"     ","      "4"    ")"

Notice that the whitespace has been removed from the list, so only the "important" stuff is turned into a token. Each token has two important pieces of information: The type ("KEYWORD", "IDENTIFIER", etc) and a value ("int", "x", etc). The parser can arrange the token types into a pattern, and the actions can use the token values to construct the PAST parse tree.

An ordered group of tokens is called a string if it matches a given input pattern for the parser. The set of all strings which are valid for a compiler is called the language. The rules that specify this language is called a grammar.

The third part of a compiler is the back end. The back end, also called the target code generator, turns the intermediate data representation of the input source (the parse tree) and converts it into the target language. In our case, the target language is PIR or PBC. All compilers are formed from these three key components: The lexical analyzer, the parser, and the back end. Parrot provides regular expressions for the lexical analyzer, and it handles all the target language generation. The only thing you, as a compiler designer, need to design is the parser. However, there is a lot of information about parsers that we need to know first.

There are two methods to parse a string of input tokens. First, we can do a top-down parse where we start with our highest-level token and try to find a match. Second, we can do a bottom-up parse where we start with the small tokens and try to combine them into bigger and bigger tokens until we have the highest-level. The ultimate goal of both parsing methods is to reduce the string of input tokens into a single match token.

Bottom-Up Parse

edit
note:
Parser-generators such as yacc or Bison produce bottom-up parsers.

As a simple bottom-up parse example, we will start at the left side of our string of tokens and try to combine small tokens into bigger ones. What we want, in the end, is a token that represents the entire input. We do this by utilizing two fundamental operations: shift and reduce. A shift operation means we read a new input token into the parser. A reduce operation means we turn a matched pattern of small tokens into a single larger token. Let's put this method into practice below.

KEYWORD IDENTIFIER OPERATOR IDENTIFIER PAREN INTEGER OPERATOR INTEGER PAREN 
 "int"     "x"        "="   "add_both"  "("    "5"     ","      "4"    ")"

The above string turns into the string below if we realize that the tokens "int" and "x" are a variable declaration. We reduce the first two tokens into a variable declaration token:

DECLARATION OPERATOR IDENTIFIER PAREN INTEGER OPERATOR INTEGER PAREN 
               "="   "add_both"  "("    "5"     ","      "4"    ")"

Now, we move through the line, shifting tokens into the parser from left to right. We can't see anything that we can reduce until we reach the parenthesis. We can reduce the open and close parenthesis, and all the tokens in the middle of them into an argument list as shown below:

DECLARATION OPERATOR IDENTIFIER ARGUMENT_LIST 
               "="   "add_both" 

Now, we know that when we have an identifier followed by an argument list, that it is a function call. We reduce these two tokens to a single function call token:

DECLARATION OPERATOR FUNCTION_CALL
               "="   

And finally, by skipping a few steps, we can convert this into an assignment statement:

ASSIGNMENT_STATEMENT

Every time we reduce a set of small tokens into a bigger token, we add them to the tree. We add the small tokens as children of the bigger token. This type of parser that we are talking about here is called a "shift-reduce" parser because those are the actions the parser performs. A subset of shift-reduce parsers that are useful for arithmetic expressions is called an operator precedence parser, and is one that we will talk about more below.

Note
Shift-reduce parsers are prone to a certain type of error called a shift-reduce conflict. A shift-reduce conflict is caused when a new input token can cause the parser to either shift or reduce. In other words, the parser has more then one option for the input token. A grammar that contains possible shift-reduce conflicts is called an ambiguous grammar. While there are ways to correct such a conflict, it is often better to redesign the language grammar to avoid them entirely. For more information on this, see the Resources section at the bottom of the page.

Top-Down Parse

edit

A top-down parser is a little bit different from a bottom-up parser. We start with the highest level token, and we try to make a match by testing for smaller patterns. This process can be inherently inefficient, because we must often test many different patterns before we can find one that matches. However, we gain an ability to avoid shift-reduce conflicts and also gain a certain robustness because our parser is attempting multiple options before giving up. Let's say that we have a string of tokens, and a top-level definition for a STATEMENT token. Here is the definition in a format called a context-free grammar:

STATEMENT := ASSIGNMENT | FUNCTION_CALL | DECLARATION

This is a simple example, of course, and certainly not enough to satisfy a language like C. The ":=" symbol is analogous to the words "is made of". The vertical bar "|" is the same as saying "or". So, the statement above says "A statement is made of an assignment or a function call or a declaration".

We have this grammar rule that tells us what a statement is, and we have our string of input tokens:

KEYWORD IDENTIFIER OPERATOR IDENTIFIER PAREN INTEGER OPERATOR INTEGER PAREN 
 "int"     "x"        "="   "add_both"  "("    "5"     ","      "4"    ")"

The top-down parser will try each alternative in the STATEMENT definition to see if it matches. If one doesn't match, it moves to the next one and tries again. So, we try the ASSIGNMENT rule, and ASSIGNMENT is defined like this:

ASSIGNMENT := (VARIABLE_NAME | DECLARATION) '=' (EXPRESSION | FUNCTION_CALL)

Parenthesis are used to group things together. In English, this statement says "An assignment is a variable name or a declaration, followed by a '=' followed by an expression or a function call". So, to satisfy ASSIGNMENT, we try VARIABLE_NAME first, and then we try DECLARATION. The string "int x" is a declaration and not a simple variable name, so VARIABLE_NAME fails, and DECLARATION succeeds. Next, the '=' matches. To satisfy the last group, we try EXPRESSION first, which fails, and then we try FUNCTION_CALL, which succeeds. We proceed this way, trying alternatives and slowly moving down to smaller and smaller tokens, until we've matched the entire input string. Once we have matched the last input token, if we have also satisfied the top-level match, the parser succeeds.

This type of parser is called a "recursive descent" parser, because we recurse into each subrule, and slowly descend from the top of the parse tree to the bottom. Once the last subrule succeeds, the whole match succeeds and the parser returns.

In this process, when a rule matches, we create a node in our PAST tree. Because we test all subrules first before a rule succeeds, all the children nodes are created before the higher-level nodes are created. This creates the parse tree from the bottom going up, even though we started at the top of the tree and moved down.

Top-down parsers can be inefficient because the parser will attempt to match patterns that obviously cannot succeed. However, there are techniques that we can use to "prune" the tree of possibilities by directing the parser towards certain paths or stopping it from going down branches that will not match. We can also prevent the parser from backtracking from subrules back up to larger rules, which helps to reduce unnecessary repetition.

Rules And Actions

edit

A recursive descent parser, like the one used in PGE, is a top-down parser. This means it attempts to start at the highest-level definition and work its way down to match the given input. The top-level rule is always called TOP. After that, we can create as many other rules as we need to in order to specify our entire grammar. Let's start by creating a very simple recursive descent parser to try and parse the input that we gave earlier:

int x = add_both(5, 4);

Here is part of our basic parser:

rule TOP {
  <assignment> | <function_call> | <declaration>
}

rule assignment {
  <lvalue> '=' <rvalue>
}

rule lvalue {
  <variable_name> | <variable_declaration>
}

rule rvalue {
  <expression> | <function_call>
}

rule function_call {
  <identifier> '(' <arguments> ')'
}

rule variable_declaration {
   'int' <identifier>
}

This is only part of the parser, but the idea should be clear. We define each rule in terms of constants or other rules. The angle brackets "<" and ">" indicate that we need to match a subrule. Single quotes are used to indicate something that is a literal string constant.

Basic Rules

edit

Rules have a couple basic operators that we can use, some of which have already been discussed. People who are familiar with regular expressions will recognize most of them.

Op What It Means Example Explanation
* "zero or more of" <number>* '.' <number> Accepts a string with zero-or-more numbers, followed by a period, followed by 1 number. Example: 1234.5
+ "one or more of" <letter>+ <number> One-or-more letters, followed by a number. Example: abcde5, or ghij6
? "one or zero" <number>? '.' <number>+ An optional number followed by a period and a string of one-or-more numbers. Examples: 1.234 or .234
[] Group <letter> [<letter> | <number>]* A letter followed by any amount of letters or number. Example: a123, ident, wiki2txt

We have already discussed the or operator "|". Here are some examples.

Decimal Numbers
rule decimal_numbers {
   <digit>* '.' <digit>+
  |<digit>+ '.' <digit>*
}
Function Parameters
rule function_parameters {
   '(' [ <identifier> [ ',' <identifier> ]* ]? ')'
}

Actions

edit

As it successfully matches rules, PGE creates a special "match object", which contains information about the match. This match object can be sent to a function written in PIR or NQP for processing. The processing function that receives the match object is called the action. Each rule in the grammar has an associated action function with the same name.

When we have completed a successful match, we need to send the match object to the action method. We do this with the special symbol {*}. {*} sends the current match object, not necessarily the complete one. You can call {*} multiple times in a rule to invoke the action method multiple times, if needed. Here is an example:

rule k_r {
   {*}       #Calls the action method with an empty match object
   'hello'
   {*}       #calls the action method after matching 'hello'
   'world'
   {*}       #Calls the action method after matching 'hello' 'world'
}

There are two important points to remember about action methods:

  1. The parser moves from left-to-right, top-to-bottom. In the k_r example above, if the input was "hello johnny", the action method would get called the first two times, but the rule would fail to match the word "world" and the action method would not be called the third time.
  2. The parser returns after a successful match, even if there are more possibilities to try. Consider this example below, where only one of the action methods can be called depending on the result of the alternation. In this example, if the input is "hello franky", the action method only gets called for the {*} after 'franky'. After that matches, the parser returns and does not try to match 'johnny'.
rule say_hello {
  'hello'
   [
     | 'tommy' {*}
     | 'franky' {*}
     | 'johnny' {*}
   ]
}

It can be very helpful sometimes to specify which action method got called, when there is a list of possibilities. This is because we want to treat certain matches differently from others. to treat an action method differently, we use a special comment syntax:

rule say_hello {
  'hello'
   [
     | 'tommy' {*}  #= tommy
     | 'franky' {*} #= franky
     | 'johnny' {*} #= johnny
   ]
}

The special "#=" symbol is not a regular comment. Instead, it's a second parameter for the action method called a "key". If you use "#=", the action method will receive two arguments: the match object and the key. You can then test the value of the key to determine how to treat the match object. We will discuss this more in the next chapter about NQP.

Resources

edit


Previous Parrot Virtual Machine Next
Parrot Compiler Tools Not Quite Perl