Parrot Virtual Machine/Optables and Expressions
Optables and Expressions
editTop-down parsing has some problems. First, it's slow because it may need to try many alternatives before it finds a match. Each time it fails on a match, it needs to return to the previous rule and try again. Another problem is that parse trees can become very large for statements that have lots of little tokens, operator precedence, and many subrules. Consider a simple mathematical expression "1 * 2 + 5" and the following grammar:
EXPRESSION := TERM ADD_OP TERM TERM := (NUMBER MUL_OP NUMBER) | NUMBER ADD_OP := '+' | '-' MUL_OP := '*' | '/'
Our equation above creates a very complex parse tree:
EXPRESSION TERM NUMBER 2 MUL_OP "*" NUMBER 5 ADD_OP "+" TERM NUMBER 5
For things like a mathematical expression, we would like to use a bottom-up parser instead of a top-down one. A special type of bottom-up parser, called an operator precidence table parser is particularly efficient and well-suited for these tasks. To create an operator precedence table, PGE has a special tool called an optable
.
Creating Optables
editAn optable is created by creating a rule that is optable
. Here is an example:
rule expression is optable { ... }
The ...
is not an omission, it's part of the rule. If you don't have the ...
three dots, your rule will not work. This is called the yada yada yada operator, and is used to define a function prototype. PGE will instantiate the function for you, using the options you specify. The keyword is optable
tells the parser to convert to bottom-up mode to parse expressions
Operator Prototypes
editIn an optable, tokens are specified using the proto
keyword instead of the rule
or token
keywords. A proto
definition must define several things: The name and type of the operator, the precidence of the operator, and the PIR code that is used to perform the particular operation. Here are some examples:
proto 'infix:+' is precidence('1') is pirop('add') { ... } proto 'prefix:-' is tighter('infix:+') is pirop('neg') { ... } proto 'postfix:++' is equal('prefix:-') { ... }
These examples show how to parse three common operators: A "+" between two values, a negative value for a negative number (such as "-123") and a post-increment operator ("x++"). There are lots of important keywords here, and this isn't even all the keywords you will need to produce a full expression parser.
Operator Types
editThere are four types of operator: prefix, postfix, infix, and circumfix. Here are some examples:
Type | Example | Use |
---|---|---|
Prefix | prefix:-
|
-123 |
postfix | postfix:*
|
x* |
infix | infix:+
|
x+y |
circumfix | circumfix:()
|
(x) |
postcircumfix | postcircumfix:[]
|
x[y] |
ternary | ternary:? :
|
x?y:z |
Operator Precedence
editBottom-up parsers, as we mentioned before, suffer from a particular problem called shift-reduce conflicts. A shift-reduce conflict occurs when a parser cannot determine whether to shift in a new input token, or reduce the tokens it already has. Here is an example from the C programming language:
-x + 2 * y
We know, because we've been taught about operator precedence since we were children, that the expression is grouped like this:
(-x) + (2 * y)
However, a computer doesn't know this unless we tell it what precedence the operators follow. For instance, without knowing the order in which operators act, a parser might group the expression on a first-come, first-served basis like this:
((-x) + 2) * y
Or maybe it would shift in all tokens and start reducing from the right:
-(x + (2 * y)
The point is, we need to create rules to tell the operator precedence parser how to parse these conflicts. Here are the precedence rules that we can specify:
Code | Effect |
---|---|
is precidence('1')
|
Sets a basic precidence level. The number in the quotes is arbitrary. The important part is that this is a constant, and that other operators can be defined in relation to this. |
is equiv('OPERATOR')
|
Here, the word OPERATOR will be replaced by an existing operator, such as is operator('infix:+') This shows that one operator has the same precedence as another operator. All operators with the same precedence are evaluated from left-to-right.
|
is tighter('OPERATOR')
|
The operator binds tighter than OPERATOR. In the expression -x + y , we expect the prefix:- to bind tighter than the infix:+ .
|
is looser('OPERATOR')
|
Similar to is tighter() , but the opposite. Shows that the operator binds less tightly than OPERATOR
|
We didn't show this in the example, but it's also possible to determine whether operators act to the left or to the right. We define this with the is assoc
keyword. The following table summarizes these rules:
Associativity | Explanation |
---|---|
is assoc('right')
|
|
is assoc('left')
|
|
is assoc('list')
|
|
is assoc('chain')
|
|
is assoc('none')
|
Operator Actions
editThere are three ways to specify how an operator behaves. The first is to associate the operator with a PIR opcode. The second is to associate the operator with a PAST node type. The third and final way to specify the actions of an operator is to create a function. The first two methods can be performed with the is pirop()
and is pasttype()
specifiers, respectively.
is pirop
Operators
edit
For simple operations, it doesn't make any sense to create and call a function. This is especially true for simple arithmetic operations which can be performed very quickly using PASM opcodes that Parrot already defines. Use the code is pirop('opname')
to implement an operator as a single opcode.
is pasttype
Operators
edit
Some operators, such as logical operators, can be best implemented using types of nodes that PAST already provides. Use the is pasttype('TYPE')
form to specify the type of PAST node to use.
Operators as Functions
editBy default, unless you specify an operator as is pirop
or is pasttype
, the operator will default to being a function. Most operators are therefore defaulted to be functions, which is very good for languages which intend to allow operator overloading.
Proto Regexes
editWe will use the term "Proto", "Proto Regex" or "Protoregex" on occasion in this book, and it is used often in the documentation for Perl 6 as well. Protos are like rules except they offer something similar to multiple dispatch: You can have multiple rules of the same name that depend on the operands passed to it. By declaring an operator as a proto, you are giving it a name and you are also allowing future additions to be made to the language. This is an imporant feature for languages which utilize operator overloading.
Protos are defined with the proto
keyword. They can be used in either the top-down or bottom-up parser, although because of their most common use as a support mechanism for operator overloading, they are most often used in the bottom-up operator precidence parser. To define a new proto, use the form:
proto NAME OPTIONS { ... }
In this declaration, the { ... }
is a literal piece of code: you actually must type the three dots (ellipsis) between the curly brackets. What this does is alerts the parser that this is not an implementation of the rule itself, it is actually just a signature or a forward-declaration for other functions. For instance, we can define a proto MyProto:
proto MyProto { ... }
And in another file we can define functions to implement this:
.sub MyProto # Insert the parameter list and function logic here .sub
.sub MyProto # Another MyProto! # Insert a different parameter list and different function logic here .end
Parsing Terms
editTerms are the values between operators. In the following expression:
1 + 2 * 3
The numbers 1, 2, and 3 are all terms. Terms are parsed using the ordinary rules, and are loaded into the operator precedence table using the following syntax:
proto 'term:' is precedence('1') is parsed(&term) { ... }
Of course, you can set the precedence however you see fit, but it's typically best to make terms parse as tightly as possible. The is parsed()
attribute specifies that terms are to be parsed using a rule named "term". You can define this term using ordinary rule syntax. The &
symbol is the sigil that NQP uses to refer to a subroutine. In this case, the subroutine in question is the grammar method that is intended to parse terms (but it can conceivably be anything else). We will learn more about NQP and sigils in the following chapter.
Terms are the way the parser switches back from bottom-up to top-down parsing.
Expression and Term Methods
editOnce we've created our parser, we need to create methods associated with them to construct the necessary PAST nodes for the parse tree. Lucky for us, these methods are easy to create, and typically the best option is to use a basic template from another language implementation.
Terms are defined like any other rule, and don't require any special syntax. Expressions, however, do require a little bit of special attention.