Rebol Programming/Language Features/Parse/Parse expressions

Sometimes you want to parse a series to see if it matches a specific format. This can be used for simple things like determining and validating the format of a phone number or an email address.

You may need this even when splitting a string but not wanting to treat the quotation marks specifically as the NONE or string rules do. (See the simply split example.)

Instead of regular expressions PARSE matches parse expressions. Parse expressions:

Are a dialect of Rebol (the parse dialect).
Are a (enhanced) member of the family of Top-down parsing languages (TDPL family) including the top-down parsing language (TDPL), the generalized top-down parsing language (GTDPL) and the parsing expression grammar (PEG).
Use the same "ordered choice" parsing method as the other members of the TDPL family.
Have an unlimited lookahead capability, brought by the ordered choice operator.
Are compatible with other members of the TDPL family.
Make a good replacement for regular expressions being strictly more powerful. For example, a regular expression inherently cannot find matched pairs of parentheses because it is not recursive, but the PARSE dialect can.
Are strictly more powerful than context-free grammars.
Context-free grammars usually require a separate tokenization step because of the way they use lookahead. PARSE of string types does not require a separate tokenization step and the tokenization rules can be written in the same way as any other grammar rule.
Many context free grammars contain inherent ambiguities, even when they're intended to describe unambiguous languages. The "dangling else" problem in C, C++, and Java is one example. These problems are often resolved by applying a rule outside of the grammar. In the PARSE dialect these ambiguities never arise because of prioritization.
Make a good replacement for an LL parser because they are strictly more powerful.

PARSE handles powerful dialects that are used throughout the language. An example of this is VID or the Visual Interface Dialect, used for constructing graphical user interfaces.

Parse expression matching

What PARSE does during parse expression matching is traversing a series (e.g. a block, a string or a binary) and while it does that, you can perform actions or collect information from the series to be used elsewhere or perform actions on the series itself.

There are roughly two ways parse expression matching is used, either on strings matching character patterns, or on blocks matching patterns of Rebol values. This means that:

when parsing strings, terminal symbols are characters
when parsing blocks, terminal symbols are Rebol values

Block parsing is generally used to handle dialects, and this is one of the main characteristics of the language.

For parse expression matching the given RULE argument has to be a block. The contents of the block are interpreted as a starting parse expression corresponding to the starting expression of a parsing expression grammar.

When matching parse expressions, PARSE maintains the input position.

Parse expression matching may have two possible results:

success, in which case PARSE may optionally move the input position forward, or
failure, in which case the input position remains unchanged

PARSE returns TRUE, if both:

the starting parse expression was found to successfully match the given INPUT
the final input position is at the tail of the given INPUT

Otherwise PARSE returns FALSE.

Atomic parse expressions

Atomic parse expressions are parse expressions that are represented by just one Rebol value.

NONE

NONE is treated as a nonterminal that successfully matches any input and generally doesn't move the input position forward, with an exception mentioned at the Character set section.

 parse "" [#[none]]
 ; == true

This returns TRUE because:

The NONE value successfully matches any input.
PARSE is already at the tail of the string.

parse [] [#[none]]
; == true

Notes:

Since the NONE value is used as a nonterminal, it cannot be used to match the NONE terminal in the input block. If you want to match the NONE terminal in the input block, see the Datatype section.

Character

Characters are treated as terminal symbols when parsing strings as well as when parsing blocks.

parse "a" [#"a"]
; == true

This returns TRUE because:

The #"a" character successfully matches the character at the current input position.
PARSE moves forward after the successful match, reaching the tail of the input in this case.

parse [#"a"] [#"a"]
; == true

Any-string

Any-strings are treated as sequences of terminal symbols when parsing strings:

parse "aaa" ["aaa"]
; == true

This returns TRUE because:

The string successfully matches a part of the input at the current input position.
PARSE moves forward after the successful match, reaching the tail of the input in this case.

parse "<html>" [<html>]
; == true

Any-strings are treated as terminal symbols when parsing blocks:

parse ["aaa"] ["aaa"]
; == true

parse [<html>] [<html>]
; == true

Block

Blocks are treated as nonterminals; their contents are interpreted as parse expressions and used for matching.

parse "a" [["a"]]
; == true

parse ["a"] [["a"]]
; == true

Notes:

Since Rebol blocks are treated as nonterminals, they cannot be used as terminal symbols to literally match specific blocks contained in the input. To match a specific block you can either use the into operator or the quote idiom defined in the Parse idioms section.

Paren

Parens successfully match any input not advancing the PARSE position; they are treated as actions to be evaluated, which results in printing the string "OK" to the console in the examples below:

parse "" [(print "OK")]
; == true

parse [] [(print "OK")]
; == true

Notes:

Since parens are treated as actions, they cannot be used as terminals to match parens in the input block. If you want to match a specific paren, you can either use the into operator or the quote idiom in the Parse idioms section.

Argument-less PARSE keyword

PARSE keywords not needing any arguments are treated as nonterminals.

The end keyword

parse "" [end]
; == true

This returns TRUE because:

The input is already at its tail.
The end keyword successfully matches the input, when it is at its tail.

The skip keyword

parse "a" [skip]
; == true

This returns TRUE because:

The skip keyword successfully matches any terminal.
PARSE moves forward after the successful match, reaching the tail of the input in this case.

parse ["aa"] [skip]
; == true

Notes:

PARSE keywords cannot be used as terminals to match specific words in the input block. For matching such words see the Lit-word section.

Word

Rebol words that are not PARSE keywords are treated as nonterminals. The value of such a word is looked up and used for matching.

parse "" [none]
; == true

This returns TRUE because:

The 'none variable refers to the NONE value, which successfully matches any input.
PARSE was already at the tail of the input.

Notes:

For matching specific words in the input see the Lit-word section.
Words referring to PARSE keywords are handled as the keywords.
Words referring to other words are not supported. When PARSE encounters such a word, it causes an "invalid argument" error.

Lit-word

Lit-words can be used only during block parsing. Every lit-word is treated as a nonterminal, which successfully matches the corresponding word at the current input position.

parse [Hi] ['Hi]
; == true

A different word match will fail:

parse [Bye] ['Hi]
; == false

Notes:

Since lit-words are nonterminals, they cannot be used as terminals to match lit-words in the input block. For matching specific lit-words see the quote idiom in the Parse idioms section.

Path

Paths work analogically as words, i.e. the value of a path is looked up and used for matching.

Lit-path

Lit-paths work analogically as lit-words, i.e. they match the corresponding paths in the input block.

Notes:

Since lit-paths are nonterminals, they cannot be used as terminals to match lit-paths in the input block. For matching specific lit-paths see the quote idiom in the Parse idioms section.

Character set

Bitsets work as character set nonterminals when parsing strings. They successfully match any character they contain.

whitespace: charset [#"^A" - #" " #"^(7F)" #"^(A0)"]
parse/all " " [whitespace]
; == true

Notice that we "turned off" the special handling of whitespace characters by using the /ALL refinement.

It may be interesting to find out what happens, if we don't "turn off" the special handling of whitespace characters:

whitespace: charset [#"^A" - #" " #"^(7F)" #"^(A0)"]
parse " " [whitespace]
; == false

The result is FALSE, since PARSE "ignores" the whitespace characters in this case, so they cannot be successfully matched.

The next trial doesn't succeed either:

parse " " []
; == false

, since the input position isn't at the tail yet.

To succeed, we need to do:

parse " " [none]
; == true

, where the NONE value is used to match the space and move the PARSE input position forward.

Notes:

The above NONE behavior is a special case, when even the NONE value can move the current input position forward.
In case of block parsing bitsets behave as terminal symbols.

Datatype

We can use Rebol datatypes as nonterminals when parsing blocks. They successfully match any value of corresponding datatype.

parse [5] [integer!]
== true

This returns TRUE because:

The INTEGER! datatype referenced by the 'integer! word successfully matched the element in the block.
PARSE reached the end of the block after moving forward.

Same thing, only with a date and a string:

parse [25-Dec-2005] [date!]
; == true

parse ["Hello"] [string!]
; == true

The NONE! datatype can be used to match the NONE value in the input:

parse [#[none]] [none!]
; == true

Notes:

Since datatypes are treated as nonterminals, they cannot be used as terminals to match datatypes in the input block. To match specific datatypes in the input see the quote idiom in the Parse idioms section.
The INTEGER! datatype matches integer representations when used during string parsing. Other datatypes "work" as NONE.

Set-word

Set-words are treated as nonterminals and used to obtain the current input position. Set-words always succeed not moving the input position.

parse "123" [position:]
; == false
position
; == "123"

Explanation:

PARSE sets the 'position variable to refer to the current input position
The match succeeds not moving the input position forward
PARSE returns FALSE, since the final input position didn't reach the tail of the input

Notes:

Since set-words are treated as nonterminals, they cannot be used as terminals to match specific set-words in the input block. To match specific set-words see the quote idiom in the Parse idioms section.

Get-word

Get-words are treated as nonterminals and used to set the current input position. An attempt to set the input position to a completely distinct series (a series not having the same head as the current input position) causes an error. Otherwise the match succeeds. Example:

string: tail "123"
parse head string [:string]
; == true

Explanation:

Match succeeds
PARSE returns TRUE since the position is set to the tail of the input in this case.

Notes:

Since get-words are treated as nonterminals, they cannot be used as terminals to match specific get-words in the input block. For matching specific get-words see the quote idiom in the Parse idioms section.

Native

What do natives match during block parsing?

Other datatype

Values of other datatypes than mentioned above can be used during block parsing as terminal symbols.

Parse operations

Let's see how PARSE traverses a block:

parse [Hi Bye] [word!]
; == false

What goes wrong here? What happens is that PARSE successfully matches the first word in the INPUT block and advances the input to the second word.

The block hasn't been parsed to the end, which means that PARSE returns FALSE.

To match the input in more complicated cases we need parse operations in addition to the atomic expressions mentioned above.

Sequence

A sequence operation is a sequence of parse expressions. It does not use a keyword. The general form of a sequence is:

subexpression_1 subexpression_2 ... subexpression_n

When matching a sequence PARSE matches the subexpression_1 and if it succeeds, it tries to match the rest of the sequence. For the sequence match to succeed, all subexpression matches are required to succeed.

Example:

parse [Hi Bye] ['Hi word!]
; == true

In this case the sequence match succeeds and the input is advanced to its tail, which causes PARSE to return TRUE.

Ordered choice

The ordered choice (a.k.a. "alternative", but the "ordered choice" name is more appropriate, since the order of subexpressions matters) operation uses the | keyword. The general form of this operation is:

subexpression_1 | subexpression_2 | ... | subexpression_n

When matching an ordered choice PARSE tries to match the subexpression_1. If it succeeds, the ordered choice match is successful. If the first subexpression match doesn't succeed, PARSE tries to match the rest of the choice.

The ordered choice operation has lower priority than the sequence operation, which means that:

e1 e2 | e3

is equivalent to

[e1 e2] | e3

Let's say you want to check that the block element is either an integer or a decimal:

parse [36] [integer! | decimal!]
; == true

parse [37.2] [integer! | decimal!]
; == true

Note: Prioritization of the ordered choice operation causes that the second subexpression in the:

["a" | "ab"]

choice never succeeds, since the first one takes precedence.

Repetition operators

Repetition operators specify how many times a given subexpression should match. The general syntax of repetition is

 repetition_operator subexpression

Repetition operators have higher priority than the sequence operator, which means that

repetition_operator subexpression_1 subexpression_2

means the same as

[repetition_operator subexpression_1] subexpression_2

All repetition operators are greedy, which means that they always match as many times as possible.

Repetition operators are left-associative, which means that

any 2 skip

is equivalent to

[any 2] skip

The same holds for the to, thru and into operators.

Zero or one

This operator uses the opt keyword. It is also known as optional match. Since a zero count is allowed, this operator always succeeds.

Examples:

parse "," [opt #","]
; == true

parse "" [opt #","]
; == true

One or more

This operator uses the some keyword.

Examples:

parse "," [some #","]
; == true

parse ",," [some #","]
; == true

Zero or more

This operator uses the any keyword. Since a zero count is allowed, this operator always succeeds.

Examples:

parse ",," [any #","]
; == true

parse "" [any #","]
; == true

parse [Hi Bye] [any word!]
; == true

It returns TRUE, because:

The any operator always succeeds
The any operator is greedy successfully matching all words in the input, leaving the final input position at the tail

If we add a different datatype to the block:

parse [Hi 36 Bye] [any word!]
; == false

PARSE returns FALSE, since:

The any operator successfully matches just the first element of the input and stops not reaching the tail

Repetition count

The general form of this operation is:

n subexpression

Where N is an integer value. This operation specifies the repetition count for the given subexpression.

Examples:

parse "12" [2 skip]
; == true

The expression checks for exactly two words, no more, no less:

parse [Hi Bye] [2 word!]
; == true

Range of times

The general form of this operation is:

n m subexpression

Where N and M are integer values. This operation specifies the repetition range for the subexpression.

Examples:

parse "12" [1 2 skip]
; == true

parse [Hi Bye] [1 2 word!]
; == true

The expression checks for not less than 1 and not more than 2 words.

parse [Hi how are you? Bye] [0 5 word!]
; == true

This expression will successfully match between 0 and 5 words.

Note that when parsing for integer values, we have to specify a range since integers are used to specify ranges of matches.

This specifies exactly one match:

parse [-1] [1 1 -1]
; == true

This is an error:

parse [-1] [-1]
; == false

Skipping data in the input

There are two PARSE operators progressing the input position in accordance with a given parse subexpression:

the to operator
the thru operator

The general syntax of the operations is:

to subexpression

or

thru subexpression

The purpose of the to operator is to progress the input position up until the position where a successful subexpression match has occurred.

The purpose of the thru operator is to progress the input position to the position after the successful subexpression match.

Both operations fail, if a successful subexpression match isn't found.

The subexpression can be:

the end keyword. The

to end

operation always succeeds advancing the input to its tail, while the

thru end

operation always fails in R2; in the newer R3 version it has been improved and became compatible with the recursive idiom mentioned in the Parse idioms section

a word - in that case its value is looked up and used for matching

The following subexpressions are supported when parsing strings:

characters
strings

The following subexpressions are supported when parsing blocks:

datatypes, they are matched as described in the Datatype section
lit-words, they are matched as described in the Lit-word section
other values are matched literally, i.e. as terminal symbols

This is nice if you want to parse large amounts of data and don't care about certain contents in the block.

Let's say we don't care about anything until we reach a word. That can be done with to:

parse [37.2 38 Bye] [to word!]
; == false

This makes PARSE return FALSE because:

We have successfully reached a word, and the to operation succeeded, but
we haven't reached the tail of the input.

In order to reach the tail of the input, we can proceed past the word using thru instead of to.

parse [37.2 38 Bye] [thru word!]
; == true

Subblock parsing

When parsing a block you may need to check its subblocks (or parens, paths, lit-paths or get-paths) for specific patterns. The into operator is suitable for this. The general form of the operation is:

into subexpression

Only blocks or words referring to blocks are accepted as subexpressions.

The subblock parsing operation fails, if the element at the current input position does not have the ANY-BLOCK! type. Otherwise the element (subblock) is used as input and matched against the given subexpression. For the subblock parsing to succeed, the subblock must successfully match the given subexpression and the final subblock input position has to be the tail of the subblock.

Examples:

parse [[]] [into [none]]
; == true
parse [[1]] [into [none]]
; == false
parse [(1)] [into [skip]]
; == true
parse [a/b] [into [2 skip]]
; == true
parse ['a/b] [into ['a 'b]]
; == true

Using data from the input series

You can also use data from the series to be used in your trigger code. In addition to getting the input position using a set-word there are the following PARSE operations:

set variable subexpression

copy variable subexpression

The set operation is available only during block parsing, while the copy operation is available in both parsing modes. If the subexpression match succeeds, the set operation sets the given variable to the first matched value, while the copy operation copies the whole part of the input matched by the given subexpression. For a more detailed description see the Parse idioms section.

parse [Hi 36 37.2 38 Bye] [
  word!
  any [set int integer! (print ["Integer" int "Found"]) | decimal! (print "Decimal Found")]
  word!
]
; Integer 36 Found
; Decimal Found
; Integer 38 Found
; == true

We can apply normal function of Series to extracting data from the series we are parsing.

parse [Hi 36 37.2 38 Bye] [
  any [
    set int integer! (print ["Integer" int "Found"])
    | dec: decimal! (print ["Decimal Found at position" index? dec])
    | wrd: thru word! (print ["Word" first wrd "is near tail:" tail? wrd])
  ]
]
; Word Hi is near tail: false
; Integer 36 Found
; Decimal Found at Position 3
; Integer 38 Found
; Word Bye is near tail: true
; == true

This copies a part of a string:

parse "123456" [copy part 5 skip to end]
; == true
part
; == "12345"

Differences between R2 and R3 parsing

fail - new keyword for R3 parse, a nonterminal that does not match any input

quote value - new keyword for R3 parse, match the value "as is"

if (expression) - new keyword for R3 parse, evaluate the expression in the paren and if the result is false or none consider it a failed match

into subrule - like in R2, but in R3 the item the subrule matches can have a different datatype than the input

return value - new keyword for R3 parse, immediately return the given value from parse

and subrule - new keyword for R3 parse; look-ahead rule; match the subrule but don't advance input

not subrule - new keyword for R3 parse, invert the result of matching the subrule

?? - new keyword for R3 parse, prints debugging output

then subrule - new keyword for R3 parse, regardless of the success or failure of the subrule matching skip the next alternative

any subrule - to prevent unwanted infinite loops in R3 this rule stops also when the subrule matches the input but does not advance it

some subrule - to prevent unwanted infinite loops in R3 this rule stops also when the subrule matches the input but does not advance it

while subrule - new keyword for R3 parse, iterate subrule matching while the subrule matches the input; this rule is similar to any but with a simpler stopping condition (at the cost that the loop may become infinite)

accept - new keyword for R3 parse, works exactly like break

reject - new keyword for R3 parse, stops the matching loop and indicates failure of loop matching

to - allows multiple targets now, still not general enough to allow any target rule, though

thru - allows multiple targets now, still not general enough to allow any target rule, though

change subrule only value - new keyword for R3 parse, change the part of input matching the subrule (Warning! This is very slow! Also, input changes impair understandability of the code! Due to these reasons it isn't a good programming practice to use it.)

insert - new keyword for R3 parse, (Warning! This is very slow! Also, input changes impair understandability of the code! Due to these reasons it isn't a good programming practice to use it.)

remove subrule - new keyword for R3 parse, (Warning! This is very slow! Also, input changes impair understandability of the code! Due to these reasons it isn't a good programming practice to use it.)

do subrule - new keyword for R3 parse proposed by Gabriele

Complex parse expressions

It's possible to build complex parse expressions. When you do that, it can be nice to split them in smaller pieces and give them meaningful names.

Recursion

Recursion is frequently the most elegant way how to describe a grammar. Let's take a grammar consisting of strings of the aⁿbⁿ type, where n >= 1 as an example. Such a grammar can be described using the following parse expression:

anbn: ["a" anbn "b" | "ab"]

Usage:

parse "ab" anbn
; == true
parse "aabb" anbn
; == true

Parse idioms

Description	Operation	Idiom
Any-string, string parsing	a: ["abc"]	a: [#"a" #"b" #"c"]^[1]
Bitset, string parsing	a: charset ",;"	a: [#"," \| #";"]^[2]
skip nonterminal, string parsing	a: [skip]	b: complement charset ""^[3] a: [b]^[4]
skip nonterminal, block parsing	a: [skip]	a: [any-type!]^[5]
opt operator (zero or one)	a: [opt b]	a: [b \|]^[6]^[7]
any operator (zero or more)^[8]	a: [any b]	a: [b a \|]^[9]^[10]
some operator (one or more)^[8]	a: [some b]	a: [b [a \|]]^[11]^[12]
fail (a nonterminal that always fails)	a: [fail]	a: [some "a" "a"]^[13] a: [end skip]^[14]
Range of times operator	a: [m n b]	a: [(l: min m n k: n - m) l b [k [b \| c: fail] \| :c]]^[15]^[16]
then operator^[17] (match B and if it succeeds, match C; otherwise match D)	a: [b then c \| d]	a: [[b (e: c) \| (e: d)] e]^[18]
not predicate^[19] (inverse the success)	a: [not b]	a: [b then fail \|]^[20] a: [[b (c: [fail]) \| (c: none)] c]
and predicate (match and don't advance)	a: [and b]	a: [c: b :c]^[21] a: [not not b]^[22] a: [[b (c: none) \| (c: [fail])] fail \| c]^[23]^[24]
end nonterminal (match the tail of the input)	a: [end]	a: [not skip]^[25]^[26]
start nonterminal (match the head of the input)^[27]	a: [start]	a: [b: (c: unless head? b [[fail]]) c]^[28]^[29]
to operator^[8] (advance to the first successful match)	a: [to b]	a: [and b \| skip a]^[30]^[31]^[32] a: [any [not [b (c: none)] (c: [fail]) skip] c]^[33] a: [any [[b (c: none d: [fail]) \| (c: [fail] d: [skip])] d] c]^[34] a: [thru [and b]]^[35]
correspondence between to and any	a: [to [b \| end]]^[36]	a: [any [not b skip]]
thru operator^[8] (advance thru the first successful match)	a: [thru b]	a: [b \| skip a]^[37]^[38]^[39] a: [any [not [b c: (d: [:c])] (d: [fail]) skip] d]^[33] a: [any [[b c: (d: [:c] e: [fail]) \| (d: [fail] e: [skip])] e] d]^[34] a: [to [b c:] :c]^[35]
set operator (set the variable to the first matched value)	a: [set b c]	f: [(set/any [b] if lesser? index? e index? d [e])]^[40]^[41]^[42] a: [and [c d:] e: f :d]^[43]
copy operator (set the variable to the matched sequence)	a: [copy b c]	f: [(b: if lesser? index? e index? d [copy/part e d])]^[44]^[42] a: [and [c d:] e: f :d]^[43]
quote operator, block parsing (match a terminal)	a: [quote b]	a: [copy c skip (d: unless equal? c [b] [[fail]]) d]^[45]^[46]

The table illustrates that:

↑ When parsing strings, strings work as sequences of characters
↑ When parsing strings, bitsets work as choices of characters
↑ The bitset containing every character can be defined as a complement of the bitset containing no character
↑ When parsing strings, the skip nonterminal works as the bitset containing every character
↑ When parsing blocks, the skip nonterminal works as the ANY-TYPE! datatype
↑ The opt operator can be defined using a common choice
↑ opt is greedy, since the first choice subexpression takes priority
↑ ^a ^b ^c ^d
An iterative operator replacing a common recursive nonterminal:
- enhances expressivity (spares nonterminal definitions)
- optimizes memory usage (spares stack space)
- optimizes speed
↑ The any operator can be defined using a common recursive expression
↑ any is greedy, since the first choice subexpression takes priority
↑ The some operator can be defined using a common recursive expression
↑ some is greedy, since the first choice subexpression takes priority
↑ The fail nonterminal can be defined (Even without the end keyword!) using the greediness of some
↑ The fail version using the end and skip keywords is more succinct, though
↑ The range of times operator can be defined using a sequence of repetitions
↑ The range of times operator is greedy, since the first choice subexpression takes priority
↑ The then operator, enhancing the apparent expressivity, is used in the Generalized TDPL
↑ The then operator can be defined using choice and a computed nonterminal
↑ "Predicate" means that it does not advance the input position
↑ The not predicate can be defined using the then operator and the fail nonterminal
↑ The and predicate can be defined using the position manipulations (Warning: This is not recursion-safe; the nonterminal B must not change the value of the variable 'c!)
↑ The and predicate can be defined using the not predicate
↑ The and predicate can be defined using choice, sequence and computed nonterminals
↑ Explanation: the first subexpression of the main choice is a sequence that is designed to always fail computing a non-advancing nonterminal C so that C succeeds, if B succeeded
↑ The end nonterminal can be defined using the not predicate (Notice that this is not a cyclic definition!)
↑ This idiom explains well, why the [end skip] idiom always fails
↑ Some users suggest that it might be useful to detect when the input series is at its head
↑ The start nonterminal can be defined using a sequence and a computed nonterminal
↑ Explanation: C is computed to fail unless the input position is at the head of the input series
↑ The to operator can be defined using the and operator and a common recursive expression
↑ The recursive definition is more general than the current to operator, supporting any nonterminal B
↑
The recursive definition is identical with the behavior of the to operator except for:
- the [to ""] expression always fails when parsing strings, while the recursive expression always succeeds
↑ ^a ^b This is an equivalent iterative definition of the operator using any, not, fail, skip and computed nonterminals
↑ ^a ^b This is an equivalent iterative definition expanding the not idiom
↑ ^a ^b This shows the relation between the to and thru operators
↑ The | END choice causes that the to operator always succeeds
↑ The thru operator can be defined using a common recursive expression
↑ The recursive definition is more general than the current thru operator, supporting any nonterminal B
↑
The recursive definition is identical with the behavior of the thru operator except for:
- the [thru end] expression always fails, while the recursive expression always succeeds
- the [thru ""] expression always fails when parsing strings, while the recursive expression always succeeds
↑ The set operator can be defined using the and operator, position manipulations, actions and sequence
↑ Notice that this set definition isn't limited to block parsing
↑ ^a ^b Explanation: Uses D and E to set the 'b variable as needed. Notice that if the input position didn't move forward, 'b has to be set to NONE
↑ ^a ^b Explanation: The first subexpression in the sequence is defined so that we know both the position after (used to set 'd) and before (used to set 'e) B was matched
↑ The copy operator can be defined using the and operator, position manipulations, actions and sequence
↑ quote idiom usage example: a: [copy c skip (d: unless equal? c ['hi] [[fail]]) d] matches the lit-word 'hi
↑ Explanation: C is computed to fail unless the first element of the input equals to the given terminal

In case you want to use some of the above idioms you do not have to remember them. Instead, you can use the parseen script, where you can find functions generating the corresponding rules for you.

Local variables in parse rules

Sometimes it is desirable to use local variables in a parse rule. Such variables need to be at least recursion-safe, since parse rules frequently are used recursively, but even threading-safe local variables may be needed in the future. The PARSE function does not have a built-in support for such construct, nevertheless, thanks to the malleability of Rebol, it is possible to define a USE-RULE function facilitating the usage of (recursion and threading safe) local variables in PARSE rules working as follows:

rule: use-rule [a b] [
    "b" (print 1) |
    a: "a" rule "a" b: (print subtract index? b index? a)
]
parse "aba" rule
parse "aabaa" rule

A more complex example of the usage of local variables in PARSE rules is the evaluate.r script showing how it is possible to handle different precedence and associativity rule sets in PARSE.

Modifying the input series

It is possible to manipulate the PARSE input series during expression matching, since series manipulation functions such as CHANGE, INSERT or REMOVE can be used during a parse operation.

Here are some reasons why it is not advisable to manipulate the input series during expression matching:

Instead of changing the input series it is possible to use a new series and collect its contents as desired.
Some manipulations change the length of the series that is currently parsed. Length-changing operations are inefficient - a length-changing operation takes O(N) time, where N is the length of the series being changed. Contrast this to the APPEND operation used for the collecting approach, which is roughly N times faster.
The length-changing operations mess up the input position bookkeeping. It is easy to produce hard to understand and debug code this way.

Example:

Let's define a test-string:

n: 100
random/seed 0
test-string: copy ""
repeat i n [insert tail test-string pick "abc" random 3]

Let's implement a remove-chars function using PARSE. The first attempt changes the series "in place", but the changes don't influence the length of the input series:

remove-chars1: func [
    {Removes the given chars from a string.}
    string [string!]
    chars [char! string!] "All characters specified will be removed."
    /local chars-to-keep group change-position
] [
    ; if a char, use a string instead
    if char? chars [chars: head insert copy "" chars]
    ; define the characters we want to keep:
    chars: charset chars
    chars-to-keep: complement chars
    ; the position where the change needs to occur
    change-position: string
    ; turn off the default whitespace handling
    parse/all string [
        any [
            ; ignore chars
            any chars
            ; get a group of chars-to-keep
            copy group some chars-to-keep
            (change-position: change change-position group)
        ]
    ]
    clear change-position
    string
]

The second attempt uses the REMOVE function changing the input series length:

remove-chars2: func [
    {Removes the given chars from a string.}
    string [string!]
    chars [char! string!] "All characters specified will be removed."
    /local chars-to-keep group-start group-end
] [
    ; if a char, use a string instead
    if char? chars [chars: head insert copy "" chars]
    ; define the characters we want to keep:
    chars: charset chars
    chars-to-keep: complement chars
    ; turn off the default whitespace handling
    parse/all string [
        any [
            ; ignore chars-to-keep
            any chars-to-keep
            ; remove group of chars
            group-start: some chars group-end:
            (remove/part group-start group-end)
        ]
    ]
    string
]

The result is:

r1: remove-chars1 copy test-string "a"
; == {cbccbbcccbbbbbcccccccbcbbcbcbcbbccbcbbccccbcbbcbbcbbcbccccbcbb}
r2: remove-chars2 copy test-string "a"
; == {cbccbbcccbababbbcccaccccbcbbaacbcbcbbccbcbbccccbcbbcbbcbbcbccccbacbb}

Surprise! The approach using input manipulations didn't remove all #"a"s as we expected! The problem is caused by improper input position handling. So, let's be stubborn and handle the input position properly:

remove-chars3: func [
    {Removes the given chars from a string.}
    string [string!]
    chars [char! string!] "All characters specified will be removed."
    /local chars-to-keep group-start group-end
] [
    ; if a char, use a string instead
    if char? chars [chars: head insert copy "" chars]
    ; define the characters we want to keep:
    chars: charset chars
    chars-to-keep: complement chars
    ; turn off the default whitespace handling
    parse/all string [
        any [
            ; ignore chars-to-keep
            any chars-to-keep
            ; remove chars
            group-start: some chars group-end:
            (remove/part group-start group-end)
            ; set the input position properly
            :group-start
        ]
    ]
    string
]

The result is:

r3: remove-chars3 copy test-string "a"
; == {cbccbbcccbbbbbcccccccbcbbcbcbcbbccbcbbccccbcbbcbbcbbcbccccbcbb}

Speed discussion

Since the CHANGE function (or any other Rebol native) does not allow to move a part of the series without making a copy of it, the REMOVE-CHARS1 function has to do a lot of excessive work, allocating and collecting a lot of group-strings, while the REMOVE-CHARS3 function uses the full speed of the REMOVE native, not needing to either allocate or deallocate any additional "helper" string.

A testimony to the quality of the REMOVE-CHARS1 function's algorithm is that even in such unfavorable conditions the speed of it remains competitive even for the shortest strings, and that for strings as short as 600 characters the REMOVE-CHARS1 function becomes faster than the REMOVE-CHARS3 function. For longer strings the speed difference will be even greater in favor of the REMOVE-CHARS1 function.

Troubleshooting

PARSE is a very powerful function, but it can also be troublesome, if you don't keep a good eye on what you are doing. If you are unlucky, PARSE can get stuck in an infinite loop, requiring you to restart Rebol.

But when exactly does it happen?

PARSE normally traverses through the block, but it's really the expressions that make PARSE progress. If you specify an expression that won't make it progress, it will be stuck at the same point forever.

Such an expression can be an expression repetitively matching an empty sequence, empty choice subexpression or a NONE expression.

Examples:

>> parse "abc" [any []]
*** HANGS Rebol ***

>> parse "abc" [some ["a" |]]
*** HANGS Rebol ***

>> parse "abc" [some [none]]
*** HANGS Rebol ***

Note: to be able to escape from infinite loops use () somewhere in the iterated expressions as for example:

 >> parse "abc" [any [()]]
 *** YOU CAN PRESS [ESC] NOW TO STOP THE LOOP ***

[1] When parsing strings, strings work as sequences of characters

[2] When parsing strings, bitsets work as choices of characters

[3] The bitset containing every character can be defined as a complement of the bitset containing no character

[4] When parsing strings, the skip nonterminal works as the bitset containing every character

[5] When parsing blocks, the skip nonterminal works as the ANY-TYPE! datatype

[6] The opt operator can be defined using a common choice

[7] opt is greedy, since the first choice subexpression takes priority

[iter-8] An iterative operator replacing a common recursive nonterminal:
enhances expressivity (spares nonterminal definitions)

optimizes memory usage (spares stack space)

optimizes speed

[9] s expressivity (spares nonterminal definitions)

[10] timizes memory usage (spares stack space)

[11] timizes speed

[9] The any operator can be defined using a common recursive expression

[10] any is greedy, since the first choice subexpression takes priority

[11] The some operator can be defined using a common recursive expression

[12] some is greedy, since the first choice subexpression takes priority

[13] The fail nonterminal can be defined (Even without the end keyword!) using the greediness of some

[14] The fail version using the end and skip keywords is more succinct, though

[15] The range of times operator can be defined using a sequence of repetitions

[16] The range of times operator is greedy, since the first choice subexpression takes priority

[17] The then operator, enhancing the apparent expressivity, is used in the Generalized TDPL

[18] The then operator can be defined using choice and a computed nonterminal

[19] "Predicate" means that it does not advance the input position

[20] The not predicate can be defined using the then operator and the fail nonterminal

[21] The and predicate can be defined using the position manipulations (Warning: This is not recursion-safe; the nonterminal B must not change the value of the variable 'c!)

[22] The and predicate can be defined using the not predicate

[23] The and predicate can be defined using choice, sequence and computed nonterminals

[24] Explanation: the first subexpression of the main choice is a sequence that is designed to always fail computing a non-advancing nonterminal C so that C succeeds, if B succeeded

[25] The end nonterminal can be defined using the not predicate (Notice that this is not a cyclic definition!)

[26] This idiom explains well, why the [end skip] idiom always fails

[27] Some users suggest that it might be useful to detect when the input series is at its head

[28] The start nonterminal can be defined using a sequence and a computed nonterminal

[29] Explanation: C is computed to fail unless the input position is at the head of the input series

[30] The to operator can be defined using the and operator and a common recursive expression

[31] The recursive definition is more general than the current to operator, supporting any nonterminal B

[32] The recursive definition is identical with the behavior of the to operator except for:
the [to ""] expression always fails when parsing strings, while the recursive expression always succeeds

[36] the [to ""] expression always fails when parsing strings, while the recursive expression always succeeds

[itereq-33] This is an equivalent iterative definition of the operator using any, not, fail, skip and computed nonterminals

[iteris-34] This is an equivalent iterative definition expanding the not idiom

[tothr-35] This shows the relation between the to and thru operators

[36] The | END choice causes that the to operator always succeeds

[37] The thru operator can be defined using a common recursive expression

[38] The recursive definition is more general than the current thru operator, supporting any nonterminal B

[39] The recursive definition is identical with the behavior of the thru operator except for:
the [thru end] expression always fails, while the recursive expression always succeeds

the [thru ""] expression always fails when parsing strings, while the recursive expression always succeeds

[44] the [thru end] expression always fails, while the recursive expression always succeeds

[45] the [thru ""] expression always fails when parsing strings, while the recursive expression always succeeds

[40] The set operator can be defined using the and operator, position manipulations, actions and sequence

[41] Notice that this set definition isn't limited to block parsing

[setg-42] Explanation: Uses D and E to set the 'b variable as needed. Notice that if the input position didn't move forward, 'b has to be set to NONE

[sex-43] Explanation: The first subexpression in the sequence is defined so that we know both the position after (used to set 'd) and before (used to set 'e) B was matched

[44] The copy operator can be defined using the and operator, position manipulations, actions and sequence

[45] quote idiom usage example: a: [copy c skip (d: unless equal? c ['hi] [[fail]]) d] matches the lit-word 'hi

[46] Explanation: C is computed to fail unless the first element of the input equals to the given terminal

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]