Common Lisp/External libraries/CL-YACC

CL-YACC is a LALR(1) parser generator written in portable ANSI Common Lisp.

LALR Parsing Primer

Before we get into the usage of CL-YACC, let's first take a two cent review of parsing, and in particular the usage of a parser generator.

Parser-generators are one of the oldest forms of meta-programming to be used by the general (programming) public. YACC, or Yet Another Compiler Compiler, was a C program that took an input of a grammar description (in the form of production rules) and output a C program that can parse input of that grammar. While YACC wasn't the first to do this, it became very popular.

The intention of YACC was for it to be used as a development tool, namely the developer is saved a hassle by having this parser generator do his work for her. This was a neat idea for many developers since writing a parser (something that was typically done by hand) isn't as hard as it is tedious, at least once you get the hang of it. However, in many cases the word tedious doesn't quite describe the writing of a parser, impossible comes to mind more readily. For instance, the CL-YACC input of the C programming language is around 280 lines long, but the macro expansion of the CL-YACC grammar is around 13,000 lines of heavily redundant spaghetti code. That lets you know something about how much it is doing for you.

So YACC is a program that writes a program in a lower language from an input from language explicitly designed for specifying grammars. It sounds a lot like YACC is a macro expansion function. YACC is exactly the sort of thing one would do via a Common Lisp macro.

The Parser and the Lexer

Typically, parsing is broken into two parts, lexing and parsing. Lexing is a very rudimentary step of examining input where the input is broken into tokens, or pieces that the parser knows how to deal with. Lexing typically deals with identifying what pieces of the input are what, such as symbols or identifiers, constants like numbers, string literals, or characters, and operators like addition, subtraction, or multiplication.

Parsing on the other hand has much more knowledge and smarts behind it. For example, the actual parser knows that parentheses have to be matched in your program. It also needs to know that parentheses that are escaped (as is possible in CL) and parentheses in strings do not need to match. In short, the parser knows something about what syntax in your language actually means.

Since a big benefit of having a parser generator at your disposal is that you don't have to know how it works, we will not deal with the actual mechanism of parsing or generating a parser. Instead we will note that the LALR(1) parsers that CL-YACC can generate are able to parse most context free grammars, and that includes the syntax of most (all?) programming languages and configuration/data files. In fact, most programming languages define their syntax such that it can be parsed by a LALR(1) parser due to the features it offers (a highly efficient yet small parser which can be automatically produced by existing and common parser generators).

Production Rules

Production rules are a way of representing a grammar. They declare how particular language elements are mapped onto other language elements. To illustrate this, let's look at some production rules you might expect in a Lisp parser.

 form : atom
      | list</lisp>

This states that in Lisp, a form is either an atom or a list.  We go on to declare what an atom and a list is.

<syntaxhighlight lang="lisp">
 atom : constant
      | symbol
 
 list : ( symbol forms )
 
 forms : form
       | form forms</lisp>

Here we stated that an atom is either a constant (number, string, etc.), or a symbol.  We also specify that a list is a function followed by some number of forms, and that forms is one or more form.  We left out quoted lists for now.  Notice how one uses recursive definitions like in forms.

What about definitions for <code>constant</code> and <code>symbol</code>?  These are what is known as terminals which relate to actual tokens in the input.  These are not composite objects like <code>list</code>.

We can translate this into CL-YACC syntax like so.

<syntaxhighlight lang="lisp">
 (yacc:define-parser lisp-parser
   (:start-symbol form)
   (:terminals (\( \) symbol constant))
   (form atom
         list )
   (atom constant
         symbol )
   (list (\( symbol forms \)))
   (forms form
          (form forms) ))

Most of this new form is self explanatory. One new part is the specification of a start symbol. This is the highest level or entry point of the parser. Now let's run this parser on some input.

 (yacc:parse-with-lexer
  (let ((list '((\( \()
                (symbol +)
                (constant 1)
                (constant 2)
                (\) \)) )))
    (lambda () (values-list (pop list))) )
  lisp-parser )
 
 ==> (|(| + (1 2) |)|)

The output is a little unimpressive, but let's recap what happened. The first argument to parse-with-lexer is a lexer function which on each call returns two values, the type of token and the value. We are using a simple list lexer where I have specified the lexical tokens ahead of time and it merely pops them off the list. Our lexical tokens are an open parenthesis, and plus symbol, two constants (1 and 2), and a closing parenthesis. The parentheses are given a token type of them self and are terminals in the grammar. If this is confusing, don't get caught up on it. A lexer's purpose is to represent a stream of lexical tokens which represent terminals in the grammar. This lexer is returning what might come from an input of "(+ 1 2)".

Now let's look at the parser. It starts in the form state. It looks at the tokens coming in and realizes that it is looking at a list (it certainly isn't looking at an atom which is defined as a constant or a symbol). It then looks at the first element in the list to ensure it is of type symbol, then it looks that the forms that make up the arguments. By default, CL-YACC returns basically how it appears in the production rules. So we see a list of tokens that are grouped in a list whenever the production rule required a list to define it. It should be noted that the parsing process described is not at all how the parser works under the hood, but it serves well as a way to think about the process.

If you have been playing around with Lisp for a while, you might ask why bother with this. If you had just typed '(+ 1 2) into the REPL, it would have given you something very similar back. Well, it is important to note that the Lisp reader is using something like this at some level. Something needs to make sense of what you typed in. But more importantly, the returned value is a default that we got because we didn't tell the parser to do anything as it parsed. In any Lisp, that default parse is going to be quite close to the input since Lisp is programmed by directly specifying the parse tree. In any other language, the first step in parsing code is often converting the language syntax to something closer to a Lisp.

That being said, let's take a look at what we actually accomplished. One thing to point out: this parser is pretty flexible. With the proper lexer, it can handle expressions like "hello", (my-function 1 2 3), and (/ 1 (+ 3 7) (- x y (a-function z))). Further, the parser knows something about the language, it knows that parentheses need to be matched, and that something like ((+ 1 2) 3) is not legal since the first element needs to be a symbol. All of that in a few lines of code.

The Lexer

Lexers are kind of a dime a dozen, but a good one is hard to find. Here we will be using DSO-LEX, a lexer that that uses CL-PPCRE to define lexical tokens.

Example: Creating a Calculator

Here we will make a very simple calculator function. People say that calculators are a parsers hello world program.

Example: An infix reader macro

There are several libraries set up to provide the Common Lisp user with infix syntax, for mathematical expressions and the like. We will make one for ourselves.

Other Parser Generators

Due to the metaprogramming nature of parser generation, Lisp has a great number of parser generators. A few others of note are FUCC, a generator that can parse more than just LALR grammars, and META, an (older) extremely lightweight generator with an eye for speed.