Programming Languages/Syntax Specification

There are two ways to describe a language: by its syntax or by its semantics. The syntax of a language is a set of rules that defines what strings of characters (sentence or statements) belong to this language; the semantics of a language describe the meaning of a given statement.

Specifying the Syntax of Programming Languages

Lexemes and Tokens

For a language, one can decompose a statement into a string of lexemes, the indecomposable building blocks of a language. For example, the following w:Perl statement:

   do { sleep 1; } while (++$i < 10);

contains lexemes: do, {, sleep, 1, ;, }, while, (, ++, $i, <, 10, ) and ;.

Lexemes can be grouped into categories. Such a category is called a token.

Context-Free Grammars

Backus-Naur Form(BNF)
Arithmetic grammar
Unary negation (~)

Parse Trees

Parse tree of complex formula using exponents, et cetera

Extended Backus-Naur Form

Extended Backus-Naur Form is intended to be a precise method for specifying the syntax of a language. It is a metalanguage, in that it is a language used to describe languages. It is also commonly seen employed in command usage or help documents.

The complete documentation on EBNF is to be found here: [1].

The basic elements of a language would be described using EBNF terminal symbols, which are then grouped together into logical higher-level expressions of the overall language using nonterminal symbols. Terminal symbols specify actual valid character patterns used to represent things like the proper syntax for variable names in the language, or for representing numbers. Nonterminal symbols group the different possible representations of a given thing (say, numerical constants in decimal, hexadecimal or scientific notation format) into a single logical thing ([Numerical_Constant]). Ultimately, by building up the various syntactical elements and expressing the ways in which they may be combined, one ends up with a symbol representing a program in the language in question.

A couple of examples are to be found in the man pages for the awk utility.

(excerpts are from [2])

First, its usage:

awk [ -F fs ] [ -v var=value ] [ 'prog' | -f progfile ] [ file ... ]

The items in square brackets are an example of EBNF and denote an optional parameter to use on the awk command line.

Secondly, the usage detail section is partially specified in EBNF.

In the example below, words such as 'pattern', 'action', 'expression' and 'statement' are EBNF nonterminal symbols.

A pattern-action statement has the form

   pattern { action }

A missing { action } means print the line; a missing pattern always matches. Pattern-action statements are separated by newlines or semicolons.

An action is a sequence of statements. A statement can be one of the following:

             if( expression ) statement [ else statement ]
             while( expression ) statement
             for( expression ; expression ; expression ) statement
             for( var in array ) statement
             do statement while( expression )
             break
             continue
             { [ statement ... ] }
             expression              # commonly var = expression
             print [ expression-list ] [ > expression ]
             printf format [ , expression-list ] [ > expression ]
             return [ expression ]
             next                    # skip remaining patterns on this input line
             nextfile                # skip rest of this file, open next, start at top
             delete array[ expression ]# delete an array element
             delete array            # delete all elements of array
             exit [ expression ]     # exit immediately; status is expression

Statements are terminated by semicolons, newlines or right braces.

Entire languages can also be documented in EBNF. The link below documents the ANSI definition of C++, and many parts of this document are written in EBNF.

[3]