Theory of Formal Languages, Automata, and Computation/Grammars and the Chomsky Hierarchy

Theory of Formal Languages, Automata, and Computation
Grammars and the Chomsky Hierarchy

Grammars specify how the strings in a language can be generated. Grammars are finite representations of formal languages. In this chapter we describe four broad categories of grammars and corresponding categories of languages that the grammar categories represent. The grammars and languages are ancestrally related and were proposed by Chomsky as potential models for natural language. Thus, the four language (and grammar) classes are known as the Chomsky hierarchy, which are summarized in Figure ChomskyOverview. The broadest class of languages, those with the least restrictive grammars, are the unrestricted or Type 0 languages (grammars).

Unrestricted (Type 0) Grammars and Languages edit

Figure GrammarDefinition: This is a succinct definition of a grammar of the most general type. -- Type 0 or unrestricted.

A grammar is specified by a 4-tuple G = (V, Σ, P, S), where V is a finite set of variables, also called non-terminal symbols; Σ is the finite alphabet of the language being represented, also called the terminal symbols of the grammar; P is a finite set of productions; and S is the start symbol and is a member of V. V and Σ have the null intersection. Each production in P takes the form $\alpha \rightarrow \beta$ , where $\alpha$ is a string of symbols from V+Σ and | $\alpha$ | > 0; and where $\beta$ is also a string of symbols from V+Σ, but | $\beta$ | $\geq$ 0 (i.e., $\beta$ can be the empty string, denoted by $\varepsilon$ ). This and no other restrictions is the definition of unrestricted or type 0 grammars (Figure GrammarDefinition). A grammar G specifies the strings of a language L(G) or L_G. This definition of a type 0 grammar seems to allow $\alpha$ to be a string of only alphabet (terminal) symbols, but I have never seen an example of this in textbooks. An exercise below asks you to reflect on this issue further.

Figure SimpleGrammar: The grammar for some number of 0s followed by an equal number of 1s, with one example derivation shown.

Consider this simple grammar, G_Simple = (V, Σ, P, S), where V = {S}, Σ = {0, 1}, P = {S $\rightarrow$ ε, S $\rightarrow$ 0S1}, S is S. There are two productions in this grammar, and we'll often use the 'or' symbol, '|', to refer to multiple productions with the same left-hand side. So, S $\rightarrow$ ε and S $\rightarrow$ 0S1 are abbreviated S $\rightarrow$ ε $|$ 0S1. ε signifies the empty string, and is typically not part of the alphabet of a grammar (else it would be in Σ, which is not the case in this grammar). We just need some way of representing 'nothing'. In this example, there is only one variable, and the left-hand side of each production, $\alpha$ , is a single variable.

Figure SimpleDerivationTree: This illustrates a breadth-first enumeration of strings in the language 0ⁿ1ⁿ. Nodes are sentential forms. The root is the start symbol. Leaves are strings in the language. Arcs correspond to productions. A path from the root to a leaf is a derivation.

Beginning with only the start symbol, we can rewrite it using any of the productions that include only the start symbol on the left-hand side. This rewrite will create a string that contains variables and/or terminals. This resulting string is called a sentential form. We can then look at the sentential form created and ask whether any productions can be applied to it. A production can be applied to a sentential form if the production's left-hand side ( $\alpha$ ) is a substring of the sentential form. In case of such a match, a new sentential form is created by rewriting the matched substring with the matching production's right-hand side ( $\beta$ ), resulting in another sentential form. And this can continue indefinitely, where a sequence of sentential forms (that are the result of s sequence of production applications) are called a derivation, where the last string in the derivation only contains terminal symbols and is a string of the language generated by the grammar (Figures SimpleGrammar and SimpleDerivationTree).

Because there may be multiple productions that have left-hand sides that match a given sentential form, we can write an enumeration procedure, like depth-first or breadth first enumeration, that tries all the relevant productions, and that traces out possible derivation sequences. Enumeration as it is often taught in introductory courses is probably a procedure for enumerating or systematically visiting the nodes of an explicit graphs, but the enumeration procedure for generating a language is better thought of as enumerating an implicit graph (i.e., the “vertices”, that is sentential forms, are generated on demand). Enumerating implicit graphs is also characteristic of many AI problems.

A more complicated grammar, G_Complex = (V = {S, A, B, C, D}, Σ = {a}, P, S), due to Hopcroft and Ullman (1979)^[1], where P is this set of productions.

1) S $\rightarrow$ ACaB

2) Ca $\rightarrow$ aaC

3) CB $\rightarrow$ DB

4) CB $\rightarrow$ E // FYI: we could summarize productions (3) and (4) as CB $\rightarrow$ DB | E

5) aD $\rightarrow$ Da

6) AD $\rightarrow$ AC

7) aE $\rightarrow$ Ea

8) AE $\rightarrow$ ε // ε is the empty string

What language could G_Complex represent? Since 'a' is the only terminal symbol, it must be a language of strings that only contain 'a's. Its a fair guess that the number of a's in each string is what determines membership. Lets start seeing what strings of 'a's we are able too derive. Can we derive the string of zero a's, i.e., the empty string? No, since the start symbol is on the left-hand side of only production (1), and an 'a' is introduced right off the bat, and importantly there is no production that removes a's once one is introduced.

Can we derive a string with exactly one 'a'? S $\rightarrow$ ACaB using production (1) $\rightarrow$ AaaCB using production (2). Indeed, production (2) is the only production that is applicable, but you should confirm that the left-hand side of production (2), 'Ca', is the only left-hand side that matches a substring of 'ACaB'. Again, there is no way to remove an 'a' from 'AaaCB' in order to arrive at string of a single 'a'.

Can we derive a string with exactly two a's? S $\rightarrow$ ACaB using production (1) $\rightarrow$ AaaCB using production (2) $\rightarrow$ AaaDB using production (3) $\rightarrow$ AaDaB using production (5) $\rightarrow$ ADaaB using production (5) $\rightarrow$ ACaaB using production (6) $\rightarrow$ AaaCaB using production (2), but again too many a's if two a's is our target. Again, failing to find ways of combining productions to remove a's, this path of rewrites is a deadend. But notice that when we applied production (3) during this sequence of rewrites we had another option available to us -- we could have applied production (4) instead of (3). Lets look at that option instead. S $\rightarrow$ ACaB using production (1) $\rightarrow$ AaaCB using production (2) $\rightarrow$ AaaE using production (4). Continuing from AaaE we have AaaE $\rightarrow$ AaEa using production (7) $\rightarrow$ AEaa using production (7) $\rightarrow$ aa using production (8).

So we can derive 'aa' from G_Complex. Can we derive 'aaa'? If we continue on from our previous deadend of 'AaaCaB' above we should quickly conclude that only production (2) will apply, which will result in 'AaaaaCB'. We can't derive 'aaa', but you should confirm that we can derive 'aaaa'. Up until this point we have been hand simulating an enumeration procedure, which is probably best left to automation. As an exercise work with a large language model such as chatGPT to code a procedure for performing an enumeration of strings generated by G_Complex. But lets also step back and reason about the generation process. We have a variable 'C' that works its way from left to right through successive sentential forms, doubling 'a's as it goes. When the 'C' reaches the right end, it either changes to an 'E' using production (4) and goes left repeatedly using production (7) until it reaches the left end and 'exits' using production (8); or the 'C' at the right end becomes a 'D' using production (3), again going left repeatedly using production (5), but this time leading to a sentential form that will again result in doubling the 'a's. Rather than the language of an even number of a's, confirm your understanding that G_Complex generates the strings with a positive power of 2 a's. Followup questions in our analysis include (i) why must E go left at all? (ii) Could the grammar be altered, creating an equally valid equivalent grammar, so that D doubled a's as it was going left? (iii) And if so, what other revisions would have to be made?

What is made particularly clear in this latter example of a grammar is that there is a procedure -- an iterative, looping procedure -- that underlies the design of the grammar, and though each production can be interrogated in isolation, each rule only makes sense in its relation to the entire grammar.

Markov Algorithms edit

Markov algorithms make the procedure explicit in the formalism

Markov of 'Markov algorithms' is the son of Markov of 'Markov chain' fame. both are named ...

Non-Contracting aka Context Sensitive (Type 1) Grammars and Languages edit

If for every production $\alpha \rightarrow \beta$ , $|\alpha |\leq |\beta |$ then the grammar is said to be a non-contracting grammar since sentential forms never shrink along a derivation path. The class of non-contracting grammars define the class of languages that is identical to the languages that are defined by the context sensitive grammars. A context sensitive grammar has productions of the form $\alpha$ ₁A $\alpha$ ₂ $\rightarrow$ $\alpha$ ₁ $\beta \alpha$ ₂ where A is a variable, and $\alpha$ ₁, $\alpha$ ₂, and $\beta$ are arbitrary strings of variables and alphabet symbols, but $\beta$ cannot be the empty string. The name context sensitive is because $\alpha$ ₁ and $\alpha$ ₂ give the context in which A can be rewritten as $\beta$ .

It should be clear that every context sensitive grammar is a non-contracting grammar, but its also the case that every non-contracting grammar can be converted to a context sensitive grammar that defines the same language.^[2]^[3] Thus, the non-contracting grammars and the context sensitive grammars define the same class of languages. You can think of context-sensitive as a normal form for non-contracting, but context sensitive is so widely used that even when talking about non-contracting grammars, we'll often use the equivalent label of context sensitive grammars (CSGs) and languages (CSLs). In any case, these grammars and languages are also referenced as type 1.

Consider this CSG, G_abc = ({S, B, C}, {a, b, c}, P, S), which is due to Hopcroft and Ullman (1969)^[4], where P is the set containing the following productions.

1) S $\rightarrow$ aSBC

2) S $\rightarrow$ aBC

3) CB $\rightarrow$ BC

4) aB $\rightarrow$ ab

5) bB $\rightarrow$ bb

6) bC $\rightarrow$ bc

7) cC $\rightarrow$ cc

Let's understand the procedure that is specified by this grammar. Production (1) can be used repeatedly to add a positive number of a's, as well as 'placeholders' for an equal number of b's (Bs) and c's (Cs). After applying production (1) a desired number of times, we can 'exit' with production (2), then move C's to the right with production (3), and convert the placeholders of Bs and Cs to their lower case equivalents. Confirm your understanding that the language generated by G_abc is {aⁿbⁿcⁿ $|$ n $\geq$ 1} (i.e., a positive number of a's followed by an equal number of b's and finally an equal number of c's).

Consider the grammar G_Complex presented earlier. This is clearly not a CSG, since productions (4) and (8) are contracting. Does this mean that L(G_Complex) is not a CSL? Not necessarily, since its possible that there is a CSG that is equivalent to G_Complex. Finding an alternative and equivalent CSG for G_Complex is left as an exercise.

A small point is that a grammar that is non-contracting cannot define a language that includes the empty string, since there can be no production of the form $\alpha \rightarrow \varepsilon$ , since $|\varepsilon |$ = 0. So technically the context-sensitive languages do not include the empty string. But it’s a technicality that is easily overcome and we won’t let it necessarily prevent us from talking about languages that include $\varepsilon$ but that are “otherwise’ context-sensitive. Suppose we have a grammar G (with start symbol S) that is context-sensitive and thereby excludes $\varepsilon$ in L(G). If we want to include $\varepsilon$ then we can create a new grammar G’ with new start symbol S’ that is identical to G, except with an additional variable S’ and two new productions, S’ $\rightarrow$ S and S’ $\rightarrow \varepsilon$ . S’ is never used otherwise. If there is something theoretical that relies on exclusion of $\varepsilon$ then we can focus on L(G), but for purposes of allowing for $\varepsilon$ we can use L(G’). In fact, we will relax this a bit more and allow a production of the form S $\rightarrow \varepsilon$ , so long as S is the start symbol. Note that if S appears on the right hand side of any production then this allowance will allow a rewrite to $\varepsilon$ in intermediate sentential forms, but again that’s easily remedied if there is a theoretical point in doing so.

We will talk more about the computational implications of context sensitive languages later, but we will preview an important point now. Suppose we ask whether w is in a language L(G) – yes or no? We can construct an algorithm to answer this question for a CSL. We can enumerate the strings in the language using the CSG, and whenever we derive a string of terminals only, we check to see if its equal to w and answer ‘yes’ if so. In contrast, when a sentential form is created that is longer than w, we know that w cannot lie along that path, since the sentential forms along that path will never shrink. If all paths in the breadth first enumeration have sentential forms that are longer than w then we know that w cannot be in the language and the algorithm can answer ‘no’.

Context Free (Type 2) Grammars and Languages edit

If $\alpha$ in each of G’s productions $\alpha \rightarrow \beta$ is a single variable, then G is a context-free grammar (CFG) and L(G) is a context-free language (CFL). That is, regardless of the symbols in a sentential form that surround the variable on a production’s left-hand side, the variable can be rewritten to $\beta$ . Context-free grammars and languages are subsets of context-sensitive (or non-contracting) grammars and languages. Every context free language is also a context sensitive language, but not vice versa. Similarly, every context sensitive language is an unrestricted language, but not vice versa. Thus, because inclusion is transitive, every CFG is an unrestricted grammar and every CFL is an unrestriced language.

Consider the grammar G_Simple from a previous section on unrestricted languages. G_Simple is a CFG and L(G_Simple) is a CFL. A similarly structured CFG is G_Palindromes, over Σ = {a, b}.

S $\rightarrow$ aSa $|$ bSb $|$ a $|$ b $|$ ε

Note that in this example and others, we will make an allowance for the empty string ε to be in the language. Technically, the empty string is not in a CFL because that would mean that there was at least one contracting production, which is disallowed in a CSL, and thus a CFL too. Nonetheless, the exception is widely used and we'll allow the start symbol of a CSG (and this CFG) to produce the empty string.

A derivation for 'bbaabaabb', S ${\xrightarrow[{}]{*}}$ bbaabaabb, using G_Palindromes, is S $\rightarrow$ bSb $\rightarrow$ bbSbb $\rightarrow$ bbaSabb $\rightarrow$ bbaaSaabb $\rightarrow$ bbaabaabb.

The CFGs and CFLs are probably the most important class of grammars and languages to us, at least for practical purposes. After Chomsky’s treatment of grammars was published, Algol became the first programming language with a syntax that was defined by a context free grammar. These grammars have been used to define other programming languages, we well as markup languages, and the grammars are adapted to define parsers for these languages. You have probably seen what amounts to context free grammars in syntax diagrams for various languages.

Leftmost and Rightmost Derivations edit

In a derivation of a string using a CFG, every sentential form before the final string will have one of more variables in it. If at every step we rewrite the leftmost variable in the sentential form next, then the derivation is a leftmost derivation. Similarly, if we rewrite the rightmost variable at each step then the derivation is a rightmost derivation. We will typically be most concerned with leftmost derivations because of the use of CFGs in programming languages. Though we won’t interrogate the relationship deeply, suffice it to say that because a programming language syntax is largely based on CFGs and because a language translator parses a program left-to-right, leftmost derivations are more practically relevant.

In all derivations of strings in L(G_Palindrome) there will always be exactly one variable (i.e., S) for every sentential form of the derivation, except the last, so that there is no difference between leftmost and rightmost derivations in that example. Let's consider a grammar for balanced parentheses instead wher the productions of G_Balanced are as follows.

B $\rightarrow$ (B) $|$ BB $|$ ( ) $|$ ε

A leftmost derivation for '(( ))( )' is B $\rightarrow$ BB $\rightarrow$ (B)B $\rightarrow$ (( ))B $\rightarrow$ (( ))( ).

In contrast, a rightmost derivation for the same string is B $\rightarrow$ BB $\rightarrow$ B( ) $\rightarrow$ (B)( ) $\rightarrow$ (( ))( ).

Note that these derivations appear different, but the difference is entirely due to the order in which the variables are rewritten, and not due to something more fundamental.

Parse Trees edit

This is the single parse tree for the derivation of 'bbaabaabb' using G_Palindrome. The dashed red enclosure highlights the actual string in L(G_Palindrome).

A parse tree is another representation of a derivation of a string using a CFG. The root of a parse tree is the start variable of the grammar, the internal nodes each correspond to a variable to be rewritten, the children of an internal node represent the various symbols, terminal and non-terminal, in the right-hand side of the production that is used to rewrite the variable representing the parent, and the leaves throughout the parse tree are the characters in a string of the language. The derived string can be written by doing a depth-first enumeration of the tree, writing leaf/terminal symbols only.

Three different ways of representing the derivation of '(( ))( )'. Note that a parse tree does not vary solely as the result of the direction, leftmost or rightmost or mixed for that matter, that variables are rewritten.

Parse trees are invariant relative to the order of variable rewrites, so that the leftmost and rightmost derivation for '(( ))( )' using G_Balanced correspond to the same parse tree.

Ambiguity of a CFG edit

A grammar is ambiguous if there are two or more leftmost derivations for at least one string in the language generated by the grammar. Alternatively, we could say that if there are two or more rightmost derivations, or two or more distinct parse trees, for a string in the language then the generating grammar is ambiguous. These three definitions of ambiguity are equivalent. Consider a CFG, G_Exp = (V, Σ, P, S), for simplified arithmetic expressions, where V = {Exp}, Σ = {id, +, *, (, )}, S = Exp, and P equals the set containing these productions:

Exp $\rightarrow$ id

Exp $\rightarrow$ Exp + Exp

Exp $\rightarrow$ Exp * Exp

Exp $\rightarrow$ (Exp)

Two leftmost derivations for the same string that demonstrate the ambiguity of grammar G_Exp for simple arithmetic expressions.

Two parse trees for same string that are an alternate and equivalent demonstration for ambiguity of grammar G_Exp for simple arithmetic expressions.

This grammar is ambiguous since there are two distinct leftmost derivations for the string id + id * id. There are two distinct parse trees for this string as well. You can confirm that there are two distinct rightmost derivations as well. Can you find other strings in L(G_Exp) that have more than one leftmost (rightmost) derivation and more than one parse tree?

Ambiguity of context-free grammars is particularly important because the syntax of so many programming and markup languages are specified by context-free grammars. As an interpreter or compiler processes a program or document, it should know precisely what to do in response, and an ambiguous grammar can be problematic in guiding the interpreter on the right course of action (e.g., on what micro-instructions should be executed or written as the higher level program is being processed). Its possible to add decision rules on what to do that lie outside the grammar, but removing the ambiguity in the grammar itself is an elegant option. The grammar for arithmetic expressions is the classic educational example of revising an ambiguous grammar to remove ambiguity.

It would be nice if there was a general algorithm that accepted an ambiguous grammar and created an unambiguous grammar that generated the same language as the input grammar. Alas, no such algorithm exists. There is not even an algorithm that identifies an arbitrary input grammar as ambiguous or not! The picture is not that dismal however since in practice there are rules of thumb for disambiguating a grammar based on ‘domain’ knowledge. Programming language constructs are one kind of domain knowledge that can guide and constrain disambiguation, as in the case of the arithmetic expression example, where the domain knowledge is the constraints provided by precedence conventions, with identifiers and parenthesized expressions having the highest precedence, addition having the lowest, with multiplication in between. In the case of G_Exp a variation we'll call G_OpPrec = (V, Σ, P, S) generates the same language (i.e., L(G_Exp) = L(G_OpPrec)), but is not ambiguous. For G_OpPrec, V = {Factor, Term, Exp}, Σ = {id, +, *, (, )}, S = Exp, and P equals the set containing these productions:

Factor $\rightarrow$ id | (Exp)

Term $\rightarrow$ Factor | Term * Factor

Exp $\rightarrow$ Term | Exp + Term

The grammar G_Balanced is ambiguous because there is more than one way to derive '(( ))( )' and other strings by alternatively using production B

\rightarrow

( ) and B

\rightarrow

ε. Notice that in the parse tree, ε is shown as a leaf, but is not considered an explicit symbol in the final string.

An expression grammar based on conventional binary operator precedence, and nests the highest precedence operator (i.e., *) more deeply, as on the left, and in a postorder traversal of the tree deeper operators will be evaluated before their ancestors. In contrast, operators of the same precedence, as in the right, are evaluated left to right, and so those on the left will be deeper in the parse tree.

We will return to this grammar when we talk about the relevance of CFLs to programming languages. Consider another example of an ambiguous grammar, G_Balanced, from above. Can you see why this grammar is ambiguous? Consider the string that we have already addressed, '(( ))( )'. There are at least two leftmost derivations for this string. We previously only looked at one, which is repeated here: B $\rightarrow$ BB $\rightarrow$ (B)B $\rightarrow$ (( ))B $\rightarrow$ (( ))( ). This derivation uses the production, B $\rightarrow$ ( ), to transition from the third to the fourth sentential form. But consider this leftmost derivation which uses the production B $\rightarrow$ ε: B $\rightarrow$ BB $\rightarrow$ (B)B $\rightarrow$ ((B))B $\rightarrow$ (( ))B $\rightarrow$ (( ))( ). Are there any other leftmost derivations for the same string? Naturally, if there are distinct leftmost derivations. then there are distinct rightmost derivations and distinct parse trees too.

In the case of this last example of ambiguity, there is a domain independent strategy for removing the source of ambiguity, while still retaining ε as a member of the language. We create a new grammar, G_Balanced', by introducing a new start variable, B', and two productions, B' $\rightarrow$ B | ε, and removing the production B $\rightarrow$ ε. Note that ε is still in the language and that L(G_Balanced') = L(G_Balanced), but there is now only one leftmost derivation of '(( ))( )' and other strings as well. This domain independent strategy can be applied in any similar circumstance to remove the same form of ambiguity.

Inherent ambiguity of a CFL edit

If all grammars that generate a language are ambiguous then the language is inherently ambiguous. There is no algorithm in the general case that can identify whether a language is inherently ambiguous or not, but clearly if an unambiguous grammar that generates the language is identified then the language is not inherently ambiguous.

Remember that ambiguity is a property of grammars, and inherent ambiguity is a property of languages.

Normal Forms edit

There are two normal forms for context free grammars that have been used for practical and theoretical purposes. Any CFL that excludes ε can be expressed by a grammar in both normal forms.

Chomsky normal form (or CNF, not to be confused with conjunctive normal form in logic) has productions of the form X $\rightarrow$ YZ or X $\rightarrow$ x, where X, Y, and Z are variables and x is a terminal. Greibach normal form (or GNF) has productions of the form X $\rightarrow$ xβ, where X is a variable, x is a terminal, and β is a (possibly empty) string of variables only.

There are algorithms for translating an arbitrary CFG into an equivalent CFG in CNF and into an equivalent CFG in GNF. The conversion to CNF is the easiest. For example, if we want to find a CNF that is equivalent to G_simplelogic, with productions S $\rightarrow$ ~S $\mid$ [S ⊃ S] $\mid$ p $\mid$ q then first (1) replace all terminals with terminal specific variables when the right hand side (rhs) is larger than 1 terminal:

S $\rightarrow$ NS $\mid$ LSISR $\mid$ p $\mid$ q

N $\rightarrow$ ~

L $\rightarrow$ [

R $\rightarrow$ ]

I $\rightarrow$ ⊃

and the (2) accumulate variables in the righthand side of length greater than 2

S $\rightarrow$ NS $\mid$ p $\mid$ q $\mid$ X_LSISR

X_LSIS $\rightarrow$ X_LSIS

X_LSI $\rightarrow$ X_LSI

X_LS $\rightarrow$ LS

The Pumping Lemma for CFLs edit

If there is a string w that is generated by a CFG in Chomsky Normal Form with m variables, and if the string is long enough (|w| > 2^m-1), then there is a variable that must have been used twice in the longest path of the parse tree for w, and w can be written as w = uvxyz

If w = uvxyz is in the CFL L and |w| > 2^m-1 with |vxy| ≤ 2^m and v and y are not both ε, then uvⁱxyⁱz for i ≥ 0 is in L too

There are identifiable patterns in CFLs that are useful to know, for purposes of theory at least. An important such pattern is given by the Pumping Lemma (PL) for CFLs. The PL says that for any CFL there is a length threshold. If a string is equal to or larger than that threshold, call it n, then the string, w can be written as uvxyz and substrings v and y can be repeated indefinitely and the resulting string will be in the language too. Substrings v and y can be omitted too, signified by v⁰ and y⁰, and the resulting string will still be in the language. The lemma can be expressed with quantifiers as follows. If L is a CFL then

∃_n (∀_{w∈L,|w| ≥ n} (∃_{uvxyz=w,|vxy|≤n,|vy|≥1}(∀_i≥0 uvⁱxyⁱz ∈ L)))

“∃_n (∀_{w∈L,|w| ≥ n}” says that there is a threshold n such that for all sufficiently long strings in the language; “(∃_{uvxyz=w,|vxy|≤n,|vy|≥1}” says that z can be rewritten in terms of 5 substrings, the central three are less or equal to the threshold (i.e., they are sufficiently close), and v and y together consist of at least one character between them.

“(∀_i≥0 uvⁱxyⁱz ∈ L)))” says that we can repeat v and y zero or more times and the resulting string will still be in the language.

The reasoning behind this construction stems from considering the parse tree for a sufficiently long string generated by a Chomsky Normal Form grammar, which can represent any CFL. If the CNF grammar has k variables, then let the threshold of n be 2^k. As the accompanying figures illustrate for the case of a string of at least n in length, at least one variable must appear at least twice along a path of the parse tree. It is the identification of this necessary repeating variable that is basis of being able to repeat or ‘pump’ substrings with the expectation that the result will be in the language. If the pumping lemma seems somewhat convoluted, give it time and it may seem elegant, and maybe does already.

Proofs using the Pumping Lemma edit

The Pumping Lemma can be used to show by contradiction that certain languages are not CFLs. Our job is to find a sufficiently long string w that is undeniably in the language of interest and then to show that there no no way of partitioning the string into uvxyz under the PL's constraints such that the result of pumping v and y some number of times results in a string that is also in the language. The outline of a proof that L is not a CFL follows the lemma as expressed with quantifiers above.

“∃_n (∀_{w∈L,|w| ≥ n}” Pick an n. It must exist if L is a CFL. We don't have to tie n down to a particular value because we will next select a sufficiently long w that is defined in terms of n.
“(∃_{uvxyz=w,|vxy|≤n,|vy|≥1}” w was selected in step 1 so as to constrain the partitioning into uvxyz is a manner that satisfies the constraints |vxy|≤n,|vy|≥1.
“(∀_i≥0 uvⁱxyⁱz ∈ L)))” Find a value i (we only need to find one value of i, but again it can be a function of n) such that uvⁱxyⁱz $\not \in$ L, which contradicts ∀_i≥0 uvⁱxyⁱz ∈ L.

Example: Show that L = {aⁱ | i is a prime number} is not a CFL.

Step 1: Let n be the pumping lemma constant. Select w = a^m where m be the first prime greater than or equal to n.

Step 2: w = a^m = uvxyz, where |vxy|≤n and |vy|≥1, though the former constraint is not used in this example

Step 3: Without lose of generality assume v has j a's and y has k a's (j+k $\geq$ 1), so v=a^j, y = a^k, uxz = a^m-(j+k)

Now consider successive values if i

|w| = |uvxyz| = j + k + m – (j + k) = m

|uv²xy²z| = 2j + 2k + m – (j+k) = m + j + k

|uv³xy³z| = 3j + 3k + m – (j+k) = m + 2j + 2k

…

|uv^m+1xy^m+1z| = (m+1)j + (m+1)k + m – (j+k)

= m + mj + mk

= m (1 + j + k) can’t be prime, so a^{m (1 + j + k)} not in language, which violates the PL condition that (∀_i≥0 uvⁱxyⁱz ∈ L) if L is a CFL.

In step 3 it can be the case that we have to consider a number of different cases in order to show a contradiction across all i.

Example: Show that {aⁱb^jc^k | i+1 < j, j+1 < k} is not CFL

Step 1: Let n be the pumping lemma constant. Let w = aⁿbⁿ⁺²cⁿ⁺⁴. w is certainly long enough whatever the particular value of n.

Step 2: Let w = aⁿbⁿ⁺²cⁿ⁺⁴ = uvxyz ; |vx| >= 1; |vxy| <= n.

Step 3: Case of v, y all a’s and/or b’s: pump, then more a’s than b’s or b’s than c’s

Step 3: Case of v some number of a’s; y some number of c’s – can’t happen, violates constraint that |vxy| <= n

Step 3: Case of vy all c’s. Consider uv⁰xy⁰z, then more b’s than c’s or equal number of b’s and c’s

Regular (Type 3) Grammars and Languages edit

Finally, regular languages are a subset of the context free languages, and regular grammars puts a further restriction on the context free grammar. I regular grammar is a context free grammar where every production is of the form A $\rightarrow$ aB or A $\rightarrow$ a, where A and B are variables and a is a terminal. At first glance a regular gram the grammar G GNF. But where GNF allows any number of variables to follow the first terminal in the righthand side of a production, a regular grammar allows at most one. We get to applications of regular grammars and languages later when discussing the broader applications of context free languages for programming and markup language syntax.

Consider this grammar G_regular = ({S, A, B}, {0, 1}, P, S}, where P is the set of productions:

S $\rightarrow$ 0A

S $\rightarrow$ 1B

A $\rightarrow$ 0A

A $\rightarrow$ 0S

A $\rightarrow$ 1B

B $\rightarrow$ 1B

B $\rightarrow$ 1

B $\rightarrow$ 0

S $\rightarrow$ 0

Figure ParseRegular: A parse tree for '1110' using G_regular.

You are asked to characterize the language generated by G_regular as an exercise, but clearly it is regular. Figure ParseRegular gives a parse tree for the string 1110, which is a right linear tree. Every parse tree for a string generated by a regular grammar will be right linear. For this reason regular grammars are sometimes called right linear grammars, and the languages they generate are sometimes called right linear languages in other sources. This book will use 'regular' consistently when referring to grammars and languages, however.

Every regular grammar is a context free grammar. Note that regular grammars, by definition, are in GNF. A regular grammar can be easily changed to a grammar in CNF that accepts the same language. In particular, replace each production of the form A $\rightarrow$ aB in a regular grammar with A $\rightarrow$ A'B, where A' $\rightarrow$ a is also added as a production.

The Pumping Lemma for Regular Languages edit

The Pumping Lemma for regular languages is a special case of the Pumping Lemma for context free languages.

Since every regular language is also a context free language, the pumping lemma for CFLs also apples to RLs, but in this latter case of a regular language, y and z are both empty. Thus, the Pumping Lemma for regular languages can be simplified to ∃_n (∀_{w∈L,|w| ≥ n} (∃_{uvx=w,|uv|≤n,|v|≥1}(∀_i≥0 uvⁱx ∈ L))).

Let's revisit L = {0^m | m is prime}. We already know that this is not a CFL, so its obviously not an RL either, so this is just to illustrate the RL form of the Pumping Lemma. Assume L is regular, so L is accepted by a DFA with n states (aka pumping lemma constant n). Consider a string of m 0s where m is the first prime number greater than n.

0^m = 0^j0^k0^m-(j+k), where u = 0^j, v = 0^k, and x = 0^m-(j+k).

|0^m| = j+k+(m-(j+k)) = m

Pump v once j+2k+(m-(j+k)) = k+m

Pump v twice j+3k+(m-(j+k)) = 2k+m

…

Pump m times j+(m+1)k+(m-(j+k)) = mk+m = m(k+1) is not prime

0^m(k+1) not in L by definition – a contradiction

Exercises, Projects, and Discussions edit

Type 0 Discussion 1: We noted in the first paragraph of this section that the definition of a grammar did not seem to overtly preclude the possibility of a production with a left-hand side of only terminal symbols (i.e., with no variables). List the implications of allowing productions with left-hand sides of only terminal symbols. As you reflect on the matter, consider discussing the issue with a large language model -- you are welcome to do so.

Type 0 Project 1: Research and discuss the relationship between Type 0 grammars and Markov Algorithms.

CSL Exercise 1: Give a derivation for 'aabbcc' using G_abc.

CSL Exercise 2: Give an attempted derivation for 'aabc' using G_abc. Clearly indicate deadend paths in a breadth-first enumeration of paths.

CSL Exercise 3: Find a context sensitive (non-contracting) grammar that is equivalent to G_Complex -- that is, your answer should generate the same language as G_Complex, but have no contracting productions. Hint: such a grammar can be found. You are welcome to use a large language model to gain insights into the process of converting G_Complex to an equivalent CSG, G_Complex'.

CSG Exercise 4: Give a context-sensitive grammar, G_ww, for the language L = {ww | w is a non-empty binary string of 0s and 1s, and ww is such a string concatenated with itself}.

PL Exercise 1: Show {a^j2b^j | j >= 1} is not a CFL. Note that the number of a's is j², not j*2, and that the number of b's is j. That is, the number of a's is the square of the number of b's.

PL Exercise 2: Show that {b_i#b_i+1 | b_i is i in binary} is not a CFL

PL Exercise 3: Show that {aⁱbⁱcⁱ | i ≥ 1} is not a CFL.

PL Exercise 4: Can v⁰ and y⁰ be used in an alternative proof that L = {aⁱ | i is a prime number} is not a CFL? If so, how?

CFG Exercise 1: Give a context-free grammar for the language of strings over {a,b} with exactly twice as many a’s as b’s. Due to Hopcroft and Ullman (1979)^[5] and Hopcroft, Motwani, and Ullman (2007)^[6]. Hint: you may have to consider different special cases, allowing for the a's to be positioned differently relative to the bs.

CFG Exercise 2: Give a context-free grammar for the language of {aⁱb^jc^k | i ≠j or j ≠k}. Due to Hopcroft and Ullman (1979) and Hopcroft, Motwani, and Ullman (2007).

CFG Exercise 3: Give a Chomsky Normal Form grammar (a) for L(G_Exp), (b) for balanced parens, (c) for palidromes over {a, b}.

CFG Exercise 4: Give a Greibach Normal Form grammar (a) for L(G_Exp), (b) for balanced parens, (c) for palidromes over {a, b}.

Type 3 Exercise 1: Give an English characterization of the language generated by G_regular. If you use a large language model, then give the prompt that you used, and the response that you received, along with any edits or corrections that you made.

Type 3 Exercise 2: Give a regular grammar that generates allowable identifiers in a fictional programming language, where the identifiers must begin with a letter, followed by any finite number of letters and digits. Special characters _ (underscore) and - (hyphen) can be used but no two of these special symbols can appear consecutively, nor can an identifier end in a special symbol.

Type 3 Exercise 3: (Due to ?) Give a regular grammar generating L = {w | w is a string of one or more 0s and 1s and w does not contain two consecutive 1s}

PL Exercise 5: Show that the set of strings of balanced parentheses is not regular using the Pumping Lemma.

PL Exercise 6: Show that L = {0^i*i | i is an integer and i ≥ 1} is not regular. I used i*i instead of i² for reasons of formatting.

References edit

↑ Hopcroft, John E; Ullman, Jeffrey D (1979). Introduction to Automata Theory, Languages, and Computation. Addison Wesley. p. 220. ISBN 0-201-02988-X.
↑ Mateescu & Salomaa (1997), Theorem 2.1, p. 187
↑ John E. Hopcroft, Jeffrey D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation. Addison-Wesley. ISBN 0-201-02988-X. Exercise 9.9, p.230.
↑ Hopcroft, John E; Ullman, Jeffrey D (1969). Introduction to Automata Theory, Languages, and Computation. Addison Wesley. p. 12. ISBN 0-201-02983-9.
↑ Hopcroft, John E; Ullman, Jeffrey D (1979). Introduction to Automata Theory, Languages, and Computation. Addison Wesley. p. 103. ISBN 0-201-02988-X.
↑ Hopcroft, John E; Motwani, Rajeev; Ullman, Jeffrey D (2007). Introduction to Automata Theory, Languages, and Computation. Pearson Education, Inc. pp. 181–182. ISBN 0-201-02983-9.

[1] Hopcroft, John E; Ullman, Jeffrey D (1979). Introduction to Automata Theory, Languages, and Computation. Addison Wesley. p. 220. ISBN 0-201-02988-X.

[2] Mateescu & Salomaa (1997), Theorem 2.1, p. 187

[3] John E. Hopcroft, Jeffrey D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation. Addison-Wesley. ISBN 0-201-02988-X. Exercise 9.9, p.230.

[4] Hopcroft, John E; Ullman, Jeffrey D (1969). Introduction to Automata Theory, Languages, and Computation. Addison Wesley. p. 12. ISBN 0-201-02983-9.

[5] Hopcroft, John E; Ullman, Jeffrey D (1979). Introduction to Automata Theory, Languages, and Computation. Addison Wesley. p. 103. ISBN 0-201-02988-X.

[6] Hopcroft, John E; Motwani, Rajeev; Ullman, Jeffrey D (2007). Introduction to Automata Theory, Languages, and Computation. Pearson Education, Inc. pp. 181–182. ISBN 0-201-02983-9.

[1]

[2]

[3]

[4]

[5]

[6]