Theory of Formal Languages, Automata, and Computation/Applications of Language Classes

Context Free Languages, Parsing, Lexical Analysis, and Translation

edit

Recursive Descent Parsing and Translation

edit
 
Three grammars representing the same language of arithmentic expressions. (a) is an ambiguous grammar. (b) is an unambiguous, operator precedence grammar, but is also left recursive, a problem for recursive descent parsing. (c) is an equivalent right recursive grammar and suits the parsing application well.

CFGs have been used for some time in defining the syntax of programming languages, starting with Algol. Ideally, we can directly translate a grammar into mutually recursive procedures for parsing a computer program, and ultimately for compiling it. Consider the grammars of Figure ExpressionGrammars. The first of these grammars for arithmetic expressions is simple, yet ambiguous, since id + id * id (and other strings) can be generated by two (or more) distinct leftmost derivations or distinct parse trees. So, that is unsuitable as the basis foor automated parsing. The second grammar is not ambiguous, having enforced operator precedence rules to ensure desirable, single parse trees for every string. This is as far as we got when introducing this CFL of arithmentic expressions. As it turns out, grammar (b) presents another prooblem for the recursive procedures we have in mind. This grammar is a left recursive. When this grammar is translated into mutually recursive procedures, we have a function such as Expression (i.e., E) potentially calling itself, as indicated in E   E + T, with no change in the argument that is passed. This can lead to infinite recursion.

 
Figure AlgolGrammar: Simplified Algol 60 grammar as given by chatGPT with some minor revisions by the author. The bold red shows the arithmentic expression portion of the grammar in right recursive form, and the bold blue shows those parts of the grammar that correspond to the lexical analyzer, in close-to regular grammar form. The notation varies slightly and straightforwardly from that used in the remainder of the textbook.
 
Figure AlgolTranslator: A recursive descent parser and translator for a small part of Algol-60.

Figure AlgolGrammar shows a simplified CFG for the Algol-60 programming language. Algol-60, as in the year 1960, was the first programming language with syntax defined by a context free grammar. The variables in the grammar translate rather directly into mutually recursive functions in a recursive descent parser. Figure AlgolTranslator shows the parsing routines corresponding to the grammar variables of expression, term, and factor. From parse_expression(), a term is what is first expected, and so parse_term() is called to identify a portion of code that was generated by the term part of the grammar. Parse_term() then calls parse-factor() to identify the Algol code that was derived from the factor variable of the grammar.

In parse_factor() actual symbols in the program are scanned by the part of the parser called the lexical analyzer, which is concerned with identifying basic tokens, such as constants, variables, operators, and punctuation in the Algol code. In the sample translator code, the match() function is highlighted in bold blue, and this function contains a call to the next_token() function, which is essentially the lexical analyzer and performs the recognition procedures associated with the parts of the grammar in Figure AlgolGrammar that are highlighted in bold blue. Those productions, in bold blue, correspond to a regular grammar (almost), and you should confirm that productions not conforming exactly to the regular grammar constraints of A   aB or A   a can be easily translated to productions that do adhere to the constraints (though its a bit unwieldy to do so because of the space it would require). Though the next_token() code is not shown in this illustration, it is an important part of a parser implementation since the lexical analyzer is what actually scans the input string/program.

In Figure AlgolTranslator, the generation of machine code is shown on bold purple. The machine code language assumes that arguments are pushed (loaded) onto a stack, and binary operations such as addition and multiply pop the top two items on the stack, apply the operation, and push the result back on the stack. So in the illustration of Figure AlgolTranslator the machine code translation of the Algol expression '1.2 + Var21 * 3' reads as follows and would execute as indicated in comments when it was run.

  1. LOAD_CONST 1.2
  2. LOAD_VAR Var21 // (i.e., push the value of Var21)
  3. LOAD_CONST 3
  4. MULTIPLY // pop 3 and pop value of Var21, compute 3 * value of Var21, and push the result, P, onto the runtime stack
  5. ADD // pop P and pop 1.2, compute P + 1.2, and push the result, R, onto the runtime stack

If I had asked chatGPT to show more code for the parser, then it would include parser code for other grammar variables such as 'assignment statement' and included additional code generation commands such as STORE_VAR for storing the top of the runtime stack in a relative memory address associated with an Algol program variable name. There are other important parts of a translator, such as the symbol table for associating Algol program identifiers with computer memory (relative) addresses, that are not tied to any underlying grammar.

In sum, a recursive descent parser and translator follows elegantly from the specification of a programming language grammar. The recursive routines, as with any function calls, will push records onto the parser's runtime stack (not to be confused with the runtime stack that is assumed by the assembly code above), and this reliance on a stack by the parser makes it comprable to a PDA.

Relationships to Artificial Intelligence

edit

There are many informal and formal connections between AI and formal languages, automata, and computation. This chapter delves into these connections, as well as the requisite and optional AI material. Notes embedded in the main text indicate how I often present the material over a semester. While this material is at the end of the textbook, it is typically scattered in my lectures throughout the semester.

Probabilistic Grammars and Automata

edit
 
Figure ProbabilisticMechanisms: The probability of a derivation (and string) can be computed as the product of the transition probabilities used in the derivation. For example, S   A with probability 0.5 and A   a with probability 0.8. Thus, S   a with probability 0.5 * 0.8 = 0.4.

In AI, non-determinism is prevalent, but cannot always be removed entirely while guaranteeing the same results because (a) operations/productions/transitions are uncertain, and (b) because the world/IDs are not typically complete or correct (i.e., the AI doesn’t know some things or it is wrong).

Measures of uncertainty are used in AI, most notably probabilities or fuzzy membership.

Probabilistic grammars are a type of formal grammar where each production rule has a probability associated with it.

These probabilities represent the likelihood that a particular rule will be used to generate a string.

Natural Language Processing: Probabilistic grammars can be used to model the probability distribution of sentences in a natural language, making it useful for tasks such as speech recognition and machine translation.

Information Retrieval: Probabilistic grammars can be used to generate queries that match documents in a database, allowing for more accurate search results.

DNA Sequencing: Probabilistic grammars can be used to model the probability distribution of DNA sequences, making it useful for tasks such as genome assembly and alignment.

NFA-to-DFA translation and speedup learning

edit

The first point of inspiration is in the elimination of nondeterminism when translating an NFA to a DFA (or a nondeterministic PDA to a deterministic PDA, if the latter exists). I’ve said that NFAs and nondeterminism generally can lead to simpler and more elegant solutions, and be much easier to specify by a human, but these automata will require enumeration (depth first or breadth first or some heuristic approach) when they are actually used (to see whether a string reaches an accepting configuration along at least one path of the machine), and that is computational costly. So ideally, we can imagine that in the case of a complex problem, someone can specify an NFA solution, but then use automated translation to acquire a more efficient-to-execute deterministic recognizer. The same can be said about AI and machine learning generally. I regard it as definitional of AI that an AI system explore alternatives, requiring enumeration or search – that is nondeterminism. Machine learning can then be used to eliminate nondeterminism, and thus “speedup” problem solving, planning, and other processes by the AI.

Its rarely, if ever, the case that all sources of nondeterminism can be eliminated in an AI program, however. The reasons are several. One, operators or “transitions” in AI often do not have perfectly predictable outcomes. A robot may try to pick up a cup, for example, and the cup slips from the robot’s grasp – there may always be at least two outcomes, I’ve got the cup or I don’t, and probably more outcomes if there can be liquid in the cup or not, hot versus cold liquid, etc. While some outcomes may be rare, we still want the robot to be able to respond to them – this possibility of multiple outcomes is the clear manifestation of nondeterminism. Another source of nondeterminism that is related to the first is that the AI may not know with certainty about all relevant aspects of the environment. For example, an intelligent vehicle may not know that a pedestrian is skateboarding behind a building but is about to abruptly enter traffic, and the AI had better anticipate that possibility, as well as mundane circumstances. Nondeterminism again.

Even though its unlikely that machine learning can get rid of all nondeterminism, it can “reduce” the nondeterminism. In a real AI application we might measure the extent that the nondeterminism is reduced by the expected number of states that an AI system has to enumerate, both before learning and after learning, and while learning. The expected number of states generated during an enumeration or search is the sum of the states generated across all possible problems, weighted by the probability of each problem. So an expected number is a weighted average.

DFA-to-RE Translation and Macro Learning

edit

The second learning task that I want to relate is called macro learning, which is used in AI learning to solve problems more effectively. In particular, if an AI observes that transitions are repeatedly sequenced together, or it discovers that certain transitions that are sequenced together lead to solutions, then the AI might form a macro operator (or macro transition) that concatenates the transitions together so that they can be applied as a single packet, or macro. This process of macro formation will both reduce the nondeterminism found in enumeration by taking large steps in individual paths or derivations, but it also adds to nondeterminism by creating additional choice points in an enumeration or search for a string’s derivation, since the macro is added to, rather than replacing, the operators (or transitions) that are used in its composition. The management of the tradeoffs in macro learning led to a good deal of research on learning choice preferences or biases as well that involved probabilities that are placed on choices. I don’t get into probabilities here, but the slides do touch on their use in probabilistic grammars which are related to AI approaches to solving problems or analogously in deriving strings.

Machine Learning, No-Free Lunch, Undecidability, Intractability

edit

AI theorem proving

edit

We discussed

AI Planning

edit

While the implication operator in logic theorem proving looks the same as the production symbol,   , theorem proving is additive. When making an inference, nothing in the current world model is overwritten.

In contrast, in AI planning a “mental” model of the current world state is maintained and changed through the application of operators (or productions). A common representation of operators show PREconditions, which have to be true before the operator can be applied, and EFFects, which are the conditions that are true after the operator is applied. For our purposes we can think of an operator as a production, as in a grammar, PRE   EFF, where effects rewrite preconditions. There is more nuance to it than that, but the details are left to an AI course.

The slides show two examples of the derivations of single plans, analogous to a single derivation of a string, but as with languages, there is a search for the derivation (plan), which requires an enumeration of many paths, most of which are unsuccessful derivations (and that’s ok – we are looking for just one plan/derivation).

Generating a plan can be modeled as generating a string with a grammar. The operators can be regarded as skeletal productions, that is productions with variables, as illustrated in the blocks-world operators of the slides – Pickup(?X), Putdown(?X), Unstack(?X, ?Y), Stack (?X, ?Y). The last slides of Friday’s lecture give a brief glimpse at this equivalence, though AI planning will generally operate with additional pattern matching capabilities than we have seen with respect to strings. The equivalence also makes reference to ridiculous computational (storage and runtime) requirements in the case where we are interpreting AI states as strings and AI operators as productions, but computational cost is not an issue we are concerned with at this point, and similar equivalence arguments that are not concerned with costs are made by Hopcroft, Motwani, and Ullman 3rd Edition (2007) when comparing Turing Machines and computers (e.g., breakout boxes on pp., 322, 346, 364).

The grammars that are equivalent to AI planning problems would typically be context sensitive, but a general or universal AI planning system is unrestricted. I don’t show this, but have a chat with chatGPT to learn more. Here are some handy references too, not just to AI planning but to other material at the intersection of CS 3252 and AI, notably forms of machine learning that have fallen out of favor, but still very interesting – these references are optional material.

Solomonoff, R. J. (1964). “A Formal Theory of Inductive Inference, Part 1”, Information and Control, Vol. 7.

Biermann, A.W. (1972). “On the Inference of Turing machines from sample computations”, Artificial Intelligence, Vol 3.

Angluin, D. (1980). “Inductive Inference of Formal Languages from Positive Data”, Information and Control, Vol. 45.

Moore, R.C. (1984). “A Formal Theory of Knowledge and Action” Technical Report 320, SRI.

Tate, Alan AI Planning MOOC.

Also, plenty of other references online (search for ‘AI Planning’) or take the AI course.

Turing Equivalency of Selected Computational Platforms

edit

Author: Kyle Moore

One of the foundational theorems underlying Automata theory and Computer Science as a whole is the Church-Turing thesis. This thesis, in part, posits that every function that is capable of being computed can be computed on some Turing Machine. A universal TM, being capable of simulating any arbitrary TM, is thus capable of computing any and all computable functions. Note that this does not mean that a universal TM is capable of computing all conceivable functions, as there are functions that are not computable (we’ve already discussed one previously, in fact: the halting problem). An important corollary that falls out of this is that any system that can simulate a universal TM is transitively also capable of computing any function. We call such a system Turing Complete (TC). It is worth noting that a TC system may be more powerful than a TM. If we can show that a TC system can also be simulated by a TM, we can say that it is Turing Equivalent (TE), meaning that it has exactly the same computational power as a TM. In this section, we will survey a small portion of known TC and TE systems. All systems will only be briefly discussed for brevity, but links to more information about each system are provided for interested readers.

Important TC systems

edit

The rest of this section will introduce a variety of TC and TE systems, many of them surprisingly so, both to outside observers and to the creators themselves. To help with this, though, we will first look briefly at some of the most fundamental TE systems.

The first is the Von Neumann architecture, designed in 1945 by John von Neumann. This architecture is the underlying design for most modern computer systems. In short, this architecture is composed of an expandable memory unit connected to a general-purpose processing unit that can handle arithmetic, logic, and control processes. A similar architecture that has also heavily influenced the design of the modern computer is the register machine, which relies on addressed access memory, rather than a sequential tape. By some formulations, the Von Neumann architecture is a specific case of the register machine model.

Lambda Calculus is a system of mathematical logic built around the application of abstracted first-order functions. A computationally universal formulation was developed in the 1930s by Alonzo Church. Church and Turing famously argued that their computational models were equivalent and computationally universal as a foundational aspect of the Church-Turing Thesis.

Programming Languages

edit

The vast majority of programming languages are modeled around one or more of the above mentioned systems. Most notably, the class of imperative programming languages are designed to utilize the Von Neumann and Register Machine architectures. Functional programming languages, on the other hand, are largely a direct implementation of some variety of lambda calculus.

A specific subset of programming languages of note here are the esoteric programming languages. These languages are very often designed to be difficult to use or use unusual encoding methods. While these languages are, usually intentionally, nonviable for general use, they provide a valuable look into how simple a system can be while maintaining Turing Completeness. A few noteworthy examples are provided here.

Probably the most well-known esoteric language is BrainFuck (often shortened to BF). The language consists of exactly 8 commands, each designated by a single character. These commands control the position of a pointer to a 1-dimensional memory and allow altering the value of the cell that is being pointed at, effectively encoding a TM.

In addition to minimizing available commands, some esoteric languages experiment with unusual encoding modalities. Two notable examples are Piet[1] and Choon[2]. Both languages are similarly direct implementations of a TM architecture, but instead of being written in text, Piet is “written” using a bitmap image, where pointer position and data manipulation are defined by changes in hue and luminance. Choon instead uses MIDI sound files as source code, encoding commands via sonic frequency.

Interesting TC systems

edit

In this section we will briefly introduce a variety of systems that were created as interesting procedures that generate complex output from a simple set of rules. These systems were generally not designed to be Turing Complete, but were found later to be so.

First is a variety of systems known as Tag systems. These are machines that are defined by a starting string and a set of production rules. At each iteration, the first character in the string determined via the production rules what substring is to be added to the end of the current string and some number, m, characters are removed from the beginning of the current string. It has been shown that there exist Tag systems that are capable of simulating a TM for all m greater than or equal to 2.

 
Animation showing the transition procedure from one state to the next in rule 110.

A particularly noteworthy class of systems are the cellular automata. These are a broad class of machines defined by an infinite grid (of varying dimensions) where every cell has a boolean value. At each iteration of the automata, a simple ruleset determines which cells will be activated in the next iteration. While not all cellular automata are Turing Complete, there are at least 3 noteworthy examples that are. Probably the most influential is Rule 110. This is the most well known elementary cellular automata. In elementary cellular automata the current state is a 1-dimensional Boolean string. In the subsequent state, the next value of cell x is determined by the values of cells x, x-1, and x+1. This effectively means that the machine is defined using only eight transition rules. The simplicity of Rule 110 makes it a common target for proving Turing Completeness. Rule 110 itself was shown to be TC by showing that it can simulate a TC Tag system.

Other noteworthy cellular automata are Conway’s Game of Life and Langton’s Ant. The Game of Life is a 2-dimensional cellular automata where the next state of a given cell is determined by the number of active cells surrounding it. Traditionally, a cell goes from inactive to active only when it is surrounded by exactly three active cells and becomes inactive whenever it is surrounded by any number of active cells that is less than two or greater than three. This machine has been shown to be able to simulate both a register machine and a universal TM. Langton’s Ant is a lesser-known cellular automata in which an “ant” travels around a 2-dimensional grid. At each iteration, the ant turns either left or right, depending on whether the cell it is in is active, flips the value of that cell, and moves one cell forward. This system has exactly two rules, one for each cell value, but has been shown to be able to simulate any Boolean circuit, making it Turing Complete.

Which brings us to Boolean circuits. These are probably the most familiar simple TC system to most readers. Boolean circuits are systems of AND, OR, and NOT logic gates. These are the physical implementation of modern computers and are thus TC by virtue of implementing Von Neumann and Register Machine architectures. A noteworthy feature of Boolean circuits is that two gates, NOR (NOT applied to the output of OR) and NAND (NOT AND) gates can each be used exclusively to simulate all other types of logic gates. This means that a system that is capable of simulating either NOR or NAND gates and showing that they can be linked together is enough to show a system is, in principle, Turing Complete.

The Games (and Other Software) Begin

edit

Computer scientists and engineers have a history of challenging themselves to show that the software and hardware created by others can be adapted to other applications for the simple fun of it (giving rise to the common internet joke "Can it run DOOM?"). This has led to the discovery of numerous games and software that are themselves Turing Complete. In some cases this is intentional on the part of the developer. A prime example of this is Minecraft, which allows players to use an item called Redstone to create Boolean circuits. In most cases, however, Turing Completeness was unintentional and discovered by dedicated users. I’ll include some noteworthy examples below.

A popular city builder game, Cities Skyline [3], was shown to be TC by way of linking in-game electric and sewage systems creatively to simulate connected NAND gates. [1] DOOM[4] similarly has shown to be able to simulate Boolean circuits when creating custom levels (simultaneously raising and principally affirmatively answering the question: can DOOM run DOOM?). Outside of video games, Turing Completeness has been shown, for Microsoft Powerpoint[5] (note that the reference here is a joke journal, but the work described in the article linked seems legitimate), Microsoft Excel[6], and Neural Networks[7]. Some amazing examples can even be found in the C++ printf function[8] and the Notepad++[9] find and replace function. Going back to games, but leaving the world of software, it has even been shown that a tournament legal deck of Magic: the Gathering[10] cards can simulate a universal TM. Even simulation of human heart cells[11] have been shown to be configurable into a working Boolean circuit.

Does this matter?

edit
 
XKCD 2556

While it can be interesting to investigate Turing Completeness of a system for its own sake, it is important to consider why the answer to that question is important from a practical standpoint. While there may be other reasons that are not discussed here, we identify two reasons for why we would want to know that a system is TC: provability and security. Provability, the ability to prove definitively that a system performs as expected in all scenarios, is an essential consideration in safety-critical systems. Unfortunately, a TC system may be impossible to fully prove compliance with expected behavior as a consequence of the halting problem. This is largely because, given specific inputs, the system can be induced to perform any computation. Notably, this can, in principle, be done purposefully to circumvent controls placed on the user. Malicious actors can use these “arbitrary execution exploits” to insert and execute malware.

Minimum Turing Completeness

edit

The simplicity of many of the systems introduced here introduces another interesting question for which this work has no sure answer: What are the minimum features needed for a system to be Turing Complete? A candidate set proposed here are unbounded memory (TM: inifinite tape), conditional branching (TM: transition table), and unbounded recursion/iteration. All the systems introduced here have these three properties and it seems clear that they are necessary features. They are clearly not sufficient, however, as many cellular automata also have these features while being known to be not TC because of choice of transition rules (e.g. Rule 204). The question of what features the transition rules need for computational completeness is left to the reader to consider.

Author: Kyle Moore

Large Language Models

edit

Author: Jesse Roberts

Large Language Models (LLMs) are a class of autoregressive models that generate tokens (word segments) based on a given context with each generated token appended to that context. By continuously appending to the context and generating the next token, an LLM is able to interact via text simply by predicting the next token given the tokens so far.

The technology that empowered early LLMs like GPT is called a decoder-only transformer model. Transformers are a special class of artificial neural network based on an auto-encoder architecture with maximum inner product based attention mechanisms that permit the network to select features based on the content of the context. The decoder-only variant lacks the encoder portion of the network.

It is interesting find that the vanilla transformer model is as computationally powerful as a Turing machine and therefore Turing complete. This was proved by showing that a vanilla transformer could simulate an arbitrary recursive neural network which was itself shown to be as powerful as a Turing machine. Later, it was proved that the decoder-only variant of the transformer was Turing complete as well by showing that they could likewise simulate an arbitrary RNN.

Since LLMs are based on decoder-only transformer models, the above discussion would seem to suggest that LLMs are capable of being Turing equivalent. However, this is not the case as the models that are trained do not learn to act as a Turing machine. Further, learning to be Turing equivalent is not possible for an LLM given it must also learn to predict the next token of human consistent language. This can be proved by construction. If a language model is asked to recognize a language which requires more space than is present in the latent representation of the network, then it must output the intermediate state as a token and continue to execute over that intermediate representation via recursion through the context. And, by doing this in-context recursion, violates the language modeling directive to output human consistent text.

The problem demonstrates that an optimization function may not be able to optimize over all objectives if they are incompatible. The language modeling task is incompatible by nature with in-context recursion which is necessary for the model to be Turing equivalent. Therefore, for a decoder-only LLM, it is impossible to be both Turing complete and pass a Turing test.

Author: Jesse Roberts

References

edit

A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A.N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017

Bhattamishra, S., Patel, A., and Goyal, N. On the com- putational power of transformers and its implications in sequence modeling. arXiv preprint arXiv:2006.09286, 2020.

H. T. Siegelmann and E. D. Sontag, “On the computational power of neural nets,” in Proceedings of the fifth annual workshop on Computa- tional learning theory, pp. 440–449, 1992.

Roberts, J. How powerful are decoder-only transformer neural models? arXiv preprint arXiv:2305.17026, 2023.

Exercises, Projects, and Discussions

edit

Project 1: Build a compiler for a non-trivial subset of a programming language of choice.

  1. https://www.dangermouse.net/esoteric/piet.html
  2. https://sange.fi/esoteric/essie2/download/choon/choon.html
  3. Bali, Daniel (2019-07-15). "Cities: Skylines is Turing Complete". Medium. Archived from the original on 2019-07-15.
  4. Spencer, Danny (2022-12-20). "Can Doom Run It? An Adding Machine in Doom". Archived from the original on 2022-12-22.
  5. Wildenhain, Tom (2017). "On the Turing completeness of MS PowerPoint" (PDF). SIGBOVIK: 102–106.
  6. Branscombe, Mary (2020-12-11). "Microsoft: Turning Excel into a Turing-complete programming language". TechRepublic. Archived from the original on 2024-05-29.
  7. Siegelmann, Hava; Sontag, Eduardo (1992-07-01). "On the computational power of neural nets". Proceedings of the fifth annual workshop on Computational learning theory: 440–449. doi:10.1145/130385.130432.{{cite journal}}: CS1 maint: date and year (link)
  8. Carlini, Nicolas; Barresi, Antonio; Payer, Mathias; Wagner, David; Gross, Thomas R. (2016-08-12). "Control-flow bending: on the effectiveness of control-flow integrity". Proceedings of the 24th USENIX Conference on Security Symposium: 161–176.
  9. 0xdanelia (2023-03-30). "regex_turing_machine". GitHub. Retrieved 2024-09-30.
  10. Churchill, Alex; Biderman, Stella; Herrick, Austin (2019-04-23). "Magic: The Gathering is Turing Complete". Arxiv. doi:10.48550/arXiv.1904.09828.
  11. Scarle, Simon. "Implications of the Turing completeness of reaction-diffusion models, informed by GPGPU simulations on an XBox 360: Cardiac arrhythmias, re-entry and the Halting problem". Computational Biology and Chemistry. 33 (4): 253–260. doi:10.1016/j.compbiolchem.2009.05.001.