Programming Language Concepts Using C and C++/Language Processors

In this chapter, we will study different kinds of language processors and while doing so, a graphical notation called tombstone diagrams will be utilized. Extensive use of this notation, however, should not fool you into believing that the graphical representation in itself has any meaning. Unless the underlying notions represented by them are clearly understood, throwing in random diagrams to create some fancy figures will have no value at all.

In our treatment of the subject, introduction of the basic concepts, program and machine, will be followed with a review of language processing techniques, translation and interpretation. Building on these we will explore the more sophisticated ways of using language processors to construct more productive environments.

Basic Diagrams

Programs

From the computer scientist’s point of view, a program is a pattern of rules that is used to direct the evolution of a computational process.^[1] It is composed from symbolic expressions in a particular notation called a programming language.

A program named P written using language L

Machine

A machine is an automaton [when given a program] that can carry out tasks to ease the life of its beneficiaries. From the computer scientist’s point of view, the word machine is a synonym for computer or any equivalent mathematical model.

A machine named M

Running a Program

A program can be run on a [physical] machine only if it is expressed in the appropriate machine code.^[2]

Running a program named P on machine named M

Translators

A translator is a program that accepts any text expressed in one language (the translator’s source language), and generates a semantically equivalent text expressed in another language (its target language).

In particular, a compiler translates from a high-level language into a low-level language (not just machine code); an assembler translates from an assembly language into the corresponding machine code; a decompiler translates a low-level language program into a program in a high-level language; a disassembler translates a machine code program into an assembly language program.

S-into-T translator expressed in language L

The head of the tombstone names the translator’s source language S and the target language T, separated by an arrow. The base of the tombstone names the translator’s implementation language L.

Translating a Program

Translating a source program P expressed in language S to an object program expressed in language T, using an S-into-T translator

The source and object programs are semantically equivalent.^[3] Giving the same names to source and object programs is a widely used convention to emphasize this fact.

Cross-Compiler

A cross-compiler is a compiler that runs on one machine (the host machine) but generates code for a dissimilar machine (the target machine). The object program is generated on the host machine but must be transferred to the target machine to be run. Such a transfer is often called downloading.

Translating and running a program

Two-Stage Translator

Two-stage translator is a composition of two translators. Such a scheme comes in handy when you want to port a new language implementation to different platforms. All you need to do is to translate from the new language into a widely available programming language and then translate the output of this translation into machine code. You now have a programming language implementation available on all platforms where there is a compiler for the omnipresent language.

Two-stage translation

More formally, we can say that semantic-equivalence relation is an equivalence relation. That is, it is reflexive, symmetric, and transitive. So, we can easily generalize this idea to multiple stages: An n-stage translator is a composition of n translators, and involves n-1 intermediate languages.

Translating a Translator

A translator, be it a compiler or an assembler or whatever kind of a translator that might be, is just another piece of software. So, like other programs, it can itself be fed into a translator as input. (The only thing special about translators, if one may call that special, is that they take other programs as input.)

Translating a translator

Seeing the input and output diagrams as below should help you perceive translators as plain programs:

Translators as plain programs

Interpreters

In translation, the entire source program must be translated to its object (target) program equivalent before it can even start to run and produce results. This can be likened to translating a novel: it is translated in its entirety all at once.

An interpreter, on the other hand, is a program that accepts any program (the source program) expressed in a particular language (the source language), and runs that source program immediately. This approach is more like the way a simultaneous translator does her job: she translates as the speaker makes his statement; she doesn’t wait for him to finish his speech.

Similar to a microprogram’s fetch-decode-execute cycle, an interpreter works by fetching, analyzing, and executing the source program instructions, one at a time. Each time a source code instruction is interpreted, it is first fetched, analyzed, which includes its translation into the instructions of the physical machine, and then executed by executing the corresponding machine code instructions.

Interpretation makes sense when:

The instructions have simple formats, and this can be analyzed easily and efficiently. (Note the distinction between instructions and instruction formats.)
Instructions are complex; their executions take so long that time spent for fetching and analyzing become negligible.
Each instruction is executed only once (or at least not frequently).
The programmer is working in interactive mode, and wishes to see the results of each instruction before entering the next instruction.
The program is to be used once and then discarded, and therefore running speed is not very important.

The reasoning behind the first two items can be understood by analyzing the cost of executing a source code instruction on the physical machine.

$t=t_{A}+t_{F}+t_{E}=\sum _{i=1}^{n}{t_{f}^{i}+t_{d}^{i}+t_{e}^{i}}\!$ (cost of executing a source code instruction)

The total cost of running a source code instruction is equal to the sum of all phases. Execution phase cost is (roughly) equal to the sum of the cost of executing the corresponding machine language instructions. Note we have multiple machine language instructions for each source code instruction, which is very natural since the source language is a higher-level one.

When source code instruction format is simple, sum of $t_{F}$ and $t_{A}$ , being too small, will be dwarfed by $t_{E}$ . That is, we will have

$t\approx t_{E}=\sum _{i=1}^{n}{t_{f}^{i}\!+t_{d}^{i}+t_{e}^{i}}\!$

Total cost will be almost equal to the cost of executing the corresponding machine code instructions. So, we don't lose much due to interpretation.

In the case of complex source code instructions, $t_{E}$ , being too large, will dominate the running time of the instruction. So, we end up with the previous conclusion: $t\approx t_{E}$ .

Executing the source code instructions as few times as possible means the cost due to the fetch and analyze phases are not repeatedly paid. Therefore, it is still reasonable to consider interpretation as an alternative.

The fourth item is basically a rehashing of the idea behind interpretation: instructions are executed one at a time. This parallel between interpretation and interactivity is revealed by the fact that the best examples for interpreters are from the world of command shells, such as bash, csh, and etc.

A typical example to the last item is prototyping of an application, where a light-weight version of the application is developed and used for ensuring user requirements are correctly grabbed by showing it to the customer at an early stage of the development process. Since such an application is used a few times and need not be fast, interpretation turns out be a good choice.

Conversely, interpretation is not a sensible choice when:

The program is to be run in production mode, and therefore speed is important.
The instructions are executed frequently. For instance, an algorithm with many for-loops would not be a good candidate for interpretation.
The instructions have complicated formats, are therefore time-consuming to analyze.

Question

What type of language should we choose in implementing a matrix manipulation module?

Answer
Since matrix manipulation–that is, matrix addition, multiplication, subtraction, and so on–relies heavily on (nested) loops, which means parts of code will be executed repetitively, we had better go for a compiled language.

Question

What type of language should we choose in implementing a script that automates the coordinated execution of multiple programs?

Answer
In addition to the control structures used for program flow, the script in question will be made up of instructions that execute the programs involved, which will likely take quite long to complete. So, due to the presence of these complex, time consuming instructions, we should opt for an interpreted language.

Question

What kind of processing is involved in running a machine code program on compatible hardware?

Answer
The way the processor runs a program follows a cycle similar to that of an interpreter: next instruction from the instruction stream is fetched, it is decoded–which may trigger fetching of data–and it is executed, which may involve writing of data.

Interpreting a Program

Tombstone diagram of an S-interpreter implemented in M

Running program P by interpreting it on an S-interpreter

Note that there is no translation of the entire program; instructions are fetched, analyzed, and executed one at a time as if you were running your program on an abstract machine for S. For more, see the section on abstract machines.

Real and Abstract Machines

Interpretation can be used to test a newly designed hardware without actually building it. Such an approach earns you a lot by shortening the design-build-debug cycle. Consider the following typical scenario.

Designers come up with a hardware design or modify a previously built one.
The design can be built in either software or hardware. However, printing a board–that is, building the design in hardware–is not something you can do in your office; you need to get some chip manifacturing company to do it for you. And if you don't have a large quantity order, you are bound to wait a long time. So, hardware option is time-consuming and hurts your competitiveness. On the other hand, software option enables you to test your design by means of running programs of the newly designed machine on existing computers and can be done in your office.
In the process of testing the new hardware, design errors may be detected. If so, you need to go the first step. Otherwise, you can proceed with the next step.
Marketing guys start a sales pitch to market the new product as you have the board printed.

This kind of interpretation–that is, running programs of the yet-to-be-built machine on another machine–is called emulation. An emulator cannot be used to measure the emulated machine’s absolute speed. But it can still be used to obtain useful quantitative information such as number of memory cycles, the degree of parallelism, and so on.

We might write the interpreter in a high-level programming language such as C:

Emulator written in a high-level programming language

This program is further compiled into another interpreter expressed in machine code M:

Translating the emulator

We can now write programs for the new hardware:

Program emulation

Running a program on top of an interpreter is functionally equivalent to running the same program directly on a machine. The user sees the same behavior in terms of the program’s inputs and outputs. The two processes are even similar in detail: an interpreter works in a fetch-analyze-execute cycle, and a machine works in a fetch-decode-execute cycle. The only difference is that an interpreter is a software artifact, whereas a machine is a hardware artifact (and therefore much faster).

Thus a machine may be viewed as an interpreter implemented in hardware. Conversely, an interpreter may be viewed as a machine implemented in software. Hence, an interpreter is sometimes called an abstract machine as opposed to its hardware counterpart, which is referred to as the [real] machine. So, we can define a machine code to be a language for which a hardware interpreter exists (at least on paper).

An abstract machine is functionally equivalent to a real machine if they both implement the same language L.

Equivalence of an abstract machines and interpreters

Interpretive Compilers

A compiler may take quite a long time to translate a source program into machine code, but then the object program will run at full speed. An interpreter allows the programs to start running immediately, but it will run very slowly.

An interpretive compiler is a combination of compiler and interpreter, giving some of the advantages of each. The key idea is to translate the source program into an intermediate language and then interpret the result of translation on an abstract machine running the programs of the intermediate language.

The intermediate language is designed to the following requirements:

It is intermediate in level between the source language and ordinary machine code.
Its instructions have simple formats, and therefore can be analyzed easily and quickly.
Translation from the source language into the intermediate language is easy and fast.

Two disadvantages of using an intermediate language as the object language are the running speed and the ease of decompilation. The former means inefficient use of resources while the latter implies lack of protection of your product. For fighting against the second drawback one can make use of a software called obfuscator, which renames variables and rearranges code to make it difficult to understand. More is said on improving the running speed in the JustJust-In-Time Compilers section.

The biggest advantage of using an interpretive compiler is object-code portability, which means you can compile a program only once and run it anywhere. Such a scheme shines when you need to address clients using diverse architectures. Indeed, the two examples offered below have this in common.

Pascal/P-code Interpretive Compiler

Our first example is used as part of a compiler kit, which made Pascal the programming language of choice in academia in late `70s and early `80s.

Pascal/P-code interpretive compiler has two components:

Components of the Pascal/P-code interpretive compiler

If we feed a Pascal program into the translator, we will get the corresponding P-code program.

Translating a Pascal program into a P-code program

Next thing we do is to run the resulting P-code program on the P-code interpreter.

Running P-code programs

Now, P-code is a Pascal-oriented intermediate language. This means it provides powerful instructions that correspond directly to Pascal operations such as array assignment, array indexing, and procedure call. Thus translation from Pascal to P-code is easy and fast. Although powerful, P-code instructions have simple formats like machine-code instructions, with operation and operand fields, and so are easy to analyze. Thus P-code interpretation is relatively fast.

Java/Bytecode Interpretive Compiler

Our second example is the Java programming language, which made its debut as the language of the Internet–the ultimate test lab for portability.

Similar to Pascal/P-code, Java/Bytecode interpretive compiler has two components: the Java-to-Bytecode compiler and the Java Virtual Machine (JVM). In addition to serving as the Bytecode interpreter, JVM provides services such as security, garbage collection, and etc.

Components of the Java/Bytecode interpretive compiler toolkit

Compiling Java source into Bytecode

Running Bytecode programs

Looks like the previous example, doesn’t it? Well, except for the names, it’s exactly the same. So, Sun was not the first to discover portability!

With the Sun’s picojava initiative in mind, the equivalence can further be extended as shown in figure below.

Running Bytecode programs faster

So, Sun’s Bytecode runs the fastest on Sun’s picojava. (Looks like a good marketing trick, ehh?)

Question: Provide the diagrams representing the process required to run a C# program.

Just-in-Time Compilers

One disadvantage of the above scheme is the running time of programs. Although it is much faster than pure interpretation–because a lower level representation of the program is interpreted–it is slow when compared to the interpretation of the machine-level counterpart. This adverse effect can be lessened by introducing just-in-time compilation to the process. Whenever a subprogram is called the interpreter–more accurately, part of the virtual machine called the just-in-time compiler–compiles on-the-fly the intermediate code representation of the subprogram into its machine code equivalent and executes it. Due to the overhead involved in the translation process the first invocation of the subprogram will be costly. Nevertheless, subsequent invocations will be as fast as it can possibly be.

As a matter of fact, code produced by just-in-time compilers, also known as jitters, may end up being faster than code produced by native-code compilers. This is thanks to the dynamic nature of code generation in jitters. Consider moving your code base to a more advanced machine. Since the virtual machine and its jitter, assuming they are updated to reflect the novelties of the new machine, are aware of the improvements in the new machine and therefore can exploit them, all of your code now run faster. However, this would not be the case if you had native-code executables. Since the executables are usually created by the implementer targeting a particular machine, migration to a different machine would lead to no increase in running speed.^[4]^[5]

Notes

↑ More generally, any object (or person) that can be controlled is a target of the programming activity, albeit in a somewhat different sense.
↑ Later on, we will relax this statement and say that a program can be run on an abstract machine by means of an interpreter.
↑ Note that we make the simplifying assumption that compiler (or assembler) produces an executable. However, this is usually not the case. You may have to link the compiler (or assembler) output with other object files and/or libraries.
↑ This doesn't mean you would get no increase at all. Performance improvements due to faster peripheral devices would still make your application run faster.
↑ In case you may have access to the source code, you can spend some time and recompile the project.

[1] More generally, any object (or person) that can be controlled is a target of the programming activity, albeit in a somewhat different sense.

[2] Later on, we will relax this statement and say that a program can be run on an abstract machine by means of an interpreter.

[3] Note that we make the simplifying assumption that compiler (or assembler) produces an executable. However, this is usually not the case. You may have to link the compiler (or assembler) output with other object files and/or libraries.

[4] This doesn't mean you would get no increase at all. Performance improvements due to faster peripheral devices would still make your application run faster.

[5] In case you may have access to the source code, you can spend some time and recompile the project.

[1]

[2]

[3]

[4]

[5]