Compiler Construction/Code Generation

Code Generation

A compiler usually is designed to output an executable program that will allow the user to run your program, and to be directly run by the processor, without having an intermediary interpreter such as in the interpretation process. For your program to be run by the processor however, you will need to transform the instructions in your specific programming language into assembler code, which is then sent to an assembler tool to create object code, which is then linked together with specific libraries to create your executable code. For now, we only really need to worry about transforming the instructions into assembler code. This process is what we will deal with in this section. You will need to be well versed in the assembler language you wish to output. If you intend your programs to run on the x86 architecture, you need to be familiar with x86 assembler code, and so on.

Code generation occurs after semantic analysis is done, which gives us enough information to generate more primitive concrete code. The general idea behind code generation is decompose the tree structure of the syntax tree into a sequence of instructions, whatever an instruction set is. In this stage, since we are done with the semantic program, we are not interested in the syntactic and semantic structure of programs but in the order of executions of instructions.

Some sort of intermediate code is often produced before generating actual machine code. The benefits of this are

it is easier to generate more abstract code, not bothering too much about things like register allocations,
optimization independent to machine architecture can be done and
compiler bugs can be spotted more easily.

However, it may be simpler for your program to output assembler code directly, but you lose the above advantages. See the next section for more techniques on this.

In this chapter, we shall use the three address format to represent intermediate code. The format is useful because it is analogous to actual machine instructions in some architectures and, more importantly, allows us to easily change the execution order of instructions, which is a huge advantage over stack-based intermediate code like the byte code of Java.

Although it is not a complex problem to reuse names after they have already been used, it is actually beneficial to allocate a new name every time one is needed because it allows us to form a call graph and optimize easily as we will see later. For this reason, we only briefly mention the methods to reuse names. You can find more on the optimization of allocation of names in optimization chapter.

The three address code, as the name suggests, consist of three address and opcode, which tells what kind of operation is meant to be done. For example, an expression (a + b) * 3 can be transformed into:

temp1 := a + b; temp2 := temp1 * 3

In the first line, temp1, a and b are addresses and + is an opcode, and the second line is similar to the first one. Unlike load-store machines, it is unnecessary to load variables to registers and store them back. You see why the three address code is easy to handle.

Choosing portable, flexible and expressive instructions is critical; Not having enough instructions can complicate generated code with the combination of several instructions to achieve one operation and having too much may obviously make maintenance more daunting task. Probably the best way to do this is to examine existing machine code. It is more straightforward to transform code close to underlying machine code than abstract one.

Expression

Algebraic expressions can be translated into the three address code in a very straightforward manner. This can be done rather recursively as follows: Assume two expressions left and right with an operation op-code, then the results should be:

   code for left
   code for right
   temp = place for left + place for right

Control Structures

The general idea behind generating code for control structures is the same as coding in assembly programming. That is, an if statement, for instance, is converted into a chunk of code using conditional and unconditional jumps.

Assembler language techniques

If the output of your compiler is assembly language code, it is necessary to understand the basic techniques of assembly language programming. Most programming languages do not map easily to most assembler languages, so some techniques or skills may need to be understood before attempting to write code that will output assembler code. These techniques are not intended to create highly optimized code - you will learn optimizing techniques later - but are intended to make sure you have a good understanding of how data and instructions are managed in the process of compiler construction.

Managing variables

Many programs use hundreds of different variables (not counting arrays).

Most computer architectures give you less than 32 registers (MIPS architecture and ARM give nearly 32 pointer registers; i386 gives only about 4 pointer registers; PIC microcontroller only has 1 pointer register).

Since we can't squeeze 100 different variables into even 32 processor registers, we must use memory for storing most variables.

We will start by storing practically all variables in memory. Later we will cover optimizing techniques that try to keep as many variables as possible in the processor registers.

Labeling

The assembler tool that you are using may reserve the names of the mnemonics. For example, your assembler may not allow a variable named add, since this is reserved for the instruction to add.

In this case, it may be important to use a prefix for your variable labels. Some compilers use a single underscore, but you can choose whichever you wish.