Creating a Virtual Machine/Register VM in C
Design
editThe first example will be one of the simplest possible architectures, and it will consist of the following components:
- A set of registers (I will arbitrarily choose to have 4 of them, numbered 0 to 3). The registers serve as a group of read/write memory cells.
- A program, which is a read-only sequence of VM instructions
- An execution unit to run the program. The execution unit will read each instruction in order and execute it.
Addressing modes
editIn any assembly language numbers serve multiple uses. Aside from representing scalar values, they also represent register numbers and memory locations. In this example I will follow the convention of using prefix characters to denote the role a number plays.
- Register numbers begin with the letter r, like r0, r1, r2.
- Immediate (scalar) values begin with the hash mark #, like #100, #200.
- Memory addresses begin with the at sign @, like @1000, @1004.
Instruction set
editAfter designing the VM, it is necessary to design the VM's instruction set. The instruction set is simply a reflection of the kinds of things we would like to do (or allow others to do) with this VM. Here are some things it should be possible to do with this VM:
- Load an immediate number (a constant) into a register.
- Perform an arithmetic sum on two registers (i.e., add two numbers).
- Halt the machine.
Let's keep it just that simple for now. After we have a working implementation we can go back and add more instructions.
First let's write a short program using these three instructions to see what they might look like:
loadi r0 #100
loadi r1 #200
add r2 r0 r1
halt
If the VM were to run the program, this is a description of what would happen on each line:
- Load the immediate value 100 into the register r0. Note that placing a value into a register is commonly called a load operation.
- Load the immediate value 200 into register r1.
- Add the contents of registers r0 and r1 and place the sum into register r2.
- End the program and halt the VM.
Note that in keeping with common convention I have chosen to place the destination registers first in the operand list.
Instruction codes
editAfter the assembly language is created it is necessary to determine how to represent each instruction as a number. This establishes a one-to-one correspondence between each instruction in the assembly language and each instruction code in the set of instruction codes. Converting a program from assembly language to instruction codes is called assembling, and conversion from instruction codes back into assembly language is called disassembling.
Several choices we must make at this point are:
- What number is used to represent each assembly language instruction?
- How are instruction operands encoded?
- Are operands part of the instruction word (remember, by word I mean number), or are they separate words (numbers)?
First, to answer the last question, since there are only small numbers of instructions and registers in this VM it should not be very difficult to encode all operands in a single instruction word, even if (for the sake of simplicity) I were to use a 16-bit instruction word. Thus, a 16-bit number written in hexadecimal has 4 digits, giving us easy access to 4 information fields, each containing 16 variations (0-9 and A-F).
- The first digit of a machine word will be the instruction number. This gives our VM the potential for having up to 16 different instructions. This is a small amount by contemporary standards, but it is plenty for our example virtual machine.
- The next three digits will be used for the operands. These can be used as three 1-digit operands, two operands of 1 and 2 digits, or a single 3-digit operand.
Having made these decisions, let us now establish the encoding. Recall that we have 16 instruction numbers available.
The halt instruction will be instruction 0, and there is an important reason for choosing 0 for this instruction. Since empty space in the computer's memory will most likely be filled with 0s, any run-away program will eventually encounter a 0 and attempt to execute this instruction, immediately halting the program.
The remaining two instructions can be assigned arbitrarily: the loadi instruction can be instruction 1, and the add instruction can be instruction 2. This is our current instruction encoding list:
0 = halt 1 = loadi 2 = add
Examining the first program instruction, we see that we must now encode the register and the immediate value:
loadi r0 #100
There are three hexadecimal digits available for operands, so we will use the first of those three as the register number, and the second and third together as the immediate value. Now we have just determined the complete encoding of the loadi instruction:
bits 15-12 = 1 bits 11- 8 = register number bits 7- 0 = immediate value
The register number for this instruction is 0, and the immediate value is 100, which in hexadecimal is 64 (6 x 16 + 4). Thus, the complete 16-bit hexadecimal instruction code for the first instruction is 1064.
The second instruction is assembled in a similar way. The immediate value is 200, which in hexadecimal is C8. The second instruction assembles to 11C8.
The third instruction is 2201.
The last instruction is 0.
Putting the instructions together, we get these 4 16-bit hexadecimal numbers as the complete program:
1064
11C8
2201
0000
Implementation
editIt is now time to implement this design. I have chosen to write this program in C for a number of reasons:
- It's a very simple language that allows the writing of concise programs.
- It's a low level language with some high-level features. This makes it a good choice for creating a VM.
The primary function of the VM will be a run function that does the following steps in order inside a loop. The loop repeats until the halt instruction is executed.
- Fetch the next instruction from the program.
- Decode the instruction into its constituent parts.
- Execute the decoded instruction.
Let's begin writing the program.
Registers
editThe first and simplest part to implement is the set of 4 registers.
# define NUM_REGS 4
int regs[ NUM_REGS ];
The registers in this VM are signed integers that are, depending on your computer and operating system, either 16, 32, or 64 bits wide. The exact size is irrelevant for this example.
Program instructions
editThe fully assembled program can easily be stored in an array of integers.
int prog[] = { 0x1064, 0x11C8, 0x2201, 0x0000 };
Fetch
editThe fetch function retrieves the next word from the program. In order to do this we must introduce a variable called the program counter (also sometimes called the instruction pointer). Before the instruction word is returned from the function the program counter must be incremented so that it refers to the next instruction in the program array.
int pc = 0; /* program counter */
int fetch()
{
return prog[ pc++ ];
}
Decode
editThe decode function will decode each instruction completely. It will determine the three operand registers and immediate value for each instruction, even if an instruction doesn't use that part. This actually makes the function simpler. The instruction number and operands are stored in variables.
int instrNum = 0;
/* operands: */
int reg1 = 0;
int reg2 = 0;
int reg3 = 0;
int imm = 0;
/* decode a word */
void decode( int instr )
{
instrNum = (instr & 0xF000) >> 12;
reg1 = (instr & 0xF00 ) >> 8;
reg2 = (instr & 0xF0 ) >> 4;
reg3 = (instr & 0xF );
imm = (instr & 0xFF );
}
Execute
editThe execute function actually performs the instruction. First there needs to be a running flag that the halt instruction can set to 0.
/* the VM runs until this flag becomes 0 */
int running = 1;
Now the eval function. It's just a switch ladder. This function uses the instruction code and operand variables that the decode function stored values into. Notice that in each instruction case I display the instruction. This makes it easy to follow the program as it runs.
/* evaluate the last decoded instruction */
void eval()
{
switch( instrNum )
{
case 0:
/* halt */
printf( "halt\n" );
running = 0;
break;
case 1:
/* loadi */
printf( "loadi r%d #%d\n", reg1, imm );
regs[ reg1 ] = imm;
break;
case 2:
/* add */
printf( "add r%d r%d r%d\n", reg1, reg2, reg3 );
regs[ reg1 ] = regs[ reg2 ] + regs[ reg3 ];
break;
}
}
Run
editThe run function is rather simple: fetch, then decode, then execute.
void run()
{
while( running )
{
int instr = fetch();
decode( instr );
eval();
}
}
Main
editThe main function simply calls the run function:
int main( int argc, const char * argv[] )
{
run();
return 0;
}
Complete program source
editFor your convenience I have included the complete program source in a single block. Note that I have added a showRegs function in order to see the values of all the registers between instructions. This allowed me to verify that both the VM and the assembled program are working correctly.
# include <stdio.h>
# define NUM_REGS 4
unsigned regs[ NUM_REGS ];
unsigned program[] = { 0x1064, 0x11C8, 0x2201, 0x0000 };
/* program counter */
int pc = 0;
/* fetch the next word from the program */
int fetch()
{
return program[ pc++ ];
}
/* instruction fields */
int instrNum = 0;
int reg1 = 0;
int reg2 = 0;
int reg3 = 0;
int imm = 0;
/* decode a word */
void decode( int instr )
{
instrNum = (instr & 0xF000) >> 12;
reg1 = (instr & 0xF00 ) >> 8;
reg2 = (instr & 0xF0 ) >> 4;
reg3 = (instr & 0xF );
imm = (instr & 0xFF );
}
/* the VM runs until this flag becomes 0 */
int running = 1;
/* evaluate the last decoded instruction */
void eval()
{
switch( instrNum )
{
case 0:
/* halt */
printf( "halt\n" );
running = 0;
break;
case 1:
/* loadi */
printf( "loadi r%d #%d\n", reg1, imm );
regs[ reg1 ] = imm;
break;
case 2:
/* add */
printf( "add r%d r%d r%d\n", reg1, reg2, reg3 );
regs[ reg1 ] = regs[ reg2 ] + regs[ reg3 ];
break;
}
}
/* display all registers as 4-digit hexadecimal words */
void showRegs()
{
int i;
printf( "regs = " );
for( i=0; i<NUM_REGS; i++ )
printf( "%04X ", regs[ i ] );
printf( "\n" );
}
void run()
{
while( running )
{
showRegs();
int instr = fetch();
decode( instr );
eval();
}
showRegs();
}
int main( int argc, const char * argv[] )
{
run();
return 0;
}
Sample run
editHere is the output from the program:
regs = 0000 0000 0000 0000
loadi r0 #100
regs = 0064 0000 0000 0000
loadi r1 #200
regs = 0064 00C8 0000 0000
add r2 r0 r1
regs = 0064 00C8 012C 0000
halt
regs = 0064 00C8 012C 0000