x86 Assembly/Machine Language Conversion
Relationship to Machine Code
editX86 assembly instructions have a one-to-one relationship with the underlying machine instructions. This means that essentially we can convert assembly instructions into machine instructions with a look-up table. This page will talk about some of the conversions from assembly language to machine language.
CISC and RISC
editThe x86 architecture is a complex instruction set computer (CISC) architecture. Amongst other things, this means that the instructions for the x86 architecture are of varying lengths. This can make the processes of assembly, disassembly and instruction decoding more complicated, because the instruction length needs to be calculated for each instruction.
x86 instructions can be anywhere between 1 and 15 bytes long. The length is defined separately for each instruction, depending on the available modes of operation of the instruction, the number of required operands and more.
8086 instruction format (16 bit)
editThis is the general instruction form for the 8086 sequentially in main memory:
Prefixes (optional) |
Opcode (first byte) | D | W |
Opcode 2 (occasional second byte) |
MOD | Reg | R/M |
Displacement or data (occasional: 1, 2 or 4 bytes) |
- Prefixes
- Optional prefixes which change the operation of the instruction
- D
- (1 bit) Direction. 1 = Register is Destination, 0 = Register is source.
- W
- (1 bit) Operation size. 1 = Word, 0 = byte.
- Opcode
- the opcode is a 6 bit quantity that determines what instruction family the code is
- MOD (Mod)
- (2 bits) Register mode.
- Reg
- (3 bits) Register. Each register has an identifier.
- R/M (r/m)
- (3 bits) Register/Memory operand
Not all instructions have W or D bits; in some cases, the width of the operation is either irrelevant or implicit, and for other operations the data direction is irrelevant.
Notice that Intel instruction format is little-endian, which means that the lowest-significance bytes are closest to absolute address 0. Thus, words are stored low-byte first; the value 1234H is stored in memory as 34H 12H. By convention, most-significant bits are always shown to the left within the byte, so 34H would be 00110100B.
After the initial 2 bytes, each instruction can have many additional addressing/immediate data bytes.
Mod / Reg / R/M tables
editMod | Displacement |
00 | If r/m is 110, Displacement (16 bits) is address; otherwise, no displacement |
01 | Eight-bit displacement, sign-extended to 16 bits |
10 | 16-bit displacement (example: MOV [BX + SI]+ displacement,al) |
11 | r/m is treated as a second "reg" field |
Reg | W = 0 | W = 1 | double word | |
000 | AL | AX | EAX | |
001 | CL | CX | ECX | |
010 | DL | DX | EDX | |
011 | BL | BX | EBX | |
100 | AH | SP | ESP | |
101 | CH | BP | EBP | |
110 | DH | SI | ESI | |
111 | BH | DI | EDI |
r/m | Operand address |
000 | (BX) + (SI) + displacement (0, 1 or 2 bytes long) |
001 | (BX) + (DI) + displacement (0, 1 or 2 bytes long) |
010 | (BP) + (SI) + displacement (0, 1 or 2 bytes long) |
011 | (BP) + (DI) + displacement (0, 1 or 2 bytes long) |
100 | (SI) + displacement (0, 1 or 2 bytes long) |
101 | (DI) + displacement (0, 1 or 2 bytes long) |
110 | (BP) + displacement unless mod = 00 (see mod table) |
111 | (BX) + displacement (0, 1 or 2 bytes long) |
Note the special meaning of MOD 00, r/m 110. Normally, this would be expected to be the operand [BP]. However, instead the 16-bit displacement is treated as the absolute address. To encode the value [BP], you would use mod = 01, r/m = 110, 8-bit displacement = 0.
Example: Absolute addressing
editLet's translate the following instruction into machine code:
XOR CL, [12H]
Note that this is XORing CL with the contents of address 12H – the square brackets are a common indirection indicator. The opcode for XOR is "001100dw". D is 1 because the CL register is the destination. W is 0 because we have a byte of data. Our first byte therefore is "00110010".
Now, we know that the code for CL is 001. Reg thus has the value 001. The address is specified as a simple displacement, so the MOD value is 00 and the R/M is 110. Byte 2 is thus (00 001 110b).
Byte 3 and 4 contain the effective address, low-order byte first, 0012H as 12H 00H, or (00010010b) (00000000b)
All together,
XOR CL, [12H] = 00110010 00001110 00010010 00000000 = 32H 0EH 12H 00H
Example: Immediate operand
editNow, if we were to want to use an immediate operand, as follows:
XOR CL, 12H
In this case, because there are no square brackets, 12H is immediate: it is the number we are going to XOR against. The opcode for an immediate XOR is 1000000w; in this case, we are using a byte, so w is 0. So our first byte is (10000000b).
The second byte, for an immediate operation, takes the form "mod 110 r/m". Since the destination is a register, mod is 11, making the r/m field a register value. We already know that the register value for CL is 001, so our second byte is (11 110 001b).
The third byte (and fourth byte, if this were a word operation) are the immediate data. As it is a byte, there is only one byte of data, 12H = (00010010b).
All together, then:
XOR CL, 12H = 10000000 11110001 00010010 = 80 F1 12
x86 Instructions (32/64 bit)
editThe 32-bit instructions are encoded in a very similar way to the 16-bit instructions, except (by default) they act upon dword quantities rather than words. Also, they support a much more flexible memory addressing format, which is made possible by the addition of an SIB "scale-index-base" byte, which follows the ModR/M byte.
Continuing the previous absolute addressing example, we take this input:
XOR CL, [12H]
...and we arrive at the 32-bit machine code like so:
Beginning with the opcode byte first, it remains the same, 32H. Consulting the Intel IA-32 manual, Volume 2C, Chapter 5, "XOR"--we see this opcode defines that a) it requires 2 operands, b) the operands have a direction, and the first operand is the destination, c) the first operand is a register of 8-bits width, d) the second operand is also 8-bit but can be either a register or memory address, and e) the destination register CL will be overridden to contain the result of the operation. This fits our case above, because the first operand is CL ("L" meaning lower 8-bits of the "C" register), and the second operand is a reference to the value stored in memory at 12H (a direct/absolute pointer or address reference). It doesn't look like we need any prefix bytes to get the operand sizes we want.
Now we know we need a ModR/M byte, because the opcode requires it; a) it requires more than zero operands, and b) they are not defined within the opcode or any prefix, and c) there is no Immediate operand. So again we consult the Intel manual, Volume 2A, Chapter 2, Section 2.1.5 "Addressing-Mode Encoding of ModR/M and SIB Bytes", Table 2-2 "32-Bit Addressing Forms with the ModR/M Byte". We know the first operand is going to be our destination register. CL, so we see that maps to REG=001b. Next we look for an Effective Address formula which matches our second operand, which is a displacement with no register (and therefore no segment, base, scale, or index). The nearest match is going to be disp32, but reading the table is tricky because of the footnotes. Basically our formula is not in that table, the one we want requires a SIB byte noted as [--][--], which tells us we need to specify Mod=00b, R/M=100b to enable the SIB byte. Our second byte is therefore 00001100b or 0CH.
We know the SIB byte, if it is used, always follows the ModR/M byte, so we continue to the next Table 2-3 "32-Bit Addressing Forms with the SIB Byte" in the Intel manual, and look for the combination of Scale, Index, and Base values which will give us the disp32 formula we need. Notice there is a footnote [*], this basically tells us to specify Scale=00b, Index=100b, Base=101b which means disp32 with no index, no scale, and no base. So our third byte is now 25H.
We know the Displacement byte, if used, always follows the ModR/M and SIB byte, so here we simply specify our 32-bit unsigned integer value in little-endian, meaning our next four bytes are 12000000H.
Finally, we have our machine code:
XOR CL, [12H] = 00110010 00001100 00100101 00010010 00000000 00000000 00000000 = 32 0C 25 12 00 00 00
This instruction works in both 32-bit Protected mode and 64-bit Long mode.