x86 Assembly/16, 32, and 64 Bits

x86 Assembly
quick links: registers • move • jump • calculate • logic • rearrange • misc. • FPU

When using x86 assembly, it is important to consider the differences between architectures that are 16, 32, and 64 bits. This page will talk about some of the basic differences between architectures with different bit widths.

Registers

16-bit

The registers found on the 8086 and all subsequent x86 processors are the following: AX, BX, CX, DX, SP, BP, SI, DI, CS, DS, SS, ES, IP and FLAGS. These are all 16 bits wide.

On DOS and up to 32-bit Windows, you can run a very handy program called "debug.exe" from a DOS shell, which is very useful for learning about 8086. If you are using DOSBox or FreeDOS, you can use "debug.exe" as provided by FreeDOS.

AX, BX, CX, DX: These general purpose registers can also be addressed as 8-bit registers. So AX = AH (high 8-bit) and AL (low 8-bit).

SI, DI: These registers are usually used as offsets into data space. By default, SI is offset from the DS data segment, DI is offset from the ES extra segment, but either or both of these can be overridden.

SP: This is the stack pointer, offset usually from the stack segment SS. Data is pushed onto the stack for temporary storage, and popped off the stack when it is needed again.

BP: The stack frame, usually treated as an offset from the stack segment SS. Parameters for subroutines are commonly pushed onto the stack when the subroutine is called, and BP is set to the value of SP when a subroutine starts. BP can then be used to find the parameters on the stack, no matter how much the stack is used in the meanwhile.

CS, DS, SS, ES: The segment pointers. These are the offset in memory of the current code segment, data segment, stack segment and extra segment respectively.

IP: The instruction pointer. Offset from the code segment CS, this points at the instruction currently being executed.

FLAGS (F): A number of single-bit flags that indicate (or sometimes set) the current status of the processor.

32-bit

With the chips beginning to support a 32-bit data bus, the registers were also widened to 32 bits. The names for the 32-bit registers are simply the 16-bit names with an 'E' prepended.

EAX, EBX, ECX, EDX, ESP, EBP, ESI, EDI: These are the 32-bit versions of the registers shown above.

EIP: The 32-bit version of IP. Always use this instead of IP on 32-bit systems.
EFLAGS: An expanded version of the 16-bit FLAGS register.

64-bit

The names of the 64-bit registers are the same of those of the 32-bit registers, except beginning with an 'R'.

RAX, RBX, RCX, RDX, RSP, RBP, RSI, RDI: These are the 64-bit versions of the registers shown above.
RIP: This is the full 64-bit instruction pointer and should be used instead of EIP (which will be inaccurate if the address space is larger than 4 GiB, which may happen even with 4 GiB or less of RAM).
R8–15: These are new extra registers for 64-bit. They are counted as if the registers above are registers zero through seven, inclusively, rather than one through eight.

R8–R15 can be accessed as 8-bit, 16-bit, or 32-bit registers. Using R8 as an example, the names corresponding to those widths are R8B, R8W, and R8D, respectively. 64-bit versions of x86 also allow the low byte of RSP, RBP, RSI, RDI to be accessed directly. For example, the low byte of RSP can be accessed using SPL. There is no way to directly access bits 8–15 of those registers, as AH allows for AX.

128-bit, 256-bit and 512-bit (SSE/AVX)

64-bit x86 includes SSE2 (an extension to 32-bit x86), which provides 128-bit registers for specific instructions. Most CPUs made since 2011 also have AVX, a further extension that lengthens these registers to 256 bits. Some also have AVX-512, which lengthens them to 512 bits and adds 16 more registers.

XMM0~7: SSE2 and newer.
XMM8~15: SSE3 and newer and AMD (but not Intel) SSE2.
YMM0~15: AVX. Each YMM register includes the corresponding XMM register as its lower half.
ZMM0~15: AVX-512F. Each ZMM register includes the corresponding YMM register as its lower half.
ZMM16~31: AVX-512F. 512-bit registers that are not addressable in narrower modes unless AVX-512VL is implemented.
XMM16~31: AVX-512VL. Each is the lower quarter of the corresponding ZMM register.
YMM16~31: AVX-512VL. Each is the lower half of the corresponding ZMM register.

Addressing memory

8086 and 80186

The original 8086 only had registers that were 16 bits in size, effectively allowing to store one value of the range [0 - (2¹⁶ - 1)] (or simpler: it could address up to 65536 different bytes, or 64 kibibytes) - but the address bus (the connection to the memory controller, which receives addresses, then loads the content from the given address, and returns the data back on the data bus to the CPU) was 20 bits in size, effectively allowing to address up to 1 mebibyte of memory. That means that all registers by themselves were not large enough to make use of the entire width of the address bus, leaving 4 bits unused, scaling down the number of usable addresses by 16 (1024 KiB / 64 KiB = 16).

The problem was this: how can a 20-bit address space be referred to by the 16-bit registers? To solve this problem, the engineers of Intel came up with segment registers CS (Code Segment), DS (Data Segment), ES (Extra Segment), and SS (Stack Segment). To convert from 20-bit address, one would first divide it by 16 and place the quotient in the segment register and remainder in the offset register. This was represented as CS:IP (this means, CS is the segment and IP is the offset). Likewise, when an address is written SS:SP it means SS is the segment and SP is the offset.

This works also the reversed way. If one was, instead of convert from, to create a 20 bit address, it would be done by taking the 16-bit value of a segment register and put it on the address bus, but shifted 4 times to the left (thus effectively multiplying the register by 16), and then by adding the offset from another register untouched to the value on the bus, thus creating a full 20-bit address.

Example

If CS = 258C and IP = 0012₁₆, then CS:IP will point to a 20 bit address equivalent to "CS × 16 + IP" which will be

258C × 10₁₆ + 0012₁₆ = 258C0 + 0012₁₆ = 258D2 (Remember: 16 decimal = 10₁₆).

The 20-bit address is known as an absolute (or linear) address and the Segment:Offset representation (CS:IP) is known as a segmented address. This separation was necessary, as the register itself could not hold values that required more than 16 bits encoding. When programming in protected mode on a 32-bit or 64-bit processor, the registers are big enough to fill the address bus entirely, thus eliminating segmented addresses - only linear/logical addresses are generally used in this "flat addressing" mode, although the Segment:Offset architecture is still supported for backwards compatibility.

It is important to note that there is not a one-to-one mapping of physical addresses to segmented addresses; for any physical address, there is more than one possible segmented address. For example: consider the segmented representations B000:8000 and B200:6000. Evaluated, they both map to physical address B8000.

B000:8000 = B000 × 10₁₆ + 8000₁₆ = B0000 + 8000₁₆ = B8000, and

B200:6000 = B200 × 10₁₆ + 6000₁₆ = B2000 + 6000₁₆ = B8000.

However, using an appropriate mapping scheme avoids this problem: such a map applies a linear transformation to the physical addresses to create precisely one segmented address for each. To reverse the translation, the map [f(x)] is simply inverted.

For example, if the segment portion is equal to the physical address divided by 10₁₆ and the offset is equal to the remainder, only one segmented address will be generated. (No offset will be greater than 0F₁₆.) Physical address B8000 maps to (B8000 / 10₁₆):(B8000 mod 10₁₆) or B800:0. This segmented representation is given a special name: such addresses are said to be "normalized Addresses".

CS:IP (Code Segment: Instruction Pointer) represents the 20-bit address of the physical memory from where the next instruction for execution will be picked up. Likewise, SS:SP (Stack Segment: Stack Pointer) points to a 20-bit absolute address which will be treated as stack top (8086 uses this for pushing/popping values).

Protected Mode (80286+)

As ugly as this may seem, it was in fact a step towards the protected addressing scheme used in later chips. The 80286 had a protected mode of operation, in which all 24 of its address lines were available, allowing for addressing of up to 16 MiB of memory. In protected mode, the CS, DS, ES, and SS registers were not segments but selectors, pointing into a table that provided information about the blocks of physical memory that the program was then using. In this mode, the pointer value CS:IP = 0010:2400 is used as follows:

The CS value 0010₁₆ is an offset into the selector table, pointing at a specific selector. This selector would have a 24-bit value to indicate the start of a memory block, a 16-bit value to indicate how long the block is, and flags to specify whether the block can be written, whether it is currently physically in memory, and other information. Let's say that the memory block pointed to actually starts at the 24-bit address 164400₁₆, the actual address referred to then is 164400₁₆ + 2400₁₆ = 166800₁₆. If the selector also includes information that the block is 2400₁₆ bytes long, the reference would be to the byte immediately following that block, which would cause an exception: the operating system should not allow a program to read memory that it does not own. And if the block is marked as read-only, which code segment memory should be so that programs don't overwrite themselves, an attempt to write to that address would similarly cause an exception.

With CS and IP being expanded to 32 bits in the 386, this scheme became unnecessary; with a selector pointing at physical address 00000000₁₆, a 32-bit register could address up to 4 GiB of memory. However, selectors are still used to protect memory from rogue programs. If a program in Windows tries to read or write memory that it doesn't own, for instance, it will violate the rules set by the selectors, triggering an exception, and Windows will shut it down with the "General protection fault" message.

32-Bit Addressing

32-bit addresses can cover memory up to 4 GiB in size. This means that we don't need to use offset addresses in 32-bit processors. Instead, we use what is called the "Flat addressing" scheme, where the address in the register directly points to a physical memory location. The segment registers are used to define different segments, so that programs don't try to execute the stack section, and they don't try to perform stack operations on the data section accidentally.

The A20 Gate Saga

As was said earlier, the 8086 processor had 20 address lines (from A0 to A19), so the total memory addressable by it was 1 MiB (or 2 to the power 20). But since it had only 16 bit registers, they came up with Segment:Offset scheme or else using a single 16-bit register they couldn't have possibly accessed more than 64 KiB (or 2 to the power 16) of memory. So this made it possible for a program to access the whole of 1 MiB of memory.

But with this segmentation scheme also came a side effect. Not only could your code refer to the whole of 1 MiB with this scheme, but actually a little more than that. Let's see how ....

Let's keep in mind how we convert from a Segment:Offset representation to Linear 20 bit representation.

The conversion:

Segment:Offset = Segment × 16 + Offset.

Now to see the maximum amount of memory that can be addressed, let's fill in both Segment and Offset to their maximum values and then convert that value to its 20-bit absolute physical address.

So, max value for Segment = FFFF₁₆, and max value for Offset = FFFF₁₆.

Now, let's convert FFFF:FFFF into its 20-bit linear address, bearing in mind 16₁₀ is represented as 10 in hexadecimal.

So we get, FFFF:FFFF -> FFFF × 10₁₆ + FFFF = FFFF0 (1 MiB - 16 bytes) + FFFF (64 KiB) = FFFFF + FFF0 = 1 MiB + FFF0 bytes.

Note: FFFFF is hexadecimal and is equal to 1 MiB and FFF0 is equal to 64 KiB minus 16 bytes.

Moral of the story: From Real mode a program can actually refer to (1 MiB + 64 KiB - 16) bytes of memory.

Notice the use of the word "refer" and not "access". A program can refer to this much memory but whether it can access it or not is dependent on the number of address lines actually present. So with the 8086 this was definitely not possible because when programs made references to 1 MiB plus memory, the address that was put on the address lines was actually more than 20-bits, and this resulted in wrapping around of the addresses.

For example, if a code is referring to 1 MiB, this will get wrapped around and point to location 0 in memory, likewise 1 MiB + 1 will wrap around to address 1 (or 0000:0001).

Now there were some super funky programmers around that time who manipulated this feature in their code, that the addresses get wrapped around and made their code a little faster and a few bytes shorter. Using this technique it was possible for them to access 32 KiB of top memory area (that is 32 KiB touching 1 MiB boundary) and 32 KiB memory of the bottom memory area, without actually reloading their segment registers!

Simple maths you see, if in Segment:Offset representation you make Segment constant, then since Offset is a 16-bit value therefore you can roam around in a 64 KiB (or 2 to the power 16) area of memory. Now if you make your segment register point to 32 KiB below the 1 MiB mark you can access 32 KiB upwards to touch 1 MiB boundary and then 32 KiB further which will ultimately get wrapped to the bottom most 32 KiB.

Now these super funky programmers overlooked the fact that processors with more address lines would be created. (Note: Bill Gates has been attributed with saying, "Who would need more than 640 KB memory?", and these programmers were probably thinking similarly.) In 1982, just 2 years after 8086, Intel released the 80286 processor with 24 address lines. Though it was theoretically backward compatible with legacy 8086 programs, since it also supported Real Mode, many 8086 programs did not function correctly because they depended on out-of-bounds addresses getting wrapped around to lower memory segments. So for the sake of compatibility IBM engineers routed the A20 address line (8086 had lines A0 - A19) through the Keyboard controller and provided a mechanism to enable/disable the A20 compatibility mode. Now if you are wondering why the keyboard controller, the answer is that it had an unused pin. Since the 80286 would have been marketed as having complete compatibility with the 8086 (that wasn't even yet out very long), upgraded customers would be furious if the 80286 was not bug-for-bug compatible such that code designed for the 8086 would operate just as well on the 80286, but faster.