Microprocessor Design/Register Renaming

Introduction

In a register machine, programs are composed of instructions which operate on values. The instructions must name these values in order to distinguish them from one another. A typical instruction might say, "add X and Y and put the result in Z." In this instruction, X, Y, and Z are the names of storage locations.

In order to have a compact instruction encoding, most processor instruction sets have a small set of special locations which can be directly named. For example, the x86 instruction set architecture has 8 integer registers, the x86-64 set architecture has 16, many RISC microprocessors have 32, and the IA-64 instruction set architecture has 128. In smaller processors, the names of these locations correspond directly to elements of a register file.

Different instructions may take different amounts of time; for example, a processor may be able to execute hundreds of instructions while a single load from the main memory is in progress. Shorter instructions executed while the load is outstanding will finish first, thus the instructions will be finishing out of the original program order. Out-of-order execution has been used in most recent high-performance CPU's to achieve some of their speed gains.

When possible, an assembly language compiler will try to assign distinct instructions to different registers. However, there is a finite number of register names that can be used in assembly code, as limited by the specific microprocessor architecture. Many high performance CPU's have more physical registers than may be named directly in the instruction set, so they rename registers in hardware to achieve additional parallelism.

Data hazards

When more than one instruction references a particular location for an operand, either reading it (as an input) or writing to it (as an output), executing those instructions in an order different from the original program order can lead to three kinds of data hazards:

Read-after-write (RAW) Error: A read from a register or memory location must return the value placed there by the last write in program order, not some other write. This is referred to as a true dependency or flow dependency, and requires the instructions to execute in program order.
Write-after-write (WAW) Error: Successive writes to a particular register or memory location must leave that location containing the result of the second write. This can be resolved by squashing (synonyms: cancelling, annulling, mooting) the first write if necessary. WAW dependencies are also known as output dependencies.
Write-after-read (WAR) Error: A read from a register or memory location must return the last prior value written to that location, and not one written programmatically after the read. This is a sort of false dependency that can be resolved by register renaming. WAR dependencies are also known as anti-dependencies.

Instead of delaying the write until all reads are completed, two copies of the location can be maintained: the old value and the new value. If reads precede in program order, the write of the new value can be provided with the old value, even while other reads that follow the write are provided with the new value. The false dependency is broken and additional opportunities for out-of-order execution are created. When all reads that need the old value have been satisfied, it can be discarded. This is the essential concept behind register renaming.

Anything that is read and written can be renamed. While the general-purpose and floating-point registers are discussed the most, flag and status registers or even individual status bits are commonly renamed as well. Memory locations can also be renamed, although it is not commonly done to the extent practiced in register renaming. An example is the Transmeta Crusoe processor's gated store buffer, which is a form of memory renaming.

If programs refrained from reusing registers immediately, there would be no need for register renaming. Some instruction sets (e.g., IA-64) specify very large numbers of registers for specifically this reason. However, there are limitations to this approach:

It is very difficult for the compiler to avoid reusing registers without large code size increases. In loops, for instance, successive iterations would have to use different registers, which requires replicating the code in a process called loop unrolling (but see register rotation).
Large numbers of registers require more bits for specifying a register as an operand in an instruction, resulting in increased code size.
Many instruction sets historically specified smaller numbers of registers and cannot be changed now.

Code size increases are important because, when the program code is larger, the instruction cache misses more often and the processor stalls waiting for new instructions.

Architectural vs physical registers

Machine language programs specify reads and writes to a limited set of registers specified by the instruction set architecture (ISA). For instance, the DEC Alpha ISA specifies 32 integer registers, each 64 bits wide, and 32 floating-point registers, each 64 bits wide. These are the architectural registers. Programs written for processors running the Alpha instruction set will specify operations reading and writing those 64 registers. If a programmer stops the program in a debugger, they can observe the contents of these 64 registers (and a few status registers) to determine the progress of the machine.

One particular processor which implements this ISA, the Alpha 21264, has 80 integer and 72 floating-point physical registers. There are, on an Alpha 21264 chip, 80 physically separate locations which can store the results of integer operations, and 72 locations which can store the results of floating point operations.^[1]

The following text describes two styles of register renaming, which are distinguished by the circuit which holds the data ready for an execution unit.

In all renaming schemes, the machine converts the architectural registers referenced in the instruction stream into tags. Where the architectural registers might be specified by 3 to 5 bits, the tags are usually a 6 to 8 bit number. The rename file must have a read port for every input of every instruction renamed every cycle, and a write port for every output of every instruction renamed every cycle. Because the size of a register file generally grows as the square of the number of ports, the rename file is usually physically large and consumes significant power.

In the tag-indexed register file style, there is one large register file for data values, containing one register for every tag. For example, if the machine has 80 physical registers, then it would use 7 bit tags. 48 of the possible tag values in this case are unused.

In this style, when an instruction is issued to an execution unit, the tags of the source registers are sent to the physical register file, where the values corresponding to those tags are read and sent to the execution unit.

In the reservation station style, there are many small associative register files, usually one at the inputs to each execution unit. Each operand of each instruction in an issue queue has a place for a value in one of these register files.

In this style, when an instruction is issued to an execution unit, the register file entries corresponding to the issue queue entry are read and forwarded to the execution unit.

Architectural Register File or Retirement Register File (RRF): The committed register state of the machine. RAM indexed by logical register number. Typically written into as results are retired or committed out of a reorder buffer.
Future File: The most speculative register state of the machine. RAM indexed by logical register number.
Active Register File: The Intel P6 group's term for Future File.
History Buffer: Typically used in combination with a future file. Contains the "old" values of registers that have been overwritten. If the producer is still in flight it may be RAM indexed by history buffer number. After a branch misprediction must use results from the history buffer—either they are copied, or the future file lookup is disabled and the history buffer is indexed by logical register number in "content-addressable memory" (CAM).
Reorder Buffer (ROB): A structure that is sequentially (circularly) indexed on a per-operation basis, for instructions in flight. It differs from a history buffer because the reorder buffer typically comes after the future file (if it exists) and before the architectural register file.

Reorder buffers can be data-less or data-ful.

In Willamette's ROB, the ROB entries point to registers in the physical register file (PRF), and also contain other book keeping. This was also the first Out of Order design done by Andy Glew, at Illinois with HaRRM.

In the P6's ROB, the ROB entries contain data; there is no separate PRF. Data values from the ROB are copied from the ROB to the RRF at retirement.

There is, however, one small detail: if there is temporal locality in ROB entries (i.e., if instructions close together in the von Neumann instruction sequence write back close together in time, it may be possible to perform write combining on ROB entries and so have fewer ports than a separate ROB/PRF would). It is not clear if it makes a difference, since a PRF should be banked.

ROBs usually don't have associative logic, and certainly none of the ROBs designed by Andy Glew have CAM (see above). However, designer Keith Diefendorff has, for many years, insisted that ROBs have complex associative logic. The first ROB proposal, on the other hand, may have utilized CAM.

Tag-indexed register file scheme

In the renaming stage, every architectural register referenced (for read or write) is looked up in an architecturally-indexed remap file. This file returns a tag and a ready bit. The tag is non-ready if there is a queued instruction which will write to it that has not yet executed. For read operands, this tag takes the place of the architectural register in the instruction. For every register write, a new tag is pulled from a free tag FIFO, and a new mapping is written into the remap file, so that future instructions reading the architectural register will refer to this new tag. The tag is marked as unready, because the instruction has not yet executed. The previous physical register allocated for that architectural register is saved with the instruction in the reorder buffer, which is a FIFO that holds the instructions in program order between the decode and graduation stages.

The instructions are then placed in various issue queues. As instructions are executed, the tags for their results are broadcast, and the issue queues match these tags against the tags of their non-ready source operands. A match means that the operand is ready. The remap file also matches these tags, so that it can mark the corresponding physical registers as ready. When all the operands of an instruction in an issue queue are ready, that instruction is ready to issue. The issue queues pick ready instructions to send to the various functional units each cycle. Non-ready instructions stay in the issue queues. This unordered removal of instructions from the issue queues can make them large and power-consuming.

Issued instructions are read from a tag-indexed physical register file (bypassing just-broadcast operands) and then execute. Execution results are written to tag-indexed physical register file, as well as broadcast to the bypass network preceding each functional unit. Graduation puts the previous tag for the written architectural register into the free queue so that it can be reused for a newly decoded instruction.

An exception or branch misprediction causes the remap file to back up to the remap state at last valid instruction via combination of state snapshots and cycling through the previous tags in the in-order pre-graduation queue. Since this mechanism is required, and since it can recover any remap state (not just the state before the instruction currently being graduated), branch mispredictions can be handled before the branch reaches graduation, potentially hiding the branch misprediction latency. This is the renaming style used in the MIPS R10000, the Alpha 21264, and the FP section of the AMD Athlon.

Reservation stations scheme

In the renaming stage, every architectural register referenced for reads is looked up in both the architecturally-indexed future file and the rename file. The future file read gives the value of that register, if there is no outstanding instruction yet to write to it (i.e., it's ready). When the instruction is placed in an issue queue, the values read from the future file are written into the corresponding entries in the reservation stations. Register writes in the instruction cause a new, non-ready tag to be written into the rename file. The tag number is usually serially allocated in instruction order—no free tag FIFO is necessary.

Just as with the tag-indexed scheme, the issue queues wait for non-ready operands to see matching tag broadcasts. Unlike the tag-indexed scheme, matching tags cause the corresponding broadcast value to be written into the issue queue entry's reservation station.

Issued instructions read their arguments from the reservation station, bypass just-broadcast operands, and then execute. As mentioned earlier, the reservation station register files are usually small, with perhaps eight entries.

Execution results are written to the reorder buffer, to the reservation stations (if the issue queue entry has a matching tag), and to the future file if this is the last instruction to target that architectural register (in which case register is marked ready).

Graduation copies the value from the reorder buffer into the architectural register file. The sole use of the architectural register file is to recover from exceptions and branch mispredictions.

Exceptions and branch mispredictions, recognised at graduation, cause the architectural file to be copied to the future file, and all registers marked as ready in the rename file. There is usually no way to reconstruct the state of the future file for some instruction intermediate between decode and graduation, so there is usually no way to do early recovery from branch mispredictions. This is the style used in the integer section of the AMD K7 and K8 designs.

Comparison between the schemes

In both schemes, instructions are inserted in-order into the issue queues, but are removed out-of-order. If the queues do not collapse empty slots, then they will either have many unused entries, or require some sort of variable priority encoding for when multiple instructions are simultaneously ready to go. Queues that collapse holes have simpler priority encoding, but require simple but large circuitry to advance instructions through the queue.

Reservation stations have better latency from rename to execute, because the rename stage finds the register values directly, rather than finding the physical register number, and then using that to find the value. This latency shows up as a component of the branch misprediction latency.

Reservation stations also have better latency from instruction issue to execution, because each local register file is smaller than the large central file of the tag-indexed scheme. Tag generation and exception processing are also simpler in the reservation station scheme, as discussed below.

The physical register files used by reservation stations usually collapse unused entries in parallel with the issue queue they serve, which makes these register files larger in aggregate, and consume more power, and more complicated than the simpler register files used in a tag-indexed scheme. Worse yet, every entry in each reservation station can be written by every result bus, so that a reservation-station machine with, e.g., 8 issue queue entries per functional unit will typically have 9 times as many bypass networks as an equivalent tag-indexed machine. Result forwarding thus consumes much more power and area than in a tag-indexed design.

Furthermore, the reservation station scheme has four places (Future File, Reservation Station, Reorder Buffer and Architectural File) where a result value can be stored, whereas the tag-indexed scheme has just one (the physical register file). Because the results from the functional units, broadcast to all these storage locations, must reach a much larger number of locations in the machine than in the tag-indexed scheme, this function consumes more power, area, and time. Still, in machines equipped with very accurate branch prediction schemes and if execute latencies are a major concern, reservation stations can work remarkably well.

Footnotes

↑ In fact, there are even more locations than that, but those extra locations are not germane to the register renaming operation.)

[1] In fact, there are even more locations than that, but those extra locations are not germane to the register renaming operation.)

[1]