Microprocessor Design/Computer Architecture

Von Neumann Architecture

Early on in the days of computer science, computer programs were hard-wired, only using memory to store data. Reprogramming computers involved changing hardware switches manually, taking ridiculous amounts of time and having a high potential for coding errors. As a workaround to these problems, mathematician and computer scientist John von Neumann proposed what is now known as the von Neumann architecture, which stores programs in memory, thereby avoiding the need to hard-wire them.

Microprocessor Execution

In a von Neumann architecture, a circuit called a microprocessor is used to process program instructions and execute them. To execute a program, the microprocessor first fetches a programs' instructions from memory and the data necessary to run them. Then, the microprocessor decodes and separates the instructions and data and activates the necessary components and pathways needed to run the program. Finally, the microprocessor executes the program, running through the instructions, manipulating the data, and storing the results.

Control Units and Datapaths

This three-step process of fetching, decoding, and executing is typically implemented with two hardware components: a control unit and a datapath. If data is thought of as water, then the datapath acts as a canal with many branches, and the control unit acts as a series of locks. The control unit reads instructions fetched from memory and uses them to direct where data flows in the datapath. Along the way, different branches of the datapath will contain different mechanisms for modifying and transforming the data flowing through it. These mechanisms are termed Arithmetic Logic Units (ALUs) (discussed in more depth later on), and they perform arithmetic operations (such as addition, subtraction, shifting, and inverting) and logic operations (such as AND and OR) on data flowing through the datapath.

A more nuanced discussion of control units and datapaths will be had in a later section, conveniently titled Control and Datapath.

Harvard Architecture

In a Harvard Architecture machine, the computer system's memory is separated into two discrete parts: data and instructions. In a pure Harvard system, the two different memories occupy separate memory modules, and instructions can only be executed from the instruction memory.

Many DSPs are modified Harvard architectures, designed to simultaneously access three distinct memory areas: the program instructions, the signal data samples, and the filter coefficients (often called the P, X, and Y memories).

In theory, such three-way Harvard architectures can be three times as fast as a Von Neumann architecture that is forced to read the instruction, the data sample, and the filter coefficient, one at a time.

Modern Computers

Modern desktop computers, especially computers based on the Intel x86 ISA are not Harvard computers, although the newer variants have features that are "Harvard-Like". All information, program instructions, and data are stored in the same RAM areas. However, a modern feature called "paging" allows the physical memory to be segmented into large blocks of memory called "pages". Each page of memory can either be instructions or data, but not both.

Modern embedded computers, however, are typically based on a Harvard architecture. Instructions are stored in a different addressable memory block than the data is, and there is no way for the microprocessor to interchange data and instructions.

RISC and CISC and DSP

Historically, the first type of ISA (Instruction Set Architecture) was the complex instruction set computers (CISC), and the second type was the reduced instruction set computers (RISC). It is a common misunderstanding that RISC systems typically have a small ISA (fewer instructions) but make up for it with faster hardware. RISC system actually have "reduced instructions", in the sense that each instruction does so little that it takes very little time to execute it. It is a common misunderstanding that CISC systems have more instructions, but typically pay a steep performance penalty for the added versatility. CISC systems actually have "complex instructions", in the sense that at least one instruction takes a long time to execute -- for example, the "double indirect" addressing mode inherently requires two memory cycles to execute, and a few CPUs have a "string copy" instruction that may require hundreds of memory cycles to execute. MIPS and SPARC are examples of RISC computers. Intel x86 is an example of a CISC computer.

Some people group stack machines with the RISC machines; others[1] group stack machines with the CISC machines; some people [2], [3] describe stack machines as neither RISC nor CISC.

Other ISA types include DSPs, stack machines, VLIW machines, MISC machines, TTA architectures, massively parallel processor arrays, etc.

We will discuss these terms and concepts in more detail later.

Microprocessor Components

Some of the common components of a microprocessor are:

Control Unit
I/O Units
Arithmetic Logic Unit (ALU)
Registers
Cache

A brief introduction to these components is placed below.

Control processer Unit

The control processer unit, as described above, reads the instructions, and generates the necessary digital signals to operate the other components. An instruction to add two numbers together would cause the Control Unit to activate the addition module, for instance.

I/O Units

The processor needs to be able to communicate with the rest of the computer system. This communication occurs through the I/O ports. The I/O ports will interface with the system memory (RAM), and also the other peripherals of a computer.

ALU

The Arithmetic Logic Unit, or ALU is the part of the microprocessor that performs arithmetic operations. ALUs can typically add, subtract, divide, multiply, and perform the logical operations of two numbers (and, or, nor, not, etc).

ALU will be discussed in far more detail in a later chapter, ALU.

Registers

Wikipedia has related information at processor register

Wikipedia has related information at hardware register

There are different kinds of registers. Hopefully it will be obvious which kind of register we are talking about from the context.

The most general meaning is a "hardware register": anything that can be used to store bits of information, in a way that all the bits of the register can be written to or read out simultaneously. Since registers outside of a CPU are also outside the scope of the book, this book will only discuss processor registers, which are hardware registers that happen to be inside a CPU. But usually we will refer to a more specific kind of register.

Registers are mentioned in far more detail in a later chapter, Register File.

programmer-visible registers

The programmer-visible registers, also called the user-accessible registers, also called the architectural registers, often simply called "the registers", are the registers that are directly encoded as part of at least one instruction in the instruction set.

The registers are the fastest accessible memory locations, and because they are so fast, there are typically very few of them. In most processors, there are fewer than 32 registers. The size of the registers defines the size of the computer. For instance, a "32 bit computer" has registers that are 32 bits long. The length of a register is known as the word length of the computer.

There are several factors limiting the number of registers, including:

It is very convenient for a new CPU to be software-compatible with an old CPU. This requires the new chip to have exactly the same number of programmer-visible registers as the old chip.
Doubling the number general-purpose registers requires adding another bit to each instruction that selects a particular register. Each 3-operand instruction (that specify 2 source operands and a destination operand) would expand by 3 bits. Modern chip manufacturing processes could put a million registers on a chip; that would make each and every 3-operand instruction require 60 bits just to select the registers, not counting the bits required to specify what to do with those operands.
Adding more registers adds more wires to the critical path, adding capacitance, which reduces the maximum clock speed of the CPU.
Historically, CPUs were designed with few registers, because each additional register increased the cost of the CPU significantly. But now that modern chip manufacturing can put tens of millions of bits of storage on a single commodity CPU chip, this is less of an issue.

Microprocessors typically contain a large number of registers, but only a small number of them are accessible by the programmer. The registers that can be used by the programmer to store arbitrary data, as needed, are called general purpose registers. Registers that cannot be accessed by the programmer directly are known as reserved registers^{[citation needed]}.

Some computers have highly specialized registers -- memory addresses always came from the program counter or "the" index register or "the" stack pointer; one ALU input was always hooked to data coming from memory, the other ALU input was always hooked to "the" accumulator; etc.

Other computers have more general-purpose registers -- any instruction that access memory can use any address register as a index register or as a stack pointer; any instruction that uses the ALU can use any data register.

Other computers have completely general-purpose registers -- any register can be used as data or an address in any instruction, without restriction.

microarchitectural registers

Besides the programmer-visible registers, all CPUs have other registers that are not programmer-visible, called "microarchitectural registers" or "physical registers".

These registers include:

memory address register
memory data register
instruction register
microinstruction register
microprogram counter
pipeline registers

Wikipedia has related information at register renaming

extra physical registers to support register renaming

Wikipedia has related information at prefetch input queue

the prefetch input queue
writable control stores (We will discuss the control store in the Microprocessor Design/Control Unit and Microprocessor Design/Microcode)
Some people consider on-chip cache to be part of the microarchitectural registers; others consider it "outside" the CPU.

There are a wide variety of ways to implement any one instruction set. The vast majority of these microarchitectural registers are technically not "necessary". A designer could choose to design a CPU that had almost no physical registers other than the programmer-visible registers. However, many designers choose to design a CPU with lots of physical registers, using them in ways that make the CPU execute the same given instruction set much faster than a CPU that lacks those registers.

Cache

Most CPUs manufactured do not have any cache.

Cache is memory that is located on the chip, but that is not considered registers. The cache is used because reading external memory is very slow (compared to the speed of the processor), and reading a local cache is much faster. In modern processors, the cache can take up as much as 50% or more of the total area of the chip. The following table shows the relationship between different types of memory:

smallest		largest
Registers	cache	RAM
fastest		slowest

Cache typically comes in 2 or 3 "levels", depending on the chip. Level 1 (L1) cache is smaller and faster than Level 2 (L2) cache, which is larger and slower. Some chips have Level 3 (L3) cache as well, which is larger still than the L2 cache (although L3 cache is still much faster than external RAM).

We discuss cache in far more detail in a later chapter, Cache.

Endian

Wikipedia has more about this subject:

endianness

Different computers order their multi-byte data words (i.e., 16-, 32-, or 64-bit words) in different ways in RAM. Each individual byte in a multi-byte word is still separately addressable. Some computers order their data with the most significant byte of a word in the lowest address, while others order their data with the most significant byte of a word in the highest address. There is logic behind both approaches, and this was formerly a topic of heated debate.

This distinction is known as endianness. Computers that order data with the least significant byte in the lowest address are known as "Little Endian", and computers that order the data with the most significant byte in the lowest address are known as "Big Endian". It is easier for a human (typically a programmer) to view multi-word data dumped to a screen one byte at a time if it is ordered as Big Endian. However it makes more sense to others to store the LS data at the LS address.

When using a computer this distinction is typically transparent; that is that the user cannot tell the difference between computers that use the different formats. However, difficulty arises when different types of computers attempt to communicate with one another over a network.

With a big-endian 68K sort of machine,

       address increases > ------ >
       data   : 74 65 73 74 00 00 00 05

is the string "test" followed by the 32-bit integer 5. The little-endian x86 sort of machine would interpret the last part as the integer 0x0500_0000.

When communicating over a network composed of both big-endian and little-endian machines, the network hardware (should) apply the Address Invariance principle, to avoid scrambling text (avoiding the NUXI problem). High-level software (should) format packets of data to be transmitted over the network in Network Byte Order. High-level software (should) be written as "endian clean" -- always reading and writing 16 bit integers as whole 16 bit integers, 32 bit integers as whole 32 bit integers, etc. -- so no changes are needed to re-compile it for big-endian or little-endian machines. Software that is not "endian clean" -- software that writes integers, but then reads them out as 8 bit octets or integers of some other length -- usually fails when re-compiled for another computer.

A few computers -- including nearly all DSPs -- are "neither-endian". They always read and write complete aligned words, and don't have any hardware for dealing with individual bytes. Systems build on top of such computers often *do* have a particular endianness -- but that endianness is written into the software, and can be switched by re-compiling for the opposite endianness.

Stack

A stack is a block of memory that is used as a scratchpad area. The stack is a sequential set of memory locations that is set to act like a LIFO (last in, first out) buffer. Data is added to the top of the stack in a "push" operation, and the top data item is removed from the stack during a "pop" operation. Most computer architectures include at least a register that is usually reserved for the stack pointer.

Some microprocessors include a small hardware stack built into the CPU, independent from the rest of the RAM.

Some people claim that a processor must have a hardware stack in order to run C programs.^[1]

Most computer architectures have hardware support for a recursive "call" instruction in their Assembly Language. Some architectures (such as the ARM, the Freescale RS08, etc.) implement "call" like this:

the "call" instruction pushes a return address into a link register and jumps to the subroutine. A separate instruction near the beginning of the subroutine pushes the contents of the link register to a stack in main memory, to free up the link register so that subroutine can then recursively call other subroutines.

Some architectures (such as the 6502, the x86, etc.) implement "call" like this:

the "call" instruction pushes a return address onto the stack in main memory and jumps to the subroutine.

A few architectures (such as the PIC16, the RISC I processor, the Novix NC4016, many LISP machines, etc.) implement "call" like this:

The "call" instruction pushes a return address into a dedicated return stack, separate from main memory, and jumps to the subroutine.