Parallel Computing and Computer Clusters/Micro Processor Units

Wikipedia has related information at Microprocessor

The marketplace has for years been comfortable with the term CPU - Central Processing Unit. This has stemmed from most machines only having a single MPU capable of working inside the machine at any single time - an MPU at its Centre and hence Central (Micro) Processor Unit. Parallel computers are concerned with utilising processors unlike these MPUs and computer clusters are concerned with utilising many such MPUs together and either way the whole becomes far more than a simple CPU.

The Old Fashioned Micro Processor Unit

In order to understand and appreciate what separates the different types of processors it is first best to get a good idea of the processor in its most basic form.

A typical stand-alone computing unit (digital clock, scientific calculator, washing machine and many more) has at its heart a single central processing unit (CPU) - an MPU controlling almost everything. This CPU will control and perform the vast majority of tasks the computing unit performs. It performs each of its many tasks, one-by-one in turn and in sequence: it serialises everything. Take a washing machine for example: it may begin its cycle by ensuring the door is shut; locking the door; filling the washing machine; heating the water; inserting the powder; cycling the drum; inserting the fabric conditioner; cycling the drum; emptying the water; filling with water; cycling the drum; emptying the water and unlocking the door. Suppose it didn't do things in that order and instead began by filling the washing machine with water: would the washing still get cleaned or would the water pour immediately out of it? Things happen in sequence for a reason: to get the correct results.

The Modern Day Micro Processor Unit

For some years a single MPU has itself been performing multiple simultaneous activities in order to better achieve throughput. Early on processor makers realised that to make MPUs faster it wasn't just the speed of the MPU they needed to increase.

A single instruction in an MPU has a lot of work to do when broken down into its component parts. Even in its most over-simplified form, the MPU must 1) retrieve the instruction from the program code; 2) execute the instruction and, for many instructions, 3) place the result of the instruction back into a location somewhere. Each of these actions would take the MPU a single cycle, so for an MPU capable of performing 2 cycles per second (more commonly translated as 2Hz) it would take one second to perform the most simple of instructions and 1.5 seconds to perform a simple instruction which requires putting a result to some other location.

Engineers soon worked out that if the MPU design were to have each of these component parts separated and working relatively independently of each other then it would be possible to increase the number of instructions performed per second without altering the speed of the MPU. For example, separate the instruction-fetching phase out from the rest of the MPU and have that running simultaneously and you have potentially doubled the number of instructions the MPU is capable of: while the instruction-processing portion of the MPU is processing the first instruction, in the same cycle, the instruction-fetching portion of the MPU is fetching the second instruction. The process of an instruction being fetched, worked upon and eventually reaching its conсlusion is referred to as a pipeline.

Example instruction fetching

Cycle count	Single-pipeline MPU	Dual-pipeline MPU
1	Fetch instruction 1	Fetch instruction 1
2	Execute instruction 1	Execute instruction 1 Fetch instruction 2
3	Fetch instruction 2	Execute instruction 2 Fetch instruction 3
4	Execute instruction 2	Execute instruction 3 Fetch instruction 4

Symmetric Multi-Processor

The Symmetric Multi-Processor (SMP) machine consists of a series of MPUs in symmetry (hence the name). The individual MPUs themselves are mirrors of each other (same speed, design and peripheral capabilities such as on-board cache and memory management) and are housed in a way which gives each MPU peer access and capabilities to the peripherals of the machine as a whole (RAM, HDs, video, etc). Due to the binary nature of computers in general, the number of MPUs in use is ordinarily a factor of two (2, 4, 8, 16, 32, 64) in order to achieve binary symmetry for simple number manipulation and arithmetic.

The SMP design introduces a number of issues into the design of both the hardware and software to be utilised: booting the machine and access to peripheral resources, particularly RAM (discussed under RAM and multiple MPUs, below).

How to boot up an SMP

In a single CPU machine the BIOS will boot up the first, and only, CPU with a section of its ROM code. The CPU will read and process that code and begin to serialise hardware peripherals and how the CPU should continue to boot (which boot device to use for example). In an SMP machine the process is much the same with the exception that there are multiple CPUs to choose from. The BIOS performs no picking & choosing but instead, typically, only signals the CPU which sits in slot 0. Another change from the single CPU system is that the code in the BIOS allows for checking whether other CPUs exist and the code to boot those other CPUs. In order for the other CPUs to be booted, the CPU in slot 0 must have been designed to understand how to perform the boot up of another CPU. Additionally, as this is a symmetric multi-processor, the CPUs in each of the other slots will be mirrors of the first and hence also contain the ability.

Asymmetric Multi-Processor

An Asymmetric Multi-Processor (unlikely to be referred to as AMP) machine is unlikely to present any form of symmetry and, by its nature, is the most difficult to give an accurate description to. Often there will be a central processor unit (or even units) but its use will consist of ensuring the correct program code is given to the correct other processors in the system. Sometimes an asymmetric multi-processor machine will not be described as such and indeed many arcade video games and modern home console machines will fit into this category.

Parallel Vector Processor

These dedicated, specialised processors are specifically designed to carry out specific tasks in parallel. Often cited in multiples up to 64 and sometimes more, the Parallel Vector Processor typically has x-number of separate cores in the each processor and each core contains x-number of pipelines. Instruction sets are often simple, sometimes less than a common RISC processor, but incredibly powerful in their aim: instructions to load data for work will load a data set for work as opposed to a single byte, word or double-word. Another single instruction will perform the same action on each data part of the data set in parallel thereby achieving x-times the performance of a similar, single-data part implementing RISC processor. Thus with two instructions spanning two processor cycles the processor could load and perform an instruction on x-many different data types. The organisation of the data and the manner in which programs are compiled are likely to be as unique as the processors themselves.

Massively Parallel Processor

Strictly speaking not a processor type in its own right, the Massively Parallel Processor (MPP) is an architecture type where multiple Parallel Vector Processors are packed onto individual daughter-boards with their own dedicated RAM and dedicated board-wide bus. Multiple daughter-boards are then slotted into the system on a dedicated processor-super-bus. The system as a whole has additional resources (storage capacity and often system RAM) available and are joined to the processor-super-bus to form a complete system. The largest of supercomputers (the Earth Simulator and Big Blue, for example) are rarely not of this category.