Parallel Computing and Computer Clusters/Memory

In the early stages of single CPU machines the CPU would typically sit on a dedicated system bus between itself and the memory. Each memory access would pass along the bus and be returned from RAM directly. As the speed of CPUs increased the speed of the RAM and the bus speed also increased. Due to the electronics involved the cycle of memory access and RAM response became a bottleneck forcing the CPU to lose precious computing cycles waiting for the response to be returned (termed latency).

To overcome this latency some designs involved placing a memory controller on the system bus which took the requests from the CPU and returned the results - the memory controller would keep a copy (a cache) of recently accessed memory portions locally to itself and therefore being able to more rapidly respond to many requests involving sequential (such as program code) or locally scattered requests (such as variables regularly accessed by the program code).

As CPUs became even faster, the latency involved in retrieving RAM even from memory controller-cached areas became significantly costly. The next stage of development would see the memory controller being positioned inside the die cast of the CPU itself (in many cases leaving memory controller on the bus as a duplicate) and thus the CPU cache was born. As the CPU cache was on the CPU die cast, the bus between the CPU core and the CPU memory controller was far less in length and the memory controller could respond more rapidly, leaving less latency for the CPU until a non-cached element was required.

Still later and further engineering improvements meant the speed of the CPU cache began to mirror the speed of the CPU itself allowing each instruction cycle to go without wait states from it's cache. Unfortunately the cost of such designs were significant and so a second layer or level of cache was placed onto the die cast and made from the cheaper (older) designs: CPUs now had Level 1 and Level 2 caches. As the level 2 cache was cheaper it could be die cast in larger quantities than the level 1 cache and retain both reasonable production costs and savings over ordinary RAM access speeds as well as significant savings in populating the limited level 1 cache from the level 2 cache by comparison to RAM access speeds.

RAM and multiple MPUs

With the simplest of multiple MPU designs each MPU has no cache and sits directly on the bus which allows for RAM accesses (direct RAM access or through a memory controller). Extending from the single-CPU design, a multi-MPU design which only ever reads information can be simply designed: each MPU sends it's request on the bus but needs a way to identify from where the request came. To this end the request must be extended in some fashion to label the request & the most obvious identification is the slot in which the MPU sits (as indicated by the BIOS during the CPU's boot up process). Responses coming back to the MPU without it's ID in the response can be safely ignored. The identification process is often performed by a controller which is not a part of the multi-MPU make-up but instead a separate, specialist, MPU on the bus referred to as the Programmable Interrupt Controller (PIC).

Unfortunately, a system unable to write any data will be very limited in it's flexibility and must therefore be capable of writing the data back to the RAM. The mechanism for writing back to the RAM becomes, again, a simple extension on that of the single-CPU system: write back the RAM indicating the MPU slot the request came from and any acknowledgement responses can be correctly identified in the same way allowing for any individual MPU to know their write request succeeded.

Alas, there's one significant flaw: what happens if two (or more) MPUs work on the same piece of data at the same time? Each MPU reads the data from RAM, processes the data based on the program code the individual MPU is working on and writes back the result to RAM. The issue resides in which MPU performed the write request last as it is that MPU's result which will reside in RAM and not the first. To illustrate the issue more clearly, consider the following where the activity references the same memory location (a number).

Activity	MPU 1	MPU 2
Read memory from program code	Is instructed to increment RAM location 0001	Is instructed to decrement RAM location 0001
Read data from memory referenced by code	Receives a value of 1	Receives a value of 1
Process instruction	Calculates a result of 2	Calculates a result of 0
Read memory from program code	Is instructed to put result to RAM location 0002	Is instructed to put result to RAM location 0002
Process instruction	Puts the value 2 to RAM location 0002	Puts the value 0 to RAM location 0002

If this were an SMP system, both MPUs would be processing at the same speed and both would reach each phase of the pipeline at the same time. As a result, each would attempt to write their interpretation of the data back to RAM location 0002 at the same time. What would be the value of RAM location 0002 afterwards? Unfortunately, there's no guarantee unless some other method is put in place first to force this race condition not to occur.

Non-caching MPUs

In non-caching MPUs the MPU would request on the bus that a set range of memory be locked to prevent other processors from updating the memory region until the locking MPU has completed its own task. This approach is inefficient as the locking prevents a second (or subsequent) MPU from accessing the memory region.

Introducing caching

With the introduction of caches into the processors, the problem at first becomes more prevalent. Going back to a single-processor architecture with a cache to better illustrate the issue, the cache presents fast-access to a region of memory that has already been accessed. The MPU begins with a cache which is entirely marked as invalid and from there any request for a byte of RAM will automate a page of RAM to be loaded into the cache (filling only a portion of the cache). When the processor accesses a different section of non-cached RAM, that page of RAM is also loaded along-side the already cached page(s). The process continues until the cache is full of valid pages.

Once the cache is full and the processor requires access to a page of RAM not yet cached, it must make a decision as to where in it's cache it will write this new page: wherever the page is written to it will overwrite existing cached RAM. So, how does the processor decide? Two basic methods are used: most recent kept and random. In the first method the MPU's memory controller keeps track of the order in which the cached pages have been most recently accessed and chooses the eldest of the cached pages as the one to overwrite.

In particularly complex algorithms the eldest-accessed cached page may not be the most convenient to be removed: the code may jump around the RAM pages accessing very little from numerous different pages while rarely accessing the data set (by comparison to all program code) or vice versa. As a result of the jumping around in RAM the cache looses the data set (the most regularly-accessed single page of RAM) from it's cache. In this scenario, the processor would be better loosing any one of the infrequently accessed program code pages than the data set itself (note that there is a very big argument to state any program behaving in this manner is either poorly programmed or badly compiled).

In addition to the caching of data being read from RAM, the writing of data back to RAM can be similarly cached by the processor and then written as an entire page when the page is flushed. With a single-MPU system the caching of writes has no effect on other components in the system and can therefore take its time in writing back any modified cached pages (note that the RAM remains unchanged until the CPU writes the modified data back to RAM). This method is termed lazy write.

Simple cached writing

Back to a multiple-MPU system and the issue of using caches becomes far more obvious: when any processor writes to a portion of RAM the data is placed into its cache and not back into RAM. If another MPU attempts to read that same section of RAM the data it receives will be out of date. It is necessary to have the content of each of the MPU caches to be in tune with each other (or coherent, hence the term cache coherency). Various methods have been devised to ensure the caches remain as coherent as they can. The simplest of these methods involves ensuring any write operations push the data immediately back to the RAM via the bus with all MPUs in the architecture monitoring the bus for write operations: when a write operation is seen on the bus the MPU checks it's own cache for that same page of RAM and if it is cached the page is immediately re-loaded from RAM. This is a very expensive method for large parallel systems using common sub-sets of data: when a single MPU writes the data section each other MPU with the data cached must re-request the data. The method is vastly improved by the other MPUs reading the data off the bus as it is written back to RAM (termed snooping) thereby negating the need for a potential n - 1 requests for reading the same RAM location. In more advanced improvements, snooping is applied to all bus packets such that any single read operation can potentially become cached into all MPUs in the architecture.

Directory-based writing

Whether a cache is in use or not, a directory-based writing technique involves all RAM being controlled by a centrally located memory manager. The MPUs involved respect only the information given to them by the memory manager and requests for writes can come from the memory manager to an individual MPU requesting for the data to be written back immediately.

MOESI & predecesors

More advanced techniques exist for cache coherency and these begin with a similar set of abilities. Almost universally, each of these methods mark all pages of RAM as one of Modified, Shared or Invalid. More complex methods introduce the markings of Exclusive and/or Owner. The names of the methods are most commonly referred to by the initials of the methods implemented: MSI, MESI, MOSI and MOESI.

Status	Description
Modified	The page is marked as altered and resides in an MPU's cache but there is no valid copy in RAM. In a method where Owner status is also used, a Modified page is not required by any other MPU.
Owner	A complement to the Modified status where an otherwise Modified page is required and in use by other MPUs. In massively parallel systems the architecture's memory controller(s) hold information on which other MPUs require the page and are thereby able to distribute the necessary page across bus boundaries.
Exclusive	The MPU having the page is the only MPU with the page cached. No other MPU has requested or retained the page.
Shared	A page marked as shared indicates that one or more MPUs have the page in their cache. If the method utilises Exclusive, then Shared indicates that more than one MPU has the page.
Invalid	An Invalid cached page is a page which cannot be relied upon for valid data. An MPU typically starts out with only invalid pages (a few systems pre-load an MPU's cache with simple boot-up code prior to booting). Some MOESI implementations will mark a page as Invalid when the page is written to RAM.

Avoiding the race condition

No matter which of the memory management techniques are used (and many more variations and techniques exist than described above), there remain possibilities to generate invalid data in the caching. However, these possibilities are due to the nature of poor or misunderstood programming techniques and/or poor compilation. This is covered in depth under the software chapter.

Unified Memory Access

As the name suggests, Unified Memory Access (UMA) describes a method by which all accesses to memory are performed in a unified, equal, manner. In a UMA design the RAM is typically pooled together and often a memory controller is available to assist the CPU(s) in accessing the RAM. UMA is by far the most common method of implementing RAM access in a SMP.

Non-Unified Memory Access

Very popular in massively parallel processing architectures and clusters alike. The Non-Unified Memory Access (NUMA) architecture is a system (single machine or cluster) which houses a single RAM block which is distributed throughout the system. For example, a four-processor architecture could be made up of two pairs of processors, each pair having direct access to 2GB of RAM. In this scenario each pair is attached to a local bus and it's shared 2GB. Access to the other pair's 2GB is performed through an inter-bus link (typically a memory controller).

A more extreme example of a NUMA system is often employed in clusters which attempt to present a single-system image: each node in the cluster houses a portion of the system RAM with the cluster software being a localised memory controller. A process existing on one node of the cluster must access a memory portion housed on another node. By comparison to the first node accessing its own real RAM (through the localised cluster memory controller thereof), attempting to access the RAM portion on the alternate node involves a much more complex scenario: the first node must make the request of the second node to send the RAM across the network; the second node must check whether the portion is already being modified to another (third) node; if being modified, the second node must request the third node relinquishes the RAM access; the third node writes the modifications back to the second node and finally the second node releases the RAM portion to the requesting node. Should two nodes require frequent modification access to the same RAM portion, the results can impact severely on the cluster's ability to process data in a worthy fashion. Thankfully, software techniques exist (much the same as those described here) to lessen the impact of such a scenario.