C++ Programming/Threading

Multi-tasking

Multi-tasking is a process by which multiple tasks (also known as processes), share common processing resources such as a CPU.

A computer with a single CPU, will only run one process at a time. By running it means that in a specific point in time, the CPU is actively executing instructions for that process. With a single CPU, systems using scheduling can achieve multi-tasking, by which the time of the processor is time-shared by several processes, permitting each to advance their computations, seemingly in parallel. A process runs for some time and another waiting gets a turn.

The act of reassigning a CPU from one task to another one is called a context switch. When context switches occur frequently enough, the illusion of parallelism is achieved.

Note:
Context switching has a cost; when deciding to use multi-tasks, a programmer must be aware of trade-offs in performance.

Even on computers with more than one CPU, multiprocessor machines, multi-tasking allows many more tasks to be run than there are CPUs.

Operating systems may adopt one of many different scheduling strategies, which generally fall into the following categories:

In multiprogramming systems, the running task keeps running until it performs an operation that requires waiting for an external event (e.g. reading from a tape) or until the computer's scheduler forcibly swaps the running task out of the CPU. Multiprogramming systems are designed to maximize CPU usage.
In time-sharing systems, the running task is required to relinquish the CPU, either voluntarily or by an external event such as a hardware interrupt. Time sharing systems are designed to allow several programs to execute apparently simultaneously. The term time-sharing used to define this behavior is no longer in use, having been replaced by the term multi-tasking.
In real-time systems, some waiting tasks are guaranteed to be given the CPU when an external event occurs. Real time systems are designed to control mechanical devices such as industrial robots, which require timely processing.

Multi-tasking has already been successfully integrated into current Operating Systems. Most computers in use today supports running several processes at a time. This is required for systems using symmetric multiprocessor (SMP) in distributed computing and multi-core or chip multiprocessors (CMPs) computing, where processors have gone from dual-core to quad-core and core number will continue to increase. Each technology has its specific limitations and applicability, but all these technologies share the common objective of performing concurrent processing.

Note:
Due to the general adoption of the new paradigm it becomes extremely important to prepare your code for it (plan for scalability), understand guarantees regarding parallelization, and select external libraries that provide the required support.

Processes

Processes are independent execution units that contain their own state information, use their own address spaces, and only interact with each other via inter-process communication (IPC) mechanisms. A process can be said to at least contain one thread of execution (not to be confused to a complete thread construct). Processes are managed by the hosting OS in a process data structure. The maximum number of processes that can run concurrently, depend on the OS and on the available resources of that system.

Child Process

A child process (also spawn process), is a process that was created by another process (the parent process), inheriting most of the parent attributes, such as opened files. Each process may create many child processes but will have at most one parent process; if a process does not have a parent this usually indicates that it was created directly by the kernel.

In UNIX, a child process is in fact created (using fork) as a copy of the parent. The child process can then overlay itself with a different program (using exec) as required. The very first process, called init, is started by the kernel at booting time and never terminates; other parentless processes may be launched to carry out various daemon tasks in userspace. Another way for a process to end up without a parent is, if its parent dies leaving an orphan process; but in this case it will shortly be adopted by init.

Inter-Process Communication (IPC)

IPC is generally managed by the operating system.

Shared Memory

Most of more recent OSs provide some sort of memory protection. In a Unix system, each process is given its own virtual address space, and the system, in turn, guarantees that no process can access the memory area of another. If an error occurs on a process, only that process memory's contents can be corrupted.

With shared memory, the need of enabling random-access to shared data between different processes is addressed. But declaring a given section of memory as simultaneously accessible by several processes raises the need for control and synchronization, since several processes might try to alter this memory area at the same time.

Multi-threading

Until recently, the C++ standard did not include any specification or built-in support for multi-threading. Therefore, Threading had to be implemented using special threading libraries, which are often platform dependent, as an extension to the C++ standard.

Note:
The new C++0x standard supports multi-threading, reducing the need to know multiple APIs and increasing the portability of code.

Some popular C++ threads libraries include:
(This list is not intended to be complete.)

Boost - This package includes several libraries, one of which is threads (concurrent programming). the boost threads library is not very full featured, but is complete, portable, robust and in the flavor of the C++ standard. Uses the boost license that is similar to the BSD license.
Intel® Threading Building Blocks (TBB) offers a rich approach to expressing parallelism in a C++ program. The library helps you take advantage of multi-core processor performance without having to be a threading expert. Threading Building Blocks is not just a threads-replacement library. It represents a higher-level, task-based parallelism that abstracts platform details and threading mechanism for performance and scalability and performance. It is an open source project under the GNU General Public License version two (GPLv2) with the runtime exception.
Intel® Cilk™ Plus (Intel® Cilk™ Plus) adds simple language extensions to the C and C++languages to express task and data parallelism. These language extensions are powerful, yet easy to apply and use in a wide range of applications.
Adaptive Communication Environment (often referred to as ACE) - Another toolkit which includes a portable threads abstraction along with many many other facilities, all rolled into one library. Open source released under a nonstandard but nonrestrictive license.
Zthreads - A portable thread abstraction library. This library is feature rich, deals only with concurrency and is open source licensed under the MIT license.

Of course, you can access the full POSIX and the C language threads interface from C++ and on Windows the API. So why bother with a library on top of that?

The reason is that things like locks are resources that are allocated, and C++ provides abstractions to make managing these things easier. For instance, boost::scoped_lock<> uses object construction/destruction to insure that a mutex is unlocked when leaving the lexical scope of the object. Classes like this can be very helpful in preventing deadlock, race conditions, and other problems unique to threaded programs. Also, these libraries enable you to write cross-platform multi-threading code, while using platform-specific function cannot.

In any case when using threading methodology, dictates that you must identify hotspots, the segments of code that take the most execution time. To determine the best chance at achieving the maximum performance possible, the task can be approached from bottom-up and top-down to determine those code segments that can run in parallel.

In the bottom-up approach, one focus solely on the hotspots in the code. This requires a deep analysis of the call stack of the application to determine the sections of code that can be run in parallel and reduce hotspots. In hotspot sections that employ concurrency, it is still required to move that concurrency at a point higher up in the call stack as to increase the granularity of each thread execution.

Using the top-down approach, the focus is on all the parts of the application, in determining what computations can be coded to run in parallel, at a higher level of abstraction. Reducing the level of abstraction until the overall performance gains are sufficient to reach the necessary goals, the benefit being speed of implementation and code re-usability. This is also the best method for achieving an optimal level of granularity for all computations.

Threads vs. Processes

Both threads and processes are methods of parallelizing an application, its implementation may differ from one operating system to another. A process has always one thread of execution, also known as the primary thread. In general, a thread is contained inside a process (in the address space of the process) and different threads of the same process share some resources while different processes do not.

Atomicity

Atomicity refers to atomic operations that are indivisible and/or uninterruptible. Even on a single core, you cannot assume that an operation will be atomic. In that regard only when using assembler can one guarantee the atomicity of an operation. Therefore, the C++ standard provides some guarantees as do operating systems and external libraries.

An atomic operation can also be seen as any given set of operations that can be combined so that they appear to the rest of the system to be a single operation with only two possible outcomes: success or failure. This all depends on the level of abstraction and underlying guarantees.

All modern processors provide basic atomic primitives which are then used to build more complex atomic objects. In addition to atomic read and write operations, most platforms provide an atomic read-and-update operation like test-and-set or compare-and-swap, or a pair of operations like load-link/store-conditional that only have an effect if they occur atomically (that is, with no intervening, conflicting update). These can be used to implement locks, a vital mechanism for multi-threaded programming, allowing invariants and atomicity to be enforced across groups of operations.

Many processors, especially 32-bit ones with 64-bit floating point support, provide some read and write operations that are not atomic: one thread reading a 64-bit register while another thread is writing to it may see a combination of both "before" and "after" values, a combination that may never actually have been written to the register. Further, only single operations are guaranteed to be atomic; threads arbitrarily performing groups of reads and writes will also observe a mixture of "before" and "after" values. Clearly, invariants cannot be relied on when such effects are possible.

If not dealing with known guaranteed atomic operations, one should rely on the synchronization primitives at the level of abstraction that one is coding to.

Example - One process

For example, imagine a single process is running on a computer incrementing a value in a given memory location. To increment the value in that memory location:

the process reads the value in the memory location;
the process adds one to the value;
the process writes the new value back into the memory location.

Example - Two processes

Now, imagine two processes are running incrementing a single, shared memory location:

the first process reads the value in memory location;
the first process adds one to the value;

but before it can write the new value back to the memory location it is suspended, and the second process is allowed to run:

the second process reads the value in memory location, the same value that the first process read;
the second process adds one to the value;
the second process writes the new value into the memory location.

The second process is suspended and the first process allowed to run again:

the first process writes a now-wrong value into the memory location, unaware that the other process has already updated the value in the memory location.

This is a trivial example. In a real system, the operations can be more complex and the errors introduced extremely subtle. For example, reading a 64-bit value from memory may actually be implemented as two sequential reads of two 32-bit memory locations. If a process has only read the first 32-bits, and before it reads the second 32-bits the value in memory gets changed, it will have neither the original value nor the new value but a mixed-up garbage value.

Furthermore, the specific order in which the processes run can change the results, making such an error difficult to detect and debug.

OS and portability

Considerations are not only necessary with regard to the underling hardware but also in dealing with the different OS APIs. When porting code across different OSs one should consider what guarantees are provided. Similar considerations are necessary when dealing with external libraries.

Note:
For instance on the Macintosh, the set file position call is atomic, whereas on Windows, it's a pair of calls.

Race condition

A race condition (data race, or simply race) occurs when data is accessed concurrently from multiple execution paths. It happens for instance when multiple threads have shared access to the same resource such as a file or a block of memory, and at least one of the accesses is a write. This can lead to interference with one another.

Threaded programming is built around predicates and shared data. It is necessary to identify all possible execution paths and identify truly independent computations. To avoid problems it is best to implement concurrency at the highest level possible.

Most race conditions occur due to an erroneous assumption about the order in which threads will run. When dealing with shared variables, never assume that a threaded write operation will precede a threaded read operation. If you need guarantees you should see if synchronization primitives are available, and if not, you should implement your own.

Locking

Locking temporarily prevents un-shareable resources from being used simultaneously. Locking can be achieved by using a synchronization object.

One of the biggest problems with threading is that locking requires analysis and understanding of the data and code relationships. This complicates software development--especially when targeting multiple operating systems. This makes multi-threaded programming more like art than science.

The number of locks (depending on the synchronization object) may be limited by the OS. A lock can be set to protect more than one resource, if always accessed in the same critical region.

Critical section

A critical section is a region defined as critical to the parallelization of code execution. The term is used to define code sections that need to be executed in isolation with respect to other code in the program.

This is a common fundamental concept. These sections of code need to be protected by a synchronization technique as they can create race conditions.

Deadlock

A deadlock is said to happen whenever there is a lock operation that results in a never-ending waiting cycle among concurrent threads.

Synchronization

Except when used to guarantee the correct execution of a parallel computation, synchronization is an overhead. Attempt to keep it to a minimum by taking advantage of the thread's local storage or by using exclusive memory locations.

Computation granularity

Computation granularity is loosely defined as the amount of computation performed before any synchronization is needed. The longer the time between synchronizations, the less granularity the computation will have. When dealing with the requirements for parallelism, it will mean being easier to scale to an increased number of threads and having lower overhead costs. A high level of granularity can mean that any benefit from using threads will be lost due to the requirements of synchronization and general thread overhead.

Mutex

Mutex is an abbreviation for mutual exclusion. It relies on a synchronization facility supplied by the operating system (not the CPU). Since this system object can only be owned by a single thread at any given time, the mutex object facilitates protection against data races and allows for thread-safe synchronization of data between threads. By calling one of the lock functions, the thread obtains ownership of a mutex object, it then relinquishes ownership by calling the corresponding unlock function. Mutexes can be either recursive or non-recursive, and may grant simultaneous ownership to one or many threads.

Semaphore

A semaphore is a yielding synchronization object that can be used to synchronize several threads. This is the most commonly used method for synchronization

Spinlock

Spinlocks are busy-wait synchronization objects, used as a substitute for mutexes. They are an implementation of inter-thread locking using machine dependent assembly instructions (such as test-and-set) where a thread simply waits (spins) in a loop that repeatedly checks if the lock becomes available (busy wait). This is why spinlocks perform better if locked for a short period of time.^[1] They are never used on single-CPU machines.

Threads

Threads are by definition a coding construct and part of a program that enable it to fork (or split) itself into two or more simultaneously (or pseudo-simultaneously) running tasks. Threads use pre-emptive multi-tasking.

The thread is the basic unit (the smallest piece of code) to which the operating system can allocate a distinct processor time (schedule) for execution. This means that, threads in reality, don't run concurrently but in sequence on any single-core system. Threads often depend on the OS thread scheduler to preempt a busy thread and resume another thread.

The thread today is not only a key concurrency model supported by most if not all modern computers, programming languages, and operating systems but is itself at the core of hardware evolution, such as symmetric multi-processors, understanding threads is now a necessity to all programmers.

The order of execution of the threads is controlled by the process scheduler of the OS; it is non-deterministic. The only control available to the programmer is in attributing a priority to the thread but never assume a particular order of execution.

To do:
Thread quantum

User Interface Thread

This type of distinction is reserved to indicate that the particular thread implements a message map to respond to events and messages generated by user inputs as he interacts with the application. This is especially common when working with the Windows platform (Win32 API) because of the way it implements message pumps.

Worker Thread

This distinction serves to specify threads that do not directly depend or are part of the graphical user interface of the application, and run concurrently with the main execution thread.

Thread local storage (TLS)

The residence of thread-local variables, a thread dedicated section of the global memory. Each thread (or fiber) will receive its own stack space, residing in a different memory location. This will consist of both reserved and initially committed memory. That is freed when the thread exits but will not be freed if the thread is terminated by other means.

Since all threads in a process share the same address space, it makes data in a static or global variable to be normally located at the same memory location, when referred to by threads from the same process. It is important for software to take in consideration hardware cache coherence. For instance in multiprocessor environments, each processor has a local cache. If threads on different processors modify variables residing on the same cache line, this will invalidate that cache line, forcing a cache update, hurting performance. This is referred to as false sharing.

This type of storage is indicated for variables that store temporary or even partial results, since condensing the needed synchronization of the partial results in as fewer and infrequent instances possible will contribute to the reduction of synchronization overhead.

Thread Synchronization

The synchronization can be defined in several steps the first is the process lock, where a process is made to halt execution due to find a protected resource locked, there is a cost for locking especially if the lock lasts for too long.

Obviously there is a performance hit if any synchronization mechanism is heavily used. Because they are an expensive operation, in certain cases, increasing the use of TLSs instead of relying only on shared data structures will reduce the need for synchronization.

Critical Section

To do:
Mine w:Critical Section, see about Guard or monitor sections ?

Suspend and Resume

Synchronizing on Objects

Cooperative vs. Preemptive Threading

Thread pool

A simple thread pool. The task queue has many waiting tasks (blue circles). When a thread opens up in the queue (green box with dotted circle) a task comes off the queue and the open thread executes it (red circles in green boxes). The completed task then "leaves" the thread pool and joins the completed tasks list (yellow circles)..

Fibers

A fiber is a particularly lightweight thread of execution. Like threads, fibers share address space. However, fibers use co-operative multi-tasking, fibers yield themselves to run another fiber while executing.

Operating system support

Less support from the operating system is needed for fibers than for threads. They can be implemented in modern Unix systems using the library functions getcontext, setcontext and swapcontext in ucontext.h, as in GNU Portable Threads.

On Microsoft Windows, fibers are created using the ConvertThreadToFiber and CreateFiber calls; a fiber that is currently suspended may be resumed in any thread. Fiber-local storage, analogous to thread-local storage, may be used to create unique copies of variables.

Symbian OS uses a similar concept to fibers in its Active Scheduler. An Active object (Symbian OS) contains one fiber to be executed by the Active Scheduler when one of several outstanding asynchronous calls complete. Several Active objects can be waiting to execute (based on priority) and each one must restrict its own execution time.

Exploiting parallelism

Most of the parallel architecture research was done in the 1960s and 1970s, providing solutions for problems that only today are reaching general awareness. As the need of concurrent programming increases, mostly due to today's hardware evolution, we as programmers are pressed to implement programming models that ease the complicated process of dealing with the old thread model in a way it preserves development time by abstracting the problem.

To do:
To extend

OpenMP

Chart of OpenMP constructs.

To do:
To extend

↑ Malte Skarupke. "Measuring Mutexes, Spinlocks and how Bad the Linux Scheduler Really is".

[malte-1] Malte Skarupke. "Measuring Mutexes, Spinlocks and how Bad the Linux Scheduler Really is".

[1]