Multitasking functionality

multitasking
threads or tasks
synchronization
Scheduler
interrupts core
CPU specific

Linux kernel is a preemptive multitasking operating system. As a multitasking OS, it allows multiple processes to share processors (CPUs) and other system resources. Each CPU executes a single task at a time. However, multitasking allows each processor to switch between tasks that are being executed without having to wait for each task to finish. For that, the kernel can, at any time, temporarily interrupt a task being carried out by the processor, and replace it by another task that can be new or a previously suspended one. The operation involving the swapping of the running task is called context switch.

Threads or tasks

In Linux kernel "thread" and "task" are almost synonyms.

💾 History: Till 2.6.39, kernel mode has only one thread protected by big kernel lock.

⚲ API

linux/sched.h_inc - the main scheduler API

task_struct_id

arch/x86/include/asm/current.h_src

current_id and get_current_id () return current task_struct_id

uapi/linux/taskstats.h_inc per-task statistics

linux/thread_info.h_inc

function current_thread_info_id() returns thread_info_id

linux/sched/task.h_inc - interface between the scheduler and various task lifetime (fork()/exit()) functionality

linux/kthread.h_inc - simple interface for creating and stopping kernel threads without mess.

kthread_run_id creates and wake a thread

kthread_create_id

⚙️ Internals

kthread_run_id ↯ hierarchy:

kernel_thread_id

kernel_clone_id

kernel/kthread.c_src

Scheduler

The scheduler is the part of the operating system that decides which process runs at a certain point in time. It usually has the ability to pause a running process, move it to the back of the running queue and start a new process.

Active processes are placed in an array called a run queue, or runqueue - rq_id. The run queue may contain priority values for each process, which will be used by the scheduler to determine which process to run next. To ensure each program has a fair share of resources, each one is run for some time period (quantum) before it is paused and placed back into the run queue. When a program is stopped to let another run, the program with the highest priority in the run queue is then allowed to execute. Processes are also removed from the run queue when they ask to sleep, are waiting on a resource to become available, or have been terminated.

Linux uses the Completely Fair Scheduler (CFS), the first implementation of a fair queuing process scheduler widely used in a general-purpose operating system. CFS uses a well-studied, classic scheduling algorithm called "fair queuing" originally invented for packet networks. The CFS scheduler has a scheduling complexity of O(log N), where N is the number of tasks in the runqueue. Choosing a task can be done in constant time, but reinserting a task after it has run requires O(log N) operations, because the run queue is implemented as a red–black tree.

In contrast to the previous O(1) scheduler, the CFS scheduler implementation is not based on run queues. Instead, a red-black tree implements a "timeline" of future task execution. Additionally, the scheduler uses nanosecond granularity accounting, the atomic units by which an individual process' share of the CPU was allocated (thus making redundant the previous notion of timeslices). This precise knowledge also means that no specific heuristics are required to determine the interactivity of a process, for example.

Like the old O(1) scheduler, CFS uses a concept called "sleeper fairness", which considers sleeping or waiting tasks equivalent to those on the runqueue. This means that interactive tasks which spend most of their time waiting for user input or other events get a comparable share of CPU time when they need it.

The data structure used for the scheduling algorithm is a red-black tree in which the nodes are scheduler specific structures, entitled sched_entity_id. These are derived from the general task_struct process descriptor, with added scheduler elements. These nodes are indexed by processor execution time in nanoseconds. A maximum execution time is also calculated for each process. This time is based upon the idea that an "ideal processor" would equally share processing power amongst all processes. Thus, the maximum execution time is the time the process has been waiting to run, divided by the total number of processes, or in other words, the maximum execution time is the time the process would have expected to run on an "ideal processor".

When the scheduler is invoked to run a new processes, the operation of the scheduler is as follows:

The left most node of the scheduling tree is chosen (as it will have the lowest spent execution time), and sent for execution.
If the process simply completes execution, it is removed from the system and scheduling tree.
If the process reaches its maximum execution time or is otherwise stopped (voluntarily or via interrupt) it is reinserted into the scheduling tree based on its new spent execution time.
The new left-most node will then be selected from the tree, repeating the iteration.

If the process spends a lot of its time sleeping, then its spent time value is low and it automatically gets the priority boost when it finally needs it. Hence such tasks do not get less processor time than the tasks that are constantly running.

An alternative to CFS is the Brain Fuck Scheduler (BFS) created by Con Kolivas. The objective of BFS, compared to other schedulers, is to provide a scheduler with a simpler algorithm, that does not require adjustment of heuristics or tuning parameters to tailor performance to a specific type of computation workload.

Con Kolivas also maintains another alternative to CFS, the MuQSS scheduler.^[1]

The Linux kernel contains different scheduler classes (or policies). The Completely Fair Scheduler used nowadays by default is SCHED_NORMAL_id scheduler class aka SCHED_OTHER. The kernel also contains two additional classes SCHED_BATCH_id and SCHED_IDLE_id, and another two real-time scheduling classes named SCHED_FIFO_id (realtime first-in-first-out) and SCHED_RR_id (realtime round-robin), with a third realtime scheduling policy known as SCHED_DEADLINE_id that implements the earliest deadline first algorithm (EDF) added later. Any realtime scheduler class takes precedence over any of the "normal" —i.e. non realtime— classes. The scheduler class is selected and configured through the man 2 sched_setscheduler ↪ do_sched_setscheduler_id system call.

Properly balancing latency, throughput, and fairness in schedulers is an open problem.^[1]

⚲ API

man 1 renice – priority of running processes

man 1 nice – run a program with modified scheduling priority

man 1 chrt – manipulate the real-time attributes of a process

man 2 sched_getattr ↪ sys_sched_getattr_id – get scheduling policy and attributes

linux/sched.h_inc – the main scheduler API

schedule_id

man 2 getpriority, man 2 setpriority

man 2 sched_setscheduler, man 2 sched_getscheduler

⚙️ Internals

sched_init_id is called from start_kernel_id

__schedule_id is the main scheduler function.

runqueues_id, this_rq_id

kernel/sched_src

kernel/sched/core.c_src

kernel/sched/fair.c_src implements SCHED_NORMAL_id, SCHED_BATCH_id, SCHED_IDLE_id

sched_setscheduler_id, sched_getscheduler_id

task_struct_id::rt_priority_id and other members with less unique identifiers

🛠️ Utilities

man 1 pidstat]

man 1 pcp-pidstat

man 1 perf-sched

Understanding Scheduling Behavior with SchedViz

📖 References

man 7 sched

Scheduling_doc

Delaying and scheduling routines_doc

CFS

Completely Fair Scheduler_doc

CFS Bandwidth Control_doc

Tuning the task scheduler

stop using CPU limits on Kubernetes

Completely fair scheduler LWN

Deadline Task Scheduler_doc

sched_ltp

sched_setparam_ltp

sched_getscheduler_ltp

sched_setscheduler_ltp

📚 Further reading about the scheduler

Scheduler tracing

bcc/ebpf CPU and scheduler tools

Preemption

Preemption refers to the ability of the system to interrupt a running task to switch to another task. This is essential for ensuring that high-priority tasks receive the necessary CPU time and for improving the system's responsiveness. In Linux, preemption models define how and when the kernel can preempt tasks. Different models offer varying trade-offs between system responsiveness and throughput.

📖 References

kernel/Kconfig.preempt_src

CONFIG_PREEMPT_NONE_id – no forced preemption for servers

CONFIG_PREEMPT_VOLUNTARY_id – voluntary preemption for desktops

CONFIG_PREEMPT_id – preemptible except for critical sections for low-latency desktops

CONFIG_PREEMPT_RT_id – real-time preemption for highly responsive applications

CONFIG_PREEMPT_DYNAMIC_id, see /sys/kernel/debug/sched/preempt

Wait queues

A wait queue in the kernel is a data structure that allows one or more processes to wait (sleep) until something of interest happens. They are used throughout the kernel to wait for available memory, I/O completion, message arrival, and many other things. In the early days of Linux, a wait queue was a simple list of waiting processes, but various scalability problems (including the thundering herd problem) have led to the addition of a fair amount of complexity since then.

⚲ API

linux/wait.h_inc

wait_queue_head_id consists of double linked list of wait_queue_entry_id and a spinlock.

Waiting for simple events:

Use one of two methods for wait_queue_head_id initialization:

init_waitqueue_head_id initializes wait_queue_head_id in function context

DECLARE_WAIT_QUEUE_HEAD_id - actually defines wait_queue_head_id in global context

Wait alternatives:

wait_event_interruptible_id - preferable wait

wait_event_interruptible_timeout_id

wait_event_id - uninterruptible wait. Can cause deadlock ⚠

wake_up_id etc

👁 For example usage see references to unique suspend_queue_id.

Explicit use of add_wait_queue instead of simple wait_event for complex cases:

DECLARE_WAITQUEUE_id actually defines wait_queue_entry with default_wake_function_id

add_wait_queue_id inserts process in the first position of a wait queue

remove_wait_queue_id

⚙️ Internals

___wait_event_id

__add_wait_queue_id

__wake_up_common_id, try_to_wake_up_id

kernel/sched/wait.c_src

📖 References

Wait queues and Wake events_doc

Handling wait queues

Real-time

RT preemption

The Linux Foundation's Real-Time Linux (RTL) collaborative project is focused on improving the real-time capabilities of Linux and advancing the adoption of real-time Linux in various industries, including aerospace, automotive, robotics, and telecommunications.

Parameter CONFIG_PREEMPT_RT_id enables real-time preemption.

RT scheduling policies

Scheduling policies for RT:

SCHED_FIFO_id, SCHED_RR_id

implemented in kernel/sched/rt.c_src

SCHED_DEADLINE

implemented in kernel/sched/deadline.c_src

API:

man 1 chrt – manipulate the real-time attributes of a process

man 2 sched_rr_get_interval – get the SCHED_RR interval for the named process

man 2 sched_setscheduler, sched_getscheduler – set and get scheduling policy/parameters

man 2 sched_get_priority_min, sched_get_priority_max – get static priority range

Testing RT capabilities

The testing process for Real-Time Linux typically involves several key aspects. First and foremost, it is crucial to verify the accuracy and stability of the system's timekeeping mechanisms. Precise time management is fundamental to real-time applications, and any inaccuracies can lead to timing errors and compromise the system's real-time capabilities.

Another essential aspect of testing is evaluating the system's scheduling algorithms. Real-Time Linux employs advanced scheduling policies to prioritize critical tasks and ensure their timely execution. Testing the scheduler involves assessing its ability to allocate resources efficiently, handle task prioritization correctly, and prevent resource contention or priority inversion scenarios.

Furthermore, latency measurement is a critical part of Real-Time Linux testing. Latency refers to the time delay between the occurrence of an event and the system's response to it. In real-time applications, minimizing latency is crucial to achieving timely and predictable behavior. Testing latency involves measuring the time it takes for the system to respond to various stimuli and identifying any sources of delay or unpredictability.

Additionally, stress testing plays a significant role in assessing the system's robustness under heavy workloads. It involves subjecting the Real-Time Linux system to high levels of concurrent activities, intense computational loads, and input/output operations to evaluate its performance, responsiveness, and stability. Stress testing helps identify potential bottlenecks, resource limitations, or issues that might degrade the real-time behavior of the system.

RTLA – The realtime Linux analysis tool:

rtla timerlat_doc – CLI for the kernel's timerlat tracer_doc

rtla osnoise_doc – CLI for the kernel's osnoise tracer_doc. Kernel function run_osnoise_id measures time with function trace_clock_local_id in loop.

rtla hwnoise_doc – CLI for the osnoise tracer_doc with interrupts disabled

Implementation: tools/tracing/rtla_src and kernel/trace/trace_osnoise.c_src

Linux scheduling latency debug and analysis

RT-Tests, source

cyclictest

some RT-Tests man pages:

cyclictest – measures man 2 clock_nanosleep or man 2 nanosleep delay

hackbench – scheduler benchmark/stress test

hwlatdetect – CLI for /sys/kernel/tracing/hwlat_detector_doc / kernel/trace/trace_hwlat.c_src. Kernel function kthread_fn_id measures time delays with function trace_clock_local_id in loop.

oslat – measures delay with RDTSC in busy loop

RT Tracing Tools with eBPF

realtime_ltp

...

Synchronization

Thread synchronization is defined as a mechanism which ensures that two or more concurrent processes or threads do not simultaneously execute some particular program segment known as mutual exclusion (mutex). When one thread starts executing the critical section (serialized segment of the program) the other thread should wait until the first thread finishes. If proper synchronization techniques are not applied, it may cause a race condition where, the values of variables may be unpredictable and vary depending on the timings of context switches of the processes or threads.

User space synchronization

Futex

A man 2 futex ↪ do_futex_id (short for "fast userspace mutex") is a kernel system call that programmers can use to implement basic locking, or as a building block for higher-level locking abstractions such as semaphores and POSIX mutexes or condition variables.

A futex consists of a kernelspace wait queue that is attached to an aligned integer in userspace. Multiple processes or threads operate on the integer entirely in userspace (using atomic operations to avoid interfering with one another), and only resort to relatively expensive system calls to request operations on the wait queue (for example to wake up waiting processes, or to put the current process on the wait queue). A properly programmed futex-based lock will not use system calls except when the lock is contended; since most operations do not require arbitration between processes, this will not happen in most cases.

The basic operations of futexes are based on only two central operations futex_wait_id and futex_wake_id though implementation has a more operations for more specialized cases.

WAIT (addr, val) checks if the value stored at the address addr is val, and if it is puts the current thread to sleep.

WAKE (addr, val) wakes up val number of threads waiting on the address addr.

⚲ API

uapi/linux/futex.h_inc

linux/futex.h_inc

⚙️ Internals: kernel/futex.c_src

📖 References

Futex

man 7 futex

Futex API reference_doc

futex_ltp

File locking

⚲ API: man 2 flock

Semaphore

💾 History: Semaphore is part of System V IPC man 7 sysvipc

⚲ API

man 2 semget

man 2 semctl

man 2 semget

⚙️ Internals: ipc/sem.c_src

Kernel space synchronization

For kernel mode synchronization Linux provides three categories of locking primitives: sleeping, per CPU local locks and spinning locks.

Read-Copy-Update

Common mechanism to solve the readers–writers problem is the read-copy-update (RCU) algorithm. Read-copy-update implements a kind of mutual exclusion that is wait-free (non-blocking) for readers, allowing extremely low overhead. However, RCU updates can be expensive, as they must leave the old versions of the data structure in place to accommodate pre-existing readers.

💾 History: RCU was added to Linux in October 2002. Since then, there are thousandths uses of the RCU API within the kernel including the networking protocol stacks and the memory-management system. The implementation of RCU in version 2.6 of the Linux kernel is among the better-known RCU implementations.

⚲ The core API in linux/rcupdate.h_inc is quite small:

rcu_read_lock_id marks an RCU-protected data structure so that it won't be reclaimed for the full duration of that critical section.

rcu_read_unlock_id is used by a reader to inform the reclaimer that the reader is exiting an RCU read-side critical section. Note that RCU read-side critical sections may be nested and/or overlapping.

synchronize_rcu_id blocks until all pre-existing RCU read-side critical sections on all CPUs have completed. Note that synchronize_rcu will not necessarily wait for any subsequent RCU read-side critical sections to complete.

👁 For example, consider the following sequence of events:

	         CPU 0                  CPU 1                 CPU 2
	     ----------------- ------------------------- ---------------
	 1.  rcu_read_lock()
	 2.                    enters synchronize_rcu()
	 3.                                               rcu_read_lock()
	 4.  rcu_read_unlock()
	 5.                     exits synchronize_rcu()
	 6.                                              rcu_read_unlock()

RCU API communications between the reader, updater, and reclaimer

Since synchronize_rcu is the API that must figure out when readers are done, its implementation is key to RCU. For RCU to be useful in all but the most read-intensive situations, synchronize_rcu's overhead must also be quite small.

Alternatively, instead of blocking, synchronize_rcu may register a callback to be invoked after all ongoing RCU read-side critical sections have completed. This callback variant is called call_rcu_id in the Linux kernel.

rcu_assign_pointer_id - The updater uses this function to assign a new value to an RCU-protected pointer, in order to safely communicate the change in value from the updater to the reader. This function returns the new value, and also executes any memory barrier instructions required for a given CPU architecture. Perhaps more importantly, it serves to document which pointers are protected by RCU.

rcu_dereference_id - The reader uses this function to fetch an RCU-protected pointer, which returns a value that may then be safely dereferenced. It also executes any directives required by the compiler or the CPU, for example, a volatile cast for gcc, a memory_order_consume load for C/C++11 or the memory-barrier instruction required by the old DEC Alpha CPU. The value returned by rcu_dereference is valid only within the enclosing RCU read-side critical section. As with rcu_assign_pointer, an important function of rcu_dereference is to document which pointers are protected by RCU.

The RCU infrastructure observes the time sequence of rcu_read_lock, rcu_read_unlock, synchronize_rcu, and call_rcu invocations in order to determine when (1) synchronize_rcu invocations may return to their callers and (2) call_rcu callbacks may be invoked. Efficient implementations of the RCU infrastructure make heavy use of batching in order to amortize their overhead over many uses of the corresponding APIs.

⚙️ Internals

kernel/rcu_src

📖 References

Avoiding Locks: Read Copy Update_doc

RCU concepts_doc

RCU initialization

Sleeping locks

Mutexes

⚲ API

linux/mutex.h_inc

linux/completion.h_inc

mutex_id has owner and usage constrains, more easy to debug then semaphore

rt_mutex_id blocking mutual exclusion locks with priority inheritance (PI) support

ww_mutex_id Wound/Wait mutexes: blocking mutual exclusion locks with deadlock avoidance

rw_semaphore_id readers–writer semaphores

percpu_rw_semaphore_id

completion_id - use completion for synchronization task with ISR and task or two tasks.

wait_for_completion_id

complete_id

💾 Historical

semaphore_id - use mutex instead semaphore if possible

linux/semaphore.h_inc

linux/rwsem.h_inc

📖 References

Completions - “wait for completion” barrier APIs_doc

Mutex API reference_doc

LWN: completion events

per CPU local lock

local_lock_id, preempt_disable_id

local_lock_irqsave_id, local_irq_save_id

etc

On normal preemptible kernel local_lock calls preempt_disable_id. On RT preemptible kernel local_lock calls migrate_disable_id and spin_lock_id. Any changes applied to spinlock_t also apply to local_lock.

⚲ API

linux/local_lock.h_inc

📖 References

local_lock_doc

PREEMPT_RT caveats: spinlock_t, rwlock_t, migrate_disable and local_lock_doc

Proper locking under a preemptive kernel_doc

Local locks in the kernel

💾 History: Prior to kernel version 2.6, Linux disabled interrupt to implement short critical sections. Since version 2.6 and later, Linux is fully preemptive.

Spinning locks

Spinlocks

a spinlock is a lock which causes a thread trying to acquire it to simply wait in a loop ("spin") while repeatedly checking if the lock is available. Since the thread remains active but is not performing a useful task, the use of such a lock is a kind of busy waiting. Once acquired, spinlocks will usually be held until they are explicitly released, although in some implementations they may be automatically released if the thread being waited on (that which holds the lock) blocks, or "goes to sleep".

Spinlocks are commonly used inside kernels because they are efficient if threads are likely to be blocked for only short periods. However, spinlocks become wasteful if held for longer durations, as they may prevent other threads from running and require rescheduling. 👁 For example kobj_kset_join_id uses spinlock to protect assess to the linked list.

Enabling and disabling of kernel preemption replaced spinlocks on uniprocessor systems (disabled CONFIG_SMP_id). Most spinning locks becoming sleeping locks in the CONFIG_PREEMPT_RT_id kernels.

📖 References

spinlock_t_id

raw_spinlock_t_id

bit_spin_lock_id

Introduction to spinlocks

Queued spinlocks

Seqlocks

A seqlock (short for "sequential lock") is a special locking mechanism used in Linux for supporting fast writes of shared variables between two parallel operating system routines. It is a special solution to the readers–writers problem when the number of writers is small.

It is a reader-writer consistent mechanism which avoids the problem of writer starvation. A seqlock_t_id consists of storage for saving a sequence counter seqcount_t_id/seqcount_spinlock_t in addition to a lock. The lock is to support synchronization between two writers and the counter is for indicating consistency in readers. In addition to updating the shared data, the writer increments the sequence counter, both after acquiring the lock and before releasing the lock. Readers read the sequence counter before and after reading the shared data. If the sequence counter is odd on either occasion, a writer had taken the lock while the data was being read and it may have changed. If the sequence counters are different, a writer has changed the data while it was being read. In either case readers simply retry (using a loop) until they read the same even sequence counter before and after.

💾 History: The semantics stabilized as of version 2.5.59, and they are present in the 2.6.x stable kernel series. The seqlocks were developed by Stephen Hemminger and originally called frlocks, based on earlier work by Andrea Arcangeli. The first implementation was in the x86-64 time code where it was needed to synchronize with user space where it was not possible to use a real lock.

⚲ API

seqlock_t_id

DEFINE_SEQLOCK_id, seqlock_init_id, read_seqlock_excl_id, write_seqlock_id

seqcount_t_id

seqcount_init_id, read_seqcount_begin_id, read_seqcount_retry_id, write_seqcount_begin_id, write_seqcount_end_id

linux/seqlock.h_inc

👁 Example: mount_lock_id, defined in fs/namespace.c_src

📖 References

Sequence counters and sequential locks_doc

SeqLock

Spinning or sleeping locks

	normal	on preempt RT
spinlock_t,	raw_spinlock_t	rt_mutex_base, rt_spin_lock, sleeping
rwlock_t	spinning	sleeping
local_lock	preempt_disable	migrate_disable, rt_spin_lock, sleeping

Low level

The compiler might optimize away or reorder writes to variables leading to unexpected behavior when variables are accessed concurrently by multiple threads.

⚲ API

asm-generic/rwonce.h_inc – prevent the compiler from merging or refetching reads or writes.

linux/compiler.h_inc

barrier_id – prevents the compiler from reordering instructions around the barrier

asm-generic/barrier.h_inc – generic barrier definitions

arch/x86/include/asm/barrier.h_src – force strict CPU ordering

mb_id – ensures that all memory operations before the barrier are completed before any memory operations after the barrier are started

⚙️ Internals

Atomics_doc

asm-generic/atomic.h_inc

linux/atomic/atomic-instrumented.h_inc

atomic_dec_and_test_id ...

📚 Further reading

volatile – prevents the compiler from optimizations

Memory barrier – enforces an ordering constraint on memory operations

...

⚙️ Locking internals

linux/lockdep.h_inc – runtime locking correctness validator

linux/debug_locks.h_inc

lib/locking-selftest.c_src

kernel/locking_src

timer_list_id wait_queue_head_t_id

kernel/locking/locktorture.c_src – module-based torture test facility for locking

📖 Locking references

locking_doc

Lock types and their rules_doc

😴 sleeping locks_doc

mutex_id, rt_mutex_id, semaphore_id, rw_semaphore_id, ww_mutex_id, percpu_rw_semaphore_id

on preempt RT: local_lock, spinlock_t, rwlock_t

😵‍💫 spinning locks_doc:

raw_spinlock_t, bit spinlocks

on non preempt RT: spinlock_t, rwlock_t

Unreliable Guide To Locking_doc

Synchronization primitives

Time

⚲ UAPI

uapi/linux/time.h_inc

timespec_id – nanosecond resolution

timeval_id – microsecond resolution

timezone_id

...

uapi/linux/time_types.h_inc

__kernel_timespec_id – nanosecond resolution, used in syscalls

...

⚲ API

...

ktime_t_id – nanosecond scalar representation for kernel time values

ktime_sub_id

...

linux/timekeeping.h_inc

ktime_get_id, ktime_get_ns_id

...

ktime_to_timespec64_id

...

⚙️ Internals

📖 References

Clock sources, Clock events, sched_clock() and delay timers_doc

Time and timer routines_doc

Year 2038 problem

Interrupts

An interrupt is a signal to the processor emitted by hardware or software indicating an event that needs immediate attention. An interrupt alerts the processor to a high-priority condition requiring the interruption of the current code the processor is executing. The processor responds by suspending its current activities, saving its state, and executing a function called an interrupt handler (or an interrupt service routine, ISR) to deal with the event. This interruption is temporary, and, after the interrupt handler finishes, the processor resumes normal activities.

There are two types of interrupts: hardware interrupts and software interrupts. Hardware interrupts are used by devices to communicate that they require attention from the operating system. For example, pressing a key on the keyboard or moving the mouse triggers hardware interrupts that cause the processor to read the keystroke or mouse position. Unlike the software type, hardware interrupts are asynchronous and can occur in the middle of instruction execution, requiring additional care in programming. The act of initiating a hardware interrupt is referred to as an interrupt request - IRQ (⚙️ do_IRQ_id).

A software interrupt is caused either by an exceptional condition in the processor itself, or a special instruction in the instruction set which causes an interrupt when it is executed. The former is often called a trap (⚙️ do_trap_id) or exception and is used for errors or events occurring during program execution that are exceptional enough that they cannot be handled within the program itself. For example, if the processor's arithmetic logic unit is commanded to divide a number by zero, this impossible demand will cause a divide-by-zero exception (⚙️ X86_TRAP_DE_id), perhaps causing the computer to abandon the calculation or display an error message. Software interrupt instructions function similarly to subroutine calls and are used for a variety of purposes, such as to request services from low-level system software such as device drivers. For example, computers often use software interrupt instructions to communicate with the disk controller to request data be read or written to the disk.

Each interrupt has its own interrupt handler. The number of hardware interrupts is limited by the number of interrupt request (IRQ) lines to the processor, but there may be hundreds of different software interrupts.

⚲ API

/proc/interrupts

man 1 irqtop – utility to display kernel interrupt information

irqbalance – distribute hardware interrupts across processors on a multiprocessor system

There are many ways to request ISR, two of them

devm_request_threaded_irq_id – preferable function to allocate an interrupt line for a managed device with a threaded ISR

request_irq_id, free_irq_id – old and common functions to add and remove a handler for an interrupt line

linux/interrupt.h_inc – main interrupt support header

irqaction_id – contains handler functions

linux/irq.h_inc

irq_data_id

include/linux/irqflags.h_inc

irqs_disabled_id

local_irq_save_id ...

local_irq_disable_id ...

linux/irqdesc.h_inc

irq_desc_id

linux/irqdomain.h_inc

irq_domain_id – hardware interrupt number translation object

irq_domain_get_irq_data_id

linux/msi.h_inc – Message Signaled Interrupts

msi_desc_id

Structure of structures:

irq_desc_id is container of

irq_data_id

irq_common_data_id

list of irqaction_id

⚙️ Internals

kernel/irq/settings.h_src

kernel/irq_src

kernel/irq/internals.h_src

ls /sys/kernel/debug/irq/domains/

x86_vector_domain_id, x86_vector_domain_ops_id

irq_chip_id

📖 References

IRQs_doc

The irq_domain interrupt number mapping library_doc

Linux generic IRQ handling_doc

Message Signaled Interrupts: The MSI Driver Guide_doc

Lock types and their rules_doc

Hard IRQ Context_doc

Interrupts

👁 Examples

dummy_irq_chip_id – dummy interrupt chip implementation

lib/locking-selftest.c_src

IRQ affinity

⚲ API

/proc/irq/default_smp_affinity

/proc/irq/*/smp_affinity and /proc/irq/*/smp_affinity_list

Common types and functions:

struct irq_affinity_id – description for automatic irq affinity assignments, see devm_platform_get_irqs_affinity_id

struct irq_affinity_desc_id – interrupt affinity descriptor, see irq_update_affinity_desc_id, irq_create_affinity_masks_id

irq_set_affinity_id

irq_get_affinity_mask_id

irq_can_set_affinity_id

irq_set_affinity_hint_id

irqd_affinity_is_managed_id

irq_data_get_affinity_mask_id

irq_data_get_effective_affinity_mask_id

irq_data_update_effective_affinity_id

irq_set_affinity_notifier_id

irq_affinity_notify_id

irq_chip_set_affinity_parent_id

irq_set_vcpu_affinity_id

🛠️ Utilities

irqbalance – distributes hardware interrupts across CPUs

📖 References

SMP IRQ affinity_doc

IRQ affinity, LF

managed_irq kernel parameter, @LKML

irqaffinity kernel parameter, @LKML

Non-maskable interrupts

⚲ API

linux/nmi.h_inc

in_nmi_id

touch_nmi_watchdog_id

...

trace/events/nmi.h_inc

arch/x86/include/asm/nmi.h_src

register_nmi_handler_id

unregister_nmi_handler_id

⚙️ Internals

arch/x86/kernel/nmi.c_src

arch/x86/kernel/nmi_selftest.c_src

📖 References

NMI Trace Events_doc

📚 Further reading

Non-maskable interrupt handler (NMI)

...

📚 Further reading about interrupts

IDT – Interrupt descriptor table

Tickless (Full dynticks) reduces timer interrupts overhead, CONFIG_NO_HZ_FULL_id

Deferred works

Scheduler context

Threaded IRQ

⚲ API

devm_request_threaded_irq_id, request_threaded_irq_id

ISR should return IRQ_WAKE_THREAD to run thread function

⚙️ Internals

setup_irq_thread_id, irq_thread_id

kernel/irq/manage.c_src

📖 References

request_threaded_irq_doc

Work

work is a workqueue wrapper

⚲ API

linux/workqueue.h_inc

work_struct_id, INIT_WORK_id, schedule_work_id,

delayed_work_id, INIT_DELAYED_WORK_id, schedule_delayed_work_id, cancel_delayed_work_sync_id

👁 Example usage samples/ftrace/sample-trace-array.c_src

⚙️ Internals: system_wq_id

Workqueue

⚲ API

linux/workqueue.h_inc

workqueue_struct_id, alloc_workqueue_id, queue_work_id

⚙️ Internals

workqueue_init_id, create_worker_id, pool_workqueue_id

kernel/workqueue.c_src

📖 References

Concurrency Managed Workqueue_doc

Interrupt context

linux/irq_work.h_inc – framework for enqueueing and running callbacks from hardirq context

samples/trace_printk/trace-printk.c_src

Timers

softirq timer

This timer is a softirq for periodical tasks with jiffies resolution

⚲ API

linux/timer.h_inc

timer_list_id, DEFINE_TIMER_id, timer_setup_id

mod_timer_id — sets expiration time in jiffies.

del_timer_id

⚙️ Internals

kernel/time/timer.c_src

timer_bases_id

👁 Examples

input_enable_softrepeat_id and input_start_autorepeat_id

📚 References

Time and timer routines_doc

mod_timer_pending ... _doc

High-resolution timer

⚲ API

/proc/timer_list

/proc/sys/kernel/timer_migration

linux/hrtimer_defs.h_inc

linux/hrtimer.h_inc

hrtimer_id, hrtimer.function — callback

hrtimer_init_id

hrtimer_setup_id

hrtimer_start_id — starts a timer with nanosecond resolution

hrtimer_cancel_id

👁 Examples alarm_init_id, watchdog_enable_id

⚙️ Internals

CONFIG_HIGH_RES_TIMERS_id

kernel/time/tick-internal.h_src

hrtimer_bases_id

kernel/time/hrtimer.c_src

kernel/time/itimer.c_src

kernel/time/timer_list.c_src

📚 HR timers references

High-resolution timers_doc

hrtimers - subsystem for high-resolution kernel timers_doc

high resolution timers and dynamic ticks design notes_doc

...

📚 Timers references

Timers_doc

Better CPU selection for timer expiration

Tasklet

tasklet is a softirq, for time critical operations

⚲ API is deprecated in favor of threaded IRQs: devm_request_threaded_irq_id

tasklet_struct_id, tasklet_init_id, tasklet_schedule_id

⚙️ Internals: tasklet_action_common_id HI_SOFTIRQ, TASKLET_SOFTIRQ

Softirq

softirq is internal system facility and should not be used directly. Use tasklet or threaded IRQs

⚲ API

linux/interrupt.h_inc

cat /proc/softirqs

open_softirq_id registers softirq_action_id

⚙️ Internals

kernel/softirq.c_src

📖 References

Introduction to deferred interrupts (Softirq, Tasklets and Workqueues)

Timers and time management

Deferred work, linux-kernel-labs

Chapter 7. Time, Delays, and Deferred Work

CPU specific

🖱️ GUI

tuna – program for tuning running processes

⚲ API

cat /proc/cpuinfo

/sys/devices/system/cpu/

/sys/cpu/

/sys/fs/cgroup/cpu/

grep -i cpu /proc/self/status

rdmsr – tool for reading CPU machine specific registers (MSR)

man 1 lscpu – display information about the CPU architecture

linux/arch_topology.h_inc – arch specific cpu topology information

linux/cpu.h_inc – generic cpu definition

linux/cpu_cooling.h_inc

linux/cpu_pm.h_inc

linux/cpufeature.h_inc

linux/peci-cpu.h_inc

linux/sched/cputime.h_inc – cputime accounting APIs

⚙️ Internals

drivers/base/cpu.c_src

cpu_dev_init_id

Cache

linux/cacheflush.h_inc

arch/x86/include/asm/cacheflush.h_src: clflush_cache_range_id

linux/cache.h_inc

arch/x86/include/asm/cache.h_src

⚙️ Internals

arch/x86/mm/pat/set_memory.c_src

arch/x86/kernel/cpu/mtrr/_src

📚 Further reading

MTTR – Memory type range register

CPU cache

SMP

This chapter is about multiprocessing and muti-core aspects of Linux kernel.

Key concepts and features of Linux SMP include:

Symmetry: In an SMP system, all processors are considered the same without hardware hierarchy in contradiction to use of coprocessors.
Load balancing: The Linux kernel employs load balancing mechanisms to distribute tasks evenly among available CPU cores. This prevents any one core from becoming overwhelmed while others remain underutilized.
Parallelism: SMP enables parallel processing, where multiple threads or processes can execute simultaneously on different CPU cores. This can significantly improve the execution speed of applications that are designed to take advantage of multiple threads.
Thread scheduling: The Linux kernel scheduler is responsible for determining which threads or processes run on which CPU cores and for how long. It aims to optimize performance by minimizing contention and maximizing CPU utilization.
Shared memory: In an SMP system, all CPU cores typically share the same physical memory space. This allows processes and threads running on different cores to communicate and share data more efficiently.
NUMA – Non-Uniform Memory Access: In larger SMP systems, memory access times might not be uniform due to the physical arrangement of memory banks and processors. Linux has mechanisms to handle NUMA architectures efficiently, allowing processes to be scheduled on CPUs closer to their associated memory.
Cache coherency: SMP systems require mechanisms to ensure that all CPU cores have consistent views of memory. Cache coherency protocols ensure that changes made to shared memory locations are correctly propagated to all cores.
Scalability: SMP systems can be scaled up to include more CPU cores, enhancing the overall computing power of the system. However, as the number of cores increases, challenges related to memory access, contention, and communication between cores may arise.
Kernel and user space: Linux applications running in user space can take advantage of SMP without needing to be aware of the underlying hardware details. The kernel handles the management of CPU cores and resource allocation.

⚲ API

ps -PLe – lists threads with processor that the thread last executed on (the third column PSR).

man 2 getcpu – determine CPU and NUMA node on which the calling thread is running

man 8 chcpu – configure CPUs

man 3 CPU_SET – macros for manipulating CPU sets

linux/smp.h_inc

linux/cpu.h_inc

linux/group_cpus.h_inc: group_cpus_evenly_id – groups all CPUs evenly per NUMA/CPU locality

asm-generic/percpu.h_inc

linux/percpu-defs.h_inc – basic definitions for percpu areas

this_cpu_ptr_id

linux/percpu.h_inc

linux/percpu-refcount.h_inc

linux/percpu-rwsem.h_inc

linux/preempt.h_inc

migrate_disable_id, migrate_enable_id

/sys/bus/cpu

per CPU local_lock

arch/x86/include/asm/topology.h_src

⚙️ Internals

boot_cpu_init_id activates the first CPU

smp_prepare_cpus_id initializes rest CPUs during boot

cpu_number_id

CONFIG_SMP_id

CONFIG_NUMA_id

trace/events/percpu.h_inc

IPI – Inter-processor interrupt

trace/events/ipi.h_inc

kernel/irq/ipi.c_src

ipi_send_single_id, ipi_send_mask_id ...

drivers/base/cpu.c_src – CPU driver model subsystem support

kernel/cpu.c_src

smpboot

linux/smpboot.h_inc

kernel/smpboot.c_src

arch/x86/kernel/smpboot.c_src

lib/group_cpus.c_src

🛠️ Utilities

irqbalance – distributes hardware interrupts across CPUs

man 8 numactl – controls NUMA policy for processes or shared memory

📖 References

Per-CPU Data_doc

How CPU topology info is exported via sysfs_doc

📚 Further reading

Functionalities of the scheduler TuneD plugin

tuned-adm – command line tool for switching between different tuning profiles

CPU affinity

Affinity refers to assigning a process or thread to specific CPU cores. This helps control which CPUs execute tasks, potentially improving performance by reducing data movement between cores. It can be managed using system calls or commands. Affinity can be represented as CPU bitmask: cpumask_t_id or CPU affinity list: cpulist_parse_id.

⚲ API

man 1 taskset – set or retrieve a process's CPU affinity

grep Cpus_allowed /proc/self/status

man 2 sched_setaffinity man 2 sched_getaffinity – set and get a thread's CPU affinity mask

↪ sched_setaffinity_id

set_cpus_allowed_ptr_id – common kernel function to change a task's affinity mask

linux/cpu_rmap.h_inc – CPU affinity reverse-map support

linux/cpumask_types.h_inc

struct cpumask, cpumask_t_id – CPUs bitmap, can be very big

cpumask_var_t_id – type for local cpumask variable, see alloc_cpumask_var_id, free_cpumask_var_id.

linux/cpumask.h_inc – Cpumasks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number

for_each_possible_cpu_id

⚙️ Internals

cpus_mask_id – affinity of task_struct_id

cpus_allowed_id – affinity of cpuset_id

📚 Further reading

Processor affinity

Affinity mask

CPU hotplug

CPU hotplugging in Linux refers to the ability to dynamically add or remove CPUs from the system without needing a reboot. This feature is crucial in environments requiring high availability and resource flexibility, such as data centers, virtualized systems, and systems that use power management aggressively.

⚲ API

/sys/devices/system/cpu/cpu*/online

/sys/devices/system/cpu/cpu*/hotplug/

include/linux/cpu.h_inc

add_cpu_id ...

linux/cpuhotplug.h_inc

cpuhp_state_id – CPU hotplug states

cpuhp_setup_state_id ... – setups hotplug state callbacks

cpuhp_setup_state_multi_id

cpuhp_setup_state_nocalls_id

linux/cpuhplock.h_inc – CPU hotplug locking

cpus_read_lock_id ...

remove_cpu_id ...

⚙️ Internals

kernel/cpu.c_src

cpuhp_hp_states_id

boot_cpu_hotplug_init_id

cpuhp_threads_init_id

... cpuhp_invoke_callback_range_id ...

kernel/irq/cpuhotplug.c_src

drivers/base/cpu.c_src – CPU subsystem support

cpu_dev_init_id

... cpu_subsys_online_id

👁️ Examples

torture_onoff_id

tools/testing/selftests/sched_ext/hotplug.c_src

📖 References

CPU hotplug in the Kernel_doc

📚 Further reading

CONFIG_CPU_HOTPLUG_STATE_CONTROL_id – enables the ability to write incremental steps between "offline" and "online" states to the CPU's sysfs target file, allowing for more granular control of state transitions.

target_store_id: cpu_up_id/cpu_down_id

cpuhotplug @LKML

CPU isolation

CPU isolation ensures that specific tasks run on dedicated CPUs, reducing contention and latency.

Housekeeping CPUs refer to the CPUs that are reserved for various system tasks. See hk_type_id.

Isolated CPUs are dedicated to real-time applications, such as DPDK.

⚲ API

/sys/devices/system/cpu/isolated

/sys/devices/system/cpu/nohz_full

/sys/fs/cgroup/cpuset.cpus.isolated

/sys/fs/cgroup/.../cpuset.cpus.partition

linux/sched/isolation.h_inc

hk_type_id – housekeeping type

housekeeping_cpumask_id

cpu_is_isolated_id

linux/cpuset.h_inc – cpuset interface

cpuset_cpu_is_isolated_id

man 7 cpuset – confine processes to processor and memory node subsets

⚙️ Internals

CONFIG_CPU_ISOLATION_id

kernel/sched/isolation.c_src

isolated_cpus_id

CONFIG_CPUSETS_id

kernel/cgroup/cpuset.c_src

cpuset_init_id

cpuset_init_smp_id

partition_xcpus_newstate_id

📖 References

CPU lists in command-line parameters_doc

nohz_full clears housekeeping.cpumasks_id for tick, wq, timer, rcu, misc, and kthread in housekeeping_nohz_full_setup_id

isolcpus clears housekeeping.cpumasks_id for domain (by default), tick, and managed_irq in housekeeping_isolcpus_setup_id

CPUSETS of cgroup v2_doc

CPUSETS of cgroup v1_doc

📚 Further reading

CPU Isolation state of the art, LPC'23

CPU Isolation

isolcpus @LKML

housekeeping @LKML

Explicitly Reserved CPU List, Kubernetes Documentation

CPU Partitioning

Scheduler Domains_doc – the Scheduler balances CPUs (scheduling groups) within a sched domain

nohz_full @LKML

Memory barriers

Memory barriers (MB) are synchronization mechanisms used to ensure proper ordering of memory operations in a SMP environment. They play a crucial role in maintaining the consistency and correctness of data shared among different CPU cores or processors. MBs prevent unexpected and potentially harmful reordering of memory access instructions by the compiler or CPU, which can lead to data corruption and race conditions in a concurrent software system.

⚲ API

man 2 membarrier

asm-generic/barrier.h_inc

mb_id, rmb_id, wmb_id

smp_mb_id, smp_rmb_id, smp_wmb_id

⚙️ Internals

arch/x86/include/asm/barrier.h_src

kernel/sched/membarrier.c_src

📖 References

Memory barriers_doc

States

C-states and P-states are features in modern CPUs designed to improve energy efficiency.

C-states, aka cpuidle, Processor states:

C0 – the operating state.

C1 (aka Halt) – the processor is not executing instructions, but can return to an executing state instantaneously.

C2 (aka Stop-Clock) – the processor maintains all software-visible state, but may take longer to wake up.

C3 (aka Sleep) – takes longer to wake up.

...

P-states, aka cpufreq, Performance states:

P0 – maximum power and frequency

Pn – less power and frequency

...

⚲ API

Working-State Power Management_doc

intel_pstate=

cpufreq.default_governor=

/dev/cpu_dma_latency – see set_cpu_dma_latency_id

C-states interfaces:

/sys/devices/system/cpu/cpuidle/

/sys/devices/system/cpu/cpu*/cpuidle/

linux/cpuidle.h_inc – a generic framework for CPU idle power management

intel_idle CPU Idle Time Management Driver_doc

P-states interfaces:

/sys/devices/system/cpu/cpufreq/

/sys/devices/system/cpu/cpu*/cpufreq/

/sys/devices/system/cpu/intel_pstate/

linux/cpufreq.h_inc

Kernel Command Line Options for intel_pstate_doc

linux/sched/cpufreq.h_inc – interface between cpufreq drivers and the scheduler

linux/pm_qos.h_inc

⚙️ Internals

drivers/cpuidle_src – C-states implementation

drivers/cpufreq_src – P-states implementation

intel_pstate_id

acpi_cpufreq_driver_id

kernel/sched/cpufreq_schedutil.c_src – implementation of cpufreq.default_governor=schedutil

kernel/power/qos.c_src

cpu_latency_qos_miscdev_id – implementation of /dev/cpu_dma_latency

📖 References

Working-State Power Management_doc

CPU Idle Time Management_doc

CPU Performance Scaling_doc

PM Quality Of Service Interface_doc

Device Frequency Scaling_doc

CPUFreq Governor

CPUFreq - CPU frequency and voltage scaling_doc

📚 Further reading

cpufreq_ltp

cpupower

How to use cpufrequtils

cpufreq-info

cpufreq-set

Architectures

Linux CPU architectures refer to the different types of central processing units (CPUs) that are compatible with the Linux operating system. Linux is designed to run on a wide range of CPU architectures, which allows it to be utilized on various devices, from smartphones to servers and supercomputers. Each architecture has its own unique features, advantages, and design considerations.

Architectures are classified by family (e.g. x86, ARM), word or long int size (e.g. CONFIG_32BIT_id, CONFIG_64BIT_id).

Some functions with different implementations for different CPU architectures:

do_boot_cpu_id > start_secondary_id > cpu_init_id