memory
Processes
virtual memory
memory mapping
demand paging and swap
logical memory
Page Allocator
pages

The kernel has full access to the system's memory and allows processes to safely access this memory as they require it. Often the first step in doing this is virtual addressing, usually achieved by paging and/or segmentation. Virtual addressing allows the kernel to make a given physical address appear to be another address, the virtual address. Virtual address spaces may be different for different processes; the memory that one process accesses at a particular (virtual) address may be different memory from what another process accesses at the same address. This allows every program to behave as if it is the only one (apart from the kernel) running and thus prevents applications from crashing each other.

On many systems, a program's virtual address may refer to data which is not currently in memory. The layer of indirection provided by virtual addressing allows the operating system to use other data stores, like a hard drive, to store what would otherwise have to remain in main random-access memory (RAM). As a result, operating systems can allow programs to use more memory than the system has physically available. When a program needs data which is not currently in RAM, the MMU signals to the kernel that this has happened, and the kernel responds by writing the contents of an inactive memory block to disk (if necessary) and replacing it with the data requested by the program. The program can then be resumed from the point where it was stopped. This scheme is generally known as demand paging.

Virtual addressing also allows creation of virtual partitions of memory in two disjointed areas, one being reserved for the kernel (kernel space) and the other for the applications (user space). The applications are not permitted by the processor to address kernel memory, thus preventing an application from damaging the running kernel. This fundamental partition of memory space has contributed much to the current designs of actual general-purpose kernels and is almost universal in such systems, Linux being one of them.


⚲ Shell interface

cat /proc/meminfo
man 1 free
man 8 vmstat



Processes

edit

Process is a running user space program. Kernel starts the first process /sbin/init in function run_init_process idusing kernel_execve id. Processes occupy system resources, like memory, CPU time. System calls sys_fork id and sys_execve id are used to create new processes from user space. The process exit with an sys_exit id system call.

Linux inherits from Unix its basic process management system calls (⚲ API ↪ ⚙️ implementations):

man 2 forkkernel_clone id creates a new process by duplicating the process invoking it.

man 2 _exitdo_exit id terminates the calling process "immediately". Any open file descriptors belonging to the process are closed.

man 2 waitkernel_waitid id suspends the execution of the calling process until one of its children processes terminates.

man 2 execvedo_execve id runs an executable file in the context of current process, replacing the previous executable. This system call is used by family of functions of libc man 3 exec

Linux enhances the traditional Unix process API with its own system calls man 2 clone. Clone creates a child process that may share parts of its execution context with the parent. It is often used to implement threads (though programmers will typically use a higher-level interface such as man 7 pthreads, implemented on top of clone).


PID - Process identifier defined as pid_t id is unique sequential number. man 1 ps -A lists current processes. Syscall man 2 getpidtask_tgid_vnr id return PID of the current process which internally is called TGID - thread group id. A process can contain many threads. man 2 gettidtask_pid_vnr id returns thread id. Which internal historically is called PID. ⚠️ Warning: confusion. User space PID ≠ kernel space PID. man 1 ps -AF lists current processes and thread as LWP. For a single thread process all these IDs are equal.


⚲ API

unistd.h
sys/types.h
sys/wait.h


⚙️ Internals

task_struct id
pid_type id
kernel/fork.c src
syscalls:
man 2 set_tid_address – set pointer to thread ID
man 2 fork – create a child process
man 2 vfork – create a child process and block parent
man 2 clone – create a child process
man 2 unshare – disassociate parts of the process execution context
kernel/sys.c src
syscalls:
man 2 prctl – operations on a process or thread
kernel/pid.c src
syscalls:
man 2 pidfd_open – obtain a file descriptor that refers to a process
man 2 pidfd_getfd – obtain a duplicate of another process's file descriptor
syscalls:
man 2 pidfd_open – obtain a file descriptor that refers to a process
man 2 pidfd_getfd – obtain a duplicate of another process's file descriptor
kernel/exit.c src
syscalls:
man 2 exit – terminate the calling process
man 2 exit_group – exit all threads in a process
man 2 waitid – wait for process to change state
man 2 waitpid – wait for process to change state


fs/exec.c src


📖 References

fork (system call)
exit (system call)
wait (system call)
exec (system call)

Inter-process communication

edit

Inter-process communication (IPC) refers specifically to the mechanisms an operating system provides to allow processes it manages to share data. Methods for achieving IPC are divided into categories which vary based on software requirements, such as performance and modularity requirements, and system circumstances. Linux inherited from Unix the following IPC mechanisms:

Signals (⚲ API ↪ ⚙️ implementations):

man 2 kill sends signal to a process
man 2 tgkilldo_tkill id sends a signal to a thread
man 2 process_vm_readvprocess_vm_rw id - zero-copy data transfer between process address spaces

🔧 TODO: man 2 sigaction man 2 signal man 2 sigaltstack man 2 sigpending man 2 sigprocmask man 2 sigsuspend man 2 sigwaitinfo man 2 sigtimedwait

kernel/signal.c src


Anonymous pipes and named pipes (FIFOs) man 2 mknoddo_mknodat id S_IFIFO id
Express Data Path PF_XDP id
Unix domain socket PF_UNIX id
Memory-mapped files man 2 mmapksys_mmap_pgoff id
Sys V IPC:
Message queues
Semaphores
Shared memory: man 2 shmget, man 2 shmctl, man 2 shmat, man 2 shmdt


📖 References

Inter-process communication
man 7 sysvipc

Memory management Linux API

edit
man 2 brksys_brk id, do_brk_flags id dynamically changes data segment size of the calling process.

The change is made by resetting the program break of the process, which determines the maximum space that can be allocated. The program break is the address of the first location beyond the current end of the data region, and determines the maximum space that can be allocated by the process. The amount of available space increases as the break value increases. The added available space is initialized to a value of zero.

man 2 mmapksys_mmap_pgoff id maps files or devices into memory.

It is a method of memory-mapped file I/O. It naturally implements demand paging, because file contents are not read from disk initially and do not use physical RAM at all. The actual reads from disk are performed in a "lazy" manner, after a specific location is accessed. After the memory is no longer needed it is important to man 2 unmmap the pointers to it. Protection information can be managed using man 2 mprotectdo_mprotect_pkey id and special treatment can be enforced using man 2 madvisedo_madvise id. In Linux, man 2 mmap can create several types of mappings, such as anonymous mappings, shared mappings and private mappings. Using the MAP_ANONYMOUS flag mmap() can map a specific area of the process's virtual memory not backed by any file, whose contents are initialized to zero.

These functions are typically called from a higher-level memory management library function such as C standard library man 3 malloc or C++ new operator.


💾 History: Two basic related to memory management system calls brk and mmap Linux inherits from Unix.

BTW: On Linux, man 2 sbrk is not a separate system call, but a C library function that also calls to sys_brk id and keeps some internal state to return the previous break value.


📚 References

Memory Management APIs doc
x86_64 Memory Management doc
sbrk
mmap
Memory management


⚙️ Internals:

sys_brk id ↯ call hierarchy:
do_brk_flags id
vm_area_alloc id
kmem_cache_alloc id
__kmem_cache_alloc_lru id

Virtual memory

edit

🔧 TODO:

Virtually contiguous memory on top of physical and swapped memory pages.


🗝️ Acronyms:

VPFN - Virtual Page Frame Number
PFN - Physical Page Frame Number
pgd - Page Directory
pmd - Page Middle Directory
pud - Page Upper Directory
pte - Page table Entry
TLB - Translation Lookaside Buffer
MMU - Memory Management Unit


⚲ API:

linux/vmalloc.h inc
vmalloc id vfree id
linux/cleanup.h inc – scope-based cleanup helpers
scoped_guard id


⚙️ Internals:

vm_struct id
virt_to_page id
vmalloc_init id
find_vma id
mm/vmalloc.c src


📚 References

Virtually Contiguous Mappings doc
Page Tables doc
Scope-based Cleanup Helpers doc

Data structures

edit

⚲ API:

linux/types.h inc
linux/kref.h inc
list_head id – common double linked list
linux/list.h inc - basic list_head id operations
list_head id list_add id ...
linux/klist.h inc – some klist_node id->kref id helpers
klist_add_tail id ...
linux/kobject.h inc
linux/circ_buf.h inc
linux/kfifo.h inc – generic kernel FIFO
kfifo_in id ...
linux/rbtree.h inc – Red-black trees
rb_node id
linux/scatterlist.h inc
scatterlist id
👁 Example: samples/kfifo/dma-example.c src
linux/idr.h inc – ID allocation
linux/bitmap.h inc – bit arrays that consume one or more unsigned longs
linux/sbitmap.h inc – fast and scalable bitmaps


📚 References:

List Management Functions doc
FIFO Buffer doc
Data structures and low-level utilities doc
Everything you never wanted to know about kobjects, ksets, and ktypes doc
Adding reference counters (krefs) to kernel objects doc
Generic Associative Array Implementation doc
XArray doc
ID Allocation doc
Circular Buffers doc
Red-black Trees (rbtree) in Linux doc
Generic radix trees/sparse arrays doc
Generic bitfield packing and unpacking functions doc
How to access I/O mapped memory from within device drivers doc
this_cpu operations doc
The errseq_t datatype doc
Data Structures in the Linux Kernel

Memory mapping

edit

🔧 TODO

Key items:

man 2 mmap man 2 mprotect man 2 mmap2 man 2 mincore man 2 ksys_mmap_pgoff

do_mmap id mm_struct id vm_area_struct id vm_struct id remap_pfn_range id SetPageReserved id ClearPageReserved id free_mmap_pages alloc_mmap_pages free_mmap_pages id


⚲ API:

linux/mm_types.h inc
linux/mm.h inc


⚙️ Internals:

mm/mmap.c src


📚 References:

mmap
Maple Tree doc
Memory mapping, linux-kernel-labs

Swap

edit

🔧 TODO


⚲ API:

cat /proc/sys/vm/swappiness ↪ vm_swappiness id
linux/swap.h inc
man 2 swaponenable_swap_slots_cache id
man 2 swapoff
man 2 mlockdo_mlock id
man 2 shmctlshmctl_do_lock id


⚙️ Internals:

mm/swapfile.c src
mm/vmscan.c src
mm/mlock.c src

VM_LOCKED id swap_info_struct id si_swapinfo id swap_info id handle_pte_fault id do_swap_page id wakeup_kswapd id kswapd id


📚 References:

Memory paging
https://wiki.archlinux.org/title/swap

Logical memory

edit

kmalloc id is the normal method of allocating memory in the kernel for objects smaller than the page size. It is defined in linux/slab.h inc. The first argument size is the size (in bytes) of the block of memory to be allocated. The second argument flags are the allocation flags or GFP flags, a set of macros that the caller provides to control the type of requested memory. The most commonly used values for flags are GFP_KERNEL and GFP_ATOMIC, but there is more to be considered.

Memory-allocation requests in the kernel are always qualified by a set of GFP flags ("GFP" initially came from "get free page") describing what can and cannot be done to satisfy the request. The most commonly used flags are GFP_ATOMIC and GFP_KERNEL, though they are actually built up from lower-level flags. The full set of flags is huge; they can be found in the linux/gfp.h inc header file.


⚲ API:

RAII allocation functions hierarchy from linux/device.h inc:
devm_kcalloc id - zeroed array
devm_kmalloc_array id
devm_kmalloc id - common allocation
devm_kzalloc id - zeroed allocation
devm_kmalloc id - common allocation
Classic direct API:
linux/slab.h inc
kmalloc id, kfree id


Slab allocation

edit

Slab allocation is a memory management algorithm intended for the efficient memory allocation of kernel objects. It eliminates fragmentation caused by allocations and deallocations. The technique is used to retain allocated memory that contains a data object of a certain type for reuse upon subsequent allocations of objects of the same type.


Basics

This section is about SLAB and SLUB allocator implementations

A slab can be thought of as an array of objects of certain type or with the same size, spanning through one or more contiguous pages of memory; for example, the slab named "task_struct" holds objects of struct task_struct type, used by the scheduling subsystem. Other slabs store objects used by other subsystems, and there is also slabs for dynamic allocations inside the kernel, such as the "kmalloc-64" slab that holds up to 64-byte chunks requested via kmalloc() calls. In a slab, each object can be allocated and freed separately.

The primary motivation for slab allocation is that the initialization and destruction of kernel data objects can actually outweigh the cost of allocating memory for them. As object creation and deletion are widely employed by the kernel, overhead costs of initialization can result in significant performance drops. The notion of object caching was therefore introduced in order to avoid the invocation of functions used to initialize object state.

With slab allocation, memory chunks suitable to fit data objects of certain type or size are preallocated. The slab allocator keeps track of these chunks, known as caches kmalloc_caches id, so that when a request to allocate memory for a data object of a certain type is received, it can instantly satisfy the request with an already allocated slot slab_alloc id.

Deallocatoion of the object with kfree id does not free up the memory, but only opens a slot which is put in the list of free slots kmem_cache_cpu id by the slab allocator. The next call to allocate memory of the same size will return the now unused memory slot. See slab_alloc id//___slab_alloc id/get_freelist id. This process eliminates the need to search for suitable memory space and greatly alleviates memory fragmentation. In this context, a slab is one or more contiguous pages in the memory containing pre-allocated memory chunks.

Slab allocation provides a kind of front-end to the zoned buddy allocator for those sections of the kernel that require more flexible memory allocation than the standard 4KB page size.


⚲ Interface:

sudo cat /proc/slabinfo
linux/slab.h inc
kmem_cache_alloc id, kmem_cache_free id
man 1 slabtop


⚙️ Internals:

mm/slab_common.c src
mm_init id is called from start_kernel id
kmem_cache_init id
create_kmalloc_caches id
____kasan_kmalloc id


SLUB allocator – default Unqueued allocator

SLUB is the iteration of the original SLAB allocator that replaced it and became the Linux default allocator since 2.6.23.

⚙️ Internals: mm/slub.c src


📚 References:

CONFIG_SLUB id


SLOB allocator – Simple List Of Blocks for 🤖 embedded devices

Unfortunately, SLAB and SLUB allocators consume a big amount of memory allocating their slabs, which is a serious drawback in small systems with memory constraints, such as embedded systems. To overcome it, the SLOB (Simple List Of Blocks) allocator was designed in January 2006 by Matt Mackall as a simpler method to allocate kernel objects.

SLOB allocator uses a first-fit algorithm, which chooses the first available space for memory. This algorithm reduces memory consumption, but a major limitation of this method is that it suffers greatly from internal fragmentation.

The SLOB allocator is also used as a fall back by the kernel build system when no slab allocator is defined (when the CONFIG_SLAB id flag is disabled).


⚙️ Internals: mm/slob.c src, slob_alloc id

SLOB


SLAB allocator

💾 History: SLAB allocator is the original implementation Slab allocation. It was the default allocator since kernel 2.2 until 2.6.23, when SLUB allocator became the default, but it is still available as an option.

SLAB is the name given to the first slab allocation implementation in the kernel to distinguish it from later allocators that use the same interface. It's heavily based on Jeff Bonwick's paper "The Slab Allocator: An Object-Caching Kernel Memory Allocator" (1994) describing the first slab allocator implemented in the Solaris 5.4 kernel.

⚙️ Internals: mm/slab.c src




📚 References for Slab allocation:

KASAN - KernelAddressSANitizer doc - dynamic memory safety error detector designed to find out-of-bound and use-after-free bugs
Video "SL[AUO]B: Kernel memory allocator design and philosophy" Christopher Lameter (Linux.conf.au 2015 conference) Slides

Page Allocator

edit

The page allocator (or "zoned buddy allocator") is a low-level allocator that deals with physical memory. It delivers physical pages (usually with a size of 4096 bytes) of free memory to high-level memory consumers such as the slab allocators and kmalloc(). As the ultimate source of memory in the system, the page allocator must ensure that memory is always available, since a failure providing memory to a critical kernel subsystem can lead to a general system failure or a kernel panic.

The page allocator divides physical memory into "zones", each of which corresponds to zone_type id with specific characteristics. ZONE_DMA contains memory at the bottom of the address range for use by severely challenged devices, for example, while ZONE_NORMAL id may contain most memory on the system. 32-bit systems have a ZONE_HIGHMEM for memory that is not directly mapped into the kernel's address space. Depending on the characteristics of any given allocation request, the page allocator will search the available zones in a specific priority order. For the curious, /proc/zoneinfo gives a lot of information about the zones in use on any given system.

Within a zone, memory is grouped into page blocks, each of which can be marked with a migration type - migratetype id describing how the block should be allocated.


⚲ API:

cat /proc/buddyinfo
linux/gfp.h inc
linux/mmzone.h inc
alloc_page id
devm_get_free_pages id - RAII function, ↯ hierarchy of it:
__get_free_pages id
alloc_pages id
alloc_pages_node id
__alloc_pages id - the 'heart' of the zoned buddy allocator

⚙️ Internals:

build_all_zonelists id is called from start_kernel id, ↯ call hierarchy:
build_all_zonelists_init id
__build_all_zonelists id
build_zonelists id
__alloc_pages id - the 'heart' of the zoned buddy allocator
struct zone id
free_area id
mm/mmzone.c src
mm/page_alloc.c src


📚 References:

Get Free Page flags doc
Buddy memory allocation
Page replacement algorithm



📚 References for logical memory:

Memory Allocation Guide doc
Selecting memory allocator doc

Physical memory

edit

Memory Layout

edit

A 32-bit processor can address a maximum of 4GB of memory. Linux kernels split the 4GB address space between user processes and the kernel; under the most common configuration, the first 3GB of the 32-bit range are given over to user space, and the kernel gets the final 1GB starting at 0xc0000000. Sharing the address space gives a number of performance benefits; in particular, the hardware's address translation buffer can be shared between the kernel and user space.

In x86-64 - CONFIG_X86_64 id with 4-level page tables (CONFIG_X86_5LEVEL id=n), only the least significant 48 bits of a virtual memory address would actually be used in address translation (page table lookup). The remainder bits 48 through 63 of any virtual address must be copies of bit 47, or the processor will raise an exception. Addresses complying with this rule are referred to as "canonical form." Canonical form addresses run from 0 through 00007FFF'FFFFFFFF, and from FFFF8000'00000000 through FFFFFFFF'FFFFFFFF, for a total of 256 TB of usable virtual address space. This is still approximately 64,000 times the virtual address space on 32-bit machines.

Linux takes the higher-addressed half of the address space for itself (kernel space) and leaves the lower-addressed half for user space. The "canonical address" design has, in effect, two memory halves: the lower half starts at 00000000'00000000 and "grows upwards" as more virtual address bits become available, while the higher half is "docked" to the top of the address space and grows downwards.

Start addr Offset End addr Size VM area description
0000'8000'0000'0000 +128 TB ffff'7fff'ffff'ffff ... huge, almost 64 bits wide hole of non-canonical virtual memory addresses up to the -128 TB starting offset of kernel mappings.
0000'0000'0000'0000 0 0000'7fff'ffff'ffff 128 TB=247 user-space virtual memory, different per mm
ffff'ffff'ffe0'0000 -2 MB ffff'ffff'ffff'ffff 2 MB=221 ... unused hole
ffff'ffff'ff60'0000 -10 MB ffff'ffff'ff60'0fff 4 kB=212 VSYSCALL_ADDR id - legacy vsyscall ABI
ffff'ffff'8000'0000 -2 GB ffff'ffff'9fff'ffff 512 MB=219 kernel text mapping, mapped to physical address 0
ffff'8880'0000'0000 -119.5 TB ffff'c87f'ffff'ffff 64 TB page_offset_base id = __PAGE_OFFSET_BASE_L4 id - direct mapping of all physical memory
ffff'8000'0000'0000 -128 TB ffff'87ff'ffff'ffff 8 TB ... guard hole, also reserved for hypervisor


 
x86-64 memory layout


⚲ API:

man 8 setarch --addr-no-randomize cat /proc/self/maps


⚙️ Internals:

arch/x86/include/asm/page_64_types.h src
arch/x86/mm/init_64.c src


📚 References:

X86_64 memory nap doc
Address space layout randomization

Pages

edit

In Linux, different architectures have different page sizes. The original —for x86 architecture— and still most commonly used page size is 4096 bytes (4 KB). The page size (in bytes) of the current architecture is defined by the PAGE_SIZE macro included in arch/x86/include/asm/page_types.h src header file. User space programs can get this value using the man 2 getpagesize library function. Another related macro is PAGE_SHIFT, that contains the number of bits to shift an address to get its page number —12 bits for 4K pages.

One of the most fundamental kernel data structures relating memory-management is struct page. The kernel keeps track of the status of every page of physical memory present in the system using variables of this type. There are millions of pages in a modern system, and therefore there are millions of these structures in memory.

The full definition of struct page can be found in linux/mm_types.h inc.


Pages

⚲ API:

dma_addr_t id - bus address
linux/dma-mapping.h inc
dma_alloc_coherent id
dma_alloc_pages id pin_user_pages id
dma_map_single id dma_data_direction id
dma_map_sg id scatterlist id
dma_set_mask id dma_set_coherent_mask id dma_set_mask_and_coherent id
dma_sync_single_for_cpu id dma_sync_single_for_device id
linux/gfp.h inc
linux/dmapool.h inc
dma_pool_create id
DMA-able memory: __get_free_page id kmalloc id kmem_cache_alloc id
get_user_pages id pins user pages in memory,


👁 Examples:

samples/kfifo/dma-example.c src
e1000_alloc_rx_buffers id, e1000_alloc_ring_dma id


⚙️ Internals:

kernel/dma src
mm/dmapool.c src
mm/gup.c src
kernel/dma/mapping.c src


📚 References:

Dynamic DMA mapping Guide doc
Dynamic DMA mapping using the generic device doc
LWM: get_user_pages, pinned pages, and DAX
pin_user_pages() and related calls doc

💾 Historical:

LDD3:Memory Mapping and DMA
http://www.xml.com/ldd/chapter/book/ch13.html mmap and DMA

SAC Single Address Cycle


DMAEngine

edit
linux/dmaengine.h inc
drivers/dma src
driver-api/dmaengine doc
https://bootlin.com/pub/conferences/2015/elc/ripard-dmaengine/


📚 References for the article:

Linux Memory Management Documentation doc
Analysis of Linux Memory Management Mechanism
Memory management
http://linux-mm.org/LinuxMM
http://www.tldp.org/HOWTO/KernelAnalysis-HOWTO-7.html
Memory management, lwn