System functionality

system
User space interfaces, System calls
Driver Model
modules
buses, PCI
hardware interfaces, [re]booting

This article describes infrastructures used to support and manage other kernel functionalities. This functionality is named after system calls and sysfs.

User space communication

User space communication refers to the exchange of data and messages between user space applications and the kernel. User space applications are programs that run in the user space of the operating system, which is a protected area of memory that provides a safe and isolated environment for applications to run in.

There are several mechanisms available in Linux for user space communication with the kernel. One of the most common mechanisms is through system calls, which are functions that allow user space applications to request services from the kernel, such as opening files, creating processes, and accessing system resources.

Another mechanism for user space communication is through service files, which are special files that represent physical or virtual devices, such as storage devices, network interfaces, and various peripheral devices. User space applications can communicate with these devices by reading from and writing to their corresponding device files.

In summary, Linux kernel provides several mechanisms for user space communication, including system calls, device files, procfs, sysfs, and devtmpfs. These mechanisms enable user space applications to communicate with the kernel and access system resources in a safe and controlled manner.

⚲ APIs:

kernel space API for user space

uapi_inc

arch/x86/include/uapi_src

man 2 ioctl

System calls

Device files

user space API for kernel space

linux/uaccess.h_inc:

copy_to_user_id

copy_from_user_id

📖 References

User-space API guides_doc

User space

Linux kernel interfaces

ULK3 Chapter 11. Signals

System calls

System calls are the fundamental interface between user space applications and the Linux kernel. They provide a way for programs to request services from the operating system, such as opening a file, allocating memory, or creating a new process. In the Linux kernel, system calls are implemented as functions that can be invoked by user space programs using a software interrupt mechanism.

The Linux kernel provides hundreds of system calls, each with its own unique functionality. These system calls are organized into categories such as process management, file management, network communication, and memory management. User space applications can use these system calls to interact with the kernel and access the underlying system resources.

⚲ API

Table of syscalls

man 2 syscalls

⚙️ Internals

linux/syscalls.h_inc

syscall_init_id installs entry_SYSCALL_64_id

man 2 syscall ↪

entry_SYSCALL_64_id ↯ call hierarchy:

do_syscall_64_id

sys_call_table_id

📖 References

System call

Directory of system calls, man section 2

Anatomy of a system call, part 1 and part 2

syscalls_ltp

💾 Historical

ULK3 Chapter 10. System Calls

Device files

Classic UNIX devices are Char devices used as byte streams with man 2 ioctl.

⚲ API

ls /dev
cat /proc/devices 
cat /proc/misc

Examples: misc_fops_id usb_fops_id memory_fops_id

Allocated devices_doc

drivers/char_src - actually byte stream devices

Chapter 13. I/O Architecture and Device Drivers

hiddev

⚠️ Warning: confusion. hiddev isn't real human interface device! It reuses USBHID infrastructure. hiddev is used for example for monitor controls and Uninterruptible Power Supplies. This module supports these devices separately using a separate event interface on /dev/usb/hiddevX (char 180:96 to 180:111) (⚙️ HIDDEV_MINOR_BASE_id)

⚲ API

uapi/linux/hiddev.h_inc

HID_CONNECT_HIDDEV_id

⚙️ Internals

CONFIG_USB_HIDDEV

linux/hiddev.h_inc

hiddev_event_id

drivers/hid/usbhid/hiddev.c_src, hiddev_fops_id

📖 References

HIDDEV - Care and feeding of your Human Interface Devices_doc

📖 References

Device file

Administration

🔧 TODO

📖 References

man 7 netlink

The Linux kernel user’s and administrator’s guide_doc

procfs

The proc filesystem (procfs) is a special filesystem that presents information about processes and other system information in a hierarchical file-like structure, providing a more convenient and standardized method for dynamically accessing process data held in the kernel than traditional tracing methods or direct access to kernel memory. Typically, it is mapped to a mount point named /proc at boot time. The proc file system acts as an interface to internal data structures in the kernel. It can be used to obtain information about the system and to change certain kernel parameters at runtime.

/proc includes a directory for each running process —including kernel threads— in directories named /proc/PID, where PID is the process number. Each directory contains information about one process, including the command that originally started the process (/proc/PID/cmdline), the names and values of its environment variables (/proc/PID/environ), a symlink to its working directory (/proc/PID/cwd), another symlink to the original executable file —if it still exists— (/proc/PID/exe), a couple of directories with symlinks to each open file descriptor (/proc/PID/fd) and the status —position, flags, ...— of each of them (/proc/PID/fdinfo), information about mapped files and blocks like heap and stack (/proc/PID/maps), a binary image representing the process's virtual memory (/proc/PID/mem), a symlink to the root path as seen by the process (/proc/PID/root), a directory containing hard links to any child process or thread (/proc/PID/task), basic information about a process including its run state and memory usage (/proc/PID/status) and much more.

📖 References

sysfs

sysfs is a pseudo-file system that exports information about various kernel subsystems, hardware devices, and associated device drivers from the kernel's device model to user space through virtual files. In addition to providing information about various devices and kernel subsystems, exported virtual files are also used for their configuring. Sysfs is designed to export the information present in the device tree, which would then no longer clutter up procfs.

Sysfs is mounted under the /sys mount point.

⚲ API

linux/sysfs.h_inc

📖 References

sysfs

man 5 sysfs

sysfs - filesystem for exporting kernel objects_doc

fs/sysfs_src

devtmpfs

devtmpfs is a hybrid kernel/userspace approach of a device filesystem to provide nodes before udev runs for the first time.

📖 References

Device file

drivers/base/devtmpfs.c_src

Containerization

Containerization is a powerful technology that has revolutionized the way software applications are developed, deployed, and run. At its core, containerization provides an isolated environment for running applications, where the application has all the necessary dependencies and can be easily moved from one environment to another without worrying about any compatibility issues.

Containerization technology has its roots in the chroot command, which was introduced in the Unix operating system in the 1979. Chroot provided a way to change the root directory of a process, effectively creating a new isolated environment with its own file system hierarchy. However, this early implementation of containerization had limited functionality, and it was difficult to manage and control the various processes running within the container.

In the early 2000s, the Linux kernel introduced namespaces and control groups to provide a more robust and scalable containerization solution. Namespaces allow processes to have their own isolated view of the system, including the file system, network, and process ID space, while control groups provide fine-grained control over the resources allocated to each container, such as CPU, memory, and I/O.

Using these kernel features, containerization platforms such as Docker and Kubernetes have emerged as popular solutions for building and deploying containerized applications at scale. Containerization has become an essential tool for modern software development, allowing developers to easily package applications and deploy them in a consistent and predictable manner across different environments.

Resources usage and limits

⚲ API

man 2 chroot – change root directory

man 2 sysinfo – return system information

man 2 getrusage – get resource usage

get/set resource limits:

📖 References

Namespaces

Linux namespaces provide the way to to isolate and virtualize different aspects of the operating system. Namespaces allow multiple instances of an application to run in isolation from each other, without interfering with the host system or other instances.

🔧 TODO

⚲ API

/proc/self/ns

man 8 lsns, man 2 ioctl_ns ↪ ns_ioctl_id

man 1 unshare, man 2 unshare

man 1 nsenter, man 2 setns

man 2 clone3 ↪ clone_args_id

linux/ns_common.h_inc

linux/proc_ns.h_inc

namespaces definition

net/net_namespace.h_inc - struct net

user_namespace_id

time_namespace_id

cgroup_namespace_id

⚙️ Internals

init_nsproxy_id - struct of namespaces

kernel/nsproxy.c_src

fs/namespace.c_src

fs/proc/namespaces.c_src

net/core/net_namespace.c_src

kernel/time/namespace.c_src

kernel/user_namespace.c_src

kernel/pid_namespace.c_src

kernel/utsname.c_src

kernel/cgroup/namespace.c_src

📖 References

man 7 mount_namespace

man 7 pid_namespaces

man 7 network_namespaces

man 7 user_namespaces

man 7 time_namespaces

man 7 cgroup_namespaces

Control groups

cgroups are used to limit and control the resource usage of groups of processes. They allow administrators to set limits on CPU usage, memory usage, disk I/O, network bandwidth, and other resources, which can be useful for managing system performance and preventing resource contention.

There are two versions of cgroups. Unlike v1, cgroup v2 has only a single process hierarchy and discriminates between processes, not threads.

Here are some of the key differences between cgroups v1 and v2:


	cgroups v1	cgroups v2
Hierarchy	each subsystem had its own hierarchy, which could lead to complexity and confusion	unified hierarchy, which simplifies management and enables better resource allocation
Controllers	has several subsystems that are controlled by separate controllers, each with its own set of configuration files and parameters	controllers are consolidated into a single "cgroup2" controller, which provides a unified interface for managing resources
Resource distribution	distributes resources among groups of processes based on proportional sharing, which can lead to unpredictable results	resources are distributed based on a "weighted fair queuing" algorithm, which provides better predictability and fairness

Cgroups v2 is not backward compatible with cgroups v1, which means that migrating from v1 to v2 can be challenging and requires careful planning.

🔧 TODO

⚲ API

linux/cgroup.h_inc

linux/cgroup-defs.h_inc

css_set_id – holds set of reference-counted pointers to cgroup_subsys_state_id objects

cgroup_subsys_id

linux/cgroup_subsys.h_inc – list of cgroup subsystems

⚙️ Internals

cg_list_id – list of css_set_id in task_struct

kernel/cgroup_src

cgroup_init_id

cgroup2_fs_type_id

tools/testing/selftests/cgroup_src

📖 References

Control Groups v1_doc

man 1 systemd-cgtop

man 5 systemd.slice – slice unit configuration

man 7 cgroups

man 7 cgroup_namespaces

CFS Bandwidth Control for cgroups_doc

Real-Time group scheduling_doc

📚 Further reading

https://github.com/containers

cgrc tool

💾 Historical

https://github.com/mk-fg/cgroup-tools for cgrpup v1

Driver Model

The Linux driver model (or Device Model, or just DM) is a framework that provides a consistent and standardized way for device drivers to interface with the kernel. It defines a set of rules, interfaces, and data structures that enable device drivers to communicate with the kernel and perform various operations, such as managing resources, livecycle and more.

DM core structure consists of DM classes, DM buses, DM drivers and DM devices.

kobject

In the Linux kernel, a kobject_id is a fundamental data structure used to represent kernel objects and provide a standardized interface for interacting with them. A kobject is a generic object that can represent any type of kernel object, including devices, files, modules, and more.

The kobject data structure contains several fields that describe the object, such as its name, type, parent, and operations. Each kobject has a unique name within its parent object, and the parent-child relationships form a hierarchy of kobjects.

Kobjects are managed by the kernel's sysfs file system, which provides a virtual file system that exposes kernel objects as files and directories in the user space. Each kobject is associated with a sysfs directory, which contains files and attributes that can be read or written to interact with the kernel object.

⚲ Infrastructure API

linux/kobject.h_inc

kobject_id

Kernel objects manipulation_doc

🔧 TODO

Classes

A class is a higher-level view of a device that abstracts out low-level implementation details. Drivers may see a NVME storage or a SATA storage, but, at the class level, they are all simply block_class_id devices. Classes allow user space to work with devices based on what they do, rather than how they are connected or how they work. General DM classes structure match composite pattern.

⚲ API

ls /sys/class/

class_register_id registers class_id

linux/device/class.h_inc

👁 Examples: input_class_id, block_class_id net_class_id

Buses

A peripheral bus is a channel between the processor and one or more peripheral devices. A DM bus is proxy for a peripheral bus. General DM buses structure match composite pattern. For the purposes of the device model, all devices are connected via a bus, even if it is an internal, virtual, platform_bus_type_id. Buses can plug into each other. A USB controller is usually a PCI device, for example. The device model represents the actual connections between buses and the devices they control. A bus is represented by the bus_type_id structure. It contains the name, the default attributes, the bus' methods, PM operations, and the driver core's private data.

⚲ API

ls /sys/bus/

bus_register_id registers bus_type_id

linux/device/bus.h_inc

👁 Examples: usb_bus_type_id, hid_bus_type_id, pci_bus_type_id, scsi_bus_type_id, platform_bus_type_id

Peripheral buses

Drivers

⚲ API

ls /sys/bus/:/drivers/

module_driver_id - simple common driver initializer, 👁 for example used in module_pci_driver_id

driver_register_id registers device_driver_id - basic device driver structure, one per all device instances.

linux/device/driver.h_inc

👁 Examples: hid_generic_id usb_register_device_driver_id

Platform drivers

module_platform_driver_id registers platform_driver_id (platform wrapper of device_driver_id) with platform_bus_type_id

linux/platform_device.h_inc

👁 Examples: gpio_mouse_device_driver_id

Devices

⚲ API

ls /sys/devices/

device_register_id registers device_id - the basic device structure, per each device instance

linux/device.h_inc – Device drivers infrastructure_doc

linux/dev_printk.h_inc

Device Resource Management_doc, devres, devm ...

👁 Examples: platform_bus_id mousedev_create

Platform devices

platform_device_id - platform wrapper of struct device - the basic device structure_doc, contains resources associated with the devie

it is can be created dynamically automatically by platform_device_register_simple_id or platform_device_alloc_id. Or registered with platform_device_register_id.

platform_device_unregister_id - releases device and associated resources

👁 Examples: add_pcspkr_id

⚲ API 🔧 TODO

platform_device_info platform_device_id platform_device_register_full platform_device_add

platform_device_add_data platform_device_register_data platform_device_add_resources

attribute_group dev_pm_ops

⚙️ Internals

linux/dev_printk.h_inc

lib/kobject.c_src

drivers/base/platform.c_src

drivers/base/core.c_src

📖 References

Device drivers infrastructure_doc

Everything you never wanted to know about kobjects, ksets, and ktypes_doc

Driver Model_doc

The Linux Kernel Device Model_doc

Platform Devices and Drivers_doc

Linux Device Model, by linux-kernel-labs

Modules

Article about modules

⚲ API

lsmod

cat /proc/modules

⚙️ Internals

kernel/kmod.c_src

📖 References

LDD3: Building and Running Modules

http://www.xml.com/ldd/chapter/book/ch02.html

http://www.tldp.org/LDP/tlk/modules/modules.html

http://www.tldp.org/LDP/lkmpg/2.6/html/ The Linux Kernel Module Programming Guide

Peripheral buses

Peripheral buses are the communication channels used to connect various peripheral devices to a computer system. These buses are used to transfer data between the peripheral devices and the system's processor or memory. In the Linux kernel, peripheral buses are implemented as drivers that enable communication between the operating system and the hardware.

Peripheral buses in the Linux kernel include USB, PCI, SPI, I2C, and more. Each of these buses has its own unique characteristics, and the Linux kernel provides support for a wide range of peripheral devices.

The PCI (Peripheral Component Interconnect) bus is used to connect internal hardware devices in a computer system. It is commonly used to connect graphics cards, network cards, and other expansion cards. The Linux kernel provides a PCI bus driver that enables communication between the operating system and the devices connected to the bus.

The USB (Universal Serial Bus) is one of the most commonly used peripheral buses in modern computer systems. It allows devices to be hot-swapped and supports high-speed data transfer rates.

🔧 TODO: device enumeration

⚲ API

Shell interface: ls /proc/bus/ /sys/bus/

Hardware interfaces

Hardware interfaces are basic part of any operating, enabling communication between the processor and other HW components of a computer system: memory, peripheral devices and buses, various controllers.

Interrupts

I/O ports and registers

I/O ports and registers are electronic components in computer systems that enable communication between CPU and other electronic controllers and devices.

⚲ API

linux/regmap.h_inc — register map access API

asm-generic/io.h_inc — generic I/O port emulation.

ioport_map_id

ioread32_id / iowrite32_id ...

readl_id/ writel_id ...

The {in,out}[bwl] macros are for emulating x86-style PCI/ISA IO space:

inl_id/ outl_id ...

linux/ioport.h_inc — definitions of routines for detecting, reserving and allocating system resources.

request_mem_region_id

arch/x86/include/asm/io.h_src

Functions for memory mapped registers:

ioremap_id ...

Hardware Device Drivers

Keywords: firmware, hotplug, clock, mux, pin

⚙️ Internals

drivers/acpi_src

drivers/base_src

drivers/sdio_src - Secure Digital Input Output

📖 References

Pin control subsystem_doc

Linux Hardware Monitoring_doc

Firmware guide_doc

Devicetree_doc

https://hwmon.wiki.kernel.org/

LDD3:The Linux Device Model

http://www.tldp.org/LDP/tlk/dd/drivers.html

http://www.xml.com/ldd/chapter/book/

http://examples.oreilly.com/linuxdrive2/

Booting and halting

Kernel booting

This is loaded in two stages - in the first stage the kernel (as a compressed image file) is loaded into memory and decompressed, and a few fundamental functions such as essential hardware and basic memory management (memory paging) are set up. Control is then switched one final time to the main kernel start process calling start_kernel_id, which then performs the majority of system setup (interrupts, the rest of memory management, device and driver initialization, etc.) before spawning separately, the idle process and scheduler, and the init process (which is executed in user space).

Kernel loading stage

The kernel as loaded is typically an image file, compressed into either zImage or bzImage formats with zlib. A routine at the head of it does a minimal amount of hardware setup, decompresses the image fully into high memory, and takes note of any RAM disk if configured. It then executes kernel startup via startup_64 (for x86_64 architecture).

arch/x86/boot/compressed/vmlinux.lds.S_src - linker script defines entry startup_64_id in

arch/x86/boot/compressed/head_64.S_src - assembly of extractor

extract_kernel_id - extractor in language C

prints

Decompressing Linux... done.
Booting the kernel.

Kernel startup stage

The startup function for the kernel (also called the swapper or process 0) establishes memory management (paging tables and memory paging), detects the type of CPU and any additional functionality such as floating point capabilities, and then switches to non-architecture specific Linux kernel functionality via a call to start_kernel_id.

↯ Startup call hierarchy:

arch/x86/kernel/vmlinux.lds.S_src – linker script

arch/x86/kernel/head_64.S_src – assembly of uncompressed startup code

arch/x86/kernel/head64.c_src – platform depended startup:

x86_64_start_kernel_id

x86_64_start_reservations_id

init/main.c_src – main initialization code

start_kernel_id 200 SLOC

rcu_init_id – Read-copy-update

rest_init_id

kernel_init_id - deferred kernel thread #1

kernel_init_freeable_id This and following functions are defied with attribute __init_id

prepare_namespace_id

initrd_load_id

mount_root_id

run_init_process_id obviously runs the first process man 1 init

kthreadd_id – deferred kernel thread #2

cpu_startup_entry_id

do_idle_id

start_kernel_id executes a wide range of initialization functions. It sets up interrupt handling (IRQs), further configures memory, starts the man 1 init process (the first user-space process), and then starts the idle task via cpu_startup_entry_id. Notably, the kernel startup process also mounts the initial ramdisk (initrd) that was loaded previously as the temporary root file system during the boot phase. The initrd allows driver modules to be loaded directly from memory, without reliance upon other devices (e.g. a hard disk) and the drivers that are needed to access them (e.g. a SATA driver). This split of some drivers statically compiled into the kernel and other drivers loaded from initrd allows for a smaller kernel. The root file system is later switched via a call to man 8 pivot_root / man 2 pivot_root which unmounts the temporary root file system and replaces it with the use of the real one, once the latter is accessible. The memory used by the temporary root file system is then reclaimed.