storage
files & directories access
Virtual File System
page cache
logical file systems
block devices
storage drivers

Storage functionality provides access to various storage devices via files and directories of files. Most of the storage is persistent as flash memory, SSD and legacy hard disks. Another kind of storage is temporary. The file system provides an abstraction to organize the information into separate pieces of data (called files) identified by a unique name. Each file system type defines their own structures and logic rules used to manage these groups of information and their names. Linux supports a plethora or different file system types, local and remote, native and from other operating systems. To accommodate such disparity the kernel defines a common top layer, the virtual file system (VFS) layer.


Summary of the Linux kernel's storage stack
Summary of the Linux kernel's storage stack

Files and directories

edit

Four basic files access system calls:

man 2 open β†ͺ do_sys_open id - opens a file by name and returns a file descriptor (fd). Below functions operates on a fd.
man 2 close β†ͺ close_fd id
man 2 read β†ͺ ksys_read id
man 2 write β†ͺ ksys_write id

File in Linux and UNIX is not only physical file on persistent storage. File interface is used to access pipes, sockets and other pseudo-files.

πŸ”§ TODO

man 2 readlink , man 2 symlink , man 2 link
man 3 readdir β‡Ύ man 2 getdents
man 7 path_resolution
man 2 fcntl – manipulate file descriptor


βš™οΈ Files and directories internals

linux/fs.h inc
fs/open.c src
fs/namei.c src
fs/read_write.c src


πŸ“š Files and directories references

Input/Output, The GNU C Library
VFS in Linux Kernel 2.4 Internals
Unix file types


File locks

edit

File locks are mechanisms that allow processes to coordinate access to shared files. These locks help prevent conflicts when multiple processes or threads attempt to access the same file simultaneously.

⚲ API

man 8 lslocks – list local system locks
man 3 lockf – apply, test or remove a POSIX lock on an open file
man 2 flock – apply or remove an advisory BSD lock on an open file
man 2 fcntl – manipulate file descriptor
F_SETLK id – advisory record lock
F_OFD_SETLK id – Open File Description Lock
flock id – lock parameters
⚠️ Avoid mixing flock and fcntl locks on the same file as they don’t interact with each other.


βš™οΈ Internals

linux/filelock.h inc
fs/locks.c src
trace/events/filelock.h inc

πŸ’Ύ Historical: Mandatory locking feature is no longer supported at all in Linux 5.15 and above because the implementation is unreliable.

Asynchronous I/O

edit

πŸš€ advanced features

AIO

https://lwn.net/Kernel/Index/#Asynchronous_IO
man 2 io_submit man 2 io_setup man 2 io_cancel man 2 io_destroy man 2 io_getevents
uapi/linux/aio_abi.h inc
fs/aio.c src
io/aio ltp


io_uring

🌱 New since release 5.1 in May 2019


https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
https://thenewstack.io/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/
io_uring_enter id io_uring_setup id io_uring_register id
linux/io_uring.h inc
uapi/linux/io_uring.h inc
fs/.c src
https://lwn.net/Kernel/Index/#io_uring
io_uring, SCM_RIGHTS, and reference-count cycles
The rapid growth of io_uring
Automatic buffer selection for io_uring
Operations restrictions for io_uring
io_uring, SCM_RIGHTS, and reference-count cycles
Redesigned workqueues for io_uring
io_uring ltp

Allow non-blocking access to multiple file descriptors.

Efficient event polling epoll


⚲ API:

uapi/linux/eventpoll.h inc
man 7 epoll
man 2 epoll_create β†ͺ do_epoll_create id
man 2 epoll_ctl β†ͺ do_epoll_ctl id
man 2 epoll_wait β†ͺ do_epoll_wait id


βš™οΈ Internals:

fs/eventpoll.c src


select and poll

πŸ’Ύ Historical: Select and poll system calls are derived from UNIX


⚲ API:

man 2 poll β†ͺ do_sys_poll id
man 2 select β†ͺ kern_select id


βš™οΈ Internals:

fs/select.c src

Vectored I/O

edit

πŸš€ advanced feature

Vectored I/O, also known as scatter/gather I/O, is a method of input and output by which a single procedure call sequentially reads data from multiple buffers and writes it to a single data stream, or reads data from a data stream and writes it to multiple buffers, as defined in a vector of buffers. Scatter/gather refers to the process of gathering data from, or scattering data into, the given set of buffers. Vectored I/O can operate synchronously or asynchronously. The main reasons for using vectored I/O are efficiency and convenience.


⚲ API:

uapi/linux/uio.h inc
linux/uio.h inc
iovec id
man 2 readv β†ͺ do_readv id
man 2 writev β†ͺ do_writev id


βš™οΈ Internals:

iov_iter id
do_readv id β†― call hierarchy:
vfs_readv id
import_iovec id
ext4_file_read_iter id
lib/iov_iter.c src


πŸ“š References

Fast Scatter-Gather I/O, The GNU C Library
https://lwn.net/Kernel/Index/#Vectored_IO
https://lwn.net/Kernel/Index/#Scattergather_chaining

Virtual File System

edit

The virtual file system (VFS) is an abstract layer on top of a concrete logical file system. The purpose of a VFS is to allow client applications to access different types of logical file systems in a uniform way. A VFS can, for example, be used to access local and network storage devices transparently without the client application noticing the difference. It can be used to bridge the differences in Windows, classic Mac OS/macOS and Unix filesystems, so that applications can access files on local file systems of those types without having to know what type of file system they are accessing. A VFS specifies an interface (or a "contract") between the kernel and a logical file system. Therefore, it is easy to add support for new file system types to the kernel simply by fulfilling the contract.

πŸ”§ TODO: vfsmount id, vfs_create id, vfs_read id, vfs_write id

πŸ“š VFS References

VFS doc
VFS in Linux Kernel 2.4 Internals


Logical file systems

edit

A file system (or filesystem) is used to control how data is stored and retrieved. Without a file system, information placed in a storage area would be one large body of data with no way to tell where one piece of information stops and the next begins. By separating the data into individual pieces, and giving each piece a name, the information is easily separated and identified. Each group of data is called a "file". The structure and logic rules used to manage the groups of information and their names is called a "file system".

There are many different kinds of file systems. Each one has different structure and logic, properties of speed, flexibility, security, size and more. Some file systems have been designed to be used for specific applications. For example, the ISO 9660 file system is designed specifically for optical discs.

File systems can be used on many different kinds of storage devices. Each storage device uses a different kind of media. The most common storage device in use today is a SSD. Other media that was used are hard disk, magnetic tape, optical disc, and . In some cases, the computer's main memory (RAM) is used to create a temporary file system for short-term use. Raw storage is called a block device.

Linux supports many different file systems, but common choices for the system disk on a block device include the ext* family (such as ext2, ext3 and ext4), XFS, ReiserFS and btrfs. For raw Flash without a flash translation layer (FTL) or Memory Technology Device (MTD), there is UBIFS, JFFS2, and YAFFS, among others. SquashFS is a common compressed read-only file system. NFS and another network FS are described further in paragraph Network storage.


⚲ Shell interfaces:

cat /proc/filesystems
ls /sys/fs/
man 8 mount
man 8 umount
man 8 findmnt
man 1 mountpoint
man 1 df


Infrastructure ⚲ API function register_filesystem id registers structs file_system_type id and stores them in linked list βš™οΈ file_systems id. Function ext4_init_fs id registers ext4_fs_type id. Operation of file system opening is called mounting: ext4_mount id


βš™οΈ Internals:

fs/namespace.c src
man 2 mount
do_mount id
linux/buffer_head.h inc
super_block id
sb_bread id
fs src
fs/ext4/ext4.h src
ext4_sb_bread id


πŸ“š References:

filesystems doc
Kernel wikis: EXT4, btrfs, Reiser4, RAID, XFS

Page cache

edit

A page cache or disk cache is a transparent cache for the memory pages originating from a secondary storage device such as a hard disk drive. The operating system keeps a page cache in otherwise unused portions of the main memory, resulting in quicker access to the contents of cached pages and overall performance improvements. The page cache is implemented by the kernel, and is mostly transparent to applications.

Usually, all physical memory not directly allocated to applications is used by the operating system for the page cache. Since the memory would otherwise be idle and is easily reclaimed when applications request it, there is generally no associated performance penalty and the operating system might even report such memory as "free" or "available". The page cache also aids in writing to a disk. Pages in the main memory that have been modified during writing data to disk are marked as "dirty" and have to be flushed to disk before they can be freed. When a file write occurs, the page backing the particular block is looked up. If it is already found in the page cache, the write is done to that page in the main memory. Otherwise, when the write perfectly falls on page size boundaries, the page is not even read from disk, but allocated and immediately marked dirty. Otherwise, the page(s) are fetched from disk and requested modifications are done.

Not all cached pages can be written to as program code is often mapped as read-only or copy-on-write; in the latter case, modifications to code will only be visible to the process itself and will not be written to disk.


⚲ API:

man 2 fsync β†ͺ do_fsync id
man 2 sync_file_range β†ͺ ksys_sync_file_range id
man 2 syncfs β†ͺ sync_filesystem id

πŸ“š References

wb_workfn id
address_space id
do_writepages id
linux/writeback.h inc
mm/page-writeback.c src
Page cache


More

The future of DAX - direct access bypassing the cache
Linux Page Cache in Linux Kernel 2.4 Internals

Zero-copy

edit

πŸš€ advanced features

Writing data to storage and reading are very resource consuming operations. Copying memory is time and CPU consuming operation too. Set of methods to avoid copying operations is called zero-copy. The goal of zero-copy methods is a fast and efficient data transfer within the system.

The first and simplest method is Pipeline, invoked by operator "|" in shells. Instead of writing data into temporary file and reading, the data is passed efficiently via a pipe bypassing a storage. The second method is tee.


⚲ Syscalls:

man 2 pipe2
man 2 tee, man 1 tee
man 2 sendfile
man 2 copy_file_range
man 2 splice
man 2 vmsplice


⚲ API and βš™οΈ Internals:

man 2 pipe2 β†ͺ do_pipe2 id - creates pipe
uses pipe_fs_type id, pipefifo_fops id
man 2 tee β†ͺ do_tee id- duplicates pipe content
calls link_pipe id
man 2 sendfile β†ͺ do_sendfile id - transfers data between file descriptors, the output can be a socket. Used in network storage and servers.
Calls: do_splice_direct id, splice_direct_to_actor id
man 2 copy_file_range β†ͺ vfs_copy_file_range id - transfers data between files
calls custom remap_file_range id like nfs42_remap_file_range id
or custom copy_file_range id like fuse_copy_file_range id
or do_splice_direct id
man 2 splice β†ͺ do_splice id - splices data to/from a pipe.
There are three cases regarding which end being a pipe:
  1. do_splice_from id - only input is a pipe
    Calls iter_file_splice_write id or custom splice_write id
    or default_file_splice_write id: write_pipe_buf id, splice_from_pipe id, __splice_from_pipe id
  2. do_splice_to id - only output is a pipe.
    Calls generic_file_splice_read id or custom splice_read id
    or default_file_splice_read id: kernel_readv id
  3. splice_pipe_to_pipe id - both are pipes
man 2 vmsplice β†ͺ
vmsplice_to_pipe id – splices user pages to a pipe
vmsplice_to_user id – splices a pipe to user pages


⚲ API

linux/splice.h inc


βš™οΈ Internals:

fs/pipe.c src
fs/splice.c src


πŸ”§ TODO: zerocopy_sg_from_iter id builds a zerocopy skb datagram from an iov_iter. Used in tap_get_user id and tun_get_user id.

skb_zerocopy id

skb_zerocopy_iter_dgram id


πŸ“š References

man 7 pipe
man 7 fifo
splice and pipes doc
Pipes API doc
splice (system call)
LTP: pipe ltp, pipe2 ltp, tee ltp, sendfile ltp, copy_file_range ltp, splice ltp, vmsplice ltp

Block device layer

edit

The block device layer in Linux provides an abstraction for accessing storage devices, such as and USB drives, by presenting them as a series of fixed-size blocks. It sits between the hardware and the file system, allowing applications and file systems to perform read and write operations efficiently without needing to know the specifics of the underlying hardware. Key components include block drivers, the I/O scheduler, and buffer management, which work together to handle requests, optimize access patterns, and ensure data integrity. This layer supports essential features like caching, partition management, and queueing mechanisms to balance performance and reliability.


⚲ Interfaces:

linux/genhd.h inc
linux/blk_types.h inc
bio id – main unit of I/O for the block layer and lower layers
req_op id – operations common to the bio and request structures
linux/bio.h inc
block_device id
block_size id
alloc_disk_node id allocates gendisk id
add_disk id
device_add_disk id
block_device_operations id
linux/blkdev.h inc
gendisk id
dev_to_disk id, disk_to_dev id
block_class id – block devices Driver Model class
register_blkdev id
request id
request_queue id


βš™οΈ Internals.

block src
block_class id


πŸ‘ Examples:

drivers/block/brd.c src - small RAM backed block device driver
drivers/block/null_blk src


Device mapper

edit

The device mapper is a framework provided by the kernel for mapping physical block devices onto higher-level "virtual block devices". It forms the foundation of LVM2, software RAIDs and dm-crypt disk encryption, and offers additional features such as file system snapshots.

Device mapper works by passing data from a virtual block device, which is provided by the device mapper itself, to another block device. Data can be also modified in transition, which is performed, for example, in the case of device mapper providing disk encryption.

User space applications that need to create new mapped devices talk to the device mapper via the libdevmapper.so shared library, which in turn issues ioctls to the /dev/mapper/control device node.

Functions provided by the device mapper include linear, striped and error mappings, as well as crypt and multipath targets. For example, two disks may be concatenated into one logical volume with a pair of linear mappings, one for each disk. As another example, crypt target encrypts the data passing through the specified device, by using the Linux kernel's Crypto API.

The following mapping targets are available:

cache - allows the creation of hybrid volumes, by using solid-state drives (SSDs) as caches for hard disk drives (HDDs)
crypt - provides data encryption, by using the Linux kernel's Crypto API
delay - delays reads and/or writes to different devices (used for testing)
era - behaves in a way similar to the linear target, while it keeps track of blocks that were written to within a user-defined period of time
error - simulates I/O errors for all mapped blocks (used for testing)
flakey - simulates periodic unreliable behaviour (used for testing)
linear - maps a continuous range of blocks onto another block device
mirror - maps a mirrored logical device, while providing data redundancy
multipath - supports the mapping of multipathed devices, through usage of their path groups
raid - offers an interface to the Linux kernel's software RAID driver (md)
snapshot and snapshot-origin - used for creation of LVM snapshots, as part of the underlining copy-on-write scheme
striped - strips the data across physical devices, with the number of stripes and the striping chunk size as parameters
zero - an equivalent of /dev/zero, all reads return blocks of zeros, and writes are discarded

πŸ“š References

Device mapper
Device mapper doc
linux/device-mapper.h inc
drivers/md src
https://lwn.net/Kernel/Index/#Device_mapper

Multi-Queue Block IO Queueing

edit

The blk-mq API enhances IO performance by leveraging multiple queues for parallel processing, addressing bottlenecks from traditional single-queue designs. It uses software queues for scheduling, merging, and reordering requests, and hardware queues to interface directly with devices. If hardware resources are limited, requests are temporarily queued for later dispatch.

⚲ Interfaces:

/sys/devices/.../mq/
linux/blk-mq.h inc
Structures:
blk_mq_hw_ctx id – hardware dispatch queue context
blk_mq_tag_set id – shared between request queues
blk_mq_ops id
blk_mq_tags id
blk_mq_queue_map id – map software queues to hardware queues
request id


πŸ‘οΈ Example

drivers/block/null_blk src – multi-queue aware block test driver


βš™οΈ Internals

/sys/kernel/debug/block/*/hctx*
block/blk-mq.h src
blk_mq_ctx id – software staging queue context
block/blk-mq.c src – block multi-queue core code
block/blk-mq-tag.c src – tag allocation using scalable bitmaps
...


πŸ“– References

Multi-Queue Block IO Queueing Mechanism (blk-mq) doc

I/O scheduler

edit

I/O scheduling (or disk scheduling) is the method chosen by the kernel to decide in which order the block I/O operations will be submitted to the storage volumes. I/O scheduling usually has to work with hard disk drives that have long access times for requests placed far away from the current position of the disk head (this operation is called a seek). To minimize the effect this has on system performance, most I/O schedulers implement a variant of the elevator algorithm that reorders the incoming randomly ordered requests so the associated data would be accessed with minimal arm/head movement.

The particular I/O scheduler used with certain block device can be switched at run time by modifying the corresponding /sys/block/<block_device>/queue/scheduler file in the sysfs filesystem. Some I/O schedulers also have tunable parameters that can be set through files in /sys/block/<block_device>/queue/iosched/.


⚲ Interfaces:

linux/elevator.h inc
Function elv_register id registers struct elevator_type id.
elevator_queue id


βš™οΈ Internals:

block/elevator.c src
block/Kconfig.iosched src
block/bfq-iosched.c src
block/kyber-iosched.c src
block/mq-deadline.c src
include/trace/events/block.h inc


πŸ“– References:

I/O scheduling
Elevator algorithm
Switching Scheduler doc
BFQ - Budget Fair Queueing doc
Deadline IO scheduler tunables doc
https://www.cloudbees.com/blog/linux-io-scheduler-tuning/
https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers

πŸ“– References

Block devices doc
Switching Scheduler doc
BFQ - Budget Fair Queueing doc
Deadline IO scheduler tunables doc
Kyber I/O scheduler tunables doc
Multi-Queue Block IO Queueing Mechanism (blk-mq) doc


πŸ“š Further reading

/sys/kernel/debug/block/*/
https://lwn.net/Kernel/Index/#Block_layer
block devices ML
LDD3:Block Drivers
LDD1:Loading Block Drivers
ULK3 Chapter 14. Block Device Drivers
Linux SCSI Generic (sg) Driver
Scsi_debug adapter driver for Linux
https://github.com/doug-gilbert/sg3_utils

πŸ”§ TODO


βš™οΈ Internals

drivers/nvmem src – Non-volatile memory
drivers/sdio src – Secure Digital Input Output
drivers/scsi src – Small Computer System Interface
drivers/virtio src
drivers/mtd src – Memory Technology Device for πŸ€– embedded devices

NVMe

edit

NVM Express drivers provide accesses a computer's non-volatile storage. Local storage is attached via PCI Express bus. PCI NVMe device driver entry point is nvme_init id. Remote storage driver is called target and local proxy driver is called host. Fabrics connect remote targets with local host. A fabric can be based on RDMA, TCP or Fibre Channel protocols.


⚲ API:

nvme-cli
uapi/linux/nvme_ioctl.h inc
linux/nvme.h inc


βš™οΈ Internals:

drivers/nvme src

Host drivers/nvme/host src:

⚲ Interfaces:

drivers/nvme/host/nvme.h src
nvme_init_ctrl id initializes a NVMe controller structures nvme_ctrl id with operations nvme_ctrl_ops id
a subroutine of nvme_scan_work id adds a new disk with device_add_disk id


nvme_init id - local PCI nvme module init
nvme_probe id
nvme_init_ctrl id ...
nvme_pci_ctrl_ops id
nvme_core_init id - module init


Fabrics

⚲ interfaces:

drivers/nvme/host/fabrics.h src
nvmf_register_transport id resisters nvmf_transport_ops id
nvmf_init id - fabrics module init

βš™οΈ internals:

nvmf_init id - fabrics module init
nvmf_misc id
nvmf_dev_fops id
nvmf_dev_write id
nvmf_create_ctrl id binds nvmf_transport_ops id


Target drivers/nvme/target src:

⚲ Interfaces: drivers/nvme/target/nvmet.h src

nvmet_register_transport id registers nvmet_fabrics_ops id
nvmet_init id - module init
fcloop_init id - loopback test module init which can be useful to test NVMe-FC transport interfaces.


NVMe over Fabrics
Layers
TCP RDMA Fibre Channel
Host modules
nvme_tcp_init_module id nvme_rdma_init_module id nvme_fc_init_module id
Fabrics protocols
linux/nvme-tcp.h inc linux/nvme-rdma.h inc linux/nvme-fc.h inc

linux/nvme-fc-driver.h inc

Target modules
nvmet_tcp_init id nvmet_rdma_init id nvmet_fc_init_module id


πŸ‘ Example: nvme_loop_init_module id nvme loopback

nvme_loop_transport id - fabrics operations
nvme_loop_create_ctrl id
nvme_loop_create_io_queues id
nvme_loop_ops id - target operation
nvme_loop_add_port id
nvme_loop_queue_response id

Appendices

edit

πŸš€ Advanced

man 1 pidstat – reports task statistics
/proc/self/io – I/O statistics for the process (see man 5 proc)


πŸ’Ύ Historical storage drivers

drivers/ata src - Parallel ATA


πŸ“– Further reading about storage

bcc/ebpf storage and filesystems tools