^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Multi-Queue Block IO Queueing Mechanism (blk-mq)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) The Multi-Queue Block IO Queueing Mechanism is an API to enable fast storage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) devices to achieve a huge number of input/output operations per second (IOPS)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) through queueing and submitting IO requests to block devices simultaneously,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) benefiting from the parallelism offered by modern storage devices.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) Background
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) ----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) Magnetic hard disks have been the de facto standard from the beginning of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) development of the kernel. The Block IO subsystem aimed to achieve the best
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) performance possible for those devices with a high penalty when doing random
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) access, and the bottleneck was the mechanical moving parts, a lot slower than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) any layer on the storage stack. One example of such optimization technique
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) involves ordering read/write requests according to the current position of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) hard disk head.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) However, with the development of Solid State Drives and Non-Volatile Memories
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) without mechanical parts nor random access penalty and capable of performing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) high parallel access, the bottleneck of the stack had moved from the storage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) device to the operating system. In order to take advantage of the parallelism
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) in those devices' design, the multi-queue mechanism was introduced.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) The former design had a single queue to store block IO requests with a single
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) lock. That did not scale well in SMP systems due to dirty data in cache and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) bottleneck of having a single lock for multiple processors. This setup also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) suffered with congestion when different processes (or the same process, moving
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) to different CPUs) wanted to perform block IO. Instead of this, the blk-mq API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) spawns multiple queues with individual entry points local to the CPU, removing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) the need for a lock. A deeper explanation on how this works is covered in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) following section (`Operation`_).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) Operation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) ---------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) When the userspace performs IO to a block device (reading or writing a file,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) for instance), blk-mq takes action: it will store and manage IO requests to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) the block device, acting as middleware between the userspace (and a file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) system, if present) and the block device driver.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) blk-mq has two group of queues: software staging queues and hardware dispatch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) queues. When the request arrives at the block layer, it will try the shortest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) path possible: send it directly to the hardware queue. However, there are two
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) cases that it might not do that: if there's an IO scheduler attached at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) layer or if we want to try to merge requests. In both cases, requests will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) sent to the software queue.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) Then, after the requests are processed by software queues, they will be placed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) at the hardware queue, a second stage queue were the hardware has direct access
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) to process those requests. However, if the hardware does not have enough
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) resources to accept more requests, blk-mq will places requests on a temporary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) queue, to be sent in the future, when the hardware is able.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) Software staging queues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) ~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) The block IO subsystem adds requests in the software staging queues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) (represented by struct blk_mq_ctx) in case that they weren't sent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) directly to the driver. A request is one or more BIOs. They arrived at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) block layer through the data structure struct bio. The block layer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) will then build a new structure from it, the struct request that will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) be used to communicate with the device driver. Each queue has its own lock and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) the number of queues is defined by a per-CPU or per-node basis.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) The staging queue can be used to merge requests for adjacent sectors. For
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) instance, requests for sector 3-6, 6-7, 7-9 can become one request for 3-9.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) Even if random access to SSDs and NVMs have the same time of response compared
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) to sequential access, grouped requests for sequential access decreases the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) number of individual requests. This technique of merging requests is called
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) plugging.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) Along with that, the requests can be reordered to ensure fairness of system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) resources (e.g. to ensure that no application suffers from starvation) and/or to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) improve IO performance, by an IO scheduler.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) IO Schedulers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) ^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) There are several schedulers implemented by the block layer, each one following
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) a heuristic to improve the IO performance. They are "pluggable" (as in plug
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) and play), in the sense of they can be selected at run time using sysfs. You
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) can read more about Linux's IO schedulers `here
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) <https://www.kernel.org/doc/html/latest/block/index.html>`_. The scheduling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) happens only between requests in the same queue, so it is not possible to merge
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) requests from different queues, otherwise there would be cache trashing and a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) need to have a lock for each queue. After the scheduling, the requests are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) eligible to be sent to the hardware. One of the possible schedulers to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) selected is the NONE scheduler, the most straightforward one. It will just
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) place requests on whatever software queue the process is running on, without
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) any reordering. When the device starts processing requests in the hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) queue (a.k.a. run the hardware queue), the software queues mapped to that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) hardware queue will be drained in sequence according to their mapping.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) Hardware dispatch queues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) ~~~~~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) The hardware queue (represented by struct blk_mq_hw_ctx) is a struct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) used by device drivers to map the device submission queues (or device DMA ring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) buffer), and are the last step of the block layer submission code before the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) low level device driver taking ownership of the request. To run this queue, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) block layer removes requests from the associated software queues and tries to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) dispatch to the hardware.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) If it's not possible to send the requests directly to hardware, they will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) added to a linked list (``hctx->dispatch``) of requests. Then,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) next time the block layer runs a queue, it will send the requests laying at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) ``dispatch`` list first, to ensure a fairness dispatch with those
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) requests that were ready to be sent first. The number of hardware queues
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) depends on the number of hardware contexts supported by the hardware and its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) device driver, but it will not be more than the number of cores of the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) There is no reordering at this stage, and each software queue has a set of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) hardware queues to send requests for.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) .. note::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) Neither the block layer nor the device protocols guarantee
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) the order of completion of requests. This must be handled by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) higher layers, like the filesystem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) Tag-based completion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) ~~~~~~~~~~~~~~~~~~~~
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) In order to indicate which request has been completed, every request is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) identified by an integer, ranging from 0 to the dispatch queue size. This tag
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) is generated by the block layer and later reused by the device driver, removing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) the need to create a redundant identifier. When a request is completed in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) drive, the tag is sent back to the block layer to notify it of the finalization.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) This removes the need to do a linear search to find out which IO has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) completed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) Further reading
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) - `Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems <http://kernel.dk/blk-mq.pdf>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) - `NOOP scheduler <https://en.wikipedia.org/wiki/Noop_scheduler>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) - `Null block device driver <https://www.kernel.org/doc/html/latest/block/null_blk.html>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) Source code documentation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) =========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) .. kernel-doc:: include/linux/blk-mq.h
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) .. kernel-doc:: block/blk-mq.c