Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    1) =====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    2) Notes on the Generic Block Layer Rewrite in Linux 2.5
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    3) =====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    5) .. note::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    7) 	It seems that there are lot of outdated stuff here. This seems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    8) 	to be written somewhat as a task list. Yet, eventually, something
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300    9) 	here might still be useful.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   11) Notes Written on Jan 15, 2002:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   12) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   13) 	- Jens Axboe <jens.axboe@oracle.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   14) 	- Suparna Bhattacharya <suparna@in.ibm.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   15) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   16) Last Updated May 2, 2002
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   17) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   18) September 2003: Updated I/O Scheduler portions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   19) 	- Nick Piggin <npiggin@kernel.dk>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   20) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   21) Introduction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   22) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   23) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   24) These are some notes describing some aspects of the 2.5 block layer in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   25) context of the bio rewrite. The idea is to bring out some of the key
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   26) changes and a glimpse of the rationale behind those changes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   27) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   28) Please mail corrections & suggestions to suparna@in.ibm.com.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   29) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   30) Credits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   31) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   32) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   33) 2.5 bio rewrite:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   34) 	- Jens Axboe <jens.axboe@oracle.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   35) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   36) Many aspects of the generic block layer redesign were driven by and evolved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   37) over discussions, prior patches and the collective experience of several
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   38) people. See sections 8 and 9 for a list of some related references.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   39) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   40) The following people helped with review comments and inputs for this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   41) document:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   42) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   43) 	- Christoph Hellwig <hch@infradead.org>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   44) 	- Arjan van de Ven <arjanv@redhat.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   45) 	- Randy Dunlap <rdunlap@xenotime.net>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   46) 	- Andre Hedrick <andre@linux-ide.org>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   47) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   48) The following people helped with fixes/contributions to the bio patches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   49) while it was still work-in-progress:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   50) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   51) 	- David S. Miller <davem@redhat.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   52) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   53) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   54) .. Description of Contents:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   55) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   56)    1. Scope for tuning of logic to various needs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   57)      1.1 Tuning based on device or low level driver capabilities
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   58) 	- Per-queue parameters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   59) 	- Highmem I/O support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   60) 	- I/O scheduler modularization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   61)      1.2 Tuning based on high level requirements/capabilities
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   62) 	1.2.1 Request Priority/Latency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   63)      1.3 Direct access/bypass to lower layers for diagnostics and special
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   64) 	 device operations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   65) 	1.3.1 Pre-built commands
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   66)    2. New flexible and generic but minimalist i/o structure or descriptor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   67)       (instead of using buffer heads at the i/o layer)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   68)      2.1 Requirements/Goals addressed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   69)      2.2 The bio struct in detail (multi-page io unit)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   70)      2.3 Changes in the request structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   71)    3. Using bios
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   72)      3.1 Setup/teardown (allocation, splitting)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   73)      3.2 Generic bio helper routines
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   74)        3.2.1 Traversing segments and completion units in a request
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   75)        3.2.2 Setting up DMA scatterlists
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   76)        3.2.3 I/O completion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   77)        3.2.4 Implications for drivers that do not interpret bios (don't handle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   78) 	  multiple segments)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   79)      3.3 I/O submission
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   80)    4. The I/O scheduler
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   81)    5. Scalability related changes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   82)      5.1 Granular locking: Removal of io_request_lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   83)      5.2 Prepare for transition to 64 bit sector_t
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   84)    6. Other Changes/Implications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   85)      6.1 Partition re-mapping handled by the generic block layer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   86)    7. A few tips on migration of older drivers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   87)    8. A list of prior/related/impacted patches/ideas
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   88)    9. Other References/Discussion Threads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   90) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   91) Bio Notes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   92) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   93) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   94) Let us discuss the changes in the context of how some overall goals for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   95) block layer are addressed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   96) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   97) 1. Scope for tuning the generic logic to satisfy various requirements
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   98) =====================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   99) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  100) The block layer design supports adaptable abstractions to handle common
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  101) processing with the ability to tune the logic to an appropriate extent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  102) depending on the nature of the device and the requirements of the caller.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  103) One of the objectives of the rewrite was to increase the degree of tunability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  104) and to enable higher level code to utilize underlying device/driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  105) capabilities to the maximum extent for better i/o performance. This is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  106) important especially in the light of ever improving hardware capabilities
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  107) and application/middleware software designed to take advantage of these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  108) capabilities.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  109) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  110) 1.1 Tuning based on low level device / driver capabilities
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  111) ----------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  112) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  113) Sophisticated devices with large built-in caches, intelligent i/o scheduling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  114) optimizations, high memory DMA support, etc may find some of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  115) generic processing an overhead, while for less capable devices the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  116) generic functionality is essential for performance or correctness reasons.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  117) Knowledge of some of the capabilities or parameters of the device should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  118) used at the generic block layer to take the right decisions on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  119) behalf of the driver.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  120) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  121) How is this achieved ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  122) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  123) Tuning at a per-queue level:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  124) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  125) i. Per-queue limits/values exported to the generic layer by the driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  126) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  127) Various parameters that the generic i/o scheduler logic uses are set at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  128) a per-queue level (e.g maximum request size, maximum number of segments in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  129) a scatter-gather list, logical block size)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  130) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  131) Some parameters that were earlier available as global arrays indexed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  132) major/minor are now directly associated with the queue. Some of these may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  133) move into the block device structure in the future. Some characteristics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  134) have been incorporated into a queue flags field rather than separate fields
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  135) in themselves.  There are blk_queue_xxx functions to set the parameters,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  136) rather than update the fields directly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  137) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  138) Some new queue property settings:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  139) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  140) 	blk_queue_bounce_limit(q, u64 dma_address)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  141) 		Enable I/O to highmem pages, dma_address being the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  142) 		limit. No highmem default.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  143) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  144) 	blk_queue_max_sectors(q, max_sectors)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  145) 		Sets two variables that limit the size of the request.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  146) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  147) 		- The request queue's max_sectors, which is a soft size in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  148) 		  units of 512 byte sectors, and could be dynamically varied
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  149) 		  by the core kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  150) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  151) 		- The request queue's max_hw_sectors, which is a hard limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  152) 		  and reflects the maximum size request a driver can handle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  153) 		  in units of 512 byte sectors.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  154) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  155) 		The default for both max_sectors and max_hw_sectors is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  156) 		255. The upper limit of max_sectors is 1024.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  157) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  158) 	blk_queue_max_phys_segments(q, max_segments)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  159) 		Maximum physical segments you can handle in a request. 128
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  160) 		default (driver limit). (See 3.2.2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  161) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  162) 	blk_queue_max_hw_segments(q, max_segments)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  163) 		Maximum dma segments the hardware can handle in a request. 128
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  164) 		default (host adapter limit, after dma remapping).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  165) 		(See 3.2.2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  166) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  167) 	blk_queue_max_segment_size(q, max_seg_size)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  168) 		Maximum size of a clustered segment, 64kB default.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  169) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  170) 	blk_queue_logical_block_size(q, logical_block_size)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  171) 		Lowest possible sector size that the hardware can operate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  172) 		on, 512 bytes default.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  173) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  174) New queue flags:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  175) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  176) 	- QUEUE_FLAG_CLUSTER (see 3.2.2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  177) 	- QUEUE_FLAG_QUEUED (see 3.2.4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  178) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  179) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  180) ii. High-mem i/o capabilities are now considered the default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  181) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  182) The generic bounce buffer logic, present in 2.4, where the block layer would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  183) by default copyin/out i/o requests on high-memory buffers to low-memory buffers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  184) assuming that the driver wouldn't be able to handle it directly, has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  185) changed in 2.5. The bounce logic is now applied only for memory ranges
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  186) for which the device cannot handle i/o. A driver can specify this by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  187) setting the queue bounce limit for the request queue for the device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  188) (blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  189) where a device is capable of handling high memory i/o.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  190) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  191) In order to enable high-memory i/o where the device is capable of supporting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  192) it, the pci dma mapping routines and associated data structures have now been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  193) modified to accomplish a direct page -> bus translation, without requiring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  194) a virtual address mapping (unlike the earlier scheme of virtual address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  195) -> bus translation). So this works uniformly for high-memory pages (which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  196) do not have a corresponding kernel virtual address space mapping) and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  197) low-memory pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  198) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  199) Note: Please refer to :doc:`/core-api/dma-api-howto` for a discussion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  200) on PCI high mem DMA aspects and mapping of scatter gather lists, and support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  201) for 64 bit PCI.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  202) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  203) Special handling is required only for cases where i/o needs to happen on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  204) pages at physical memory addresses beyond what the device can support. In these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  205) cases, a bounce bio representing a buffer from the supported memory range
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  206) is used for performing the i/o with copyin/copyout as needed depending on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  207) the type of the operation.  For example, in case of a read operation, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  208) data read has to be copied to the original buffer on i/o completion, so a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  209) callback routine is set up to do this, while for write, the data is copied
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  210) from the original buffer to the bounce buffer prior to issuing the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  211) operation. Since an original buffer may be in a high memory area that's not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  212) mapped in kernel virtual addr, a kmap operation may be required for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  213) performing the copy, and special care may be needed in the completion path
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  214) as it may not be in irq context. Special care is also required (by way of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  215) GFP flags) when allocating bounce buffers, to avoid certain highmem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  216) deadlock possibilities.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  217) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  218) It is also possible that a bounce buffer may be allocated from high-memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  219) area that's not mapped in kernel virtual addr, but within the range that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  220) device can use directly; so the bounce page may need to be kmapped during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  221) copy operations. [Note: This does not hold in the current implementation,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  222) though]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  223) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  224) There are some situations when pages from high memory may need to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  225) be kmapped, even if bounce buffers are not necessary. For example a device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  226) may need to abort DMA operations and revert to PIO for the transfer, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  227) which case a virtual mapping of the page is required. For SCSI it is also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  228) done in some scenarios where the low level driver cannot be trusted to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  229) handle a single sg entry correctly. The driver is expected to perform the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  230) kmaps as needed on such occasions as appropriate. A driver could also use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  231) the blk_queue_bounce() routine on its own to bounce highmem i/o to low
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  232) memory for specific requests if so desired.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  233) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  234) iii. The i/o scheduler algorithm itself can be replaced/set as appropriate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  235) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  236) As in 2.4, it is possible to plugin a brand new i/o scheduler for a particular
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  237) queue or pick from (copy) existing generic schedulers and replace/override
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  238) certain portions of it. The 2.5 rewrite provides improved modularization
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  239) of the i/o scheduler. There are more pluggable callbacks, e.g for init,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  240) add request, extract request, which makes it possible to abstract specific
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  241) i/o scheduling algorithm aspects and details outside of the generic loop.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  242) It also makes it possible to completely hide the implementation details of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  243) the i/o scheduler from block drivers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  244) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  245) I/O scheduler wrappers are to be used instead of accessing the queue directly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  246) See section 4. The I/O scheduler for details.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  247) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  248) 1.2 Tuning Based on High level code capabilities
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  249) ------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  250) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  251) i. Application capabilities for raw i/o
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  252) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  253) This comes from some of the high-performance database/middleware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  254) requirements where an application prefers to make its own i/o scheduling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  255) decisions based on an understanding of the access patterns and i/o
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  256) characteristics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  257) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  258) ii. High performance filesystems or other higher level kernel code's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  259) capabilities
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  260) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  261) Kernel components like filesystems could also take their own i/o scheduling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  262) decisions for optimizing performance. Journalling filesystems may need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  263) some control over i/o ordering.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  264) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  265) What kind of support exists at the generic block layer for this ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  266) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  267) The flags and rw fields in the bio structure can be used for some tuning
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  268) from above e.g indicating that an i/o is just a readahead request, or priority
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  269) settings (currently unused). As far as user applications are concerned they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  270) would need an additional mechanism either via open flags or ioctls, or some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  271) other upper level mechanism to communicate such settings to block.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  272) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  273) 1.2.1 Request Priority/Latency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  274) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  275) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  276) Todo/Under discussion::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  277) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  278)   Arjan's proposed request priority scheme allows higher levels some broad
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  279)   control (high/med/low) over the priority  of an i/o request vs other pending
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  280)   requests in the queue. For example it allows reads for bringing in an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  281)   executable page on demand to be given a higher priority over pending write
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  282)   requests which haven't aged too much on the queue. Potentially this priority
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  283)   could even be exposed to applications in some manner, providing higher level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  284)   tunability. Time based aging avoids starvation of lower priority
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  285)   requests. Some bits in the bi_opf flags field in the bio structure are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  286)   intended to be used for this priority information.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  287) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  288) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  289) 1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  290) -----------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  291) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  292) (e.g Diagnostics, Systems Management)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  293) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  294) There are situations where high-level code needs to have direct access to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  295) the low level device capabilities or requires the ability to issue commands
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  296) to the device bypassing some of the intermediate i/o layers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  297) These could, for example, be special control commands issued through ioctl
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  298) interfaces, or could be raw read/write commands that stress the drive's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  299) capabilities for certain kinds of fitness tests. Having direct interfaces at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  300) multiple levels without having to pass through upper layers makes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  301) it possible to perform bottom up validation of the i/o path, layer by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  302) layer, starting from the media.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  303) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  304) The normal i/o submission interfaces, e.g submit_bio, could be bypassed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  305) for specially crafted requests which such ioctl or diagnostics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  306) interfaces would typically use, and the elevator add_request routine
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  307) can instead be used to directly insert such requests in the queue or preferably
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  308) the blk_do_rq routine can be used to place the request on the queue and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  309) wait for completion. Alternatively, sometimes the caller might just
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  310) invoke a lower level driver specific interface with the request as a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  311) parameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  312) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  313) If the request is a means for passing on special information associated with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  314) the command, then such information is associated with the request->special
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  315) field (rather than misuse the request->buffer field which is meant for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  316) request data buffer's virtual mapping).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  317) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  318) For passing request data, the caller must build up a bio descriptor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  319) representing the concerned memory buffer if the underlying driver interprets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  320) bio segments or uses the block layer end*request* functions for i/o
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  321) completion. Alternatively one could directly use the request->buffer field to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  322) specify the virtual address of the buffer, if the driver expects buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  323) addresses passed in this way and ignores bio entries for the request type
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  324) involved. In the latter case, the driver would modify and manage the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  325) request->buffer, request->sector and request->nr_sectors or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  326) request->current_nr_sectors fields itself rather than using the block layer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  327) end_request or end_that_request_first completion interfaces.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  328) (See 2.3 or Documentation/block/request.rst for a brief explanation of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  329) the request structure fields)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  330) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  331) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  332) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  333)   [TBD: end_that_request_last should be usable even in this case;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  334)   Perhaps an end_that_direct_request_first routine could be implemented to make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  335)   handling direct requests easier for such drivers; Also for drivers that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  336)   expect bios, a helper function could be provided for setting up a bio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  337)   corresponding to a data buffer]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  338) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  339)   <JENS: I dont understand the above, why is end_that_request_first() not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  340)   usable? Or _last for that matter. I must be missing something>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  341) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  342)   <SUP: What I meant here was that if the request doesn't have a bio, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  343)    end_that_request_first doesn't modify nr_sectors or current_nr_sectors,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  344)    and hence can't be used for advancing request state settings on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  345)    completion of partial transfers. The driver has to modify these fields
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  346)    directly by hand.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  347)    This is because end_that_request_first only iterates over the bio list,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  348)    and always returns 0 if there are none associated with the request.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  349)    _last works OK in this case, and is not a problem, as I mentioned earlier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  350)   >
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  351) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  352) 1.3.1 Pre-built Commands
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  353) ^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  354) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  355) A request can be created with a pre-built custom command  to be sent directly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  356) to the device. The cmd block in the request structure has room for filling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  357) in the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  358) command pre-building, and the type of the request is now indicated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  359) through rq->flags instead of via rq->cmd)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  360) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  361) The request structure flags can be set up to indicate the type of request
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  362) in such cases (REQ_PC: direct packet command passed to driver, REQ_BLOCK_PC:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  363) packet command issued via blk_do_rq, REQ_SPECIAL: special request).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  364) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  365) It can help to pre-build device commands for requests in advance.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  366) Drivers can now specify a request prepare function (q->prep_rq_fn) that the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  367) block layer would invoke to pre-build device commands for a given request,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  368) or perform other preparatory processing for the request. This is routine is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  369) called by elv_next_request(), i.e. typically just before servicing a request.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  370) (The prepare function would not be called for requests that have RQF_DONTPREP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  371) enabled)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  372) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  373) Aside:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  374)   Pre-building could possibly even be done early, i.e before placing the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  375)   request on the queue, rather than construct the command on the fly in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  376)   driver while servicing the request queue when it may affect latencies in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  377)   interrupt context or responsiveness in general. One way to add early
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  378)   pre-building would be to do it whenever we fail to merge on a request.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  379)   Now REQ_NOMERGE is set in the request flags to skip this one in the future,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  380)   which means that it will not change before we feed it to the device. So
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  381)   the pre-builder hook can be invoked there.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  382) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  383) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  384) 2. Flexible and generic but minimalist i/o structure/descriptor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  385) ===============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  386) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  387) 2.1 Reason for a new structure and requirements addressed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  388) ---------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  389) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  390) Prior to 2.5, buffer heads were used as the unit of i/o at the generic block
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  391) layer, and the low level request structure was associated with a chain of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  392) buffer heads for a contiguous i/o request. This led to certain inefficiencies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  393) when it came to large i/o requests and readv/writev style operations, as it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  394) forced such requests to be broken up into small chunks before being passed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  395) on to the generic block layer, only to be merged by the i/o scheduler
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  396) when the underlying device was capable of handling the i/o in one shot.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  397) Also, using the buffer head as an i/o structure for i/os that didn't originate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  398) from the buffer cache unnecessarily added to the weight of the descriptors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  399) which were generated for each such chunk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  400) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  401) The following were some of the goals and expectations considered in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  402) redesign of the block i/o data structure in 2.5.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  403) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  404) 1.  Should be appropriate as a descriptor for both raw and buffered i/o  -
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  405)     avoid cache related fields which are irrelevant in the direct/page i/o path,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  406)     or filesystem block size alignment restrictions which may not be relevant
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  407)     for raw i/o.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  408) 2.  Ability to represent high-memory buffers (which do not have a virtual
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  409)     address mapping in kernel address space).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  410) 3.  Ability to represent large i/os w/o unnecessarily breaking them up (i.e
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  411)     greater than PAGE_SIZE chunks in one shot)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  412) 4.  At the same time, ability to retain independent identity of i/os from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  413)     different sources or i/o units requiring individual completion (e.g. for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  414)     latency reasons)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  415) 5.  Ability to represent an i/o involving multiple physical memory segments
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  416)     (including non-page aligned page fragments, as specified via readv/writev)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  417)     without unnecessarily breaking it up, if the underlying device is capable of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  418)     handling it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  419) 6.  Preferably should be based on a memory descriptor structure that can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  420)     passed around different types of subsystems or layers, maybe even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  421)     networking, without duplication or extra copies of data/descriptor fields
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  422)     themselves in the process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  423) 7.  Ability to handle the possibility of splits/merges as the structure passes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  424)     through layered drivers (lvm, md, evms), with minimal overhead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  425) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  426) The solution was to define a new structure (bio)  for the block layer,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  427) instead of using the buffer head structure (bh) directly, the idea being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  428) avoidance of some associated baggage and limitations. The bio structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  429) is uniformly used for all i/o at the block layer ; it forms a part of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  430) bh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  431) mapped to bio structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  432) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  433) 2.2 The bio struct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  434) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  435) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  436) The bio structure uses a vector representation pointing to an array of tuples
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  437) of <page, offset, len> to describe the i/o buffer, and has various other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  438) fields describing i/o parameters and state that needs to be maintained for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  439) performing the i/o.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  440) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  441) Notice that this representation means that a bio has no virtual address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  442) mapping at all (unlike buffer heads).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  443) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  444) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  445) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  446)   struct bio_vec {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  447)        struct page     *bv_page;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  448)        unsigned short  bv_len;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  449)        unsigned short  bv_offset;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  450)   };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  451) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  452)   /*
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  453)    * main unit of I/O for the block layer and lower layers (ie drivers)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  454)    */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  455)   struct bio {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  456)        struct bio          *bi_next;    /* request queue link */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  457)        struct block_device *bi_bdev;	/* target device */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  458)        unsigned long       bi_flags;    /* status, command, etc */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  459)        unsigned long       bi_opf;       /* low bits: r/w, high: priority */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  460) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  461)        unsigned int	bi_vcnt;     /* how may bio_vec's */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  462)        struct bvec_iter	bi_iter;	/* current index into bio_vec array */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  463) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  464)        unsigned int	bi_size;     /* total size in bytes */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  465)        unsigned short	bi_hw_segments; /* segments after DMA remapping */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  466)        unsigned int	bi_max;	     /* max bio_vecs we can hold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  467)                                         used as index into pool */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  468)        struct bio_vec   *bi_io_vec;  /* the actual vec list */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  469)        bio_end_io_t	*bi_end_io;  /* bi_end_io (bio) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  470)        atomic_t		bi_cnt;	     /* pin count: free when it hits zero */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  471)        void             *bi_private;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  472)   };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  473) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  474) With this multipage bio design:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  475) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  476) - Large i/os can be sent down in one go using a bio_vec list consisting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  477)   of an array of <page, offset, len> fragments (similar to the way fragments
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  478)   are represented in the zero-copy network code)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  479) - Splitting of an i/o request across multiple devices (as in the case of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  480)   lvm or raid) is achieved by cloning the bio (where the clone points to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  481)   the same bi_io_vec array, but with the index and size accordingly modified)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  482) - A linked list of bios is used as before for unrelated merges [#]_ - this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  483)   avoids reallocs and makes independent completions easier to handle.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  484) - Code that traverses the req list can find all the segments of a bio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  485)   by using rq_for_each_segment.  This handles the fact that a request
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  486)   has multiple bios, each of which can have multiple segments.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  487) - Drivers which can't process a large bio in one shot can use the bi_iter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  488)   field to keep track of the next bio_vec entry to process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  489)   (e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  490)   [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  491)   bi_offset an len fields]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  492) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  493) .. [#]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  494) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  495) 	unrelated merges -- a request ends up containing two or more bios that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  496) 	didn't originate from the same place.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  497) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  498) bi_end_io() i/o callback gets called on i/o completion of the entire bio.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  499) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  500) At a lower level, drivers build a scatter gather list from the merged bios.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  501) The scatter gather list is in the form of an array of <page, offset, len>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  502) entries with their corresponding dma address mappings filled in at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  503) appropriate time. As an optimization, contiguous physical pages can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  504) covered by a single entry where <page> refers to the first page and <len>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  505) covers the range of pages (up to 16 contiguous pages could be covered this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  506) way). There is a helper routine (blk_rq_map_sg) which drivers can use to build
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  507) the sg list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  508) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  509) Note: Right now the only user of bios with more than one page is ll_rw_kio,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  510) which in turn means that only raw I/O uses it (direct i/o may not work
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  511) right now). The intent however is to enable clustering of pages etc to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  512) become possible. The pagebuf abstraction layer from SGI also uses multi-page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  513) bios, but that is currently not included in the stock development kernels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  514) The same is true of Andrew Morton's work-in-progress multipage bio writeout
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  515) and readahead patches.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  516) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  517) 2.3 Changes in the Request Structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  518) ------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  519) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  520) The request structure is the structure that gets passed down to low level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  521) drivers. The block layer make_request function builds up a request structure,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  522) places it on the queue and invokes the drivers request_fn. The driver makes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  523) use of block layer helper routine elv_next_request to pull the next request
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  524) off the queue. Control or diagnostic functions might bypass block and directly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  525) invoke underlying driver entry points passing in a specially constructed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  526) request structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  527) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  528) Only some relevant fields (mainly those which changed or may be referred
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  529) to in some of the discussion here) are listed below, not necessarily in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  530) the order in which they occur in the structure (see include/linux/blkdev.h)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  531) Refer to Documentation/block/request.rst for details about all the request
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  532) structure fields and a quick reference about the layers which are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  533) supposed to use or modify those fields::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  534) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  535)   struct request {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  536) 	struct list_head queuelist;  /* Not meant to be directly accessed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  537) 					the driver.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  538) 					Used by q->elv_next_request_fn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  539) 					rq->queue is gone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  540) 					*/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  541) 	.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  542) 	.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  543) 	unsigned char cmd[16]; /* prebuilt command data block */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  544) 	unsigned long flags;   /* also includes earlier rq->cmd settings */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  545) 	.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  546) 	.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  547) 	sector_t sector; /* this field is now of type sector_t instead of int
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  548) 			    preparation for 64 bit sectors */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  549) 	.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  550) 	.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  551) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  552) 	/* Number of scatter-gather DMA addr+len pairs after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  553) 	 * physical address coalescing is performed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  554) 	 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  555) 	unsigned short nr_phys_segments;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  556) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  557) 	/* Number of scatter-gather addr+len pairs after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  558) 	 * physical and DMA remapping hardware coalescing is performed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  559) 	 * This is the number of scatter-gather entries the driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  560) 	 * will actually have to deal with after DMA mapping is done.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  561) 	 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  562) 	unsigned short nr_hw_segments;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  563) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  564) 	/* Various sector counts */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  565) 	unsigned long nr_sectors;  /* no. of sectors left: driver modifiable */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  566) 	unsigned long hard_nr_sectors;  /* block internal copy of above */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  567) 	unsigned int current_nr_sectors; /* no. of sectors left in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  568) 					   current segment:driver modifiable */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  569) 	unsigned long hard_cur_sectors; /* block internal copy of the above */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  570) 	.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  571) 	.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  572) 	int tag;	/* command tag associated with request */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  573) 	void *special;  /* same as before */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  574) 	char *buffer;   /* valid only for low memory buffers up to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  575) 			 current_nr_sectors */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  576) 	.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  577) 	.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  578) 	struct bio *bio, *biotail;  /* bio list instead of bh */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  579) 	struct request_list *rl;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  580)   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  581) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  582) See the req_ops and req_flag_bits definitions for an explanation of the various
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  583) flags available. Some bits are used by the block layer or i/o scheduler.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  584) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  585) The behaviour of the various sector counts are almost the same as before,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  586) except that since we have multi-segment bios, current_nr_sectors refers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  587) to the numbers of sectors in the current segment being processed which could
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  588) be one of the many segments in the current bio (i.e i/o completion unit).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  589) The nr_sectors value refers to the total number of sectors in the whole
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  590) request that remain to be transferred (no change). The purpose of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  591) hard_xxx values is for block to remember these counts every time it hands
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  592) over the request to the driver. These values are updated by block on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  593) end_that_request_first, i.e. every time the driver completes a part of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  594) transfer and invokes block end*request helpers to mark this. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  595) driver should not modify these values. The block layer sets up the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  596) nr_sectors and current_nr_sectors fields (based on the corresponding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  597) hard_xxx values and the number of bytes transferred) and updates it on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  598) every transfer that invokes end_that_request_first. It does the same for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  599) buffer, bio, bio->bi_iter fields too.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  600) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  601) The buffer field is just a virtual address mapping of the current segment
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  602) of the i/o buffer in cases where the buffer resides in low-memory. For high
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  603) memory i/o, this field is not valid and must not be used by drivers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  604) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  605) Code that sets up its own request structures and passes them down to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  606) a driver needs to be careful about interoperation with the block layer helper
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  607) functions which the driver uses. (Section 1.3)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  608) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  609) 3. Using bios
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  610) =============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  611) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  612) 3.1 Setup/Teardown
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  613) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  614) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  615) There are routines for managing the allocation, and reference counting, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  616) freeing of bios (bio_alloc, bio_get, bio_put).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  617) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  618) This makes use of Ingo Molnar's mempool implementation, which enables
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  619) subsystems like bio to maintain their own reserve memory pools for guaranteed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  620) deadlock-free allocations during extreme VM load. For example, the VM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  621) subsystem makes use of the block layer to writeout dirty pages in order to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  622) able to free up memory space, a case which needs careful handling. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  623) allocation logic draws from the preallocated emergency reserve in situations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  624) where it cannot allocate through normal means. If the pool is empty and it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  625) can wait, then it would trigger action that would help free up memory or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  626) replenish the pool (without deadlocking) and wait for availability in the pool.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  627) If it is in IRQ context, and hence not in a position to do this, allocation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  628) could fail if the pool is empty. In general mempool always first tries to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  629) perform allocation without having to wait, even if it means digging into the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  630) pool as long it is not less that 50% full.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  631) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  632) On a free, memory is released to the pool or directly freed depending on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  633) the current availability in the pool. The mempool interface lets the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  634) subsystem specify the routines to be used for normal alloc and free. In the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  635) case of bio, these routines make use of the standard slab allocator.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  636) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  637) The caller of bio_alloc is expected to taken certain steps to avoid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  638) deadlocks, e.g. avoid trying to allocate more memory from the pool while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  639) already holding memory obtained from the pool.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  640) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  641) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  642) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  643)   [TBD: This is a potential issue, though a rare possibility
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  644)    in the bounce bio allocation that happens in the current code, since
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  645)    it ends up allocating a second bio from the same pool while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  646)    holding the original bio ]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  647) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  648) Memory allocated from the pool should be released back within a limited
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  649) amount of time (in the case of bio, that would be after the i/o is completed).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  650) This ensures that if part of the pool has been used up, some work (in this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  651) case i/o) must already be in progress and memory would be available when it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  652) is over. If allocating from multiple pools in the same code path, the order
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  653) or hierarchy of allocation needs to be consistent, just the way one deals
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  654) with multiple locks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  655) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  656) The bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc())
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  657) for a non-clone bio. There are the 6 pools setup for different size biovecs,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  658) so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  659) given size from these slabs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  660) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  661) The bio_get() routine may be used to hold an extra reference on a bio prior
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  662) to i/o submission, if the bio fields are likely to be accessed after the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  663) i/o is issued (since the bio may otherwise get freed in case i/o completion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  664) happens in the meantime).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  665) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  666) The bio_clone_fast() routine may be used to duplicate a bio, where the clone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  667) shares the bio_vec_list with the original bio (i.e. both point to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  668) same bio_vec_list). This would typically be used for splitting i/o requests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  669) in lvm or md.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  670) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  671) 3.2 Generic bio helper Routines
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  672) -------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  673) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  674) 3.2.1 Traversing segments and completion units in a request
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  675) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  676) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  677) The macro rq_for_each_segment() should be used for traversing the bios
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  678) in the request list (drivers should avoid directly trying to do it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  679) themselves). Using these helpers should also make it easier to cope
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  680) with block changes in the future.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  681) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  682) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  683) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  684) 	struct req_iterator iter;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  685) 	rq_for_each_segment(bio_vec, rq, iter)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  686) 		/* bio_vec is now current segment */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  687) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  688) I/O completion callbacks are per-bio rather than per-segment, so drivers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  689) that traverse bio chains on completion need to keep that in mind. Drivers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  690) which don't make a distinction between segments and completion units would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  691) need to be reorganized to support multi-segment bios.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  692) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  693) 3.2.2 Setting up DMA scatterlists
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  694) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  695) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  696) The blk_rq_map_sg() helper routine would be used for setting up scatter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  697) gather lists from a request, so a driver need not do it on its own.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  698) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  699) 	nr_segments = blk_rq_map_sg(q, rq, scatterlist);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  700) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  701) The helper routine provides a level of abstraction which makes it easier
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  702) to modify the internals of request to scatterlist conversion down the line
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  703) without breaking drivers. The blk_rq_map_sg routine takes care of several
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  704) things like collapsing physically contiguous segments (if QUEUE_FLAG_CLUSTER
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  705) is set) and correct segment accounting to avoid exceeding the limits which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  706) the i/o hardware can handle, based on various queue properties.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  707) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  708) - Prevents a clustered segment from crossing a 4GB mem boundary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  709) - Avoids building segments that would exceed the number of physical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  710)   memory segments that the driver can handle (phys_segments) and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  711)   number that the underlying hardware can handle at once, accounting for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  712)   DMA remapping (hw_segments)  (i.e. IOMMU aware limits).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  713) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  714) Routines which the low level driver can use to set up the segment limits:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  715) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  716) blk_queue_max_hw_segments() : Sets an upper limit of the maximum number of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  717) hw data segments in a request (i.e. the maximum number of address/length
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  718) pairs the host adapter can actually hand to the device at once)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  719) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  720) blk_queue_max_phys_segments() : Sets an upper limit on the maximum number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  721) of physical data segments in a request (i.e. the largest sized scatter list
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  722) a driver could handle)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  723) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  724) 3.2.3 I/O completion
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  725) ^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  726) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  727) The existing generic block layer helper routines end_request,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  728) end_that_request_first and end_that_request_last can be used for i/o
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  729) completion (and setting things up so the rest of the i/o or the next
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  730) request can be kicked of) as before. With the introduction of multi-page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  731) bio support, end_that_request_first requires an additional argument indicating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  732) the number of sectors completed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  733) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  734) 3.2.4 Implications for drivers that do not interpret bios
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  735) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  736) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  737) (don't handle multiple segments)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  738) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  739) Drivers that do not interpret bios e.g those which do not handle multiple
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  740) segments and do not support i/o into high memory addresses (require bounce
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  741) buffers) and expect only virtually mapped buffers, can access the rq->buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  742) field. As before the driver should use current_nr_sectors to determine the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  743) size of remaining data in the current segment (that is the maximum it can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  744) transfer in one go unless it interprets segments), and rely on the block layer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  745) end_request, or end_that_request_first/last to take care of all accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  746) and transparent mapping of the next bio segment when a segment boundary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  747) is crossed on completion of a transfer. (The end*request* functions should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  748) be used if only if the request has come down from block/bio path, not for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  749) direct access requests which only specify rq->buffer without a valid rq->bio)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  750) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  751) 3.3 I/O Submission
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  752) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  753) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  754) The routine submit_bio() is used to submit a single io. Higher level i/o
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  755) routines make use of this:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  756) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  757) (a) Buffered i/o:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  758) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  759) The routine submit_bh() invokes submit_bio() on a bio corresponding to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  760) bh, allocating the bio if required. ll_rw_block() uses submit_bh() as before.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  761) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  762) (b) Kiobuf i/o (for raw/direct i/o):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  763) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  764) The ll_rw_kio() routine breaks up the kiobuf into page sized chunks and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  765) maps the array to one or more multi-page bios, issuing submit_bio() to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  766) perform the i/o on each of these.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  767) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  768) The embedded bh array in the kiobuf structure has been removed and no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  769) preallocation of bios is done for kiobufs. [The intent is to remove the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  770) blocks array as well, but it's currently in there to kludge around direct i/o.]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  771) Thus kiobuf allocation has switched back to using kmalloc rather than vmalloc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  772) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  773) Todo/Observation:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  774) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  775)  A single kiobuf structure is assumed to correspond to a contiguous range
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  776)  of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  777)  So right now it wouldn't work for direct i/o on non-contiguous blocks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  778)  This is to be resolved.  The eventual direction is to replace kiobuf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  779)  by kvec's.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  780) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  781)  Badari Pulavarty has a patch to implement direct i/o correctly using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  782)  bio and kvec.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  783) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  784) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  785) (c) Page i/o:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  786) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  787) Todo/Under discussion:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  788) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  789)  Andrew Morton's multi-page bio patches attempt to issue multi-page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  790)  writeouts (and reads) from the page cache, by directly building up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  791)  large bios for submission completely bypassing the usage of buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  792)  heads. This work is still in progress.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  793) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  794)  Christoph Hellwig had some code that uses bios for page-io (rather than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  795)  bh). This isn't included in bio as yet. Christoph was also working on a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  796)  design for representing virtual/real extents as an entity and modifying
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  797)  some of the address space ops interfaces to utilize this abstraction rather
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  798)  than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  799)  abstraction, but intended to be as lightweight as possible).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  800) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  801) (d) Direct access i/o:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  802) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  803) Direct access requests that do not contain bios would be submitted differently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  804) as discussed earlier in section 1.3.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  805) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  806) Aside:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  807) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  808)   Kvec i/o:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  809) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  810)   Ben LaHaise's aio code uses a slightly different structure instead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  811)   of kiobufs, called a kvec_cb. This contains an array of <page, offset, len>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  812)   tuples (very much like the networking code), together with a callback function
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  813)   and data pointer. This is embedded into a brw_cb structure when passed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  814)   to brw_kvec_async().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  815) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  816)   Now it should be possible to directly map these kvecs to a bio. Just as while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  817)   cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  818)   array pointer to point to the veclet array in kvecs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  819) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  820)   TBD: In order for this to work, some changes are needed in the way multi-page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  821)   bios are handled today. The values of the tuples in such a vector passed in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  822)   from higher level code should not be modified by the block layer in the course
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  823)   of its request processing, since that would make it hard for the higher layer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  824)   to continue to use the vector descriptor (kvec) after i/o completes. Instead,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  825)   all such transient state should either be maintained in the request structure,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  826)   and passed on in some way to the endio completion routine.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  827) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  828) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  829) 4. The I/O scheduler
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  830) ====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  831) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  832) I/O scheduler, a.k.a. elevator, is implemented in two layers.  Generic dispatch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  833) queue and specific I/O schedulers.  Unless stated otherwise, elevator is used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  834) to refer to both parts and I/O scheduler to specific I/O schedulers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  835) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  836) Block layer implements generic dispatch queue in `block/*.c`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  837) The generic dispatch queue is responsible for requeueing, handling non-fs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  838) requests and all other subtleties.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  839) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  840) Specific I/O schedulers are responsible for ordering normal filesystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  841) requests.  They can also choose to delay certain requests to improve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  842) throughput or whatever purpose.  As the plural form indicates, there are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  843) multiple I/O schedulers.  They can be built as modules but at least one should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  844) be built inside the kernel.  Each queue can choose different one and can also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  845) change to another one dynamically.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  846) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  847) A block layer call to the i/o scheduler follows the convention elv_xxx(). This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  848) calls elevator_xxx_fn in the elevator switch (block/elevator.c). Oh, xxx
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  849) and xxx might not match exactly, but use your imagination. If an elevator
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  850) doesn't implement a function, the switch does nothing or some minimal house
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  851) keeping work.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  852) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  853) 4.1. I/O scheduler API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  854) ----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  855) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  856) The functions an elevator may implement are: (* are mandatory)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  857) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  858) =============================== ================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  859) elevator_merge_fn		called to query requests for merge with a bio
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  860) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  861) elevator_merge_req_fn		called when two requests get merged. the one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  862) 				which gets merged into the other one will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  863) 				never seen by I/O scheduler again. IOW, after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  864) 				being merged, the request is gone.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  865) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  866) elevator_merged_fn		called when a request in the scheduler has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  867) 				involved in a merge. It is used in the deadline
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  868) 				scheduler for example, to reposition the request
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  869) 				if its sorting order has changed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  870) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  871) elevator_allow_merge_fn		called whenever the block layer determines
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  872) 				that a bio can be merged into an existing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  873) 				request safely. The io scheduler may still
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  874) 				want to stop a merge at this point if it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  875) 				results in some sort of conflict internally,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  876) 				this hook allows it to do that. Note however
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  877) 				that two *requests* can still be merged at later
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  878) 				time. Currently the io scheduler has no way to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  879) 				prevent that. It can only learn about the fact
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  880) 				from elevator_merge_req_fn callback.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  881) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  882) elevator_dispatch_fn*		fills the dispatch queue with ready requests.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  883) 				I/O schedulers are free to postpone requests by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  884) 				not filling the dispatch queue unless @force
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  885) 				is non-zero.  Once dispatched, I/O schedulers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  886) 				are not allowed to manipulate the requests -
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  887) 				they belong to generic dispatch queue.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  888) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  889) elevator_add_req_fn*		called to add a new request into the scheduler
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  890) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  891) elevator_former_req_fn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  892) elevator_latter_req_fn		These return the request before or after the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  893) 				one specified in disk sort order. Used by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  894) 				block layer to find merge possibilities.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  895) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  896) elevator_completed_req_fn	called when a request is completed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  897) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  898) elevator_set_req_fn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  899) elevator_put_req_fn		Must be used to allocate and free any elevator
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  900) 				specific storage for a request.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  901) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  902) elevator_activate_req_fn	Called when device driver first sees a request.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  903) 				I/O schedulers can use this callback to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  904) 				determine when actual execution of a request
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  905) 				starts.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  906) elevator_deactivate_req_fn	Called when device driver decides to delay
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  907) 				a request by requeueing it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  908) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  909) elevator_init_fn*
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  910) elevator_exit_fn		Allocate and free any elevator specific storage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  911) 				for a queue.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  912) =============================== ================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  913) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  914) 4.2 Request flows seen by I/O schedulers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  915) ----------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  916) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  917) All requests seen by I/O schedulers strictly follow one of the following three
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  918) flows.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  919) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  920)  set_req_fn ->
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  921) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  922)  i.   add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn ->
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  923)       (deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  924)  ii.  add_req_fn -> (merged_fn ->)* -> merge_req_fn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  925)  iii. [none]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  926) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  927)  -> put_req_fn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  928) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  929) 4.3 I/O scheduler implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  930) --------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  931) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  932) The generic i/o scheduler algorithm attempts to sort/merge/batch requests for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  933) optimal disk scan and request servicing performance (based on generic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  934) principles and device capabilities), optimized for:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  935) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  936) i.   improved throughput
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  937) ii.  improved latency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  938) iii. better utilization of h/w & CPU time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  939) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  940) Characteristics:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  941) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  942) i. Binary tree
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  943) AS and deadline i/o schedulers use red black binary trees for disk position
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  944) sorting and searching, and a fifo linked list for time-based searching. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  945) gives good scalability and good availability of information. Requests are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  946) almost always dispatched in disk sort order, so a cache is kept of the next
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  947) request in sort order to prevent binary tree lookups.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  948) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  949) This arrangement is not a generic block layer characteristic however, so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  950) elevators may implement queues as they please.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  951) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  952) ii. Merge hash
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  953) AS and deadline use a hash table indexed by the last sector of a request. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  954) enables merging code to quickly look up "back merge" candidates, even when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  955) multiple I/O streams are being performed at once on one disk.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  956) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  957) "Front merges", a new request being merged at the front of an existing request,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  958) are far less common than "back merges" due to the nature of most I/O patterns.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  959) Front merges are handled by the binary trees in AS and deadline schedulers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  960) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  961) iii. Plugging the queue to batch requests in anticipation of opportunities for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  962)      merge/sort optimizations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  963) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  964) Plugging is an approach that the current i/o scheduling algorithm resorts to so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  965) that it collects up enough requests in the queue to be able to take
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  966) advantage of the sorting/merging logic in the elevator. If the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  967) queue is empty when a request comes in, then it plugs the request queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  968) (sort of like plugging the bath tub of a vessel to get fluid to build up)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  969) till it fills up with a few more requests, before starting to service
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  970) the requests. This provides an opportunity to merge/sort the requests before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  971) passing them down to the device. There are various conditions when the queue is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  972) unplugged (to open up the flow again), either through a scheduled task or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  973) could be on demand. For example wait_on_buffer sets the unplugging going
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  974) through sync_buffer() running blk_run_address_space(mapping). Or the caller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  975) can do it explicity through blk_unplug(bdev). So in the read case,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  976) the queue gets explicitly unplugged as part of waiting for completion on that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  977) buffer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  978) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  979) Aside:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  980)   This is kind of controversial territory, as it's not clear if plugging is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  981)   always the right thing to do. Devices typically have their own queues,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  982)   and allowing a big queue to build up in software, while letting the device be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  983)   idle for a while may not always make sense. The trick is to handle the fine
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  984)   balance between when to plug and when to open up. Also now that we have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  985)   multi-page bios being queued in one shot, we may not need to wait to merge
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  986)   a big request from the broken up pieces coming by.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  987) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  988) 4.4 I/O contexts
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  989) ----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  990) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  991) I/O contexts provide a dynamically allocated per process data area. They may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  992) be used in I/O schedulers, and in the block layer (could be used for IO statis,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  993) priorities for example). See `*io_context` in block/ll_rw_blk.c, and as-iosched.c
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  994) for an example of usage in an i/o scheduler.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  995) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  996) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  997) 5. Scalability related changes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  998) ==============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  999) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1000) 5.1 Granular Locking: io_request_lock replaced by a per-queue lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1001) ------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1002) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1003) The global io_request_lock has been removed as of 2.5, to avoid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1004) the scalability bottleneck it was causing, and has been replaced by more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1005) granular locking. The request queue structure has a pointer to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1006) lock to be used for that queue. As a result, locking can now be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1007) per-queue, with a provision for sharing a lock across queues if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1008) necessary (e.g the scsi layer sets the queue lock pointers to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1009) corresponding adapter lock, which results in a per host locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1010) granularity). The locking semantics are the same, i.e. locking is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1011) still imposed by the block layer, grabbing the lock before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1012) request_fn execution which it means that lots of older drivers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1013) should still be SMP safe. Drivers are free to drop the queue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1014) lock themselves, if required. Drivers that explicitly used the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1015) io_request_lock for serialization need to be modified accordingly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1016) Usually it's as easy as adding a global lock::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1017) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1018) 	static DEFINE_SPINLOCK(my_driver_lock);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1019) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1020) and passing the address to that lock to blk_init_queue().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1021) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1022) 5.2 64 bit sector numbers (sector_t prepares for 64 bit support)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1023) ----------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1024) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1025) The sector number used in the bio structure has been changed to sector_t,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1026) which could be defined as 64 bit in preparation for 64 bit sector support.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1027) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1028) 6. Other Changes/Implications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1029) =============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1030) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1031) 6.1 Partition re-mapping handled by the generic block layer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1032) -----------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1033) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1034) In 2.5 some of the gendisk/partition related code has been reorganized.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1035) Now the generic block layer performs partition-remapping early and thus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1036) provides drivers with a sector number relative to whole device, rather than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1037) having to take partition number into account in order to arrive at the true
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1038) sector number. The routine blk_partition_remap() is invoked by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1039) submit_bio_noacct even before invoking the queue specific ->submit_bio,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1040) so the i/o scheduler also gets to operate on whole disk sector numbers. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1041) should typically not require changes to block drivers, it just never gets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1042) to invoke its own partition sector offset calculations since all bios
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1043) sent are offset from the beginning of the device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1044) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1045) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1046) 7. A Few Tips on Migration of older drivers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1047) ===========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1048) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1049) Old-style drivers that just use CURRENT and ignores clustered requests,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1050) may not need much change.  The generic layer will automatically handle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1051) clustered requests, multi-page bios, etc for the driver.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1052) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1053) For a low performance driver or hardware that is PIO driven or just doesn't
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1054) support scatter-gather changes should be minimal too.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1055) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1056) The following are some points to keep in mind when converting old drivers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1057) to bio.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1058) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1059) Drivers should use elv_next_request to pick up requests and are no longer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1060) supposed to handle looping directly over the request list.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1061) (struct request->queue has been removed)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1062) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1063) Now end_that_request_first takes an additional number_of_sectors argument.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1064) It used to handle always just the first buffer_head in a request, now
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1065) it will loop and handle as many sectors (on a bio-segment granularity)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1066) as specified.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1067) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1068) Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1069) right thing to use is bio_endio(bio) instead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1070) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1071) If the driver is dropping the io_request_lock from its request_fn strategy,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1072) then it just needs to replace that with q->queue_lock instead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1073) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1074) As described in Sec 1.1, drivers can set max sector size, max segment size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1075) etc per queue now. Drivers that used to define their own merge functions i
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1076) to handle things like this can now just use the blk_queue_* functions at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1077) blk_init_queue time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1078) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1079) Drivers no longer have to map a {partition, sector offset} into the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1080) correct absolute location anymore, this is done by the block layer, so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1081) where a driver received a request ala this before::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1082) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1083) 	rq->rq_dev = mk_kdev(3, 5);	/* /dev/hda5 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1084) 	rq->sector = 0;			/* first sector on hda5 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1085) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1086) it will now see::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1087) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1088) 	rq->rq_dev = mk_kdev(3, 0);	/* /dev/hda */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1089) 	rq->sector = 123128;		/* offset from start of disk */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1090) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1091) As mentioned, there is no virtual mapping of a bio. For DMA, this is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1092) not a problem as the driver probably never will need a virtual mapping.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1093) Instead it needs a bus mapping (dma_map_page for a single segment or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1094) use dma_map_sg for scatter gather) to be able to ship it to the driver. For
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1095) PIO drivers (or drivers that need to revert to PIO transfer once in a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1096) while (IDE for example)), where the CPU is doing the actual data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1097) transfer a virtual mapping is needed. If the driver supports highmem I/O,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1098) (Sec 1.1, (ii) ) it needs to use kmap_atomic or similar to temporarily map
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1099) a bio into the virtual address space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1100) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1101) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1102) 8. Prior/Related/Impacted patches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1103) =================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1104) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1105) 8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1106) -----------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1107) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1108) - orig kiobuf & raw i/o patches (now in 2.4 tree)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1109) - direct kiobuf based i/o to devices (no intermediate bh's)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1110) - page i/o using kiobuf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1111) - kiobuf splitting for lvm (mkp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1112) - elevator support for kiobuf request merging (axboe)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1113) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1114) 8.2. Zero-copy networking (Dave Miller)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1115) ---------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1116) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1117) 8.3. SGI XFS - pagebuf patches - use of kiobufs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1118) -----------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1119) 8.4. Multi-page pioent patch for bio (Christoph Hellwig)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1120) --------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1121) 8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre11
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1122) --------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1123) 8.6. Async i/o implementation patch (Ben LaHaise)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1124) -------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1125) 8.7. EVMS layering design (IBM EVMS team)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1126) -----------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1127) 8.8. Larger page cache size patch (Ben LaHaise) and Large page size (Daniel Phillips)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1128) -------------------------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1129) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1130)     => larger contiguous physical memory buffers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1131) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1132) 8.9. VM reservations patch (Ben LaHaise)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1133) ----------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1134) 8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1135) ----------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1136) 8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1137) ---------------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1138) 8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar, Badari)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1139) -------------------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1140) 8.13  Priority based i/o scheduler - prepatches (Arjan van de Ven)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1141) ------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1142) 8.14  IDE Taskfile i/o patch (Andre Hedrick)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1143) --------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1144) 8.15  Multi-page writeout and readahead patches (Andrew Morton)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1145) ---------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1146) 8.16  Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1147) -----------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1148) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1149) 9. Other References
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1150) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1151) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1152) 9.1 The Splice I/O Model
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1153) ------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1154) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1155) Larry McVoy (and subsequent discussions on lkml, and Linus' comments - Jan 2001
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1156) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1157) 9.2 Discussions about kiobuf and bh design
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1158) ------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1159) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1160) On lkml between sct, linus, alan et al - Feb-March 2001 (many of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1161) initial thoughts that led to bio were brought up in this discussion thread)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1162) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1163) 9.3 Discussions on mempool on lkml - Dec 2001.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1164) ----------------------------------------------