^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) PCI Peer-to-Peer DMA Support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) The PCI bus has pretty decent support for performing DMA transfers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) between two devices on the bus. This type of transaction is henceforth
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) called Peer-to-Peer (or P2P). However, there are a number of issues that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) make P2P transactions tricky to do in a perfectly safe way.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) One of the biggest issues is that PCI doesn't require forwarding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) transactions between hierarchy domains, and in PCIe, each Root Port
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) defines a separate hierarchy domain. To make things worse, there is no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) simple way to determine if a given Root Complex supports this or not.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) only supports doing P2P when the endpoints involved are all behind the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) same PCI bridge, as such devices are all in the same PCI hierarchy
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) domain, and the spec guarantees that all transactions within the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) hierarchy will be routable, but it does not require routing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) between hierarchies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) The second issue is that to make use of existing interfaces in Linux,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) memory that is used for P2P transactions needs to be backed by struct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) pages. However, PCI BARs are not typically cache coherent so there are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) a few corner case gotchas with these pages so developers need to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) be careful about what they do with them.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) Driver Writer's Guide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) In a given P2P implementation there may be three or more different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) types of kernel drivers in play:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) * Provider - A driver which provides or publishes P2P resources like
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) memory or doorbell registers to other drivers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) * Client - A driver which makes use of a resource by setting up a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) DMA transaction to or from it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) * Orchestrator - A driver which orchestrates the flow of data between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) clients and providers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) In many cases there could be overlap between these three types (i.e.,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) it may be typical for a driver to be both a provider and a client).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) For example, in the NVMe Target Copy Offload implementation:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) * The NVMe PCI driver is both a client, provider and orchestrator
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) resource (provider), it accepts P2P memory pages as buffers in requests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) to be used directly (client) and it can also make use of the CMB as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) submission queue entries (orchestrator).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) * The RDMA driver is a client in this arrangement so that an RNIC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) can DMA directly to the memory exposed by the NVMe device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) * The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) to the P2P memory (CMB) and then to the NVMe device (and vice versa).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) This is currently the only arrangement supported by the kernel but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) one could imagine slight tweaks to this that would allow for the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) functionality. For example, if a specific RNIC added a BAR with some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) memory behind it, its driver could add support as a P2P provider and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) then the NVMe Target could use the RNIC's memory instead of the CMB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) in cases where the NVMe cards in use do not have CMB support.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) Provider Drivers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) ----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) A provider simply needs to register a BAR (or a portion of a BAR)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) This will register struct pages for all the specified memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) After that it may optionally publish all of its resources as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) any orchestrator drivers to find and use the memory. When marked in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) this way, the resource must be regular memory with no side effects.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) For the time being this is fairly rudimentary in that all resources
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) are typically going to be P2P memory. Future work will likely expand
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) this to include other types of resources like doorbells.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) Client Drivers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) A client driver typically only has to conditionally change its DMA map
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) routine to use the mapping function :c:func:`pci_p2pdma_map_sg()` instead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) of the usual :c:func:`dma_map_sg()` function. Memory mapped in this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) way does not need to be unmapped.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) The client may also, optionally, make use of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) :c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) functions and when to use the regular mapping functions. In some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) situations, it may be more appropriate to use a flag to indicate a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) given request is P2P memory and map appropriately. It is important to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) ensure that struct pages that back P2P memory stay out of code that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) does not have support for them as other code may treat the pages as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) regular memory which may not be appropriate.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) Orchestrator Drivers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) --------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) The first task an orchestrator driver must do is compile a list of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) all client devices that will be involved in a given transaction. For
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) example, the NVMe Target driver creates a list including the namespace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) block device and the RNIC in use. If the orchestrator has access to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) a specific P2P provider to use it may check compatibility using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) :c:func:`pci_p2pdma_distance()` otherwise it may find a memory provider
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) that's compatible with all clients using :c:func:`pci_p2pmem_find()`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) If more than one provider is supported, the one nearest to all the clients will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) be chosen first. If more than one provider is an equal distance away, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) one returned will be chosen at random (it is not an arbitrary but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) truly random). This function returns the PCI device to use for the provider
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) with a reference taken and therefore when it's no longer needed it should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) returned with pci_dev_put().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) Once a provider is selected, the orchestrator can then use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) :c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) allocating scatter-gather lists with P2P memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) Struct Page Caveats
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) -------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) Driver writers should be very careful about not passing these special
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) struct pages to code that isn't prepared for it. At this time, the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) interfaces do not have any checks for ensuring this. This obviously
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) precludes passing these pages to userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) P2P memory is also technically IO memory but should never have any side
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) effects behind it. Thus, the order of loads and stores should not be important
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) and ioreadX(), iowriteX() and friends should not be necessary.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) P2P DMA Support Library
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) =======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) .. kernel-doc:: drivers/pci/p2pdma.c
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) :export: