^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) ==================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) VFIO - "Virtual Function I/O" [1]_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ==================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) Many modern system now provide DMA and interrupt remapping facilities
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) to help ensure I/O devices behave within the boundaries they've been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) allotted. This includes x86 hardware with AMD-Vi and Intel VT-d,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) systems such as Freescale PAMU. The VFIO driver is an IOMMU/device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) agnostic framework for exposing direct device access to userspace, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) a secure, IOMMU protected environment. In other words, this allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) safe [2]_, non-privileged, userspace drivers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) Why do we want that? Virtual machines often make use of direct device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) access ("device assignment") when configured for the highest possible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) I/O performance. From a device and host perspective, this simply
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) turns the VM into a userspace driver, with the benefits of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) significantly reduced latency, higher bandwidth, and direct use of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) bare-metal device drivers [3]_.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) Some applications, particularly in the high performance computing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) field, also benefit from low-overhead, direct device access from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) userspace. Examples include network adapters (often non-TCP/IP based)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) and compute accelerators. Prior to VFIO, these drivers had to either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) go through the full development cycle to become proper upstream
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) driver, be maintained out of tree, or make use of the UIO framework,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) which has no notion of IOMMU protection, limited interrupt support,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) and requires root privileges to access things like PCI configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) The VFIO driver framework intends to unify these, replacing both the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) KVM PCI specific device assignment code as well as provide a more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) secure, more featureful userspace driver environment than UIO.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) Groups, Devices, and IOMMUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) Devices are the main target of any I/O driver. Devices typically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) create a programming interface made up of I/O access, interrupts,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) and DMA. Without going into the details of each of these, DMA is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) by far the most critical aspect for maintaining a secure environment
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) as allowing a device read-write access to system memory imposes the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) greatest risk to the overall system integrity.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) To help mitigate this risk, many modern IOMMUs now incorporate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) isolation properties into what was, in many cases, an interface only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) meant for translation (ie. solving the addressing problems of devices
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) with limited address spaces). With this, devices can now be isolated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) from each other and from arbitrary memory access, thus allowing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) things like secure direct assignment of devices into virtual machines.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) This isolation is not always at the granularity of a single device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) though. Even when an IOMMU is capable of this, properties of devices,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) interconnects, and IOMMU topologies can each reduce this isolation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) For instance, an individual device may be part of a larger multi-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) function enclosure. While the IOMMU may be able to distinguish
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) between devices within the enclosure, the enclosure may not require
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) transactions between devices to reach the IOMMU. Examples of this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) could be anything from a multi-function PCI device with backdoors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) between functions to a non-PCI-ACS (Access Control Services) capable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) bridge allowing redirection without reaching the IOMMU. Topology
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) can also play a factor in terms of hiding devices. A PCIe-to-PCI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) bridge masks the devices behind it, making transaction appear as if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) from the bridge itself. Obviously IOMMU design plays a major factor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) as well.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) Therefore, while for the most part an IOMMU may have device level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) granularity, any system is susceptible to reduced granularity. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) IOMMU API therefore supports a notion of IOMMU groups. A group is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) a set of devices which is isolatable from all other devices in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) system. Groups are therefore the unit of ownership used by VFIO.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) While the group is the minimum granularity that must be used to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) ensure secure user access, it's not necessarily the preferred
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) granularity. In IOMMUs which make use of page tables, it may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) possible to share a set of page tables between different groups,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) reducing the overhead both to the platform (reduced TLB thrashing,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) reduced duplicate page tables), and to the user (programming only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) a single set of translations). For this reason, VFIO makes use of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) a container class, which may hold one or more groups. A container
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) is created by simply opening the /dev/vfio/vfio character device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) On its own, the container provides little functionality, with all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) but a couple version and extension query interfaces locked away.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) The user needs to add a group into the container for the next level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) of functionality. To do this, the user first needs to identify the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) group associated with the desired device. This can be done using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) the sysfs links described in the example below. By unbinding the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) device from the host driver and binding it to a VFIO driver, a new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) VFIO group will appear for the group as /dev/vfio/$GROUP, where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) $GROUP is the IOMMU group number of which the device is a member.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) If the IOMMU group contains multiple devices, each will need to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) be bound to a VFIO driver before operations on the VFIO group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) are allowed (it's also sufficient to only unbind the device from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) host drivers if a VFIO driver is unavailable; this will make the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) group available, but not that particular device). TBD - interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) for disabling driver probing/locking a device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) Once the group is ready, it may be added to the container by opening
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) the VFIO group character device (/dev/vfio/$GROUP) and using the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) VFIO_GROUP_SET_CONTAINER ioctl, passing the file descriptor of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) previously opened container file. If desired and if the IOMMU driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) supports sharing the IOMMU context between groups, multiple groups may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) be set to the same container. If a group fails to set to a container
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) with existing groups, a new empty container will need to be used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) instead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) With a group (or groups) attached to a container, the remaining
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) ioctls become available, enabling access to the VFIO IOMMU interfaces.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) Additionally, it now becomes possible to get file descriptors for each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) device within a group using an ioctl on the VFIO group file descriptor.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) The VFIO device API includes ioctls for describing the device, the I/O
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) regions and their read/write/mmap offsets on the device descriptor, as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) well as mechanisms for describing and registering interrupt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) notifications.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) VFIO Usage Example
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) Assume user wants to access PCI device 0000:06:0d.0::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) $ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) ../../../../kernel/iommu_groups/26
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) This device is therefore in IOMMU group 26. This device is on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) pci bus, therefore the user will make use of vfio-pci to manage the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) group::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) # modprobe vfio-pci
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) Binding this device to the vfio-pci driver creates the VFIO group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) character devices for this group::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) $ lspci -n -s 0000:06:0d.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) 06:0d.0 0401: 1102:0002 (rev 08)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) # echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) # echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) Now we need to look at what other devices are in the group to free
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) it for use by VFIO::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) $ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) total 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 ->
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) ../../../../devices/pci0000:00/0000:00:1e.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 ->
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 ->
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) This device is behind a PCIe-to-PCI bridge [4]_, therefore we also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) need to add device 0000:06:0d.1 to the group following the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) procedure as above. Device 0000:00:1e.0 is a bridge that does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) not currently have a host driver, therefore it's not required to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) bind this device to the vfio-pci driver (vfio-pci does not currently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) support PCI bridges).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) The final step is to provide the user with access to the group if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) unprivileged operation is desired (note that /dev/vfio/vfio provides
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) no capabilities on its own and is therefore expected to be set to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) mode 0666 by the system)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) # chown user:user /dev/vfio/26
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) The user now has full access to all the devices and the iommu for this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) group and can access them as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) int container, group, device, i;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) struct vfio_group_status group_status =
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) { .argsz = sizeof(group_status) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) /* Create a new container */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) container = open("/dev/vfio/vfio", O_RDWR);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) /* Unknown API version */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) /* Doesn't support the IOMMU driver we want. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) /* Open the group */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) group = open("/dev/vfio/26", O_RDWR);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) /* Test the group is viable and available */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) /* Group is not viable (ie, not all devices bound for vfio) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) /* Add the group to the container */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) /* Enable the IOMMU model we want */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) /* Get addition IOMMU info */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) /* Allocate some space and setup a DMA mapping */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) dma_map.size = 1024 * 1024;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) /* Get a file descriptor for the device */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) /* Test and setup the device */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) for (i = 0; i < device_info.num_regions; i++) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) struct vfio_region_info reg = { .argsz = sizeof(reg) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) reg.index = i;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) ioctl(device, VFIO_DEVICE_GET_REGION_INFO, ®);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) /* Setup mappings... read/write offsets, mmaps
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) * For PCI devices, config space is a region */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) for (i = 0; i < device_info.num_irqs; i++) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) struct vfio_irq_info irq = { .argsz = sizeof(irq) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) irq.index = i;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) /* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) /* Gratuitous device reset and go... */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) ioctl(device, VFIO_DEVICE_RESET);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) VFIO User API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) -------------------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) Please see include/linux/vfio.h for complete API documentation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) VFIO bus driver API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) -------------------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) VFIO bus drivers, such as vfio-pci make use of only a few interfaces
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) into VFIO core. When devices are bound and unbound to the driver,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) the driver should call vfio_add_group_dev() and vfio_del_group_dev()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) respectively::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) extern int vfio_add_group_dev(struct device *dev,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) const struct vfio_device_ops *ops,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) void *device_data);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) extern void *vfio_del_group_dev(struct device *dev);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) vfio_add_group_dev() indicates to the core to begin tracking the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) iommu_group of the specified dev and register the dev as owned by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) a VFIO bus driver. The driver provides an ops structure for callbacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) similar to a file operations structure::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) struct vfio_device_ops {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) int (*open)(void *device_data);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) void (*release)(void *device_data);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) ssize_t (*read)(void *device_data, char __user *buf,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) size_t count, loff_t *ppos);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) ssize_t (*write)(void *device_data, const char __user *buf,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) size_t size, loff_t *ppos);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) long (*ioctl)(void *device_data, unsigned int cmd,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) unsigned long arg);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) int (*mmap)(void *device_data, struct vm_area_struct *vma);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) Each function is passed the device_data that was originally registered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) in the vfio_add_group_dev() call above. This allows the bus driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) an easy place to store its opaque, private data. The open/release
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) callbacks are issued when a new file descriptor is created for a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) device (via VFIO_GROUP_GET_DEVICE_FD). The ioctl interface provides
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) a direct pass through for VFIO_DEVICE_* ioctls. The read/write/mmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) interfaces implement the device region access defined by the device's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) own VFIO_DEVICE_GET_REGION_INFO ioctl.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) PPC64 sPAPR implementation note
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) -------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) This implementation has some specifics:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) 1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) container is supported as an IOMMU table is allocated at the boot time,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) one table per a IOMMU group which is a Partitionable Endpoint (PE)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) (PE is often a PCI domain but not always).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) Newer systems (POWER8 with IODA2) have improved hardware design which allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) to remove this limitation and have multiple IOMMU groups per a VFIO
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) container.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) 2) The hardware supports so called DMA windows - the PCI address range
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) within which DMA transfer is allowed, any attempt to access address space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) out of the window leads to the whole PE isolation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) 3) PPC64 guests are paravirtualized but not fully emulated. There is an API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) to map/unmap pages for DMA, and it normally maps 1..32 pages per call and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) currently there is no way to reduce the number of calls. In order to make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) things faster, the map/unmap handling has been implemented in real mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) which provides an excellent performance which has limitations such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) inability to do locked pages accounting in real time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) 4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) subtree that can be treated as a unit for the purposes of partitioning and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) function of a multi-function IOA, or multiple IOAs (possibly including
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) switch and bridge structures above the multiple IOAs). PPC64 guests detect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) PCI errors and recover from them via EEH RTAS services, which works on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) basis of additional ioctl commands.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) So 4 additional ioctls have been added:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) VFIO_IOMMU_SPAPR_TCE_GET_INFO
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) returns the size and the start of the DMA window on the PCI bus.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) VFIO_IOMMU_ENABLE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) enables the container. The locked pages accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) is done at this point. This lets user first to know what
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) the DMA window is and adjust rlimit before doing any real job.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) VFIO_IOMMU_DISABLE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) disables the container.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) VFIO_EEH_PE_OP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) provides an API for EEH setup, error detection and recovery.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) The code flow from the example above should be slightly changed::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) .....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) /* Add the group to the container */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) /* Enable the IOMMU model we want */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) ioctl(container, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) /* Get addition sPAPR IOMMU info */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) vfio_iommu_spapr_tce_info spapr_iommu_info;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) ioctl(container, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &spapr_iommu_info);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) if (ioctl(container, VFIO_IOMMU_ENABLE))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) /* Cannot enable container, may be low rlimit */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) /* Allocate some space and setup a DMA mapping */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) dma_map.size = 1024 * 1024;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) /* Check here is .iova/.size are within DMA window from spapr_iommu_info */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) /* Get a file descriptor for the device */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) ....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) /* Gratuitous device reset and go... */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) ioctl(device, VFIO_DEVICE_RESET);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) /* Make sure EEH is supported */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) ioctl(container, VFIO_CHECK_EXTENSION, VFIO_EEH);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) /* Enable the EEH functionality on the device */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) pe_op.op = VFIO_EEH_PE_ENABLE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) /* You're suggested to create additional data struct to represent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) * PE, and put child devices belonging to same IOMMU group to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) * PE instance for later reference.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) /* Check the PE's state and make sure it's in functional state */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) pe_op.op = VFIO_EEH_PE_GET_STATE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) /* Save device state using pci_save_state().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) * EEH should be enabled on the specified device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) ....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) /* Inject EEH error, which is expected to be caused by 32-bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) * config load.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) pe_op.op = VFIO_EEH_PE_INJECT_ERR;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) pe_op.err.type = EEH_ERR_TYPE_32;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) pe_op.err.func = EEH_ERR_FUNC_LD_CFG_ADDR;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) pe_op.err.addr = 0ul;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) pe_op.err.mask = 0ul;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) ....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) /* When 0xFF's returned from reading PCI config space or IO BARs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) * of the PCI device. Check the PE's state to see if that has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) * frozen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) /* Waiting for pending PCI transactions to be completed and don't
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) * produce any more PCI traffic from/to the affected PE until
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) * recovery is finished.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) /* Enable IO for the affected PE and collect logs. Usually, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) * standard part of PCI config space, AER registers are dumped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) * as logs for further analysis.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) pe_op.op = VFIO_EEH_PE_UNFREEZE_IO;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) /*
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) * Issue PE reset: hot or fundamental reset. Usually, hot reset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) * is enough. However, the firmware of some PCI adapters would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) * require fundamental reset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) pe_op.op = VFIO_EEH_PE_RESET_HOT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) pe_op.op = VFIO_EEH_PE_RESET_DEACTIVATE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) /* Configure the PCI bridges for the affected PE */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) pe_op.op = VFIO_EEH_PE_CONFIGURE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) /* Restored state we saved at initialization time. pci_restore_state()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) * is good enough as an example.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) /* Hopefully, error is recovered successfully. Now, you can resume to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) * start PCI traffic to/from the affected PE.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) ....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) 5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) VFIO_IOMMU_DISABLE and implements 2 new ioctls:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) (which are unsupported in v1 IOMMU).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) PPC64 paravirtualized guests generate a lot of map/unmap requests,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456) and the handling of those includes pinning/unpinning pages and updating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) mm::locked_vm counter to make sure we do not exceed the rlimit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) The v2 IOMMU splits accounting and pinning into separate operations:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) receive a user space address and size of the block to be pinned.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) be called with the exact address and size used for registering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) the memory block. The userspace is not expected to call these often.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) The ranges are stored in a linked list in a VFIO container.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) IOMMU table and do not do pinning; instead these check that the userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) address is from pre-registered range.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) This separation helps in optimizing DMA for guests.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) 6) sPAPR specification allows guests to have an additional DMA window(s) on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) a PCI bus with a variable page size. Two ioctls have been added to support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) The platform has to support the functionality or error will be returned to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) the userspace. The existing hardware supports up to 2 DMA windows, one is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478) 2GB long, uses 4K pages and called "default 32bit window"; the other can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) be as big as entire RAM, use different page size, it is optional - guests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) create those in run-time if the guest driver supports 64bit DMA.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) a number of TCE table levels (if a TCE table is going to be big enough and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) the kernel may not be able to allocate enough of physically contiguous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) memory). It creates a new window in the available slot and returns the bus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) address where the new window starts. Due to hardware limitation, the user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) space cannot choose the location of DMA windows.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490) and removes it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) -------------------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) .. [1] VFIO was originally an acronym for "Virtual Function I/O" in its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495) initial implementation by Tom Lyon while as Cisco. We've since
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) outgrown the acronym, but it's catchy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) .. [2] "safe" also depends upon a device being "well behaved". It's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) possible for multi-function devices to have backdoors between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500) functions and even for single function devices to have alternative
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) access to things like PCI config space through MMIO registers. To
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) guard against the former we can include additional precautions in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) IOMMU driver to group multi-function PCI devices together
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) (iommu=group_mf). The latter we can't prevent, but the IOMMU should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505) still provide isolation. For PCI, SR-IOV Virtual Functions are the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) best indicator of "well behaved", as these are designed for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507) virtualization usage models.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) .. [3] As always there are trade-offs to virtual machine device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) assignment that are beyond the scope of VFIO. It's expected that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) future IOMMU technologies will reduce some, but maybe not all, of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) these trade-offs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) .. [4] In this case the device is below a PCI bridge, so transactions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) from either function of the device are indistinguishable to the iommu::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) -[0000:00]-+-1e.0-[06]--+-0d.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) \-0d.1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)