Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) ==================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) VFIO - "Virtual Function I/O" [1]_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ==================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) Many modern system now provide DMA and interrupt remapping facilities
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) to help ensure I/O devices behave within the boundaries they've been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) systems such as Freescale PAMU.  The VFIO driver is an IOMMU/device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) agnostic framework for exposing direct device access to userspace, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) a secure, IOMMU protected environment.  In other words, this allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) safe [2]_, non-privileged, userspace drivers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) Why do we want that?  Virtual machines often make use of direct device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) access ("device assignment") when configured for the highest possible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) I/O performance.  From a device and host perspective, this simply
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) turns the VM into a userspace driver, with the benefits of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) significantly reduced latency, higher bandwidth, and direct use of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) bare-metal device drivers [3]_.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) Some applications, particularly in the high performance computing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) field, also benefit from low-overhead, direct device access from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) userspace.  Examples include network adapters (often non-TCP/IP based)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) and compute accelerators.  Prior to VFIO, these drivers had to either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) go through the full development cycle to become proper upstream
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) driver, be maintained out of tree, or make use of the UIO framework,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) which has no notion of IOMMU protection, limited interrupt support,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) and requires root privileges to access things like PCI configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) The VFIO driver framework intends to unify these, replacing both the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) KVM PCI specific device assignment code as well as provide a more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) secure, more featureful userspace driver environment than UIO.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) Groups, Devices, and IOMMUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) Devices are the main target of any I/O driver.  Devices typically
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) create a programming interface made up of I/O access, interrupts,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) and DMA.  Without going into the details of each of these, DMA is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) by far the most critical aspect for maintaining a secure environment
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) as allowing a device read-write access to system memory imposes the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) greatest risk to the overall system integrity.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) To help mitigate this risk, many modern IOMMUs now incorporate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) isolation properties into what was, in many cases, an interface only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) meant for translation (ie. solving the addressing problems of devices
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) with limited address spaces).  With this, devices can now be isolated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) from each other and from arbitrary memory access, thus allowing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) things like secure direct assignment of devices into virtual machines.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) This isolation is not always at the granularity of a single device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) though.  Even when an IOMMU is capable of this, properties of devices,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) interconnects, and IOMMU topologies can each reduce this isolation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) For instance, an individual device may be part of a larger multi-
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) function enclosure.  While the IOMMU may be able to distinguish
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) between devices within the enclosure, the enclosure may not require
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) transactions between devices to reach the IOMMU.  Examples of this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) could be anything from a multi-function PCI device with backdoors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) between functions to a non-PCI-ACS (Access Control Services) capable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) bridge allowing redirection without reaching the IOMMU.  Topology
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) can also play a factor in terms of hiding devices.  A PCIe-to-PCI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) bridge masks the devices behind it, making transaction appear as if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) from the bridge itself.  Obviously IOMMU design plays a major factor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) as well.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) Therefore, while for the most part an IOMMU may have device level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) granularity, any system is susceptible to reduced granularity.  The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) IOMMU API therefore supports a notion of IOMMU groups.  A group is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) a set of devices which is isolatable from all other devices in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) system.  Groups are therefore the unit of ownership used by VFIO.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) While the group is the minimum granularity that must be used to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) ensure secure user access, it's not necessarily the preferred
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) granularity.  In IOMMUs which make use of page tables, it may be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) possible to share a set of page tables between different groups,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) reducing the overhead both to the platform (reduced TLB thrashing,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) reduced duplicate page tables), and to the user (programming only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) a single set of translations).  For this reason, VFIO makes use of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) a container class, which may hold one or more groups.  A container
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) is created by simply opening the /dev/vfio/vfio character device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) On its own, the container provides little functionality, with all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) but a couple version and extension query interfaces locked away.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) The user needs to add a group into the container for the next level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) of functionality.  To do this, the user first needs to identify the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) group associated with the desired device.  This can be done using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) the sysfs links described in the example below.  By unbinding the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) device from the host driver and binding it to a VFIO driver, a new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) VFIO group will appear for the group as /dev/vfio/$GROUP, where
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) $GROUP is the IOMMU group number of which the device is a member.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) If the IOMMU group contains multiple devices, each will need to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) be bound to a VFIO driver before operations on the VFIO group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) are allowed (it's also sufficient to only unbind the device from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) host drivers if a VFIO driver is unavailable; this will make the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) group available, but not that particular device).  TBD - interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) for disabling driver probing/locking a device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) Once the group is ready, it may be added to the container by opening
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) the VFIO group character device (/dev/vfio/$GROUP) and using the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) VFIO_GROUP_SET_CONTAINER ioctl, passing the file descriptor of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) previously opened container file.  If desired and if the IOMMU driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) supports sharing the IOMMU context between groups, multiple groups may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) be set to the same container.  If a group fails to set to a container
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) with existing groups, a new empty container will need to be used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) instead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) With a group (or groups) attached to a container, the remaining
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) ioctls become available, enabling access to the VFIO IOMMU interfaces.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) Additionally, it now becomes possible to get file descriptors for each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) device within a group using an ioctl on the VFIO group file descriptor.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) The VFIO device API includes ioctls for describing the device, the I/O
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) regions and their read/write/mmap offsets on the device descriptor, as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) well as mechanisms for describing and registering interrupt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) notifications.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) VFIO Usage Example
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) Assume user wants to access PCI device 0000:06:0d.0::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) 	$ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) 	../../../../kernel/iommu_groups/26
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) This device is therefore in IOMMU group 26.  This device is on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) pci bus, therefore the user will make use of vfio-pci to manage the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) group::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) 	# modprobe vfio-pci
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) Binding this device to the vfio-pci driver creates the VFIO group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) character devices for this group::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) 	$ lspci -n -s 0000:06:0d.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) 	06:0d.0 0401: 1102:0002 (rev 08)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 	# echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) 	# echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) Now we need to look at what other devices are in the group to free
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) it for use by VFIO::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) 	$ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) 	total 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 	lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 ->
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 		../../../../devices/pci0000:00/0000:00:1e.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) 	lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 ->
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 	lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 ->
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) This device is behind a PCIe-to-PCI bridge [4]_, therefore we also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) need to add device 0000:06:0d.1 to the group following the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) procedure as above.  Device 0000:00:1e.0 is a bridge that does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) not currently have a host driver, therefore it's not required to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) bind this device to the vfio-pci driver (vfio-pci does not currently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) support PCI bridges).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) The final step is to provide the user with access to the group if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) unprivileged operation is desired (note that /dev/vfio/vfio provides
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) no capabilities on its own and is therefore expected to be set to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) mode 0666 by the system)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 	# chown user:user /dev/vfio/26
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) The user now has full access to all the devices and the iommu for this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) group and can access them as follows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 	int container, group, device, i;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) 	struct vfio_group_status group_status =
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 					{ .argsz = sizeof(group_status) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) 	struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) 	struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) 	/* Create a new container */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) 	container = open("/dev/vfio/vfio", O_RDWR);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) 	if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) 		/* Unknown API version */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) 	if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) 		/* Doesn't support the IOMMU driver we want. */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) 	/* Open the group */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) 	group = open("/dev/vfio/26", O_RDWR);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) 	/* Test the group is viable and available */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) 	ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) 	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) 		/* Group is not viable (ie, not all devices bound for vfio) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) 	/* Add the group to the container */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) 	ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) 	/* Enable the IOMMU model we want */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) 	ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) 	/* Get addition IOMMU info */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) 	ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) 	/* Allocate some space and setup a DMA mapping */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) 	dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) 			     MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) 	dma_map.size = 1024 * 1024;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) 	dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) 	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) 	ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) 	/* Get a file descriptor for the device */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) 	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) 	/* Test and setup the device */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) 	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) 	for (i = 0; i < device_info.num_regions; i++) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) 		struct vfio_region_info reg = { .argsz = sizeof(reg) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) 		reg.index = i;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) 		ioctl(device, VFIO_DEVICE_GET_REGION_INFO, &reg);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) 		/* Setup mappings... read/write offsets, mmaps
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) 		 * For PCI devices, config space is a region */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) 	for (i = 0; i < device_info.num_irqs; i++) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) 		struct vfio_irq_info irq = { .argsz = sizeof(irq) };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) 		irq.index = i;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) 		ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) 		/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) 	/* Gratuitous device reset and go... */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) 	ioctl(device, VFIO_DEVICE_RESET);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) VFIO User API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) -------------------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) Please see include/linux/vfio.h for complete API documentation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) VFIO bus driver API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) -------------------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) VFIO bus drivers, such as vfio-pci make use of only a few interfaces
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) into VFIO core.  When devices are bound and unbound to the driver,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) the driver should call vfio_add_group_dev() and vfio_del_group_dev()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) respectively::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) 	extern int vfio_add_group_dev(struct device *dev,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) 				      const struct vfio_device_ops *ops,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) 				      void *device_data);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) 	extern void *vfio_del_group_dev(struct device *dev);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) vfio_add_group_dev() indicates to the core to begin tracking the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) iommu_group of the specified dev and register the dev as owned by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) a VFIO bus driver.  The driver provides an ops structure for callbacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) similar to a file operations structure::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) 	struct vfio_device_ops {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) 		int	(*open)(void *device_data);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) 		void	(*release)(void *device_data);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) 		ssize_t	(*read)(void *device_data, char __user *buf,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) 				size_t count, loff_t *ppos);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) 		ssize_t	(*write)(void *device_data, const char __user *buf,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) 				 size_t size, loff_t *ppos);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) 		long	(*ioctl)(void *device_data, unsigned int cmd,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) 				 unsigned long arg);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) 		int	(*mmap)(void *device_data, struct vm_area_struct *vma);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) 	};
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) Each function is passed the device_data that was originally registered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) in the vfio_add_group_dev() call above.  This allows the bus driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) an easy place to store its opaque, private data.  The open/release
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) callbacks are issued when a new file descriptor is created for a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) device (via VFIO_GROUP_GET_DEVICE_FD).  The ioctl interface provides
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) a direct pass through for VFIO_DEVICE_* ioctls.  The read/write/mmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) interfaces implement the device region access defined by the device's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) own VFIO_DEVICE_GET_REGION_INFO ioctl.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) PPC64 sPAPR implementation note
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) -------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) This implementation has some specifics:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) 1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294)    container is supported as an IOMMU table is allocated at the boot time,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295)    one table per a IOMMU group which is a Partitionable Endpoint (PE)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296)    (PE is often a PCI domain but not always).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298)    Newer systems (POWER8 with IODA2) have improved hardware design which allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299)    to remove this limitation and have multiple IOMMU groups per a VFIO
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300)    container.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) 2) The hardware supports so called DMA windows - the PCI address range
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303)    within which DMA transfer is allowed, any attempt to access address space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304)    out of the window leads to the whole PE isolation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) 3) PPC64 guests are paravirtualized but not fully emulated. There is an API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307)    to map/unmap pages for DMA, and it normally maps 1..32 pages per call and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308)    currently there is no way to reduce the number of calls. In order to make
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309)    things faster, the map/unmap handling has been implemented in real mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310)    which provides an excellent performance which has limitations such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311)    inability to do locked pages accounting in real time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) 4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314)    subtree that can be treated as a unit for the purposes of partitioning and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315)    error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316)    function of a multi-function IOA, or multiple IOAs (possibly including
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317)    switch and bridge structures above the multiple IOAs). PPC64 guests detect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318)    PCI errors and recover from them via EEH RTAS services, which works on the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319)    basis of additional ioctl commands.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321)    So 4 additional ioctls have been added:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) 	VFIO_IOMMU_SPAPR_TCE_GET_INFO
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) 		returns the size and the start of the DMA window on the PCI bus.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) 	VFIO_IOMMU_ENABLE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) 		enables the container. The locked pages accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) 		is done at this point. This lets user first to know what
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) 		the DMA window is and adjust rlimit before doing any real job.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) 	VFIO_IOMMU_DISABLE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) 		disables the container.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) 	VFIO_EEH_PE_OP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) 		provides an API for EEH setup, error detection and recovery.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337)    The code flow from the example above should be slightly changed::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) 	struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 };
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) 	.....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) 	/* Add the group to the container */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) 	ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) 	/* Enable the IOMMU model we want */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) 	ioctl(container, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) 	/* Get addition sPAPR IOMMU info */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) 	vfio_iommu_spapr_tce_info spapr_iommu_info;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) 	ioctl(container, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &spapr_iommu_info);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) 	if (ioctl(container, VFIO_IOMMU_ENABLE))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) 		/* Cannot enable container, may be low rlimit */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) 	/* Allocate some space and setup a DMA mapping */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) 	dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) 			     MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) 	dma_map.size = 1024 * 1024;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) 	dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) 	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) 	/* Check here is .iova/.size are within DMA window from spapr_iommu_info */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) 	ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) 	/* Get a file descriptor for the device */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) 	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) 	....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) 	/* Gratuitous device reset and go... */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) 	ioctl(device, VFIO_DEVICE_RESET);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) 	/* Make sure EEH is supported */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) 	ioctl(container, VFIO_CHECK_EXTENSION, VFIO_EEH);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) 	/* Enable the EEH functionality on the device */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) 	pe_op.op = VFIO_EEH_PE_ENABLE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) 	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) 	/* You're suggested to create additional data struct to represent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) 	 * PE, and put child devices belonging to same IOMMU group to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) 	 * PE instance for later reference.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) 	 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) 	/* Check the PE's state and make sure it's in functional state */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) 	pe_op.op = VFIO_EEH_PE_GET_STATE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) 	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) 	/* Save device state using pci_save_state().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) 	 * EEH should be enabled on the specified device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) 	 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) 	....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) 	/* Inject EEH error, which is expected to be caused by 32-bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) 	 * config load.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) 	 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) 	pe_op.op = VFIO_EEH_PE_INJECT_ERR;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) 	pe_op.err.type = EEH_ERR_TYPE_32;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) 	pe_op.err.func = EEH_ERR_FUNC_LD_CFG_ADDR;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) 	pe_op.err.addr = 0ul;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) 	pe_op.err.mask = 0ul;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) 	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) 	....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) 	/* When 0xFF's returned from reading PCI config space or IO BARs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) 	 * of the PCI device. Check the PE's state to see if that has been
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) 	 * frozen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) 	 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) 	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) 	/* Waiting for pending PCI transactions to be completed and don't
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) 	 * produce any more PCI traffic from/to the affected PE until
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) 	 * recovery is finished.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) 	 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) 	/* Enable IO for the affected PE and collect logs. Usually, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) 	 * standard part of PCI config space, AER registers are dumped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) 	 * as logs for further analysis.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) 	 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) 	pe_op.op = VFIO_EEH_PE_UNFREEZE_IO;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) 	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) 	/*
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) 	 * Issue PE reset: hot or fundamental reset. Usually, hot reset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) 	 * is enough. However, the firmware of some PCI adapters would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) 	 * require fundamental reset.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) 	 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) 	pe_op.op = VFIO_EEH_PE_RESET_HOT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) 	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) 	pe_op.op = VFIO_EEH_PE_RESET_DEACTIVATE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) 	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) 	/* Configure the PCI bridges for the affected PE */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437) 	pe_op.op = VFIO_EEH_PE_CONFIGURE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) 	ioctl(container, VFIO_EEH_PE_OP, &pe_op);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440) 	/* Restored state we saved at initialization time. pci_restore_state()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) 	 * is good enough as an example.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) 	 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) 	/* Hopefully, error is recovered successfully. Now, you can resume to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) 	 * start PCI traffic to/from the affected PE.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) 	 */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) 	....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) 5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451)    VFIO_IOMMU_DISABLE and implements 2 new ioctls:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452)    VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453)    (which are unsupported in v1 IOMMU).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455)    PPC64 paravirtualized guests generate a lot of map/unmap requests,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456)    and the handling of those includes pinning/unpinning pages and updating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457)    mm::locked_vm counter to make sure we do not exceed the rlimit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458)    The v2 IOMMU splits accounting and pinning into separate operations:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460)    - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461)      receive a user space address and size of the block to be pinned.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462)      Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463)      be called with the exact address and size used for registering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464)      the memory block. The userspace is not expected to call these often.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465)      The ranges are stored in a linked list in a VFIO container.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467)    - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468)      IOMMU table and do not do pinning; instead these check that the userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469)      address is from pre-registered range.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471)    This separation helps in optimizing DMA for guests.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) 6) sPAPR specification allows guests to have an additional DMA window(s) on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474)    a PCI bus with a variable page size. Two ioctls have been added to support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475)    this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476)    The platform has to support the functionality or error will be returned to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477)    the userspace. The existing hardware supports up to 2 DMA windows, one is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478)    2GB long, uses 4K pages and called "default 32bit window"; the other can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479)    be as big as entire RAM, use different page size, it is optional - guests
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480)    create those in run-time if the guest driver supports 64bit DMA.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482)    VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483)    a number of TCE table levels (if a TCE table is going to be big enough and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484)    the kernel may not be able to allocate enough of physically contiguous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485)    memory). It creates a new window in the available slot and returns the bus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486)    address where the new window starts. Due to hardware limitation, the user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487)    space cannot choose the location of DMA windows.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489)    VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490)    and removes it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) -------------------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) .. [1] VFIO was originally an acronym for "Virtual Function I/O" in its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495)    initial implementation by Tom Lyon while as Cisco.  We've since
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496)    outgrown the acronym, but it's catchy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498) .. [2] "safe" also depends upon a device being "well behaved".  It's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499)    possible for multi-function devices to have backdoors between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500)    functions and even for single function devices to have alternative
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501)    access to things like PCI config space through MMIO registers.  To
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502)    guard against the former we can include additional precautions in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503)    IOMMU driver to group multi-function PCI devices together
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504)    (iommu=group_mf).  The latter we can't prevent, but the IOMMU should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505)    still provide isolation.  For PCI, SR-IOV Virtual Functions are the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506)    best indicator of "well behaved", as these are designed for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507)    virtualization usage models.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) .. [3] As always there are trade-offs to virtual machine device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510)    assignment that are beyond the scope of VFIO.  It's expected that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511)    future IOMMU technologies will reduce some, but maybe not all, of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512)    these trade-offs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514) .. [4] In this case the device is below a PCI bridge, so transactions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515)    from either function of the device are indistinguishable to the iommu::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) 	-[0000:00]-+-1e.0-[06]--+-0d.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) 				\-0d.1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520) 	00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)