^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) Userspace verbs access
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) The ib_uverbs module, built by enabling CONFIG_INFINIBAND_USER_VERBS,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6) enables direct userspace access to IB hardware via "verbs," as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) described in chapter 11 of the InfiniBand Architecture Specification.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) To use the verbs, the libibverbs library, available from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) https://github.com/linux-rdma/rdma-core, is required. libibverbs contains a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) device-independent API for using the ib_uverbs interface.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) libibverbs also requires appropriate device-dependent kernel and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) userspace driver for your InfiniBand hardware. For example, to use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) a Mellanox HCA, you will need the ib_mthca kernel module and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) libmthca userspace driver be installed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) User-kernel communication
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) =========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) Userspace communicates with the kernel for slow path, resource
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) management operations via the /dev/infiniband/uverbsN character
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) devices. Fast path operations are typically performed by writing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) directly to hardware registers mmap()ed into userspace, with no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) system call or context switch into the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) Commands are sent to the kernel via write()s on these device files.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) The ABI is defined in drivers/infiniband/include/ib_user_verbs.h.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) The structs for commands that require a response from the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) contain a 64-bit field used to pass a pointer to an output buffer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) Status is returned to userspace as the return value of the write()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) system call.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) Resource management
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) ===================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) Since creation and destruction of all IB resources is done by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) commands passed through a file descriptor, the kernel can keep track
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) of which resources are attached to a given userspace context. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) ib_uverbs module maintains idr tables that are used to translate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) between kernel pointers and opaque userspace handles, so that kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) pointers are never exposed to userspace and userspace cannot trick
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) the kernel into following a bogus pointer.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) This also allows the kernel to clean up when a process exits and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) prevent one process from touching another process's resources.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) Memory pinning
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) Direct userspace I/O requires that memory regions that are potential
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) I/O targets be kept resident at the same physical address. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) ib_uverbs module manages pinning and unpinning memory regions via
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) get_user_pages() and put_page() calls. It also accounts for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) amount of memory pinned in the process's pinned_vm, and checks that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) unprivileged processes do not exceed their RLIMIT_MEMLOCK limit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) Pages that are pinned multiple times are counted each time they are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) pinned, so the value of pinned_vm may be an overestimate of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) number of pages pinned by a process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) /dev files
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) To create the appropriate character device files automatically with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) udev, a rule like::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) KERNEL=="uverbs*", NAME="infiniband/%k"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) can be used. This will create device nodes named::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) /dev/infiniband/uverbs0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) and so on. Since the InfiniBand userspace verbs should be safe for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) use by non-privileged processes, it may be useful to add an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) appropriate MODE or GROUP to the udev rule.