Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. _userfaultfd:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) Userfaultfd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) Objective
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) =========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) Userfaults allow the implementation of on-demand paging from userland
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) and more generally they allow userland to take control of various
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) memory page faults, something otherwise only the kernel code could do.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) For example userfaults allows a proper and more optimal implementation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) of the ``PROT_NONE+SIGSEGV`` trick.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) Design
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) ======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) The ``userfaultfd`` (aside from registering and unregistering virtual
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) memory ranges) provides two primary functionalities:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 1) ``read/POLLIN`` protocol to notify a userland thread of the faults
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26)    happening
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 2) various ``UFFDIO_*`` ioctls that can manage the virtual memory regions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29)    registered in the ``userfaultfd`` that allows userland to efficiently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30)    resolve the userfaults it receives via 1) or to manage the virtual
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31)    memory in the background
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) The real advantage of userfaults if compared to regular virtual memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) management of mremap/mprotect is that the userfaults in all their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) operations never involve heavyweight structures like vmas (in fact the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) ``userfaultfd`` runtime load never takes the mmap_lock for writing).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) Vmas are not suitable for page- (or hugepage) granular fault tracking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) when dealing with virtual address spaces that could span
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) Terabytes. Too many vmas would be needed for that.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) The ``userfaultfd`` once opened by invoking the syscall, can also be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) passed using unix domain sockets to a manager process, so the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) manager process could handle the userfaults of a multitude of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) different processes without them being aware about what is going on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) (well of course unless they later try to use the ``userfaultfd``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) themselves on the same region the manager is already tracking, which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) is a corner case that would currently return ``-EBUSY``).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) ===
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) When first opened the ``userfaultfd`` must be enabled invoking the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) a later API version) which will specify the ``read/POLLIN`` protocol
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) userland intends to speak on the ``UFFD`` and the ``uffdio_api.features``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) userland requires. The ``UFFDIO_API`` ioctl if successful (i.e. if the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) requested ``uffdio_api.api`` is spoken also by the running kernel and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) requested features are going to be enabled) will return into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) ``uffdio_api.features`` and ``uffdio_api.ioctls`` two 64bit bitmasks of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) respectively all the available features of the read(2) protocol and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) the generic ioctl available.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) defines what memory types are supported by the ``userfaultfd`` and what
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) events, except page fault notifications, may be generated:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) - The ``UFFD_FEATURE_EVENT_*`` flags indicate that various other events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69)   other than page faults are supported. These events are described in more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70)   detail below in the `Non-cooperative userfaultfd`_ section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) - ``UFFD_FEATURE_MISSING_HUGETLBFS`` and ``UFFD_FEATURE_MISSING_SHMEM``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73)   indicate that the kernel supports ``UFFDIO_REGISTER_MODE_MISSING``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74)   registrations for hugetlbfs and shared memory (covering all shmem APIs,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75)   i.e. tmpfs, ``IPCSHM``, ``/dev/zero``, ``MAP_SHARED``, ``memfd_create``,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76)   etc) virtual memory areas, respectively.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) - ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79)   ``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80)   areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81)   support for shmem virtual memory areas.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) The userland application should set the feature flags it intends to use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) when invoking the ``UFFDIO_API`` ioctl, to request that those features be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) enabled if supported.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) Once the ``userfaultfd`` API has been enabled the ``UFFDIO_REGISTER``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) ioctl should be invoked (if present in the returned ``uffdio_api.ioctls``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) bitmask) to register a memory range in the ``userfaultfd`` by setting the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) uffdio_register structure accordingly. The ``uffdio_register.mode``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) bitmask will specify to the kernel which kind of faults to track for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) the range. The ``UFFDIO_REGISTER`` ioctl will return the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) ``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) userfaults on the range registered. Not all ioctls will necessarily be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) supported for all memory types (e.g. anonymous memory vs. shmem vs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) hugetlbfs), or all types of intercepted faults.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) Userland can use the ``uffdio_register.ioctls`` to manage the virtual
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) address space in the background (to add or potentially also remove
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) memory from the ``userfaultfd`` registered range). This means a userfault
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) could be triggering just before userland maps in the background the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) user-faulted page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) Resolving Userfaults
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) --------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) There are three basic ways to resolve userfaults:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) - ``UFFDIO_COPY`` atomically copies some existing page contents from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110)   userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) - ``UFFDIO_ZEROPAGE`` atomically zeros the new page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) - ``UFFDIO_CONTINUE`` maps an existing, previously-populated page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) These operations are atomic in the sense that they guarantee nothing can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) see a half-populated page, since readers will keep userfaulting until the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) operation has finished.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) By default, these wake up userfaults blocked on the range in question.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) They support a ``UFFDIO_*_MODE_DONTWAKE`` ``mode`` flag, which indicates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) that waking will be done separately at some later time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) Which ioctl to choose depends on the kind of page fault, and what we'd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) like to do to resolve it:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) - For ``UFFDIO_REGISTER_MODE_MISSING`` faults, the fault needs to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)   resolved by either providing a new page (``UFFDIO_COPY``), or mapping
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)   the zero page (``UFFDIO_ZEROPAGE``). By default, the kernel would map
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)   the zero page for a missing fault. With userfaultfd, userspace can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)   decide what content to provide before the faulting thread continues.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) - For ``UFFDIO_REGISTER_MODE_MINOR`` faults, there is an existing page (in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)   the page cache). Userspace has the option of modifying the page's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)   contents before resolving the fault. Once the contents are correct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)   (modified or not), userspace asks the kernel to map the page and let the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)   faulting thread continue with ``UFFDIO_CONTINUE``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) Notes:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) - You can tell which kind of fault occurred by examining
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)   ``pagefault.flags`` within the ``uffd_msg``, checking for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)   ``UFFD_PAGEFAULT_FLAG_*`` flags.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) - None of the page-delivering ioctls default to the range that you
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)   registered with.  You must fill in all fields for the appropriate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)   ioctl struct including the range.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) - You get the address of the access that triggered the missing page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)   event out of a struct uffd_msg that you read in the thread from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)   uffd.  You can supply as many pages as you want with these IOCTLs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)   Keep in mind that unless you used DONTWAKE then the first of any of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)   those IOCTLs wakes up the faulting thread.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) - Be sure to test for all errors including
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156)   (``pollfd[0].revents & POLLERR``).  This can happen, e.g. when ranges
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157)   supplied were incorrect.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) Write Protect Notifications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) ---------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) This is equivalent to (but faster than) using mprotect and a SIGSEGV
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) signal handler.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) Firstly you need to register a range with ``UFFDIO_REGISTER_MODE_WP``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) Instead of using mprotect(2) you use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) ``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) while ``mode = UFFDIO_WRITEPROTECT_MODE_WP``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) in the struct passed in.  The range does not default to and does not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) have to be identical to the range you registered with.  You can write
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) protect as many ranges as you like (inside the registered range).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) Then, in the thread reading from uffd the struct will have
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) ``msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP`` set. Now you send
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) ``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) again while ``pagefault.mode`` does not have ``UFFDIO_WRITEPROTECT_MODE_WP``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) set. This wakes up the thread which will continue to run with writes. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) allows you to do the bookkeeping about the write in the uffd reading
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) thread before the ioctl.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) If you registered with both ``UFFDIO_REGISTER_MODE_MISSING`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) ``UFFDIO_REGISTER_MODE_WP`` then you need to think about the sequence in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) which you supply a page and undo write protect.  Note that there is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) difference between writes into a WP area and into a !WP area.  The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) ``UFFD_PAGEFAULT_FLAG_WRITE``.  The latter did not fail on protection but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) used.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) QEMU/KVM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) QEMU/KVM is using the ``userfaultfd`` syscall to implement postcopy live
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) migration. Postcopy live migration is one form of memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) externalization consisting of a virtual machine running with part or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) all of its memory residing on a different node in the cloud. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) ``userfaultfd`` abstraction is generic enough that not a single line of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) KVM kernel code had to be modified in order to add postcopy live
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) migration to QEMU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) Guest async page faults, ``FOLL_NOWAIT`` and all other ``GUP*`` features work
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) just fine in combination with userfaults. Userfaults trigger async
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) page faults in the guest scheduler so those guest processes that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) aren't waiting for userfaults (i.e. network bound) can keep running in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) the guest vcpus.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) It is generally beneficial to run one pass of precopy live migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) just before starting postcopy live migration, in order to avoid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) generating userfaults for readonly guest regions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) The implementation of postcopy live migration currently uses one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) single bidirectional socket but in the future two different sockets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) will be used (to reduce the latency of the userfaults to the minimum
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) possible without having to decrease ``/proc/sys/net/ipv4/tcp_wmem``).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) The QEMU in the source node writes all pages that it knows are missing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) in the destination node, into the socket, and the migration thread of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) the QEMU running in the destination node runs ``UFFDIO_COPY|ZEROPAGE``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) ioctls on the ``userfaultfd`` in order to map the received pages into the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) guest (``UFFDIO_ZEROCOPY`` is used if the source page was a zero page).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) A different postcopy thread in the destination node listens with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) poll() to the ``userfaultfd`` in parallel. When a ``POLLIN`` event is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) generated after a userfault triggers, the postcopy thread read() from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) the ``userfaultfd`` and receives the fault address (or ``-EAGAIN`` in case the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) userfault was already resolved and waken by a ``UFFDIO_COPY|ZEROPAGE`` run
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) by the parallel QEMU migration thread).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) After the QEMU postcopy thread (running in the destination node) gets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) the userfault address it writes the information about the missing page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) into the socket. The QEMU source node receives the information and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) roughly "seeks" to that page address and continues sending all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) remaining missing pages from that new page offset. Soon after that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) (just the time to flush the tcp_wmem queue through the network) the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) migration thread in the QEMU running in the destination node will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) receive the page that triggered the userfault and it'll map it as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) usual with the ``UFFDIO_COPY|ZEROPAGE`` (without actually knowing if it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) was spontaneously sent by the source or if it was an urgent page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) requested through a userfault).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) By the time the userfaults start, the QEMU in the destination node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) doesn't need to keep any per-page state bitmap relative to the live
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) migration around and a single per-page bitmap has to be maintained in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) the QEMU running in the source node to know which pages are still
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) missing in the destination node. The bitmap in the source node is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) checked to find which missing pages to send in round robin and we seek
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) over it when receiving incoming userfaults. After sending each page of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) course the bitmap is updated accordingly. It's also useful to avoid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) sending the same page twice (in case the userfault is read by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) postcopy thread just before ``UFFDIO_COPY|ZEROPAGE`` runs in the migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) thread).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) Non-cooperative userfaultfd
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) ===========================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) When the ``userfaultfd`` is monitored by an external manager, the manager
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) must be able to track changes in the process virtual memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) layout. Userfaultfd can notify the manager about such changes using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) the same read(2) protocol as for the page fault notifications. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) manager has to explicitly enable these events by setting appropriate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) bits in ``uffdio_api.features`` passed to ``UFFDIO_API`` ioctl:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) ``UFFD_FEATURE_EVENT_FORK``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) 	enable ``userfaultfd`` hooks for fork(). When this feature is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) 	enabled, the ``userfaultfd`` context of the parent process is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) 	duplicated into the newly created process. The manager
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) 	receives ``UFFD_EVENT_FORK`` with file descriptor of the new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) 	``userfaultfd`` context in the ``uffd_msg.fork``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) ``UFFD_FEATURE_EVENT_REMAP``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) 	enable notifications about mremap() calls. When the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) 	non-cooperative process moves a virtual memory area to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) 	different location, the manager will receive
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) 	``UFFD_EVENT_REMAP``. The ``uffd_msg.remap`` will contain the old and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) 	new addresses of the area and its original length.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) ``UFFD_FEATURE_EVENT_REMOVE``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) 	enable notifications about madvise(MADV_REMOVE) and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) 	madvise(MADV_DONTNEED) calls. The event ``UFFD_EVENT_REMOVE`` will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) 	be generated upon these calls to madvise(). The ``uffd_msg.remove``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) 	will contain start and end addresses of the removed area.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) ``UFFD_FEATURE_EVENT_UNMAP``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) 	enable notifications about memory unmapping. The manager will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) 	get ``UFFD_EVENT_UNMAP`` with ``uffd_msg.remove`` containing start and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) 	end addresses of the unmapped area.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) Although the ``UFFD_FEATURE_EVENT_REMOVE`` and ``UFFD_FEATURE_EVENT_UNMAP``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) are pretty similar, they quite differ in the action expected from the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) ``userfaultfd`` manager. In the former case, the virtual memory is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) removed, but the area is not, the area remains monitored by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) ``userfaultfd``, and if a page fault occurs in that area it will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) delivered to the manager. The proper resolution for such page fault is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) to zeromap the faulting address. However, in the latter case, when an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) area is unmapped, either explicitly (with munmap() system call), or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) implicitly (e.g. during mremap()), the area is removed and in turn the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) ``userfaultfd`` context for such area disappears too and the manager will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) not get further userland page faults from the removed area. Still, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) notification is required in order to prevent manager from using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) ``UFFDIO_COPY`` on the unmapped area.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) Unlike userland page faults which have to be synchronous and require
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) explicit or implicit wakeup, all the events are delivered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) asynchronously and the non-cooperative process resumes execution as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) soon as manager executes read(). The ``userfaultfd`` manager should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) carefully synchronize calls to ``UFFDIO_COPY`` with the events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) processing. To aid the synchronization, the ``UFFDIO_COPY`` ioctl will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) return ``-ENOSPC`` when the monitored process exits at the time of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) ``UFFDIO_COPY``, and ``-ENOENT``, when the non-cooperative process has changed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) its virtual memory layout simultaneously with outstanding ``UFFDIO_COPY``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) operation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) The current asynchronous model of the event delivery is optimal for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) single threaded non-cooperative ``userfaultfd`` manager implementations. A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) synchronous event delivery model can be added later as a new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) ``userfaultfd`` feature to facilitate multithreading enhancements of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) non cooperative manager, for example to allow ``UFFDIO_COPY`` ioctls to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) run in parallel to the event reception. Single threaded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) implementations should continue to use the current async event
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) delivery model instead.