Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. _hmm:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) =====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) Heterogeneous Memory Management (HMM)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) =====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) Provide infrastructure and helpers to integrate non-conventional memory (device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) memory like GPU on board memory) into regular kernel path, with the cornerstone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) of this being specialized struct page for such memory (see sections 5 to 7 of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) this document).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) HMM also provides optional helpers for SVM (Share Virtual Memory), i.e.,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) allowing a device to transparently access program addresses coherently with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) the CPU meaning that any valid pointer on the CPU is also a valid pointer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) for the device. This is becoming mandatory to simplify the use of advanced
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) heterogeneous computing where GPU, DSP, or FPGA are used to perform various
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) computations on behalf of a process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) This document is divided as follows: in the first section I expose the problems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) related to using device specific memory allocators. In the second section, I
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) expose the hardware limitations that are inherent to many platforms. The third
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) section gives an overview of the HMM design. The fourth section explains how
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) CPU page-table mirroring works and the purpose of HMM in this context. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) fifth section deals with how device memory is represented inside the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) Finally, the last section presents a new migration helper that allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) leveraging the device DMA engine.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) .. contents:: :local:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) Problems of using a device specific memory allocator
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) ====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) Devices with a large amount of on board memory (several gigabytes) like GPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) have historically managed their memory through dedicated driver specific APIs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) This creates a disconnect between memory allocated and managed by a device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) driver and regular application memory (private anonymous, shared memory, or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) regular file backed memory). From here on I will refer to this aspect as split
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) address space. I use shared address space to refer to the opposite situation:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) i.e., one in which any application memory region can be used by a device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) transparently.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) Split address space happens because devices can only access memory allocated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) through a device specific API. This implies that all memory objects in a program
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) are not equal from the device point of view which complicates large programs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) that rely on a wide set of libraries.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) Concretely, this means that code that wants to leverage devices like GPUs needs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) to copy objects between generically allocated memory (malloc, mmap private, mmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) share) and memory allocated through the device driver API (this still ends up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) with an mmap but of the device file).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) For flat data sets (array, grid, image, ...) this isn't too hard to achieve but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) for complex data sets (list, tree, ...) it's hard to get right. Duplicating a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) complex data set needs to re-map all the pointer relations between each of its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) elements. This is error prone and programs get harder to debug because of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) duplicate data set and addresses.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) Split address space also means that libraries cannot transparently use data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) they are getting from the core program or another library and thus each library
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) might have to duplicate its input data set using the device specific memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) allocator. Large projects suffer from this and waste resources because of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) various memory copies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) Duplicating each library API to accept as input or output memory allocated by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) each device specific allocator is not a viable option. It would lead to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) combinatorial explosion in the library entry points.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) Finally, with the advance of high level language constructs (in C++ but in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) other languages too) it is now possible for the compiler to leverage GPUs and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) other devices without programmer knowledge. Some compiler identified patterns
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) are only do-able with a shared address space. It is also more reasonable to use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) a shared address space for all other patterns.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) I/O bus, device memory characteristics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) ======================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) I/O buses cripple shared address spaces due to a few limitations. Most I/O
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) buses only allow basic memory access from device to main memory; even cache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) coherency is often optional. Access to device memory from a CPU is even more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) limited. More often than not, it is not cache coherent.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) If we only consider the PCIE bus, then a device can access main memory (often
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) through an IOMMU) and be cache coherent with the CPUs. However, it only allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) a limited set of atomic operations from the device on main memory. This is worse
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) in the other direction: the CPU can only access a limited range of the device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) memory and cannot perform atomic operations on it. Thus device memory cannot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) be considered the same as regular memory from the kernel point of view.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) The final limitation is latency. Access to main memory from the device has an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) order of magnitude higher latency than when the device accesses its own memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) Some platforms are developing new I/O buses or additions/modifications to PCIE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) to address some of these limitations (OpenCAPI, CCIX). They mainly allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) two-way cache coherency between CPU and device and allow all atomic operations the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) architecture supports. Sadly, not all platforms are following this trend and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) some major architectures are left without hardware solutions to these problems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) So for shared address space to make sense, not only must we allow devices to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) access any memory but we must also permit any memory to be migrated to device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) memory while the device is using it (blocking CPU access while it happens).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) Shared address space and migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) ==================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) HMM intends to provide two main features. The first one is to share the address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) space by duplicating the CPU page table in the device page table so the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) address points to the same physical memory for any valid main memory address in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) the process address space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) To achieve this, HMM offers a set of helpers to populate the device page table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) while keeping track of CPU page table updates. Device page table updates are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) not as easy as CPU page table updates. To update the device page table, you must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) allocate a buffer (or use a pool of pre-allocated buffers) and write GPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) specific commands in it to perform the update (unmap, cache invalidations, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) flush, ...). This cannot be done through common code for all devices. Hence
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) why HMM provides helpers to factor out everything that can be while leaving the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) hardware specific details to the device driver.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) The second mechanism HMM provides is a new kind of ZONE_DEVICE memory that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) allows allocating a struct page for each page of device memory. Those pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) are special because the CPU cannot map them. However, they allow migrating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) main memory to device memory using existing migration mechanisms and everything
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) looks like a page that is swapped out to disk from the CPU point of view. Using a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) struct page gives the easiest and cleanest integration with existing mm
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) mechanisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) memory for the device memory and second to perform migration. Policy decisions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) of what and when to migrate is left to the device driver.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) Note that any CPU access to a device page triggers a page fault and a migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) back to main memory. For example, when a page backing a given CPU address A is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) migrated from a main memory page to a device page, then any CPU access to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) address A triggers a page fault and initiates a migration back to main memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) With these two features, HMM not only allows a device to mirror process address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) space and keeps both CPU and device page tables synchronized, but also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) leverages device memory by migrating the part of the data set that is actively being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) used by the device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) Address space mirroring implementation and API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) ==============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) Address space mirroring's main objective is to allow duplication of a range of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) CPU page table into a device page table; HMM helps keep both synchronized. A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) device driver that wants to mirror a process address space must start with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) registration of a mmu_interval_notifier::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)  int mmu_interval_notifier_insert(struct mmu_interval_notifier *interval_sub,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) 				  struct mm_struct *mm, unsigned long start,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) 				  unsigned long length,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 				  const struct mmu_interval_notifier_ops *ops);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) During the ops->invalidate() callback the device driver must perform the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) update action to the range (mark range read only, or fully unmap, etc.). The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) device must complete the update before the driver callback returns.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) When the device driver wants to populate a range of virtual addresses, it can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) use::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164)   int hmm_range_fault(struct hmm_range *range);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) It will trigger a page fault on missing or read-only entries if write access is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) requested (see below). Page faults use the generic mm page fault code path just
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) like a CPU page fault.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) Both functions copy CPU page table entries into their pfns array argument. Each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) entry in that array corresponds to an address in the virtual range. HMM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) provides a set of flags to help the driver identify special CPU page table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) entries.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) Locking within the sync_cpu_device_pagetables() callback is the most important
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) aspect the driver must respect in order to keep things properly synchronized.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) The usage pattern is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)  int driver_populate_range(...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180)  {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181)       struct hmm_range range;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)       ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184)       range.notifier = &interval_sub;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185)       range.start = ...;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186)       range.end = ...;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187)       range.hmm_pfns = ...;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189)       if (!mmget_not_zero(interval_sub->notifier.mm))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190)           return -EFAULT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192)  again:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)       range.notifier_seq = mmu_interval_read_begin(&interval_sub);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194)       mmap_read_lock(mm);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195)       ret = hmm_range_fault(&range);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196)       if (ret) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197)           mmap_read_unlock(mm);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198)           if (ret == -EBUSY)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199)                  goto again;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200)           return ret;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201)       }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202)       mmap_read_unlock(mm);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204)       take_lock(driver->update);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205)       if (mmu_interval_read_retry(&ni, range.notifier_seq) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206)           release_lock(driver->update);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207)           goto again;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208)       }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210)       /* Use pfns array content to update device page table,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211)        * under the update lock */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213)       release_lock(driver->update);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214)       return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215)  }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) The driver->update lock is the same lock that the driver takes inside its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) invalidate() callback. That lock must be held before calling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) mmu_interval_read_retry() to avoid any race with a concurrent CPU page table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) update.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) Leverage default_flags and pfn_flags_mask
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) =========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) The hmm_range struct has 2 fields, default_flags and pfn_flags_mask, that specify
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) fault or snapshot policy for the whole range instead of having to set them
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) for each entry in the pfns array.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) For instance if the device driver wants pages for a range with at least read
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) permission, it sets::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232)     range->default_flags = HMM_PFN_REQ_FAULT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233)     range->pfn_flags_mask = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) and calls hmm_range_fault() as described above. This will fill fault all pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) in the range with at least read permission.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) Now let's say the driver wants to do the same except for one page in the range for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) which it wants to have write permission. Now driver set::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241)     range->default_flags = HMM_PFN_REQ_FAULT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242)     range->pfn_flags_mask = HMM_PFN_REQ_WRITE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243)     range->pfns[index_of_write] = HMM_PFN_REQ_WRITE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) With this, HMM will fault in all pages with at least read (i.e., valid) and for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) address == range->start + (index_of_write << PAGE_SHIFT) it will fault with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) write permission i.e., if the CPU pte does not have write permission set then HMM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) will call handle_mm_fault().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) After hmm_range_fault completes the flag bits are set to the current state of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) the page tables, ie HMM_PFN_VALID | HMM_PFN_WRITE will be set if the page is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) writable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) Represent and manage device memory from core kernel point of view
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) =================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) Several different designs were tried to support device memory. The first one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) used a device specific data structure to keep information about migrated memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) and HMM hooked itself in various places of mm code to handle any access to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) addresses that were backed by device memory. It turns out that this ended up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) replicating most of the fields of struct page and also needed many kernel code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) paths to be updated to understand this new kind of memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) Most kernel code paths never try to access the memory behind a page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) but only care about struct page contents. Because of this, HMM switched to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) directly using struct page for device memory which left most kernel code paths
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) unaware of the difference. We only need to make sure that no one ever tries to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) map those pages from the CPU side.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) Migration to and from device memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) ===================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) Because the CPU cannot access device memory directly, the device driver must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) use hardware DMA or device specific load/store instructions to migrate data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) The migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) functions are designed to make drivers easier to write and to centralize common
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) code across drivers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) Before migrating pages to device private memory, special device private
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) ``struct page`` need to be created. These will be used as special "swap"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) page table entries so that a CPU process will fault if it tries to access
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) a page that has been migrated to device private memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) These can be allocated and freed with::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)     struct resource *res;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288)     struct dev_pagemap pagemap;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290)     res = request_free_mem_region(&iomem_resource, /* number of bytes */,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291)                                   "name of driver resource");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292)     pagemap.type = MEMORY_DEVICE_PRIVATE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293)     pagemap.range.start = res->start;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294)     pagemap.range.end = res->end;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295)     pagemap.nr_range = 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296)     pagemap.ops = &device_devmem_ops;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297)     memremap_pages(&pagemap, numa_node_id());
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299)     memunmap_pages(&pagemap);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300)     release_mem_region(pagemap.range.start, range_len(&pagemap.range));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) There are also devm_request_free_mem_region(), devm_memremap_pages(),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) devm_memunmap_pages(), and devm_release_mem_region() when the resources can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) be tied to a ``struct device``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) The overall migration steps are similar to migrating NUMA pages within system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) memory (see :ref:`Page migration <page_migration>`) but the steps are split
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) between device driver specific code and shared common code:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) 1. ``mmap_read_lock()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312)    The device driver has to pass a ``struct vm_area_struct`` to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313)    migrate_vma_setup() so the mmap_read_lock() or mmap_write_lock() needs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314)    be held for the duration of the migration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) 2. ``migrate_vma_setup(struct migrate_vma *args)``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318)    The device driver initializes the ``struct migrate_vma`` fields and passes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319)    the pointer to migrate_vma_setup(). The ``args->flags`` field is used to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320)    filter which source pages should be migrated. For example, setting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321)    ``MIGRATE_VMA_SELECT_SYSTEM`` will only migrate system memory and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322)    ``MIGRATE_VMA_SELECT_DEVICE_PRIVATE`` will only migrate pages residing in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323)    device private memory. If the latter flag is set, the ``args->pgmap_owner``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324)    field is used to identify device private pages owned by the driver. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325)    avoids trying to migrate device private pages residing in other devices.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326)    Currently only anonymous private VMA ranges can be migrated to or from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327)    system memory and device private memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329)    One of the first steps migrate_vma_setup() does is to invalidate other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330)    device's MMUs with the ``mmu_notifier_invalidate_range_start(()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331)    ``mmu_notifier_invalidate_range_end()`` calls around the page table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332)    walks to fill in the ``args->src`` array with PFNs to be migrated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333)    The ``invalidate_range_start()`` callback is passed a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334)    ``struct mmu_notifier_range`` with the ``event`` field set to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335)    ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336)    the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337)    allows the device driver to skip the invalidation callback and only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338)    invalidate device private MMU mappings that are actually migrating.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339)    This is explained more in the next section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341)    While walking the page tables, a ``pte_none()`` or ``is_zero_pfn()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342)    entry results in a valid "zero" PFN stored in the ``args->src`` array.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343)    This lets the driver allocate device private memory and clear it instead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344)    of copying a page of zeros. Valid PTE entries to system memory or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345)    device private struct pages will be locked with ``lock_page()``, isolated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346)    from the LRU (if system memory since device private pages are not on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347)    the LRU), unmapped from the process, and a special migration PTE is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348)    inserted in place of the original PTE.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349)    migrate_vma_setup() also clears the ``args->dst`` array.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) 3. The device driver allocates destination pages and copies source pages to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352)    destination pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354)    The driver checks each ``src`` entry to see if the ``MIGRATE_PFN_MIGRATE``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355)    bit is set and skips entries that are not migrating. The device driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356)    can also choose to skip migrating a page by not filling in the ``dst``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357)    array for that page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359)    The driver then allocates either a device private struct page or a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360)    system memory page, locks the page with ``lock_page()``, and fills in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361)    ``dst`` array entry with::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363)      dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365)    Now that the driver knows that this page is being migrated, it can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366)    invalidate device private MMU mappings and copy device private memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367)    to system memory or another device private page. The core Linux kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368)    handles CPU page table invalidations so the device driver only has to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369)    invalidate its own MMU mappings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371)    The driver can use ``migrate_pfn_to_page(src[i])`` to get the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372)    ``struct page`` of the source and either copy the source page to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373)    destination or clear the destination device private memory if the pointer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374)    is ``NULL`` meaning the source page was not populated in system memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) 4. ``migrate_vma_pages()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378)    This step is where the migration is actually "committed".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380)    If the source page was a ``pte_none()`` or ``is_zero_pfn()`` page, this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381)    is where the newly allocated page is inserted into the CPU's page table.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382)    This can fail if a CPU thread faults on the same page. However, the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383)    table is locked and only one of the new pages will be inserted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384)    The device driver will see that the ``MIGRATE_PFN_MIGRATE`` bit is cleared
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385)    if it loses the race.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387)    If the source page was locked, isolated, etc. the source ``struct page``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388)    information is now copied to destination ``struct page`` finalizing the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389)    migration on the CPU side.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) 5. Device driver updates device MMU page tables for pages still migrating,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392)    rolling back pages not migrating.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394)    If the ``src`` entry still has ``MIGRATE_PFN_MIGRATE`` bit set, the device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395)    driver can update the device MMU and set the write enable bit if the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396)    ``MIGRATE_PFN_WRITE`` bit is set.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) 6. ``migrate_vma_finalize()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400)    This step replaces the special migration page table entry with the new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401)    page's page table entry and releases the reference to the source and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402)    destination ``struct page``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) 7. ``mmap_read_unlock()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406)    The lock can now be released.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) Memory cgroup (memcg) and rss accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) ========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) For now, device memory is accounted as any regular page in rss counters (either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) anonymous if device page is used for anonymous, file if device page is used for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) file backed page, or shmem if device page is used for shared memory). This is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) deliberate choice to keep existing applications, that might start using device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) memory without knowing about it, running unimpacted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) A drawback is that the OOM killer might kill an application using a lot of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) device memory and not a lot of regular system memory and thus not freeing much
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) system memory. We want to gather more real world experience on how applications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) and system react under memory pressure in the presence of device memory before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) deciding to account device memory differently.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) Same decision was made for memory cgroup. Device memory pages are accounted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) against same memory cgroup a regular page would be accounted to. This does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) simplify migration to and from device memory. This also means that migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) back from device memory to regular memory cannot fail because it would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) go above memory cgroup limit. We might revisit this choice latter on once we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) get more experience in how device memory is used and its impact on memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) resource control.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) Note that device memory can never be pinned by a device driver nor through GUP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) and thus such memory is always free upon process exit. Or when last reference
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) is dropped in case of shared memory or file backed memory.