^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. _hmm:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) =====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Heterogeneous Memory Management (HMM)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) =====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Provide infrastructure and helpers to integrate non-conventional memory (device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) memory like GPU on board memory) into regular kernel path, with the cornerstone
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) of this being specialized struct page for such memory (see sections 5 to 7 of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) this document).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) HMM also provides optional helpers for SVM (Share Virtual Memory), i.e.,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) allowing a device to transparently access program addresses coherently with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) the CPU meaning that any valid pointer on the CPU is also a valid pointer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) for the device. This is becoming mandatory to simplify the use of advanced
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) heterogeneous computing where GPU, DSP, or FPGA are used to perform various
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) computations on behalf of a process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) This document is divided as follows: in the first section I expose the problems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) related to using device specific memory allocators. In the second section, I
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) expose the hardware limitations that are inherent to many platforms. The third
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) section gives an overview of the HMM design. The fourth section explains how
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) CPU page-table mirroring works and the purpose of HMM in this context. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) fifth section deals with how device memory is represented inside the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) Finally, the last section presents a new migration helper that allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) leveraging the device DMA engine.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) .. contents:: :local:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) Problems of using a device specific memory allocator
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) ====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) Devices with a large amount of on board memory (several gigabytes) like GPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) have historically managed their memory through dedicated driver specific APIs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) This creates a disconnect between memory allocated and managed by a device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) driver and regular application memory (private anonymous, shared memory, or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) regular file backed memory). From here on I will refer to this aspect as split
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) address space. I use shared address space to refer to the opposite situation:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) i.e., one in which any application memory region can be used by a device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) transparently.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) Split address space happens because devices can only access memory allocated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) through a device specific API. This implies that all memory objects in a program
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) are not equal from the device point of view which complicates large programs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) that rely on a wide set of libraries.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) Concretely, this means that code that wants to leverage devices like GPUs needs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) to copy objects between generically allocated memory (malloc, mmap private, mmap
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) share) and memory allocated through the device driver API (this still ends up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) with an mmap but of the device file).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) For flat data sets (array, grid, image, ...) this isn't too hard to achieve but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) for complex data sets (list, tree, ...) it's hard to get right. Duplicating a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) complex data set needs to re-map all the pointer relations between each of its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) elements. This is error prone and programs get harder to debug because of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) duplicate data set and addresses.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) Split address space also means that libraries cannot transparently use data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) they are getting from the core program or another library and thus each library
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) might have to duplicate its input data set using the device specific memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) allocator. Large projects suffer from this and waste resources because of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) various memory copies.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) Duplicating each library API to accept as input or output memory allocated by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) each device specific allocator is not a viable option. It would lead to a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) combinatorial explosion in the library entry points.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) Finally, with the advance of high level language constructs (in C++ but in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) other languages too) it is now possible for the compiler to leverage GPUs and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) other devices without programmer knowledge. Some compiler identified patterns
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) are only do-able with a shared address space. It is also more reasonable to use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) a shared address space for all other patterns.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) I/O bus, device memory characteristics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) ======================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) I/O buses cripple shared address spaces due to a few limitations. Most I/O
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) buses only allow basic memory access from device to main memory; even cache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) coherency is often optional. Access to device memory from a CPU is even more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) limited. More often than not, it is not cache coherent.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) If we only consider the PCIE bus, then a device can access main memory (often
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) through an IOMMU) and be cache coherent with the CPUs. However, it only allows
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) a limited set of atomic operations from the device on main memory. This is worse
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) in the other direction: the CPU can only access a limited range of the device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) memory and cannot perform atomic operations on it. Thus device memory cannot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) be considered the same as regular memory from the kernel point of view.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) The final limitation is latency. Access to main memory from the device has an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) order of magnitude higher latency than when the device accesses its own memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) Some platforms are developing new I/O buses or additions/modifications to PCIE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) to address some of these limitations (OpenCAPI, CCIX). They mainly allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) two-way cache coherency between CPU and device and allow all atomic operations the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) architecture supports. Sadly, not all platforms are following this trend and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) some major architectures are left without hardware solutions to these problems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) So for shared address space to make sense, not only must we allow devices to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) access any memory but we must also permit any memory to be migrated to device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) memory while the device is using it (blocking CPU access while it happens).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) Shared address space and migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) ==================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) HMM intends to provide two main features. The first one is to share the address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) space by duplicating the CPU page table in the device page table so the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) address points to the same physical memory for any valid main memory address in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) the process address space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) To achieve this, HMM offers a set of helpers to populate the device page table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) while keeping track of CPU page table updates. Device page table updates are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) not as easy as CPU page table updates. To update the device page table, you must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) allocate a buffer (or use a pool of pre-allocated buffers) and write GPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) specific commands in it to perform the update (unmap, cache invalidations, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) flush, ...). This cannot be done through common code for all devices. Hence
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) why HMM provides helpers to factor out everything that can be while leaving the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) hardware specific details to the device driver.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) The second mechanism HMM provides is a new kind of ZONE_DEVICE memory that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) allows allocating a struct page for each page of device memory. Those pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) are special because the CPU cannot map them. However, they allow migrating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) main memory to device memory using existing migration mechanisms and everything
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) looks like a page that is swapped out to disk from the CPU point of view. Using a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) struct page gives the easiest and cleanest integration with existing mm
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) mechanisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) memory for the device memory and second to perform migration. Policy decisions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) of what and when to migrate is left to the device driver.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) Note that any CPU access to a device page triggers a page fault and a migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) back to main memory. For example, when a page backing a given CPU address A is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) migrated from a main memory page to a device page, then any CPU access to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) address A triggers a page fault and initiates a migration back to main memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) With these two features, HMM not only allows a device to mirror process address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) space and keeps both CPU and device page tables synchronized, but also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) leverages device memory by migrating the part of the data set that is actively being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) used by the device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) Address space mirroring implementation and API
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) ==============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) Address space mirroring's main objective is to allow duplication of a range of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) CPU page table into a device page table; HMM helps keep both synchronized. A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) device driver that wants to mirror a process address space must start with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) registration of a mmu_interval_notifier::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) int mmu_interval_notifier_insert(struct mmu_interval_notifier *interval_sub,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) struct mm_struct *mm, unsigned long start,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) unsigned long length,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) const struct mmu_interval_notifier_ops *ops);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) During the ops->invalidate() callback the device driver must perform the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) update action to the range (mark range read only, or fully unmap, etc.). The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) device must complete the update before the driver callback returns.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) When the device driver wants to populate a range of virtual addresses, it can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) use::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) int hmm_range_fault(struct hmm_range *range);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) It will trigger a page fault on missing or read-only entries if write access is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) requested (see below). Page faults use the generic mm page fault code path just
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) like a CPU page fault.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) Both functions copy CPU page table entries into their pfns array argument. Each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) entry in that array corresponds to an address in the virtual range. HMM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) provides a set of flags to help the driver identify special CPU page table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) entries.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) Locking within the sync_cpu_device_pagetables() callback is the most important
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) aspect the driver must respect in order to keep things properly synchronized.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) The usage pattern is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) int driver_populate_range(...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) struct hmm_range range;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) range.notifier = &interval_sub;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) range.start = ...;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) range.end = ...;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) range.hmm_pfns = ...;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) if (!mmget_not_zero(interval_sub->notifier.mm))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) return -EFAULT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) again:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) range.notifier_seq = mmu_interval_read_begin(&interval_sub);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) mmap_read_lock(mm);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) ret = hmm_range_fault(&range);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) if (ret) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) mmap_read_unlock(mm);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) if (ret == -EBUSY)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) goto again;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) return ret;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) mmap_read_unlock(mm);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) take_lock(driver->update);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) if (mmu_interval_read_retry(&ni, range.notifier_seq) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) release_lock(driver->update);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) goto again;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) /* Use pfns array content to update device page table,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) * under the update lock */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) release_lock(driver->update);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) The driver->update lock is the same lock that the driver takes inside its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) invalidate() callback. That lock must be held before calling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) mmu_interval_read_retry() to avoid any race with a concurrent CPU page table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) update.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) Leverage default_flags and pfn_flags_mask
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) =========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) The hmm_range struct has 2 fields, default_flags and pfn_flags_mask, that specify
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) fault or snapshot policy for the whole range instead of having to set them
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) for each entry in the pfns array.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) For instance if the device driver wants pages for a range with at least read
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) permission, it sets::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) range->default_flags = HMM_PFN_REQ_FAULT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) range->pfn_flags_mask = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) and calls hmm_range_fault() as described above. This will fill fault all pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) in the range with at least read permission.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) Now let's say the driver wants to do the same except for one page in the range for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) which it wants to have write permission. Now driver set::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) range->default_flags = HMM_PFN_REQ_FAULT;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) range->pfn_flags_mask = HMM_PFN_REQ_WRITE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) range->pfns[index_of_write] = HMM_PFN_REQ_WRITE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) With this, HMM will fault in all pages with at least read (i.e., valid) and for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) address == range->start + (index_of_write << PAGE_SHIFT) it will fault with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) write permission i.e., if the CPU pte does not have write permission set then HMM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) will call handle_mm_fault().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) After hmm_range_fault completes the flag bits are set to the current state of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) the page tables, ie HMM_PFN_VALID | HMM_PFN_WRITE will be set if the page is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) writable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) Represent and manage device memory from core kernel point of view
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) =================================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) Several different designs were tried to support device memory. The first one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) used a device specific data structure to keep information about migrated memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) and HMM hooked itself in various places of mm code to handle any access to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) addresses that were backed by device memory. It turns out that this ended up
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) replicating most of the fields of struct page and also needed many kernel code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) paths to be updated to understand this new kind of memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) Most kernel code paths never try to access the memory behind a page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) but only care about struct page contents. Because of this, HMM switched to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) directly using struct page for device memory which left most kernel code paths
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) unaware of the difference. We only need to make sure that no one ever tries to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) map those pages from the CPU side.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) Migration to and from device memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272) ===================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) Because the CPU cannot access device memory directly, the device driver must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) use hardware DMA or device specific load/store instructions to migrate data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) The migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) functions are designed to make drivers easier to write and to centralize common
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) code across drivers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) Before migrating pages to device private memory, special device private
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) ``struct page`` need to be created. These will be used as special "swap"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) page table entries so that a CPU process will fault if it tries to access
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) a page that has been migrated to device private memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) These can be allocated and freed with::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) struct resource *res;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) struct dev_pagemap pagemap;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290) res = request_free_mem_region(&iomem_resource, /* number of bytes */,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) "name of driver resource");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) pagemap.type = MEMORY_DEVICE_PRIVATE;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) pagemap.range.start = res->start;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) pagemap.range.end = res->end;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) pagemap.nr_range = 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) pagemap.ops = &device_devmem_ops;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) memremap_pages(&pagemap, numa_node_id());
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) memunmap_pages(&pagemap);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) release_mem_region(pagemap.range.start, range_len(&pagemap.range));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) There are also devm_request_free_mem_region(), devm_memremap_pages(),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) devm_memunmap_pages(), and devm_release_mem_region() when the resources can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) be tied to a ``struct device``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) The overall migration steps are similar to migrating NUMA pages within system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) memory (see :ref:`Page migration <page_migration>`) but the steps are split
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) between device driver specific code and shared common code:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) 1. ``mmap_read_lock()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) The device driver has to pass a ``struct vm_area_struct`` to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) migrate_vma_setup() so the mmap_read_lock() or mmap_write_lock() needs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) be held for the duration of the migration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) 2. ``migrate_vma_setup(struct migrate_vma *args)``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) The device driver initializes the ``struct migrate_vma`` fields and passes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) the pointer to migrate_vma_setup(). The ``args->flags`` field is used to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) filter which source pages should be migrated. For example, setting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) ``MIGRATE_VMA_SELECT_SYSTEM`` will only migrate system memory and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) ``MIGRATE_VMA_SELECT_DEVICE_PRIVATE`` will only migrate pages residing in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) device private memory. If the latter flag is set, the ``args->pgmap_owner``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) field is used to identify device private pages owned by the driver. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) avoids trying to migrate device private pages residing in other devices.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) Currently only anonymous private VMA ranges can be migrated to or from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) system memory and device private memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) One of the first steps migrate_vma_setup() does is to invalidate other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) device's MMUs with the ``mmu_notifier_invalidate_range_start(()`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) ``mmu_notifier_invalidate_range_end()`` calls around the page table
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) walks to fill in the ``args->src`` array with PFNs to be migrated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) The ``invalidate_range_start()`` callback is passed a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) ``struct mmu_notifier_range`` with the ``event`` field set to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) allows the device driver to skip the invalidation callback and only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) invalidate device private MMU mappings that are actually migrating.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) This is explained more in the next section.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) While walking the page tables, a ``pte_none()`` or ``is_zero_pfn()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) entry results in a valid "zero" PFN stored in the ``args->src`` array.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) This lets the driver allocate device private memory and clear it instead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) of copying a page of zeros. Valid PTE entries to system memory or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) device private struct pages will be locked with ``lock_page()``, isolated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) from the LRU (if system memory since device private pages are not on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) the LRU), unmapped from the process, and a special migration PTE is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) inserted in place of the original PTE.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) migrate_vma_setup() also clears the ``args->dst`` array.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) 3. The device driver allocates destination pages and copies source pages to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) destination pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) The driver checks each ``src`` entry to see if the ``MIGRATE_PFN_MIGRATE``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) bit is set and skips entries that are not migrating. The device driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) can also choose to skip migrating a page by not filling in the ``dst``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) array for that page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) The driver then allocates either a device private struct page or a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) system memory page, locks the page with ``lock_page()``, and fills in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) ``dst`` array entry with::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) Now that the driver knows that this page is being migrated, it can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366) invalidate device private MMU mappings and copy device private memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) to system memory or another device private page. The core Linux kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) handles CPU page table invalidations so the device driver only has to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) invalidate its own MMU mappings.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) The driver can use ``migrate_pfn_to_page(src[i])`` to get the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) ``struct page`` of the source and either copy the source page to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) destination or clear the destination device private memory if the pointer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) is ``NULL`` meaning the source page was not populated in system memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) 4. ``migrate_vma_pages()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) This step is where the migration is actually "committed".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) If the source page was a ``pte_none()`` or ``is_zero_pfn()`` page, this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) is where the newly allocated page is inserted into the CPU's page table.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382) This can fail if a CPU thread faults on the same page. However, the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) table is locked and only one of the new pages will be inserted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) The device driver will see that the ``MIGRATE_PFN_MIGRATE`` bit is cleared
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) if it loses the race.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) If the source page was locked, isolated, etc. the source ``struct page``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) information is now copied to destination ``struct page`` finalizing the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) migration on the CPU side.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391) 5. Device driver updates device MMU page tables for pages still migrating,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) rolling back pages not migrating.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) If the ``src`` entry still has ``MIGRATE_PFN_MIGRATE`` bit set, the device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) driver can update the device MMU and set the write enable bit if the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) ``MIGRATE_PFN_WRITE`` bit is set.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) 6. ``migrate_vma_finalize()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) This step replaces the special migration page table entry with the new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) page's page table entry and releases the reference to the source and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) destination ``struct page``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404) 7. ``mmap_read_unlock()``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) The lock can now be released.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) Memory cgroup (memcg) and rss accounting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) ========================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) For now, device memory is accounted as any regular page in rss counters (either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) anonymous if device page is used for anonymous, file if device page is used for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) file backed page, or shmem if device page is used for shared memory). This is a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414) deliberate choice to keep existing applications, that might start using device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) memory without knowing about it, running unimpacted.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) A drawback is that the OOM killer might kill an application using a lot of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) device memory and not a lot of regular system memory and thus not freeing much
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) system memory. We want to gather more real world experience on how applications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420) and system react under memory pressure in the presence of device memory before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) deciding to account device memory differently.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) Same decision was made for memory cgroup. Device memory pages are accounted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) against same memory cgroup a regular page would be accounted to. This does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426) simplify migration to and from device memory. This also means that migration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) back from device memory to regular memory cannot fail because it would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) go above memory cgroup limit. We might revisit this choice latter on once we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429) get more experience in how device memory is used and its impact on memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) resource control.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433) Note that device memory can never be pinned by a device driver nor through GUP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434) and thus such memory is always free upon process exit. Or when last reference
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) is dropped in case of shared memory or file backed memory.