^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. hwpoison:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) hwpoison
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) What is hwpoison?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) Upcoming Intel CPUs have support for recovering from some memory errors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) (``MCA recovery``). This requires the OS to declare a page "poisoned",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) kill the processes associated with it and avoid using it in the future.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) This patchkit implements the necessary infrastructure in the VM.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) To quote the overview comment::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) High level machine check handler. Handles pages reported by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) hardware as being corrupted usually due to a 2bit ECC memory or cache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) failure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) This focusses on pages detected as corrupted in the background.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) When the current CPU tries to consume corruption the currently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) running process can just be killed directly instead. This implies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) that if the error cannot be handled for some reason it's safe to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) just ignore it because no corruption has been consumed yet. Instead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) when that happens another machine check will happen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) Handles page cache pages in various states. The tricky part
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) here is that we can access any page asynchronous to other VM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) users, because memory failures could happen anytime and anywhere,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) possibly violating some of their assumptions. This is why this code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) has to be extremely careful. Generally it tries to use normal locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) rules, as in get the standard locks, even if that means the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) error handling takes potentially a long time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) Some of the operations here are somewhat inefficient and have non
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) linear algorithmic complexity, because the data structures have not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) been optimized for this case. This is in particular the case
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) for the mapping from a vma to a process. Since this case is expected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) to be rare we hope we can get away with this.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) The code consists of a the high level handler in mm/memory-failure.c,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) a new page poison bit and various checks in the VM to handle poisoned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) The main target right now is KVM guests, but it works for all kinds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) of applications. KVM support requires a recent qemu-kvm release.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) For the KVM use there was need for a new signal type so that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) KVM can inject the machine check into the guest with the proper
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) address. This in theory allows other applications to handle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) memory failures too. The expection is that near all applications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) won't do that, but some very specialized ones might.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) Failure recovery modes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) There are two (actually three) modes memory failure recovery can be in:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) vm.memory_failure_recovery sysctl set to zero:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) All memory failures cause a panic. Do not attempt recovery.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) (on x86 this can be also affected by the tolerant level of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) MCE subsystem)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) early kill
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) (can be controlled globally and per process)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) Send SIGBUS to the application as soon as the error is detected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) This allows applications who can process memory errors in a gentle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) way (e.g. drop affected object)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) This is the mode used by KVM qemu.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) late kill
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) Send SIGBUS when the application runs into the corrupted page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) This is best for memory error unaware applications and default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) Note some pages are always handled as late kill.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) User control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) vm.memory_failure_recovery
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) See sysctl.txt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) vm.memory_failure_early_kill
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) Enable early kill mode globally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) PR_MCE_KILL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) Set early/late kill mode/revert to system default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) arg1: PR_MCE_KILL_CLEAR:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) Revert to system default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) arg1: PR_MCE_KILL_SET:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) arg2 defines thread specific mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) PR_MCE_KILL_EARLY:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) Early kill
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) PR_MCE_KILL_LATE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) Late kill
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) PR_MCE_KILL_DEFAULT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) Use system global default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) Note that if you want to have a dedicated thread which handles
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) the SIGBUS is sent to the main thread.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) PR_MCE_KILL_GET
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) return current mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) Testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) * madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) process for testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) * hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) corrupt-pfn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) Inject hwpoison fault at PFN echoed into this file. This does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) some early filtering to avoid corrupted unintended pages in test suites.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) unpoison-pfn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) Software-unpoison page at PFN echoed into this file. This way
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) a page can be reused again. This only works for Linux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) injected failures, not for real memory failures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) Note these injection interfaces are not stable and might change between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) kernel versions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) corrupt-filter-dev-major, corrupt-filter-dev-minor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) Only handle memory failures to pages associated with the file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) system defined by block device major/minor. -1U is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) wildcard value. This should be only used for testing with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) artificial injection.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) corrupt-filter-memcg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) Limit injection to pages owned by memgroup. Specified by inode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) number of the memcg.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) Example::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) mkdir /sys/fs/cgroup/mem/hwpoison
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) usemem -m 100 -s 1000 &
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) page-types -p `pidof init` --hwpoison # shall do nothing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) page-types -p `pidof usemem` --hwpoison # poison its pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) corrupt-filter-flags-mask, corrupt-filter-flags-value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) When specified, only poison pages if ((page_flags & mask) ==
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) value). This allows stress testing of many kinds of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) pages. The page_flags are the same as in /proc/kpageflags. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) flag bits are defined in include/linux/kernel-page-flags.h and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) documented in Documentation/admin-guide/mm/pagemap.rst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) * Architecture specific MCE injector
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) x86 has mce-inject, mce-test
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) Some portable hwpoison test programs in mce-test, see below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) References
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) http://halobates.de/mce-lc09-2.pdf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) Overview presentation from LinuxCon 09
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) Test suite (hwpoison specific portable tests in tsrc)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) x86 specific injector
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) Limitations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) - Not all page types are supported and never will. Most kernel internal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) objects cannot be recovered, only LRU pages for now.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) - Right now hugepage support is missing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) Andi Kleen, Oct 2009