Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. hwpoison:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) hwpoison
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) What is hwpoison?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) Upcoming Intel CPUs have support for recovering from some memory errors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) (``MCA recovery``). This requires the OS to declare a page "poisoned",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) kill the processes associated with it and avoid using it in the future.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) This patchkit implements the necessary infrastructure in the VM.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) To quote the overview comment::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 	High level machine check handler. Handles pages reported by the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) 	hardware as being corrupted usually due to a 2bit ECC memory or cache
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) 	failure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) 	This focusses on pages detected as corrupted in the background.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) 	When the current CPU tries to consume corruption the currently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) 	running process can just be killed directly instead. This implies
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 	that if the error cannot be handled for some reason it's safe to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 	just ignore it because no corruption has been consumed yet. Instead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) 	when that happens another machine check will happen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) 	Handles page cache pages in various states. The tricky part
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) 	here is that we can access any page asynchronous to other VM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) 	users, because memory failures could happen anytime and anywhere,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) 	possibly violating some of their assumptions. This is why this code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) 	has to be extremely careful. Generally it tries to use normal locking
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 	rules, as in get the standard locks, even if that means the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) 	error handling takes potentially a long time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) 	Some of the operations here are somewhat inefficient and have non
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) 	linear algorithmic complexity, because the data structures have not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) 	been optimized for this case. This is in particular the case
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) 	for the mapping from a vma to a process. Since this case is expected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 	to be rare we hope we can get away with this.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) The code consists of a the high level handler in mm/memory-failure.c,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) a new page poison bit and various checks in the VM to handle poisoned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) pages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) The main target right now is KVM guests, but it works for all kinds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) of applications. KVM support requires a recent qemu-kvm release.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) For the KVM use there was need for a new signal type so that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) KVM can inject the machine check into the guest with the proper
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) address. This in theory allows other applications to handle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) memory failures too. The expection is that near all applications
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) won't do that, but some very specialized ones might.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) Failure recovery modes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) There are two (actually three) modes memory failure recovery can be in:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) vm.memory_failure_recovery sysctl set to zero:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) 	All memory failures cause a panic. Do not attempt recovery.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 	(on x86 this can be also affected by the tolerant level of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 	MCE subsystem)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) early kill
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) 	(can be controlled globally and per process)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) 	Send SIGBUS to the application as soon as the error is detected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) 	This allows applications who can process memory errors in a gentle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) 	way (e.g. drop affected object)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) 	This is the mode used by KVM qemu.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) late kill
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) 	Send SIGBUS when the application runs into the corrupted page.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) 	This is best for memory error unaware applications and default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) 	Note some pages are always handled as late kill.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) User control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) ============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) vm.memory_failure_recovery
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) 	See sysctl.txt
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) vm.memory_failure_early_kill
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) 	Enable early kill mode globally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) PR_MCE_KILL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) 	Set early/late kill mode/revert to system default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) 	arg1: PR_MCE_KILL_CLEAR:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) 		Revert to system default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) 	arg1: PR_MCE_KILL_SET:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) 		arg2 defines thread specific mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) 		PR_MCE_KILL_EARLY:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) 			Early kill
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 		PR_MCE_KILL_LATE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) 			Late kill
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) 		PR_MCE_KILL_DEFAULT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) 			Use system global default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) 	Note that if you want to have a dedicated thread which handles
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) 	the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) 	call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 	the SIGBUS is sent to the main thread.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) PR_MCE_KILL_GET
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 	return current mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) Testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) * madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)   process for testing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) * hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118)   corrupt-pfn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) 	Inject hwpoison fault at PFN echoed into this file. This does
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) 	some early filtering to avoid corrupted unintended pages in test suites.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)   unpoison-pfn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) 	Software-unpoison page at PFN echoed into this file. This way
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) 	a page can be reused again.  This only works for Linux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) 	injected failures, not for real memory failures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)   Note these injection interfaces are not stable and might change between
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)   kernel versions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)   corrupt-filter-dev-major, corrupt-filter-dev-minor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) 	Only handle memory failures to pages associated with the file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) 	system defined by block device major/minor.  -1U is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) 	wildcard value.  This should be only used for testing with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) 	artificial injection.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)   corrupt-filter-memcg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 	Limit injection to pages owned by memgroup. Specified by inode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) 	number of the memcg.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 	Example::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 		mkdir /sys/fs/cgroup/mem/hwpoison
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) 	        usemem -m 100 -s 1000 &
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 		echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) 		memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) 		echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 		page-types -p `pidof init`   --hwpoison  # shall do nothing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 		page-types -p `pidof usemem` --hwpoison  # poison its pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)   corrupt-filter-flags-mask, corrupt-filter-flags-value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) 	When specified, only poison pages if ((page_flags & mask) ==
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 	value).  This allows stress testing of many kinds of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 	pages. The page_flags are the same as in /proc/kpageflags. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) 	flag bits are defined in include/linux/kernel-page-flags.h and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 	documented in Documentation/admin-guide/mm/pagemap.rst
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) * Architecture specific MCE injector
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162)   x86 has mce-inject, mce-test
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164)   Some portable hwpoison test programs in mce-test, see below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) References
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) http://halobates.de/mce-lc09-2.pdf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) 	Overview presentation from LinuxCon 09
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) 	Test suite (hwpoison specific portable tests in tsrc)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) 	x86 specific injector
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) Limitations
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) ===========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) - Not all page types are supported and never will. Most kernel internal
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)   objects cannot be recovered, only LRU pages for now.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) - Right now hugepage support is missing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) Andi Kleen, Oct 2009