^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) Firmware-Assisted Dump
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ======================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) July 2011
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) The goal of firmware-assisted dump is to enable the dump of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) a crashed system, and to do so from a fully-reset system, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) to minimize the total elapsed time until the system is back
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) in production use.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) - Firmware-Assisted Dump (FADump) infrastructure is intended to replace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) the existing phyp assisted dump.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) - Fadump uses the same firmware interfaces and memory reservation model
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) as phyp assisted dump.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) - Unlike phyp dump, FADump exports the memory dump through /proc/vmcore
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) in the ELF format in the same way as kdump. This helps us reuse the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) kdump infrastructure for dump capture and filtering.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) - Unlike phyp dump, userspace tool does not need to refer any sysfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) interface while reading /proc/vmcore.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) - Unlike phyp dump, FADump allows user to release all the memory reserved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) - Once enabled through kernel boot parameter, FADump can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) started/stopped through /sys/kernel/fadump_registered interface (see
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) sysfs files section below) and can be easily integrated with kdump
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) service start/stop init scripts.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) Comparing with kdump or other strategies, firmware-assisted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) dump offers several strong, practical advantages:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) - Unlike kdump, the system has been reset, and loaded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) with a fresh copy of the kernel. In particular,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) PCI and I/O devices have been reinitialized and are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) in a clean, consistent state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) - Once the dump is copied out, the memory that held the dump
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) is immediately available to the running kernel. And therefore,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) unlike kdump, FADump doesn't need a 2nd reboot to get back
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) the system to the production configuration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) The above can only be accomplished by coordination with,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) and assistance from the Power firmware. The procedure is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) as follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) - The first kernel registers the sections of memory with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) Power firmware for dump preservation during OS initialization.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) These registered sections of memory are reserved by the first
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) kernel during early boot.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) - When system crashes, the Power firmware will copy the registered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) low memory regions (boot memory) from source to destination area.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) It will also save hardware PTE's.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) The term 'boot memory' means size of the low memory chunk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) that is required for a kernel to boot successfully when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) booted with restricted memory. By default, the boot memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) size will be the larger of 5% of system RAM or 256MB.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) Alternatively, user can also specify boot memory size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) through boot parameter 'crashkernel=' which will override
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) the default calculated size. Use this option if default
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) boot memory size is not sufficient for second kernel to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) boot successfully. For syntax of crashkernel= parameter,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) refer to Documentation/admin-guide/kdump/kdump.rst. If any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) offset is provided in crashkernel= parameter, it will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) ignored as FADump uses a predefined offset to reserve memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) for boot memory dump preservation in case of a crash.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) - After the low memory (boot memory) area has been saved, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) firmware will reset PCI and other hardware state. It will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) *not* clear the RAM. It will then launch the bootloader, as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) normal.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) - The freshly booted kernel will notice that there is a new node
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) (rtas/ibm,kernel-dump on pSeries or ibm,opal/dump/mpipl-boot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) on OPAL platform) in the device tree, indicating that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) there is crash data available from a previous boot. During
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) the early boot OS will reserve rest of the memory above
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) boot memory size effectively booting with restricted memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) size. This will make sure that this kernel (also, referred
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) to as second kernel or capture kernel) will not touch any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) of the dump memory area.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) - User-space tools will read /proc/vmcore to obtain the contents
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) of memory, which holds the previous crashed kernel dump in ELF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) format. The userspace tools may copy this info to disk, or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) network, nas, san, iscsi, etc. as desired.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) - Once the userspace tool is done saving dump, it will echo
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) '1' to /sys/kernel/fadump_release_mem to release the reserved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) memory back to general use, except the memory required for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) next firmware-assisted dump registration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) e.g.::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) # echo 1 > /sys/kernel/fadump_release_mem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) Please note that the firmware-assisted dump feature
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) is only available on POWER6 and above systems on pSeries
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) (PowerVM) platform and POWER9 and above systems with OP940
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) or later firmware versions on PowerNV (OPAL) platform.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) Note that, OPAL firmware exports ibm,opal/dump node when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) FADump is supported on PowerNV platform.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) On OPAL based machines, system first boots into an intermittent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) kernel (referred to as petitboot kernel) before booting into the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) capture kernel. This kernel would have minimal kernel and/or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) userspace support to process crash data. Such kernel needs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) preserve previously crash'ed kernel's memory for the subsequent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) capture kernel boot to process this crash data. Kernel config
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) option CONFIG_PRESERVE_FA_DUMP has to be enabled on such kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) to ensure that crash data is preserved to process later.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) -- On OPAL based machines (PowerNV), if the kernel is build with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) CONFIG_OPAL_CORE=y, OPAL memory at the time of crash is also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) exported as /sys/firmware/opal/mpipl/core file. This procfs file is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) helpful in debugging OPAL crashes with GDB. The kernel memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) used for exporting this procfs file can be released by echo'ing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) '1' to /sys/firmware/opal/mpipl/release_core node.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) e.g.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) # echo 1 > /sys/firmware/opal/mpipl/release_core
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) Implementation details:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) -----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) During boot, a check is made to see if firmware supports
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) this feature on that particular machine. If it does, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) we check to see if an active dump is waiting for us. If yes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) then everything but boot memory size of RAM is reserved during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) early boot (See Fig. 2). This area is released once we finish
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) collecting the dump from user land scripts (e.g. kdump scripts)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) that are run. If there is dump data, then the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) /sys/kernel/fadump_release_mem file is created, and the reserved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) memory is held.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) If there is no waiting dump data, then only the memory required to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) hold CPU state, HPTE region, boot memory dump, FADump header and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) elfcore header, is usually reserved at an offset greater than boot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) memory size (see Fig. 1). This area is *not* released: this region
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) will be kept permanently reserved, so that it can act as a receptacle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) for a copy of the boot memory content in addition to CPU state and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) HPTE region, in the case a crash does occur.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) Since this reserved memory area is used only after the system crash,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) there is no point in blocking this significant chunk of memory from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) production kernel. Hence, the implementation uses the Linux kernel's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) Contiguous Memory Allocator (CMA) for memory reservation if CMA is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) configured for kernel. With CMA reservation this memory will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) available for applications to use it, while kernel is prevented from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) using it. With this FADump will still be able to capture all of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) kernel memory and most of the user space memory except the user pages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) that were present in CMA region::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) o Memory Reservation during first kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) Low memory Top of memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) 0 boot memory size |<--- Reserved dump area --->| |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) | | | Permanent Reservation | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) V V | | V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) | | |///|////| DUMP | HDR | ELF |////| |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) | ^ ^ ^ ^ ^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) | | | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) \ CPU HPTE / | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) ------------------------------ | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) Boot memory content gets transferred | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) to reserved area by firmware at the | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) time of crash. | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) FADump Header |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) (meta area) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) Metadata: This area holds a metadata struture whose
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) address is registered with f/w and retrieved in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) second kernel after crash, on platforms that support
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) tags (OPAL). Having such structure with info needed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) to process the crashdump eases dump capture process.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) Fig. 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) o Memory Reservation during second kernel after crash
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) Low memory Top of memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) 0 boot memory size |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) | |<------------ Crash preserved area ------------>|
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) V V |<--- Reserved dump area --->| |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) | | |///|////| DUMP | HDR | ELF |////| |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) +-----------+-----/ /---+---+----+-------+-----+-----+----+--+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) V V
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) Used by second /proc/vmcore
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) kernel to boot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) +---+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) |///| -> Regions (CPU, HPTE & Metadata) marked like this in the above
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) +---+ figures are not always present. For example, OPAL platform
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200) does not have CPU & HPTE regions while Metadata region is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) not supported on pSeries currently.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) Fig. 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) Currently the dump will be copied from /proc/vmcore to a new file upon
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) user intervention. The dump data available through /proc/vmcore will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) in ELF format. Hence the existing kdump infrastructure (kdump scripts)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) to save the dump works fine with minor modifications. KDump scripts on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) major Distro releases have already been modified to work seemlessly (no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) user intervention in saving the dump) when FADump is used, instead of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) KDump, as dump mechanism.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) The tools to examine the dump will be same as the ones
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) used for kdump.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) How to enable firmware-assisted dump (FADump):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) ----------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) 1. Set config option CONFIG_FA_DUMP=y and build kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) 2. Boot into linux kernel with 'fadump=on' kernel cmdline option.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) By default, FADump reserved memory will be initialized as CMA area.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) Alternatively, user can boot linux kernel with 'fadump=nocma' to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) prevent FADump to use CMA.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) 3. Optionally, user can also set 'crashkernel=' kernel cmdline
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) to specify size of the memory to reserve for boot memory dump
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) preservation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) 1. 'fadump_reserve_mem=' parameter has been deprecated. Instead
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) use 'crashkernel=' to specify size of the memory to reserve
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) for boot memory dump preservation.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) 2. If firmware-assisted dump fails to reserve memory then it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) will fallback to existing kdump mechanism if 'crashkernel='
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) option is set at kernel cmdline.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) 3. if user wants to capture all of user space memory and ok with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) reserved memory not available to production system, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) 'fadump=nocma' kernel parameter can be used to fallback to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) old behaviour.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) Sysfs/debugfs files:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) --------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) Firmware-assisted dump feature uses sysfs file system to hold
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) the control files and debugfs file to display memory reserved region.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) Here is the list of files under kernel sysfs:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) /sys/kernel/fadump_enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) This is used to display the FADump status.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) - 0 = FADump is disabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) - 1 = FADump is enabled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) This interface can be used by kdump init scripts to identify if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) FADump is enabled in the kernel and act accordingly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) /sys/kernel/fadump_registered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) This is used to display the FADump registration status as well
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) as to control (start/stop) the FADump registration.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) - 0 = FADump is not registered.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) - 1 = FADump is registered and ready to handle system crash.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) To register FADump echo 1 > /sys/kernel/fadump_registered and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) echo 0 > /sys/kernel/fadump_registered for un-register and stop the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) FADump. Once the FADump is un-registered, the system crash will not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) be handled and vmcore will not be captured. This interface can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) easily integrated with kdump service start/stop.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) /sys/kernel/fadump/mem_reserved
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) This is used to display the memory reserved by FADump for saving the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) crash dump.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) /sys/kernel/fadump_release_mem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) This file is available only when FADump is active during
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) second kernel. This is used to release the reserved memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) region that are held for saving crash dump. To release the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) reserved memory echo 1 to it::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) echo 1 > /sys/kernel/fadump_release_mem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) file will change to reflect the new memory reservations.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287) The existing userspace tools (kdump infrastructure) can be easily
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) enhanced to use this interface to release the memory reserved for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) dump and continue without 2nd reboot.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) Note: /sys/kernel/fadump_release_opalcore sysfs has moved to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) /sys/firmware/opal/mpipl/release_core
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294) /sys/firmware/opal/mpipl/release_core
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) This file is available only on OPAL based machines when FADump is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) active during capture kernel. This is used to release the memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) used by the kernel to export /sys/firmware/opal/mpipl/core file. To
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) release this memory, echo '1' to it:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) echo 1 > /sys/firmware/opal/mpipl/release_core
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) Note: The following FADump sysfs files are deprecated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) +----------------------------------+--------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) | Deprecated | Alternative |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) +----------------------------------+--------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) | /sys/kernel/fadump_enabled | /sys/kernel/fadump/enabled |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) +----------------------------------+--------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) | /sys/kernel/fadump_registered | /sys/kernel/fadump/registered |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) +----------------------------------+--------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) | /sys/kernel/fadump_release_mem | /sys/kernel/fadump/release_mem |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) +----------------------------------+--------------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) Here is the list of files under powerpc debugfs:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) (Assuming debugfs is mounted on /sys/kernel/debug directory.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) /sys/kernel/debug/powerpc/fadump_region
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) This file shows the reserved memory regions if FADump is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) enabled otherwise this file is empty. The output format
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321) is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) and for kernel DUMP region is:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) DUMP: Src: <src-addr>, Dest: <dest-addr>, Size: <size>, Dumped: # bytes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) e.g.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) Contents when FADump is registered during first kernel::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) # cat /sys/kernel/debug/powerpc/fadump_region
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) Contents when FADump is active during second kernel::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339) # cat /sys/kernel/debug/powerpc/fadump_region
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) Please refer to Documentation/filesystems/debugfs.rst on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) how to mount the debugfs filesystem.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) TODO:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) -----
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) - Need to come up with the better approach to find out more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) accurate boot memory size that is required for a kernel to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) boot successfully when booted with restricted memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) - The FADump implementation introduces a FADump crash info structure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) in the scratch area before the ELF core header. The idea of introducing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) this structure is to pass some important crash info data to the second
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) kernel which will help second kernel to populate ELF core header with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) correct data before it gets exported through /proc/vmcore. The current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) design implementation does not address a possibility of introducing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) additional fields (in future) to this structure without affecting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) compatibility. Need to come up with the better approach to address this.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) The possible approaches are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) 1. Introduce version field for version tracking, bump up the version
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) whenever a new field is added to the structure in future. The version
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369) field can be used to find out what fields are valid for the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) version of the structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) 2. Reserve the area of predefined size (say PAGE_SIZE) for this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) structure and have unused area as reserved (initialized to zero)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) for future field additions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375) The advantage of approach 1 over 2 is we don't need to reserve extra space.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) This document is based on the original documentation written for phyp
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) assisted dump by Linas Vepstas and Manish Ahuja.