^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) =============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) An ad-hoc collection of notes on IA64 MCA and INIT processing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) =============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) Feel free to update it with notes about any area that is not clear.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) MCA/INIT are completely asynchronous. They can occur at any time, when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) the OS is in any state. Including when one of the cpus is already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) holding a spinlock. Trying to get any lock from MCA/INIT state is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) asking for deadlock. Also the state of structures that are protected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) by locks is indeterminate, including linked lists.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) The complicated ia64 MCA process. All of this is mandated by Intel's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) specification for ia64 SAL, error recovery and unwind, it is not as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) if we have a choice here.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) * MCA occurs on one cpu, usually due to a double bit memory error.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) This is the monarch cpu.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) * SAL sends an MCA rendezvous interrupt (which is a normal interrupt)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) to all the other cpus, the slaves.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) * Slave cpus that receive the MCA interrupt call down into SAL, they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) end up spinning disabled while the MCA is being serviced.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) * If any slave cpu was already spinning disabled when the MCA occurred
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) then it cannot service the MCA interrupt. SAL waits ~20 seconds then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) sends an unmaskable INIT event to the slave cpus that have not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) already rendezvoused.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) * Because MCA/INIT can be delivered at any time, including when the cpu
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) is down in PAL in physical mode, the registers at the time of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) event are _completely_ undefined. In particular the MCA/INIT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) handlers cannot rely on the thread pointer, PAL physical mode can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) (and does) modify TP. It is allowed to do that as long as it resets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) TP on return. However MCA/INIT events expose us to these PAL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) internal TP changes. Hence curr_task().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) * If an MCA/INIT event occurs while the kernel was running (not user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) space) and the kernel has called PAL then the MCA/INIT handler cannot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) assume that the kernel stack is in a fit state to be used. Mainly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) because PAL may or may not maintain the stack pointer internally.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) Because the MCA/INIT handlers cannot trust the kernel stack, they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) have to use their own, per-cpu stacks. The MCA/INIT stacks are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) preformatted with just enough task state to let the relevant handlers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) do their job.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) * Unlike most other architectures, the ia64 struct task is embedded in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) the kernel stack[1]. So switching to a new kernel stack means that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) we switch to a new task as well. Because various bits of the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) assume that current points into the struct task, switching to a new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) stack also means a new value for current.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) * Once all slaves have rendezvoused and are spinning disabled, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) monarch is entered. The monarch now tries to diagnose the problem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) and decide if it can recover or not.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) * Part of the monarch's job is to look at the state of all the other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) tasks. The only way to do that on ia64 is to call the unwinder,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) as mandated by Intel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) * The starting point for the unwind depends on whether a task is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) running or not. That is, whether it is on a cpu or is blocked. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) monarch has to determine whether or not a task is on a cpu before it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) knows how to start unwinding it. The tasks that received an MCA or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) INIT event are no longer running, they have been converted to blocked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) tasks. But (and its a big but), the cpus that received the MCA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) rendezvous interrupt are still running on their normal kernel stacks!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) * To distinguish between these two cases, the monarch must know which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) tasks are on a cpu and which are not. Hence each slave cpu that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) switches to an MCA/INIT stack, registers its new stack using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) set_curr_task(), so the monarch can tell that the _original_ task is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) no longer running on that cpu. That gives us a decent chance of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) getting a valid backtrace of the _original_ task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) * MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) nested error, we want diagnostics on the MCA/INIT handler that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) failed, not on the task that was originally running. Again this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) requires set_curr_task() so the MCA/INIT handlers can register their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) own stack as running on that cpu. Then a recursive error gets a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) trace of the failing handler's "task".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) [1]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) My (Keith Owens) original design called for ia64 to separate its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) struct task and the kernel stacks. Then the MCA/INIT data would be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) chained stacks like i386 interrupt stacks. But that required
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) radical surgery on the rest of ia64, plus extra hard wired TLB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) entries with its associated performance degradation. David
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) Mosberger vetoed that approach. Which meant that separate kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) stacks meant separate "tasks" for the MCA/INIT handlers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) INIT is less complicated than MCA. Pressing the nmi button or using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) the equivalent command on the management console sends INIT to all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) cpus. SAL picks one of the cpus as the monarch and the rest are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) slaves. All the OS INIT handlers are entered at approximately the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) time. The OS monarch prints the state of all tasks and returns, after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) which the slaves return and the system resumes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) At least that is what is supposed to happen. Alas there are broken
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) versions of SAL out there. Some drive all the cpus as monarchs. Some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) drive them all as slaves. Some drive one cpu as monarch, wait for that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) cpu to return from the OS then drive the rest as slaves. Some versions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) of SAL cannot even cope with returning from the OS, they spin inside
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) SAL on resume. The OS INIT code has workarounds for some of these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) broken SAL symptoms, but some simply cannot be fixed from the OS side.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) violations. Unfortunately MCA/INIT start off as massive layer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) violations (can occur at _any_ time) and they build from there.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) At least ia64 makes an attempt at recovering from hardware errors, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) it is a difficult problem because of the asynchronous nature of these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) errors. When processing an unmaskable interrupt we sometimes need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) special code to cope with our inability to take any locks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) How is ia64 MCA/INIT different from x86 NMI?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) * x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) all cpus.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) * x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) per cpu.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) * x86 has a separate struct task which points to one of multiple kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) stacks. ia64 has the struct task embedded in the single kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) stack, so switching stack means switching task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) * x86 does not call the BIOS so the NMI handler does not have to worry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) about any registers having changed. MCA/INIT can occur while the cpu
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) is in PAL in physical mode, with undefined registers and an undefined
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) kernel stack.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) * i386 backtrace is not very sensitive to whether a process is running
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) or not. ia64 unwind is very, very sensitive to whether a process is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) running or not.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) What happens when MCA/INIT is delivered what a cpu is running user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) space code?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) The user mode registers are stored in the RSE area of the MCA/INIT on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) entry to the OS and are restored from there on return to SAL, so user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) mode registers are preserved across a recoverable MCA/INIT. Since the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) OS has no idea what unwind data is available for the user space stack,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) MCA/INIT never tries to backtrace user space. Which means that the OS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) does not bother making the user space process look like a blocked task,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) i.e. the OS does not copy pt_regs and switch_stack to the user space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) stack. Also the OS has no idea how big the user space RSE and memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) stacks are, which makes it too risky to copy the saved state to a user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) mode stack.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) How do we get a backtrace on the tasks that were running when MCA/INIT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) was delivered?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) mca.c:::ia64_mca_modify_original_stack(). That identifies and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) verifies the original kernel stack, copies the dirty registers from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) the MCA/INIT stack's RSE to the original stack's RSE, copies the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) skeleton struct pt_regs and switch_stack to the original stack, fills
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) in the skeleton structures from the PAL minstate area and updates the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) original stack's thread.ksp. That makes the original stack look
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) exactly like any other blocked task, i.e. it now appears to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) sleeping. To get a backtrace, just start with thread.ksp for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) original task and unwind like any other sleeping task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) How do we identify the tasks that were running when MCA/INIT was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) delivered?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) If the previous task has been verified and converted to a blocked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) state, then sos->prev_task on the MCA/INIT stack is updated to point to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) the previous task. You can look at that field in dumps or debuggers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) To help distinguish between the handler and the original tasks,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) handlers have _TIF_MCA_INIT set in thread_info.flags.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) The sos data is always in the MCA/INIT handler stack, at offset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) ia64_sal_os_state), with 16 byte alignment for all structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) Also the comm field of the MCA/INIT task is modified to include the pid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) of the original task, for humans to use. For example, a comm field of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) 'MCA 12159' means that pid 12159 was running when the MCA was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) delivered.