Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) =============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) An ad-hoc collection of notes on IA64 MCA and INIT processing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) =============================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) Feel free to update it with notes about any area that is not clear.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) MCA/INIT are completely asynchronous.  They can occur at any time, when
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) the OS is in any state.  Including when one of the cpus is already
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) holding a spinlock.  Trying to get any lock from MCA/INIT state is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) asking for deadlock.  Also the state of structures that are protected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) by locks is indeterminate, including linked lists.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) The complicated ia64 MCA process.  All of this is mandated by Intel's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) specification for ia64 SAL, error recovery and unwind, it is not as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) if we have a choice here.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) * MCA occurs on one cpu, usually due to a double bit memory error.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22)   This is the monarch cpu.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) * SAL sends an MCA rendezvous interrupt (which is a normal interrupt)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25)   to all the other cpus, the slaves.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) * Slave cpus that receive the MCA interrupt call down into SAL, they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28)   end up spinning disabled while the MCA is being serviced.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) * If any slave cpu was already spinning disabled when the MCA occurred
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31)   then it cannot service the MCA interrupt.  SAL waits ~20 seconds then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32)   sends an unmaskable INIT event to the slave cpus that have not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33)   already rendezvoused.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) * Because MCA/INIT can be delivered at any time, including when the cpu
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36)   is down in PAL in physical mode, the registers at the time of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37)   event are _completely_ undefined.  In particular the MCA/INIT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38)   handlers cannot rely on the thread pointer, PAL physical mode can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39)   (and does) modify TP.  It is allowed to do that as long as it resets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40)   TP on return.  However MCA/INIT events expose us to these PAL
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41)   internal TP changes.  Hence curr_task().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) * If an MCA/INIT event occurs while the kernel was running (not user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44)   space) and the kernel has called PAL then the MCA/INIT handler cannot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45)   assume that the kernel stack is in a fit state to be used.  Mainly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46)   because PAL may or may not maintain the stack pointer internally.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47)   Because the MCA/INIT handlers cannot trust the kernel stack, they
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48)   have to use their own, per-cpu stacks.  The MCA/INIT stacks are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49)   preformatted with just enough task state to let the relevant handlers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50)   do their job.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) * Unlike most other architectures, the ia64 struct task is embedded in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53)   the kernel stack[1].  So switching to a new kernel stack means that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54)   we switch to a new task as well.  Because various bits of the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55)   assume that current points into the struct task, switching to a new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56)   stack also means a new value for current.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) * Once all slaves have rendezvoused and are spinning disabled, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59)   monarch is entered.  The monarch now tries to diagnose the problem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60)   and decide if it can recover or not.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) * Part of the monarch's job is to look at the state of all the other
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63)   tasks.  The only way to do that on ia64 is to call the unwinder,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64)   as mandated by Intel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) * The starting point for the unwind depends on whether a task is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67)   running or not.  That is, whether it is on a cpu or is blocked.  The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68)   monarch has to determine whether or not a task is on a cpu before it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69)   knows how to start unwinding it.  The tasks that received an MCA or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70)   INIT event are no longer running, they have been converted to blocked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71)   tasks.  But (and its a big but), the cpus that received the MCA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72)   rendezvous interrupt are still running on their normal kernel stacks!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) * To distinguish between these two cases, the monarch must know which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75)   tasks are on a cpu and which are not.  Hence each slave cpu that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76)   switches to an MCA/INIT stack, registers its new stack using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77)   set_curr_task(), so the monarch can tell that the _original_ task is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78)   no longer running on that cpu.  That gives us a decent chance of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79)   getting a valid backtrace of the _original_ task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) * MCA/INIT can be nested, to a depth of 2 on any cpu.  In the case of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82)   nested error, we want diagnostics on the MCA/INIT handler that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83)   failed, not on the task that was originally running.  Again this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84)   requires set_curr_task() so the MCA/INIT handlers can register their
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85)   own stack as running on that cpu.  Then a recursive error gets a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86)   trace of the failing handler's "task".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) [1]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89)     My (Keith Owens) original design called for ia64 to separate its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90)     struct task and the kernel stacks.  Then the MCA/INIT data would be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91)     chained stacks like i386 interrupt stacks.  But that required
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92)     radical surgery on the rest of ia64, plus extra hard wired TLB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93)     entries with its associated performance degradation.  David
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94)     Mosberger vetoed that approach.  Which meant that separate kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95)     stacks meant separate "tasks" for the MCA/INIT handlers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) INIT is less complicated than MCA.  Pressing the nmi button or using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) the equivalent command on the management console sends INIT to all
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) cpus.  SAL picks one of the cpus as the monarch and the rest are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) slaves.  All the OS INIT handlers are entered at approximately the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) time.  The OS monarch prints the state of all tasks and returns, after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) which the slaves return and the system resumes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) At least that is what is supposed to happen.  Alas there are broken
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) versions of SAL out there.  Some drive all the cpus as monarchs.  Some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) drive them all as slaves.  Some drive one cpu as monarch, wait for that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) cpu to return from the OS then drive the rest as slaves.  Some versions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) of SAL cannot even cope with returning from the OS, they spin inside
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) SAL on resume.  The OS INIT code has workarounds for some of these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) broken SAL symptoms, but some simply cannot be fixed from the OS side.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) violations.  Unfortunately MCA/INIT start off as massive layer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) violations (can occur at _any_ time) and they build from there.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) At least ia64 makes an attempt at recovering from hardware errors, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) it is a difficult problem because of the asynchronous nature of these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) errors.  When processing an unmaskable interrupt we sometimes need
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) special code to cope with our inability to take any locks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) How is ia64 MCA/INIT different from x86 NMI?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) * x86 NMI typically gets delivered to one cpu.  MCA/INIT gets sent to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)   all cpus.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) * x86 NMI cannot be nested.  MCA/INIT can be nested, to a depth of 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)   per cpu.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) * x86 has a separate struct task which points to one of multiple kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)   stacks.  ia64 has the struct task embedded in the single kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)   stack, so switching stack means switching task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) * x86 does not call the BIOS so the NMI handler does not have to worry
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140)   about any registers having changed.  MCA/INIT can occur while the cpu
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)   is in PAL in physical mode, with undefined registers and an undefined
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)   kernel stack.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) * i386 backtrace is not very sensitive to whether a process is running
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)   or not.  ia64 unwind is very, very sensitive to whether a process is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)   running or not.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) What happens when MCA/INIT is delivered what a cpu is running user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) space code?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) The user mode registers are stored in the RSE area of the MCA/INIT on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) entry to the OS and are restored from there on return to SAL, so user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) mode registers are preserved across a recoverable MCA/INIT.  Since the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) OS has no idea what unwind data is available for the user space stack,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) MCA/INIT never tries to backtrace user space.  Which means that the OS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) does not bother making the user space process look like a blocked task,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) i.e. the OS does not copy pt_regs and switch_stack to the user space
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) stack.  Also the OS has no idea how big the user space RSE and memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) stacks are, which makes it too risky to copy the saved state to a user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) mode stack.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) How do we get a backtrace on the tasks that were running when MCA/INIT
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) was delivered?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) mca.c:::ia64_mca_modify_original_stack().  That identifies and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) verifies the original kernel stack, copies the dirty registers from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) the MCA/INIT stack's RSE to the original stack's RSE, copies the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) skeleton struct pt_regs and switch_stack to the original stack, fills
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) in the skeleton structures from the PAL minstate area and updates the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) original stack's thread.ksp.  That makes the original stack look
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) exactly like any other blocked task, i.e. it now appears to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) sleeping.  To get a backtrace, just start with thread.ksp for the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) original task and unwind like any other sleeping task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) ---
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) How do we identify the tasks that were running when MCA/INIT was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) delivered?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) If the previous task has been verified and converted to a blocked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) state, then sos->prev_task on the MCA/INIT stack is updated to point to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) the previous task.  You can look at that field in dumps or debuggers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) To help distinguish between the handler and the original tasks,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) handlers have _TIF_MCA_INIT set in thread_info.flags.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) The sos data is always in the MCA/INIT handler stack, at offset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) MCA_SOS_OFFSET.  You can get that value from mca_asm.h or calculate it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192) as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) ia64_sal_os_state), with 16 byte alignment for all structures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) Also the comm field of the MCA/INIT task is modified to include the pid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) of the original task, for humans to use.  For example, a comm field of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) 'MCA 12159' means that pid 12159 was running when the MCA was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) delivered.