^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Kernel Entries
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) This file documents some of the kernel entries in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) arch/x86/entry/entry_64.S. A lot of this explanation is adapted from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) an email from Ingo Molnar:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) http://lkml.kernel.org/r/<20110529191055.GC9835%40elte.hu>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) The x86 architecture has quite a few different ways to jump into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) kernel code. Most of these entry points are registered in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) arch/x86/kernel/traps.c and implemented in arch/x86/entry/entry_64.S
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) for 64-bit, arch/x86/entry/entry_32.S for 32-bit and finally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) arch/x86/entry/entry_64_compat.S which implements the 32-bit compatibility
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) syscall entry points and thus provides for 32-bit processes the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) ability to execute syscalls when running on 64-bit kernels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) The IDT vector assignments are listed in arch/x86/include/asm/irq_vectors.h.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) Some of these entries are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) - system_call: syscall instruction from 64-bit code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) - entry_INT80_compat: int 0x80 from 32-bit or 64-bit code; compat syscall
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) either way.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) - entry_INT80_compat, ia32_sysenter: syscall and sysenter from 32-bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) - interrupt: An array of entries. Every IDT vector that doesn't
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) explicitly point somewhere else gets set to the corresponding
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) value in interrupts. These point to a whole array of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) magically-generated functions that make their way to do_IRQ with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) the interrupt number as a parameter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) - APIC interrupts: Various special-purpose interrupts for things
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) like TLB shootdown.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) - Architecturally-defined exceptions like divide_error.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) There are a few complexities here. The different x86-64 entries
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) have different calling conventions. The syscall and sysenter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) instructions have their own peculiar calling conventions. Some of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) the IDT entries push an error code onto the stack; others don't.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) IDT entries using the IST alternative stack mechanism need their own
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) magic to get the stack frames right. (You can find some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) Volume 3, Chapter 6.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) Dealing with the swapgs instruction is especially tricky. Swapgs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) toggles whether gs is the kernel gs or the user gs. The swapgs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) instruction is rather fragile: it must nest perfectly and only in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) single depth, it should only be used if entering from user mode to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) kernel mode and then when returning to user-space, and precisely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) so. If we mess that up even slightly, we crash.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) So when we have a secondary entry, already in kernel mode, we *must
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) not switched/swapped yet.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) Now, there's a secondary complication: there's a cheap way to test
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) which mode the CPU is in and an expensive way.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) The cheap way is to pick this info off the entry frame on the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) stack, from the CS of the ptregs area of the kernel stack::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) xorl %ebx,%ebx
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) testl $3,CS+8(%rsp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) je error_kernelspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) SWAPGS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) The expensive (paranoid) way is to read back the MSR_GS_BASE value
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) (which is what SWAPGS modifies)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) movl $1,%ebx
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) movl $MSR_GS_BASE,%ecx
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) rdmsr
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) testl %edx,%edx
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) js 1f /* negative -> in kernel */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) SWAPGS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) xorl %ebx,%ebx
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) 1: ret
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) If we are at an interrupt or user-trap/gate-alike boundary then we can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) use the faster check: the stack will be a reliable indicator of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) whether SWAPGS was already done: if we see that we are a secondary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) entry interrupting kernel mode execution, then we know that the GS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) base has already been switched. If it says that we interrupted
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) user-space execution then we must do the SWAPGS.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) which might have triggered right after a normal entry wrote CS to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) stack but before we executed SWAPGS, then the only safe way to check
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) for GS is the slower method: the RDMSR.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) Therefore, super-atomic entries (except NMI, which is handled separately)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) must use idtentry with paranoid=1 to handle gsbase correctly. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) triggers three main behavior changes:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) - Interrupt entry will use the slower gsbase check.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) - Interrupt entry from user mode will switch off the IST stack.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) - Interrupt exit to kernel mode will not attempt to reschedule.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) We try to only use IST entries and the paranoid entry code for vectors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) that absolutely need the more expensive check for the GS base - and we
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) generate all 'normal' entry points with the regular (faster) paranoid=0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) variant.