Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ===============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) Kernel level exception handling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) ===============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) When a process runs in kernel mode, it often has to access user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) mode memory whose address has been passed by an untrusted program.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) To protect itself the kernel has to verify this address.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) In older versions of Linux this was done with the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) int verify_area(int type, const void * addr, unsigned long size)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) function (which has since been replaced by access_ok()).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) This function verified that the memory area starting at address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 'addr' and of size 'size' was accessible for the operation specified
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) in type (read or write). To do this, verify_read had to look up the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) virtual memory area (vma) that contained the address addr. In the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) normal case (correctly working program), this test was successful.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) It only failed for a few buggy programs. In some kernel profiling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) tests, this normally unneeded verification used up a considerable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) amount of time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) To overcome this situation, Linus decided to let the virtual memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) hardware present in every Linux-capable CPU handle this test.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) How does this work?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) Whenever the kernel tries to access an address that is currently not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) accessible, the CPU generates a page fault exception and calls the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) page fault handler::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35)   void do_page_fault(struct pt_regs *regs, unsigned long error_code)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) in arch/x86/mm/fault.c. The parameters on the stack are set up by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) the low level assembly glue in arch/x86/entry/entry_32.S. The parameter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) regs is a pointer to the saved registers on the stack, error_code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) contains a reason code for the exception.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) do_page_fault first obtains the unaccessible address from the CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) control register CR2. If the address is within the virtual address
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) space of the process, the fault probably occurred, because the page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) was not swapped in, write protected or something similar. However,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) we are interested in the other case: the address is not valid, there
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) is no vma that contains this address. In this case, the kernel jumps
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) to the bad_area label.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) There it uses the address of the instruction that caused the exception
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) (i.e. regs->eip) to find an address where the execution can continue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) (fixup). If this search is successful, the fault handler modifies the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) return address (again regs->eip) and returns. The execution will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) continue at the address in fixup.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) Where does fixup point to?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) Since we jump to the contents of fixup, fixup obviously points
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) to executable code. This code is hidden inside the user access macros.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) as an example. The definition is somewhat hard to follow, so let's peek at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) the code generated by the preprocessor and the compiler. I selected
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) the get_user call in drivers/char/sysrq.c for a detailed examination.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) The original code in sysrq.c line 587::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67)         get_user(c, buf);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) The preprocessor output (edited to become somewhat readable)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71)   (
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72)     {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73)       long __gu_err = - 14 , __gu_val = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74)       const __typeof__(*( (  buf ) )) *__gu_addr = ((buf));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75)       if (((((0 + current_set[0])->tss.segment) == 0x18 )  ||
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76)         (((sizeof(*(buf))) <= 0xC0000000UL) &&
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77)         ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78)         do {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79)           __gu_err  = 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80)           switch ((sizeof(*(buf)))) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81)             case 1:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82)               __asm__ __volatile__(
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83)                 "1:      mov" "b" " %2,%" "b" "1\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84)                 "2:\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85)                 ".section .fixup,\"ax\"\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86)                 "3:      movl %3,%0\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87)                 "        xor" "b" " %" "b" "1,%" "b" "1\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88)                 "        jmp 2b\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89)                 ".section __ex_table,\"a\"\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90)                 "        .align 4\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91)                 "        .long 1b,3b\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92)                 ".text"        : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93)                               (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  )) ;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94)                 break;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95)             case 2:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96)               __asm__ __volatile__(
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97)                 "1:      mov" "w" " %2,%" "w" "1\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98)                 "2:\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99)                 ".section .fixup,\"ax\"\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100)                 "3:      movl %3,%0\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101)                 "        xor" "w" " %" "w" "1,%" "w" "1\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)                 "        jmp 2b\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)                 ".section __ex_table,\"a\"\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104)                 "        .align 4\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105)                 "        .long 1b,3b\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)                 ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107)                               (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  ));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)                 break;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109)             case 4:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110)               __asm__ __volatile__(
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111)                 "1:      mov" "l" " %2,%" "" "1\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112)                 "2:\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113)                 ".section .fixup,\"ax\"\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)                 "3:      movl %3,%0\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)                 "        xor" "l" " %" "" "1,%" "" "1\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116)                 "        jmp 2b\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)                 ".section __ex_table,\"a\"\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118)                 "        .align 4\n"        "        .long 1b,3b\n"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119)                 ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120)                               (   __gu_addr   )) ), "i"(- 14 ), "0"(__gu_err));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121)                 break;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)             default:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123)               (__gu_val) = __get_user_bad();
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)           }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)         } while (0) ;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)       ((c)) = (__typeof__(*((buf))))__gu_val;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)       __gu_err;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)     }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)   );
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) WOW! Black GCC/assembly magic. This is impossible to follow, so let's
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) see what code gcc generates::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)  >         xorl %edx,%edx
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135)  >         movl current_set,%eax
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)  >         cmpl $24,788(%eax)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137)  >         je .L1424
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)  >         cmpl $-1073741825,64(%esp)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139)  >         ja .L1423
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140)  > .L1424:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)  >         movl %edx,%eax
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)  >         movl 64(%esp),%ebx
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)  > #APP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144)  > 1:      movb (%ebx),%dl                /* this is the actual user access */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)  > 2:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)  > .section .fixup,"ax"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)  > 3:      movl $-14,%eax
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148)  >         xorb %dl,%dl
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)  >         jmp 2b
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150)  > .section __ex_table,"a"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)  >         .align 4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)  >         .long 1b,3b
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)  > .text
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154)  > #NO_APP
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)  > .L1423:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156)  >         movzbl %dl,%esi
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) The optimizer does a good job and gives us something we can actually
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) understand. Can we? The actual user access is quite obvious. Thanks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) to the unified address space we can just access the address in user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) memory. But what does the .section stuff do?????
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) To understand this we have to look at the final kernel::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165)  > objdump --section-headers vmlinux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166)  >
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167)  > vmlinux:     file format elf32-i386
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168)  >
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169)  > Sections:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170)  > Idx Name          Size      VMA       LMA       File off  Algn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171)  >   0 .text         00098f40  c0100000  c0100000  00001000  2**4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172)  >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173)  >   1 .fixup        000016bc  c0198f40  c0198f40  00099f40  2**0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)  >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)  >   2 .rodata       0000f127  c019a5fc  c019a5fc  0009b5fc  2**2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176)  >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)  >   3 __ex_table    000015c0  c01a9724  c01a9724  000aa724  2**2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)  >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)  >   4 .data         0000ea58  c01abcf0  c01abcf0  000abcf0  2**4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180)  >                   CONTENTS, ALLOC, LOAD, DATA
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181)  >   5 .bss          00018e21  c01ba748  c01ba748  000ba748  2**2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)  >                   ALLOC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)  >   6 .comment      00000ec4  00000000  00000000  000ba748  2**0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184)  >                   CONTENTS, READONLY
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185)  >   7 .note         00001068  00000ec4  00000ec4  000bb60c  2**0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186)  >                   CONTENTS, READONLY
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) There are obviously 2 non standard ELF sections in the generated object
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) file. But first we want to find out what happened to our code in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) final kernel executable::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192)  > objdump --disassemble --section=.text vmlinux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)  >
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194)  > c017e785 <do_con_write+c1> xorl   %edx,%edx
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195)  > c017e787 <do_con_write+c3> movl   0xc01c7bec,%eax
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196)  > c017e78c <do_con_write+c8> cmpl   $0x18,0x314(%eax)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197)  > c017e793 <do_con_write+cf> je     c017e79f <do_con_write+db>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198)  > c017e795 <do_con_write+d1> cmpl   $0xbfffffff,0x40(%esp,1)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199)  > c017e79d <do_con_write+d9> ja     c017e7a7 <do_con_write+e3>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200)  > c017e79f <do_con_write+db> movl   %edx,%eax
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201)  > c017e7a1 <do_con_write+dd> movl   0x40(%esp,1),%ebx
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202)  > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203)  > c017e7a7 <do_con_write+e3> movzbl %dl,%esi
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) The whole user memory access is reduced to 10 x86 machine instructions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) The instructions bracketed in the .section directives are no longer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) in the normal execution path. They are located in a different section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) of the executable file::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210)  > objdump --disassemble --section=.fixup vmlinux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211)  >
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212)  > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213)  > c0199ffa <.fixup+10ba> xorb   %dl,%dl
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214)  > c0199ffc <.fixup+10bc> jmp    c017e7a7 <do_con_write+e3>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) And finally::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218)  > objdump --full-contents --section=__ex_table vmlinux
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219)  >
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220)  >  c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0  ................
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221)  >  c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0  ................
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222)  >  c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0  ................
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) or in human readable byte order::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226)  >  c01aa7c4 c017c093 c0199fe0 c017c097 c017c099  ................
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227)  >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228)                                ^^^^^^^^^^^^^^^^^
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229)                                this is the interesting part!
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230)  >  c01aa7e4 c0180a08 c019a001 c0180a0a c019a004  ................
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) What happened? The assembly directives::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234)   .section .fixup,"ax"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235)   .section __ex_table,"a"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) told the assembler to move the following code to the specified
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) sections in the ELF object file. So the instructions::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240)   3:      movl $-14,%eax
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241)           xorb %dl,%dl
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242)           jmp 2b
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) ended up in the .fixup section of the object file and the addresses::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246)         .long 1b,3b
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) ended up in the __ex_table section of the object file. 1b and 3b
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) are local labels. The local label 1b (1b stands for next label 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) backward) is the address of the instruction that might fault, i.e.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) in our case the address of the label 1 is c017e7a5:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) the original assembly code: > 1:      movb (%ebx),%dl
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) and linked in vmlinux     : > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) The local label 3 (backwards again) is the address of the code to handle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) the fault, in our case the actual value is c0199ff5:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) the original assembly code: > 3:      movl $-14,%eax
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) and linked in vmlinux     : > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) If the fixup was able to handle the exception, control flow may be returned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) to the instruction after the one that triggered the fault, ie. local label 2b.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) The assembly code::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265)  > .section __ex_table,"a"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266)  >         .align 4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267)  >         .long 1b,3b
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) becomes the value pair::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271)  >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272)                                ^this is ^this is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273)                                1b       3b
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) c017e7a5,c0199ff5 in the exception table of the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277) So, what actually happens if a fault from kernel mode with no suitable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278) vma occurs?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) #. access to invalid address::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282)     > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) #. MMU generates exception
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) #. CPU calls do_page_fault
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) #. do page fault calls search_exception_table (regs->eip == c017e7a5);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) #. search_exception_table looks up the address c017e7a5 in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)    exception table (i.e. the contents of the ELF section __ex_table)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288)    and returns the address of the associated fault handle code c0199ff5.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) #. do_page_fault modifies its own return address to point to the fault
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290)    handle code and returns.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291) #. execution continues in the fault handling code.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) #. a) EAX becomes -EFAULT (== -14)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293)    b) DL  becomes zero (the value we "read" from user space)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294)    c) execution continues at local label 2 (address of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295)       instruction immediately after the faulting user access).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) The steps 8a to 8c in a certain way emulate the faulting instruction.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) That's it, mostly. If you look at our example, you might ask why
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300) we set EAX to -EFAULT in the exception handler code. Well, the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) get_user macro actually returns a value: 0, if the user access was
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302) successful, -EFAULT on failure. Our original code did not test this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) return value, however the inline assembly code in get_user tries to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304) return -EFAULT. GCC selected EAX to return this value.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) NOTE:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307) Due to the way that the exception table is built and needs to be ordered,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308) only use exceptions for code in the .text section.  Any other section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) will cause the exception table to not be sorted correctly, and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) exceptions will fail.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) Things changed when 64-bit support was added to x86 Linux. Rather than
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) double the size of the exception table by expanding the two entries
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314) from 32-bits to 64 bits, a clever trick was used to store addresses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) as relative offsets from the table itself. The assembly code changed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316) from::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318)     .long 1b,3b
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319)   to:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320)           .long (from) - .
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321)           .long (to) - .
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323) and the C-code that uses these values converts back to absolute addresses
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) like this::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326) 	ex_insn_addr(const struct exception_table_entry *x)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) 	{
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) 		return (unsigned long)&x->insn + x->insn;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) In v4.6 the exception table entry was expanded with a new field "handler".
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) This is also 32-bits wide and contains a third relative function
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) pointer which points to one of:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) 1) ``int ex_handler_default(const struct exception_table_entry *fixup)``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336)      This is legacy case that just jumps to the fixup code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) 2) ``int ex_handler_fault(const struct exception_table_entry *fixup)``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339)      This case provides the fault number of the trap that occurred at
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340)      entry->insn. It is used to distinguish page faults from machine
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341)      check.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343) More functions can easily be added.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) CONFIG_BUILDTIME_TABLE_SORT allows the __ex_table section to be sorted post
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) link of the kernel image, via a host utility scripts/sorttable. It will set the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) symbol main_extable_sort_needed to 0, avoiding sorting the __ex_table section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348) at boot time. With the exception table sorted, at runtime when an exception
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) occurs we can quickly lookup the __ex_table entry via binary search.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) This is not just a boot time optimization, some architectures require this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) table to be sorted in order to handle exceptions relatively early in the boot
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) process. For example, i386 makes use of this form of exception handling before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) paging support is even enabled!