Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  1) .. _mmu_notifier:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  3) When do you need to notify inside page table lock ?
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  4) ===================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  5) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  6) When clearing a pte/pmd we are given a choice to notify the event through
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  7) (notify version of \*_clear_flush call mmu_notifier_invalidate_range) under
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  8) the page table lock. But that notification is not necessary in all cases.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  9) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) process virtual address space). There is only 2 cases when you need to notify
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) those secondary TLB while holding page table lock when clearing a pte/pmd:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)   A) page backing address is free before mmu_notifier_invalidate_range_end()
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16)   B) a page table entry is updated to point to a new page (COW, write fault
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)      on zero page, __replace_page(), ...)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) Case A is obvious you do not want to take the risk for the device to write to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) a page that might now be used by some completely different task.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) Case B is more subtle. For correctness it requires the following sequence to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) happen:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)   - take page table lock
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26)   - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify())
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27)   - set page table entry to point to new page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) If clearing the page table entry is not followed by a notify before setting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) the new pte/pmd value then you can break memory model like C11 or C++11 for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) the device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) Consider the following scenario (device use a feature similar to ATS/PASID):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) Two address addrA and addrB such that \|addrA - addrB\| >= PAGE_SIZE we assume
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) they are write protected for COW (other case of B apply too).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40)  [Time N] --------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)  CPU-thread-0  {try to write to addrA}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42)  CPU-thread-1  {try to write to addrB}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43)  CPU-thread-2  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44)  CPU-thread-3  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45)  DEV-thread-0  {read addrA and populate device TLB}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46)  DEV-thread-2  {read addrB and populate device TLB}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47)  [Time N+1] ------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48)  CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49)  CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50)  CPU-thread-2  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51)  CPU-thread-3  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52)  DEV-thread-0  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)  DEV-thread-2  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54)  [Time N+2] ------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55)  CPU-thread-0  {COW_step1: {update page table to point to new page for addrA}}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)  CPU-thread-1  {COW_step1: {update page table to point to new page for addrB}}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57)  CPU-thread-2  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58)  CPU-thread-3  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)  DEV-thread-0  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)  DEV-thread-2  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)  [Time N+3] ------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62)  CPU-thread-0  {preempted}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63)  CPU-thread-1  {preempted}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64)  CPU-thread-2  {write to addrA which is a write to new page}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65)  CPU-thread-3  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66)  DEV-thread-0  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)  DEV-thread-2  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68)  [Time N+3] ------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69)  CPU-thread-0  {preempted}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70)  CPU-thread-1  {preempted}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)  CPU-thread-2  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72)  CPU-thread-3  {write to addrB which is a write to new page}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73)  DEV-thread-0  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)  DEV-thread-2  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75)  [Time N+4] ------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76)  CPU-thread-0  {preempted}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77)  CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)  CPU-thread-2  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79)  CPU-thread-3  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80)  DEV-thread-0  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81)  DEV-thread-2  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)  [Time N+5] ------------------------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)  CPU-thread-0  {preempted}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84)  CPU-thread-1  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85)  CPU-thread-2  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)  CPU-thread-3  {}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87)  DEV-thread-0  {read addrA from old page}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88)  DEV-thread-2  {read addrB from new page}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) So here because at time N+2 the clear page table entry was not pair with a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) notification to invalidate the secondary TLB, the device see the new value for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) addrB before seeing the new value for addrA. This break total memory ordering
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) for the device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) When changing a pte to write protect or to point to a new write protected page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) call to mmu_notifier_invalidate_range_end() outside the page table lock. This
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) is true even if the thread doing the page table update is preempted right after
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) releasing page table lock but before call mmu_notifier_invalidate_range_end().