Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) Scheduler Nice Design
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) =====================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) This document explains the thinking about the revamped and streamlined
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) nice-levels implementation in the new Linux scheduler.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) Nice levels were always pretty weak under Linux and people continuously
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) pestered us to make nice +19 tasks use up much less CPU time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) Unfortunately that was not that easy to implement under the old
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) scheduler, (otherwise we'd have done it long ago) because nice level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) support was historically coupled to timeslice length, and timeslice
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) units were driven by the HZ tick, so the smallest timeslice was 1/HZ.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) In the O(1) scheduler (in 2003) we changed negative nice levels to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) much stronger than they were before in 2.4 (and people were happy about
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) that change), and we also intentionally calibrated the linear timeslice
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) rule so that nice +19 level would be _exactly_ 1 jiffy. To better
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) understand it, the timeslice graph went like this (cheesy ASCII art
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) alert!)::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24)                    A
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25)              \     | [timeslice length]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26)               \    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27)                \   |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28)                 \  |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29)                  \ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30)                   \|___100msecs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31)                    |^ . _
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32)                    |      ^ . _
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33)                    |            ^ . _
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34)  -*----------------------------------*-----> [nice level]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35)  -20               |                +19
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36)                    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37)                    |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) So that if someone wanted to really renice tasks, +19 would give a much
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) bigger hit than the normal linear rule would do. (The solution of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) changing the ABI to extend priorities was discarded early on.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) This approach worked to some degree for some time, but later on with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) we felt to be a bit excessive. Excessive _not_ because it's too small of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) a CPU utilization, but because it causes too frequent (once per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) millisec) rescheduling. (and would thus trash the cache, etc. Remember,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) this was long ago when hardware was weaker and caches were smaller, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) people were running number crunching apps at nice +19.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) right minimal granularity - and this translates to 5% CPU utilization.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) But the fundamental HZ-sensitive property for nice+19 still remained,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) and we never got a single complaint about nice +19 being too _weak_ in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) terms of CPU utilization, we only got complaints about it (still) being
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) too _strong_ :-)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) To sum it up: we always wanted to make nice levels more consistent, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) within the constraints of HZ and jiffies and their nasty design level
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) coupling to timeslices and granularity it was not really viable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) The second (less frequent but still periodically occurring) complaint
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) about Linux's nice level support was its assymetry around the origo
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) (which you can see demonstrated in the picture above), or more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) accurately: the fact that nice level behavior depended on the _absolute_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) nice level as well, while the nice API itself is fundamentally
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) "relative":
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69)    int nice(int inc);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71)    asmlinkage long sys_nice(int increment)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) (the first one is the glibc API, the second one is the syscall API.)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) Note that the 'inc' is relative to the current nice level. Tools like
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) bash's "nice" command mirror this relative API.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) With the old scheduler, if you for example started a niced task with +1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) and another task with +2, the CPU split between the two tasks would
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) depend on the nice level of the parent shell - if it was at nice -10 the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) CPU split was different than if it was at +5 or +10.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) A third complaint against Linux's nice level support was that negative
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) nice levels were not 'punchy enough', so lots of people had to resort to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) run audio (and other multimedia) apps under RT priorities such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) proof, and a buggy SCHED_FIFO app can also lock up the system for good.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) The new scheduler in v2.6.23 addresses all three types of complaints:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) To address the first complaint (of nice levels being not "punchy"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) enough), the scheduler was decoupled from 'time slice' and HZ concepts
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) (and granularity was made a separate concept from nice levels) and thus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) it was possible to implement better and more consistent nice +19
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) support: with the new scheduler nice +19 tasks get a HZ-independent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) 1.5%, instead of the variable 3%-5%-9% range they got in the old
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) scheduler.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) To address the second complaint (of nice levels not being consistent),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) the new scheduler makes nice(1) have the same CPU utilization effect on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) tasks, regardless of their absolute nice levels. So on the new
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) scheduler, running a nice +10 and a nice 11 task has the same CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) utilization "split" between them as running a nice -5 and a nice -4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) task. (one will get 55% of the CPU, the other 45%.) That is why nice
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) levels were changed to be "multiplicative" (or exponential) - that way
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) it does not matter which nice level you start out from, the 'relative
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) result' will always be the same.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) The third complaint (of negative nice levels not being "punchy" enough
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) and forcing audio apps to run under the more dangerous SCHED_FIFO
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) scheduling policy) is addressed by the new scheduler almost
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) automatically: stronger negative nice levels are an automatic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) side-effect of the recalibrated dynamic range of nice levels.