Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. _psi:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) ================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) PSI - Pressure Stall Information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) ================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) :Date: April, 2018
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) :Author: Johannes Weiner <hannes@cmpxchg.org>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) When CPU, memory or IO devices are contended, workloads experience
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) latency spikes, throughput losses, and run the risk of OOM kills.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) Without an accurate measure of such contention, users are forced to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) either play it safe and under-utilize their hardware resources, or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) roll the dice and frequently suffer the disruptions resulting from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) excessive overcommit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) The psi feature identifies and quantifies the disruptions caused by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) such resource crunches and the time impact it has on complex workloads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) or even entire systems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) Having an accurate measure of productivity losses caused by resource
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) scarcity aids users in sizing workloads to hardware--or provisioning
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) hardware according to workload demand.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) As psi aggregates this information in realtime, systems can be managed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) dynamically using techniques such as load shedding, migrating jobs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) other systems or data centers, or strategically pausing or killing low
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) priority or restartable batch jobs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) This allows maximizing hardware utilization without sacrificing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) workload health or risking major disruptions such as OOM kills.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) Pressure interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) ==================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) Pressure information for each resource is exported through the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) respective file in /proc/pressure/ -- cpu, memory, and io.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) The format for CPU is as such::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) 	some avg10=0.00 avg60=0.00 avg300=0.00 total=0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) and for memory and IO::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) 	some avg10=0.00 avg60=0.00 avg300=0.00 total=0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) 	full avg10=0.00 avg60=0.00 avg300=0.00 total=0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) The "some" line indicates the share of time in which at least some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) tasks are stalled on a given resource.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) The "full" line indicates the share of time in which all non-idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) tasks are stalled on a given resource simultaneously. In this state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) actual CPU cycles are going to waste, and a workload that spends
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) extended time in this state is considered to be thrashing. This has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) severe impact on performance, and it's useful to distinguish this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) situation from a state where some tasks are stalled but the CPU is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) still doing productive work. As such, time spent in this subset of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) stall state is tracked separately and exported in the "full" averages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) The ratios (in %) are tracked as recent trends over ten, sixty, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) three hundred second windows, which gives insight into short term events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) as well as medium and long term trends. The total absolute stall time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) (in us) is tracked and exported as well, to allow detection of latency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) spikes which wouldn't necessarily make a dent in the time averages,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) or to average trends over custom time frames.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) Monitoring for pressure thresholds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) ==================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) Users can register triggers and use poll() to be woken up when resource
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) pressure exceeds certain thresholds.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) A trigger describes the maximum cumulative stall time over a specific
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) time window, e.g. 100ms of total stall time within any 500ms window to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) generate a wakeup event.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) To register a trigger user has to open psi interface file under
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) /proc/pressure/ representing the resource to be monitored and write the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) desired threshold and time window. The open file descriptor should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) used to wait for trigger events using select(), poll() or epoll().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) The following format is used::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) 	<some|full> <stall amount in us> <time window in us>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) For example writing "some 150000 1000000" into /proc/pressure/memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) would add 150ms threshold for partial memory stall measured within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) 1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) would add 50ms threshold for full io stall measured within 1sec time window.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) Triggers can be set on more than one psi metric and more than one trigger
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) for the same psi metric can be specified. However for each trigger a separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) file descriptor is required to be able to poll it separately from others,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) therefore for each trigger a separate open() syscall should be made even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) when opening the same psi interface file. Write operations to a file descriptor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) with an already existing psi trigger will fail with EBUSY.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) Monitors activate only when system enters stall state for the monitored
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) psi metric and deactivates upon exit from the stall state. While system is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) in the stall state psi signal growth is monitored at a rate of 10 times per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) tracking window.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) The kernel accepts window sizes ranging from 500ms to 10s, therefore min
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) monitoring update interval is 50ms and max is 1s. Min limit is set to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) prevent overly frequent polling. Max limit is chosen as a high enough number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) after which monitors are most likely not needed and psi averages can be used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) instead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) When activated, psi monitor stays active for at least the duration of one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) tracking window to avoid repeated activations/deactivations when system is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) bouncing in and out of the stall state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) Notifications to the userspace are rate-limited to one per tracking window.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) The trigger will de-register when the file descriptor used to define the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) trigger  is closed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) Userspace monitor usage example
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) ===============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123)   #include <errno.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)   #include <fcntl.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)   #include <stdio.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)   #include <poll.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127)   #include <string.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)   #include <unistd.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)   /*
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)    * Monitor memory partial stall with 1s tracking window size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)    * and 150ms threshold.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)    */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)   int main() {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) 	const char trig[] = "some 150000 1000000";
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) 	struct pollfd fds;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) 	int n;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) 	fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 	if (fds.fd < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) 		printf("/proc/pressure/memory open error: %s\n",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 			strerror(errno));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) 		return 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 	fds.events = POLLPRI;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) 	if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) 		printf("/proc/pressure/memory write error: %s\n",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) 			strerror(errno));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 		return 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) 	printf("waiting for events...\n");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) 	while (1) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) 		n = poll(&fds, 1, -1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 		if (n < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) 			printf("poll error: %s\n", strerror(errno));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) 			return 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) 		if (fds.revents & POLLERR) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 			printf("got POLLERR, event source is gone\n");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) 			return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) 		if (fds.revents & POLLPRI) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 			printf("event triggered!\n");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) 		} else {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) 			printf("unknown event received: 0x%x\n", fds.revents);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 			return 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 		}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) 	}
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) 	return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173)   }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) Cgroup2 interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) mounted, pressure stall information is also tracked for tasks grouped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) into cgroups. Each subdirectory in the cgroupfs mountpoint contains
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) cpu.pressure, memory.pressure, and io.pressure files; the format is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) the same as the /proc/pressure/ files.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) Per-cgroup psi monitors can be specified and used the same way as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) system-wide ones.