^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. _psi:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) PSI - Pressure Stall Information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) :Date: April, 2018
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) :Author: Johannes Weiner <hannes@cmpxchg.org>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) When CPU, memory or IO devices are contended, workloads experience
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) latency spikes, throughput losses, and run the risk of OOM kills.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) Without an accurate measure of such contention, users are forced to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) either play it safe and under-utilize their hardware resources, or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) roll the dice and frequently suffer the disruptions resulting from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) excessive overcommit.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) The psi feature identifies and quantifies the disruptions caused by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) such resource crunches and the time impact it has on complex workloads
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) or even entire systems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) Having an accurate measure of productivity losses caused by resource
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) scarcity aids users in sizing workloads to hardware--or provisioning
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) hardware according to workload demand.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) As psi aggregates this information in realtime, systems can be managed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) dynamically using techniques such as load shedding, migrating jobs to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) other systems or data centers, or strategically pausing or killing low
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) priority or restartable batch jobs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) This allows maximizing hardware utilization without sacrificing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) workload health or risking major disruptions such as OOM kills.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) Pressure interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) ==================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) Pressure information for each resource is exported through the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) respective file in /proc/pressure/ -- cpu, memory, and io.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40) The format for CPU is as such::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) some avg10=0.00 avg60=0.00 avg300=0.00 total=0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) and for memory and IO::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) some avg10=0.00 avg60=0.00 avg300=0.00 total=0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) full avg10=0.00 avg60=0.00 avg300=0.00 total=0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) The "some" line indicates the share of time in which at least some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) tasks are stalled on a given resource.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) The "full" line indicates the share of time in which all non-idle
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53) tasks are stalled on a given resource simultaneously. In this state
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) actual CPU cycles are going to waste, and a workload that spends
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) extended time in this state is considered to be thrashing. This has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) severe impact on performance, and it's useful to distinguish this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) situation from a state where some tasks are stalled but the CPU is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) still doing productive work. As such, time spent in this subset of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) stall state is tracked separately and exported in the "full" averages.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) The ratios (in %) are tracked as recent trends over ten, sixty, and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) three hundred second windows, which gives insight into short term events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) as well as medium and long term trends. The total absolute stall time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) (in us) is tracked and exported as well, to allow detection of latency
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) spikes which wouldn't necessarily make a dent in the time averages,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) or to average trends over custom time frames.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) Monitoring for pressure thresholds
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) ==================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) Users can register triggers and use poll() to be woken up when resource
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) pressure exceeds certain thresholds.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) A trigger describes the maximum cumulative stall time over a specific
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) time window, e.g. 100ms of total stall time within any 500ms window to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) generate a wakeup event.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) To register a trigger user has to open psi interface file under
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) /proc/pressure/ representing the resource to be monitored and write the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) desired threshold and time window. The open file descriptor should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) used to wait for trigger events using select(), poll() or epoll().
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) The following format is used::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) <some|full> <stall amount in us> <time window in us>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) For example writing "some 150000 1000000" into /proc/pressure/memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) would add 150ms threshold for partial memory stall measured within
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) 1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) would add 50ms threshold for full io stall measured within 1sec time window.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) Triggers can be set on more than one psi metric and more than one trigger
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) for the same psi metric can be specified. However for each trigger a separate
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) file descriptor is required to be able to poll it separately from others,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) therefore for each trigger a separate open() syscall should be made even
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) when opening the same psi interface file. Write operations to a file descriptor
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) with an already existing psi trigger will fail with EBUSY.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) Monitors activate only when system enters stall state for the monitored
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) psi metric and deactivates upon exit from the stall state. While system is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) in the stall state psi signal growth is monitored at a rate of 10 times per
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) tracking window.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) The kernel accepts window sizes ranging from 500ms to 10s, therefore min
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) monitoring update interval is 50ms and max is 1s. Min limit is set to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) prevent overly frequent polling. Max limit is chosen as a high enough number
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) after which monitors are most likely not needed and psi averages can be used
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) instead.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) When activated, psi monitor stays active for at least the duration of one
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) tracking window to avoid repeated activations/deactivations when system is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) bouncing in and out of the stall state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) Notifications to the userspace are rate-limited to one per tracking window.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) The trigger will de-register when the file descriptor used to define the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) trigger is closed.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) Userspace monitor usage example
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) ===============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) #include <errno.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) #include <fcntl.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) #include <stdio.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) #include <poll.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) #include <string.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) #include <unistd.h>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) /*
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) * Monitor memory partial stall with 1s tracking window size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) * and 150ms threshold.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) */
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) int main() {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) const char trig[] = "some 150000 1000000";
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) struct pollfd fds;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) int n;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) if (fds.fd < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) printf("/proc/pressure/memory open error: %s\n",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) strerror(errno));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) return 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) fds.events = POLLPRI;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) printf("/proc/pressure/memory write error: %s\n",
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) strerror(errno));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) return 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) printf("waiting for events...\n");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) while (1) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) n = poll(&fds, 1, -1);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) if (n < 0) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) printf("poll error: %s\n", strerror(errno));
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) return 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) if (fds.revents & POLLERR) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) printf("got POLLERR, event source is gone\n");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) if (fds.revents & POLLPRI) {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) printf("event triggered!\n");
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) } else {
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) printf("unknown event received: 0x%x\n", fds.revents);
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) return 1;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) return 0;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) }
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) Cgroup2 interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) =================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) mounted, pressure stall information is also tracked for tasks grouped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) into cgroups. Each subdirectory in the cgroupfs mountpoint contains
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) cpu.pressure, memory.pressure, and io.pressure files; the format is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) the same as the /proc/pressure/ files.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) Per-cgroup psi monitors can be specified and used the same way as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185) system-wide ones.