Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) =============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) Per-task statistics interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) =============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) Taskstats is a netlink-based interface for sending per-task and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) per-process statistics from the kernel to userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) Taskstats was designed for the following benefits:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) - efficiently provide statistics during lifetime of a task and on its exit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) - unified interface for multiple accounting subsystems
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) - extensibility for use by future accounting patches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) Terminology
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) -----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) "pid", "tid" and "task" are used interchangeably and refer to the standard
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) Linux task defined by struct task_struct.  per-pid stats are the same as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20) per-task stats.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22) "tgid", "process" and "thread group" are used interchangeably and refer to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23) tasks that share an mm_struct i.e. the traditional Unix process. Despite the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) use of tgid, there is no special treatment for the task that is thread group
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) leader - a process is deemed alive as long as it has any task belonging to it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27) Usage
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) -----
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30) To get statistics during a task's lifetime, userspace opens a unicast netlink
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31) socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32) The response contains statistics for a task (if pid is specified) or the sum of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33) statistics for all tasks of the process (if tgid is specified).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35) To obtain statistics for tasks which are exiting, the userspace listener
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) sends a register command and specifies a cpumask. Whenever a task exits on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) one of the cpus in the cpumask, its per-pid statistics are sent to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38) registered listener. Using cpumasks allows the data received by one listener
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39) to be limited and assists in flow control over the netlink interface and is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40) explained in more detail below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) If the exiting task is the last thread exiting its thread group,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) an additional record containing the per-tgid stats is also sent to userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) The latter contains the sum of per-pid stats for all threads in the thread
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) group, both past and present.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) getdelays.c is a simple utility demonstrating usage of the taskstats interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) for reporting delay accounting statistics. Users can register cpumasks,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) send commands and process responses, listen for per-tid/tgid exit data,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) write the data received to a file and do basic flow control by increasing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) receive buffer sizes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) Interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) ---------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) The user-kernel interface is encapsulated in include/linux/taskstats.h
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) To avoid this documentation becoming obsolete as the interface evolves, only
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) an outline of the current version is given. taskstats.h always overrides the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) description here.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) struct taskstats is the common accounting structure for both per-pid and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) per-tgid data. It is versioned and can be extended by each accounting subsystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) that is added to the kernel. The fields and their semantics are defined in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) taskstats.h file.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) The data exchanged between user and kernel space is a netlink message belonging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) to the NETLINK_GENERIC family and using the netlink attributes interface.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) The messages are in the format::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71)     +----------+- - -+-------------+-------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72)     | nlmsghdr | Pad |  genlmsghdr | taskstats payload |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73)     +----------+- - -+-------------+-------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) The taskstats payload is one of the following three kinds:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) 1. Commands: Sent from user to kernel. Commands to get data on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) the task/process for which userspace wants statistics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) Commands to register/deregister interest in exit data from a set of cpus
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) consist of one attribute, of type
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) attribute payload. The cpumask is specified as an ascii string of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) in cpus before closing the listening socket, the kernel cleans up its interest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) set over time. However, for the sake of efficiency, an explicit deregistration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) is advisable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) 2. Response for a command: sent from the kernel in response to a userspace
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) command. The payload is a series of three attributes of type:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) a pid/tgid will be followed by some stats.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) are being returned.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) same structure is used for both per-pid and per-tgid stats.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) 3. New message sent by kernel whenever a task exits. The payload consists of a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)    series of attributes of the following type:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) b) TASKSTATS_TYPE_PID: contains exiting task's pid
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) per-tgid stats
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) Taskstats provides per-process stats, in addition to per-task stats, since
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) resource management is often done at a process granularity and aggregating task
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) stats in userspace alone is inefficient and potentially inaccurate (due to lack
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) of atomicity).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) However, maintaining per-process, in addition to per-task stats, within the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) kernel has space and time overheads. To address this, the taskstats code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) accumulates each exiting task's statistics into a process-wide data structure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) When the last task of a process exits, the process level data accumulated also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) gets sent to userspace (along with the per-task data).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) When a user queries to get per-tgid data, the sum of all other live threads in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) the group is added up and added to the accumulated total for previously exited
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) threads of the same thread group.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) Extending taskstats
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) -------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) There are two ways to extend the taskstats interface to export more
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) per-task/process stats as patches to collect them get added to the kernel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) in future:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) 1. Adding more fields to the end of the existing struct taskstats. Backward
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142)    compatibility is ensured by the version number within the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143)    structure. Userspace will use only the fields of the struct that correspond
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144)    to the version its using.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 2. Defining separate statistic structs and using the netlink attributes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147)    interface to return them. Since userspace processes each netlink attribute
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148)    independently, it can always ignore attributes whose type it does not
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149)    understand (because it is using an older version of the interface).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) Choosing between 1. and 2. is a matter of trading off flexibility and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) overhead. If only a few fields need to be added, then 1. is the preferable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) path since the kernel and userspace don't need to incur the overhead of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) processing new netlink attributes. But if the new fields expand the existing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) struct too much, requiring disparate userspace accounting utilities to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) unnecessarily receive large structures whose fields are of no interest, then
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) extending the attributes structure would be worthwhile.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) Flow control for taskstats
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) --------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) When the rate of task exits becomes large, a listener may not be able to keep
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) up with the kernel's rate of sending per-tid/tgid exit data leading to data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) loss. This possibility gets compounded when the taskstats structure gets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) extended and the number of cpus grows large.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) To avoid losing statistics, userspace should do one or more of the following:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) - increase the receive buffer sizes for the netlink sockets opened by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171)   listeners to receive exit data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) - create more listeners and reduce the number of cpus being listened to by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)   each listener. In the extreme case, there could be one listener for each cpu.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)   Users may also consider setting the cpu affinity of the listener to the subset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176)   of cpus to which it listens, especially if they are listening to just one cpu.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) Despite these measures, if the userspace receives ENOBUFS error messages
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) indicated overflow of receive buffers, it should take measures to handle the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) loss of data.