Orange Pi5 kernel

Deprecated Linux kernel 5.10.110 for OrangePi 5/5B/5+ boards

3 Commits   0 Branches   0 Tags
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   1) .. _perf_security:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   2) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   3) Perf events and tool security
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   4) =============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   5) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   6) Overview
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   7) --------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   8) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300   9) Usage of Performance Counters for Linux (perf_events) [1]_ , [2]_ , [3]_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  10) can impose a considerable risk of leaking sensitive data accessed by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  11) monitored processes. The data leakage is possible both in scenarios of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  12) direct usage of perf_events system call API [2]_ and over data files
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  13) generated by Perf tool user mode utility (Perf) [3]_ , [4]_ . The risk
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  14) depends on the nature of data that perf_events performance monitoring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  15) units (PMU) [2]_ and Perf collect and expose for performance analysis.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  16) Collected system and performance data may be split into several
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  17) categories:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  18) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  19) 1. System hardware and software configuration data, for example: a CPU
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  20)    model and its cache configuration, an amount of available memory and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  21)    its topology, used kernel and Perf versions, performance monitoring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  22)    setup including experiment time, events configuration, Perf command
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  23)    line parameters, etc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  24) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  25) 2. User and kernel module paths and their load addresses with sizes,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  26)    process and thread names with their PIDs and TIDs, timestamps for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  27)    captured hardware and software events.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  28) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  29) 3. Content of kernel software counters (e.g., for context switches, page
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  30)    faults, CPU migrations), architectural hardware performance counters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  31)    (PMC) [8]_ and machine specific registers (MSR) [9]_ that provide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  32)    execution metrics for various monitored parts of the system (e.g.,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  33)    memory controller (IMC), interconnect (QPI/UPI) or peripheral (PCIe)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  34)    uncore counters) without direct attribution to any execution context
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  35)    state.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  36) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  37) 4. Content of architectural execution context registers (e.g., RIP, RSP,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  38)    RBP on x86_64), process user and kernel space memory addresses and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  39)    data, content of various architectural MSRs that capture data from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  40)    this category.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  41) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  42) Data that belong to the fourth category can potentially contain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  43) sensitive process data. If PMUs in some monitoring modes capture values
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  44) of execution context registers or data from process memory then access
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  45) to such monitoring modes requires to be ordered and secured properly.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  46) So, perf_events performance monitoring and observability operations are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  47) the subject for security access control management [5]_ .
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  48) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  49) perf_events access control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  50) -------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  51) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  52) To perform security checks, the Linux implementation splits processes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  53) into two categories [6]_ : a) privileged processes (whose effective user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  54) ID is 0, referred to as superuser or root), and b) unprivileged
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  55) processes (whose effective UID is nonzero). Privileged processes bypass
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  56) all kernel security permission checks so perf_events performance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  57) monitoring is fully available to privileged processes without access,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  58) scope and resource restrictions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  59) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  60) Unprivileged processes are subject to a full security permission check
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  61) based on the process's credentials [5]_ (usually: effective UID,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  62) effective GID, and supplementary group list).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  63) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  64) Linux divides the privileges traditionally associated with superuser
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  65) into distinct units, known as capabilities [6]_ , which can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  66) independently enabled and disabled on per-thread basis for processes and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  67) files of unprivileged users.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  68) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  69) Unprivileged processes with enabled CAP_PERFMON capability are treated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  70) as privileged processes with respect to perf_events performance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  71) monitoring and observability operations, thus, bypass *scope* permissions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  72) checks in the kernel. CAP_PERFMON implements the principle of least
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  73) privilege [13]_ (POSIX 1003.1e: 2.2.2.39) for performance monitoring and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  74) observability operations in the kernel and provides a secure approach to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  75) perfomance monitoring and observability in the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  76) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  77) For backward compatibility reasons the access to perf_events monitoring and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  78) observability operations is also open for CAP_SYS_ADMIN privileged
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  79) processes but CAP_SYS_ADMIN usage for secure monitoring and observability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  80) use cases is discouraged with respect to the CAP_PERFMON capability.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  81) If system audit records [14]_ for a process using perf_events system call
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  82) API contain denial records of acquiring both CAP_PERFMON and CAP_SYS_ADMIN
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  83) capabilities then providing the process with CAP_PERFMON capability singly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  84) is recommended as the preferred secure approach to resolve double access
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  85) denial logging related to usage of performance monitoring and observability.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  86) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  87) Unprivileged processes using perf_events system call are also subject
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  88) for PTRACE_MODE_READ_REALCREDS ptrace access mode check [7]_ , whose
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  89) outcome determines whether monitoring is permitted. So unprivileged
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  90) processes provided with CAP_SYS_PTRACE capability are effectively
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  91) permitted to pass the check.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  92) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  93) Other capabilities being granted to unprivileged processes can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  94) effectively enable capturing of additional data required for later
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  95) performance analysis of monitored processes or a system. For example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  96) CAP_SYSLOG capability permits reading kernel space memory addresses from
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  97) /proc/kallsyms file.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  98) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300  99) Privileged Perf users groups
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) ---------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) Mechanisms of capabilities, privileged capability-dumb files [6]_ and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) file system ACLs [10]_ can be used to create dedicated groups of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) privileged Perf users who are permitted to execute performance monitoring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) and observability without scope limits. The following steps can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) taken to create such groups of privileged Perf users.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) 1. Create perf_users group of privileged Perf users, assign perf_users
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109)    group to Perf tool executable and limit access to the executable for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110)    other users in the system who are not in the perf_users group:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114)    # groupadd perf_users
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115)    # ls -alhF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116)    -rwxr-xr-x  2 root root  11M Oct 19 15:12 perf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)    # chgrp perf_users perf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118)    # ls -alhF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119)    -rwxr-xr-x  2 root perf_users  11M Oct 19 15:12 perf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120)    # chmod o-rwx perf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121)    # ls -alhF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122)    -rwxr-x---  2 root perf_users  11M Oct 19 15:12 perf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) 2. Assign the required capabilities to the Perf tool executable file and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125)    enable members of perf_users group with monitoring and observability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)    privileges [6]_ :
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)    # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131)    # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132)    perf: OK
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)    # getcap perf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134)    perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) If the libcap installed doesn't yet support "cap_perfmon", use "38" instead,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) i.e.:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)    # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) Note that you may need to have 'cap_ipc_lock' in the mix for tools such as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) 'perf top', alternatively use 'perf top -m N', to reduce the memory that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) it uses for the perf ring buffer, see the memory allocation section below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) Using a libcap without support for CAP_PERFMON will make cap_get_flag(caps, 38,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) CAP_EFFECTIVE, &val) fail, which will lead the default event to be 'cycles:u',
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) so as a workaround explicitly ask for the 'cycles' event, i.e.:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) ::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153)   # perf top -e cycles
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) To get kernel and user samples with a perf binary with just CAP_PERFMON.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) As a result, members of perf_users group are capable of conducting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) performance monitoring and observability by using functionality of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) configured Perf tool executable that, when executes, passes perf_events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) subsystem scope checks.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) This specific access control management is only available to superuser
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) or root running processes with CAP_SETPCAP, CAP_SETFCAP [6]_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) capabilities.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) Unprivileged users
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) -----------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) perf_events *scope* and *access* control for unprivileged processes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) is governed by perf_event_paranoid [2]_ setting:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) -1:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173)      Impose no *scope* and *access* restrictions on using perf_events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)      performance monitoring. Per-user per-cpu perf_event_mlock_kb [2]_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)      locking limit is ignored when allocating memory buffers for storing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176)      performance data. This is the least secure mode since allowed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)      monitored *scope* is maximized and no perf_events specific limits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178)      are imposed on *resources* allocated for performance monitoring.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) >=0:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181)      *scope* includes per-process and system wide performance monitoring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182)      but excludes raw tracepoints and ftrace function tracepoints
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183)      monitoring. CPU and system events happened when executing either in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184)      user or in kernel space can be monitored and captured for later
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185)      analysis. Per-user per-cpu perf_event_mlock_kb locking limit is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186)      imposed but ignored for unprivileged processes with CAP_IPC_LOCK
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187)      [6]_ capability.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) >=1:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190)      *scope* includes per-process performance monitoring only and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191)      excludes system wide performance monitoring. CPU and system events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192)      happened when executing either in user or in kernel space can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193)      monitored and captured for later analysis. Per-user per-cpu
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194)      perf_event_mlock_kb locking limit is imposed but ignored for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195)      unprivileged processes with CAP_IPC_LOCK capability.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) >=2:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198)      *scope* includes per-process performance monitoring only. CPU and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199)      system events happened when executing in user space only can be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200)      monitored and captured for later analysis. Per-user per-cpu
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201)      perf_event_mlock_kb locking limit is imposed but ignored for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202)      unprivileged processes with CAP_IPC_LOCK capability.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) Resource control
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) ---------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) Open file descriptors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208) +++++++++++++++++++++
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) The perf_events system call API [2]_ allocates file descriptors for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) every configured PMU event. Open file descriptors are a per-process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) accountable resource governed by the RLIMIT_NOFILE [11]_ limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213) (ulimit -n), which is usually derived from the login shell process. When
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) configuring Perf collection for a long list of events on a large server
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) system, this limit can be easily hit preventing required monitoring
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216) configuration. RLIMIT_NOFILE limit can be increased on per-user basis
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) modifying content of the limits.conf file [12]_ . Ordinarily, a Perf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) sampling session (perf record) requires an amount of open perf_event
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219) file descriptors that is not less than the number of monitored events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) multiplied by the number of monitored CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222) Memory allocation
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) +++++++++++++++++
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225) The amount of memory available to user processes for capturing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) performance monitoring data is governed by the perf_event_mlock_kb [2]_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) setting. This perf_event specific resource setting defines overall
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) per-cpu limits of memory allowed for mapping by the user processes to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229) execute performance monitoring. The setting essentially extends the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) RLIMIT_MEMLOCK [11]_ limit, but only for memory regions mapped
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) specifically for capturing monitored performance events and related data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) For example, if a machine has eight cores and perf_event_mlock_kb limit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234) is set to 516 KiB, then a user process is provided with 516 KiB * 8 =
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) 4128 KiB of memory above the RLIMIT_MEMLOCK limit (ulimit -l) for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) perf_event mmap buffers. In particular, this means that, if the user
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) wants to start two or more performance monitoring processes, the user is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238) required to manually distribute the available 4128 KiB between the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) monitoring processes, for example, using the --mmap-pages Perf record
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) mode option. Otherwise, the first started performance monitoring process
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241) allocates all available 4128 KiB and the other processes will fail to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) proceed due to the lack of memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244) RLIMIT_MEMLOCK and perf_event_mlock_kb resource constraints are ignored
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) for processes with the CAP_IPC_LOCK capability. Thus, perf_events/Perf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) privileged users can be provided with memory above the constraints for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) perf_events/Perf performance monitoring purpose by providing the Perf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248) executable with CAP_IPC_LOCK capability.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) Bibliography
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) ------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) 
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) .. [1] `<https://lwn.net/Articles/337493/>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254) .. [2] `<http://man7.org/linux/man-pages/man2/perf_event_open.2.html>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255) .. [3] `<http://web.eece.maine.edu/~vweaver/projects/perf_events/>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) .. [4] `<https://perf.wiki.kernel.org/index.php/Main_Page>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) .. [5] `<https://www.kernel.org/doc/html/latest/security/credentials.html>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258) .. [6] `<http://man7.org/linux/man-pages/man7/capabilities.7.html>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) .. [7] `<http://man7.org/linux/man-pages/man2/ptrace.2.html>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) .. [8] `<https://en.wikipedia.org/wiki/Hardware_performance_counter>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261) .. [9] `<https://en.wikipedia.org/wiki/Model-specific_register>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) .. [10] `<http://man7.org/linux/man-pages/man5/acl.5.html>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) .. [11] `<http://man7.org/linux/man-pages/man2/getrlimit.2.html>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) .. [12] `<http://man7.org/linux/man-pages/man5/limits.conf.5.html>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) .. [13] `<https://sites.google.com/site/fullycapable>`_
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) .. [14] `<http://man7.org/linux/man-pages/man8/auditd.8.html>`_