^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. SPDX-License-Identifier: GPL-2.0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Devlink Health
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) Background
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) ==========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) The ``devlink`` health mechanism is targeted for Real Time Alerting, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) order to know when something bad happened to a PCI device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) * Provide alert debug information.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) * Self healing.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15) * If problem needs vendor support, provide a way to gather all needed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) debugging information.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) Overview
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) ========
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) The main idea is to unify and centralize driver health reports in the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) generic ``devlink`` instance and allow the user to set different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) attributes of the health reporting and recovery procedures.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) The ``devlink`` health reporter:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) Device driver creates a "health reporter" per each error/health type.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) or unknown (driver specific).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29) For each registered health reporter a driver can issue error/health reports
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) asynchronously. All health reports handling is done by ``devlink``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31) Device driver can provide specific callbacks for each "health reporter", e.g.:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) * Recovery procedures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34) * Diagnostics procedures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) * Object dump procedures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) * OOB initial parameters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) Different parts of the driver can register different types of health reporters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) with different handlers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) Actions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) Once an error is reported, devlink health will perform the following actions:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) * A log is being send to the kernel trace events buffer
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) * Health status and statistics are being updated for the reporter instance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) * Object dump is being taken and saved at the reporter instance (as long as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) there is no other dump which is already stored)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) * Auto recovery attempt is being done. Depends on:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) - Auto-recovery configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) - Grace period vs. time passed since last recover
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) User Interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) ==============
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) User can access/change each reporter's parameters and driver specific callbacks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) via ``devlink``, e.g per error type (per health reporter):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) * Configure reporter's generic parameters (like: disable/enable auto recovery)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) * Invoke recovery procedure
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) * Run diagnostics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) * Object dump
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) .. list-table:: List of devlink health interfaces
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) :widths: 10 90
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) * - Name
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) - Description
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) - Retrieves status and configuration info per DEV and reporter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) - Allows reporter-related configuration setting.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74) * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) - Triggers a reporter's recovery procedure.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) - Retrieves diagnostics data from a reporter on a device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) - Retrieves the last stored dump. Devlink health
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) saves a single dump. If an dump is not already stored by the devlink
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) for this reporter, devlink generates a new dump.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) dump output is defined by the reporter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) - Clears the last saved dump file for the specified reporter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) The following diagram provides a general overview of ``devlink-health``::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) netlink
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) +--------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) | + |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) +--------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) |request for ops
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) |(diagnose,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) mlx5_core devlink |recover,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) |dump)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) +--------+ +--------------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) | | | reporter| |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) | | | +---------v----------+ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) | | ops execution | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) | <----------------------------------+ | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) | | | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) | | | + ^------------------+ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) | | | | request for ops |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) | | | | (recover, dump) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) | | | | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) | | | +-+------------------+ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) | | health report | | health handler | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) | +-------------------------------> | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) | | | +--------------------+ |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) | | health reporter create | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) | +----------------------------> |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) +--------+ +--------------------------+