Merge tag 'docs-5.11-2' of git://git.lwn.net/linux
[linux-2.6-microblaze.git] / Documentation / networking / devlink / devlink-health.rst
1 .. SPDX-License-Identifier: GPL-2.0
2
3 ==============
4 Devlink Health
5 ==============
6
7 Background
8 ==========
9
10 The ``devlink`` health mechanism is targeted for Real Time Alerting, in
11 order to know when something bad happened to a PCI device.
12
13   * Provide alert debug information.
14   * Self healing.
15   * If problem needs vendor support, provide a way to gather all needed
16     debugging information.
17
18 Overview
19 ========
20
21 The main idea is to unify and centralize driver health reports in the
22 generic ``devlink`` instance and allow the user to set different
23 attributes of the health reporting and recovery procedures.
24
25 The ``devlink`` health reporter:
26 Device driver creates a "health reporter" per each error/health type.
27 Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
28 or unknown (driver specific).
29 For each registered health reporter a driver can issue error/health reports
30 asynchronously. All health reports handling is done by ``devlink``.
31 Device driver can provide specific callbacks for each "health reporter", e.g.:
32
33   * Recovery procedures
34   * Diagnostics procedures
35   * Object dump procedures
36   * OOB initial parameters
37
38 Different parts of the driver can register different types of health reporters
39 with different handlers.
40
41 Actions
42 =======
43
44 Once an error is reported, devlink health will perform the following actions:
45
46   * A log is being send to the kernel trace events buffer
47   * Health status and statistics are being updated for the reporter instance
48   * Object dump is being taken and saved at the reporter instance (as long as
49     there is no other dump which is already stored)
50   * Auto recovery attempt is being done. Depends on:
51     - Auto-recovery configuration
52     - Grace period vs. time passed since last recover
53
54 User Interface
55 ==============
56
57 User can access/change each reporter's parameters and driver specific callbacks
58 via ``devlink``, e.g per error type (per health reporter):
59
60   * Configure reporter's generic parameters (like: disable/enable auto recovery)
61   * Invoke recovery procedure
62   * Run diagnostics
63   * Object dump
64
65 .. list-table:: List of devlink health interfaces
66    :widths: 10 90
67
68    * - Name
69      - Description
70    * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
71      - Retrieves status and configuration info per DEV and reporter.
72    * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
73      - Allows reporter-related configuration setting.
74    * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
75      - Triggers a reporter's recovery procedure.
76    * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
77      - Retrieves diagnostics data from a reporter on a device.
78    * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
79      - Retrieves the last stored dump. Devlink health
80        saves a single dump. If an dump is not already stored by the devlink
81        for this reporter, devlink generates a new dump.
82        dump output is defined by the reporter.
83    * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
84      - Clears the last saved dump file for the specified reporter.
85
86 The following diagram provides a general overview of ``devlink-health``::
87
88                                                    netlink
89                                           +--------------------------+
90                                           |                          |
91                                           |            +             |
92                                           |            |             |
93                                           +--------------------------+
94                                                        |request for ops
95                                                        |(diagnose,
96      mlx5_core                             devlink     |recover,
97                                                        |dump)
98     +--------+                            +--------------------------+
99     |        |                            |    reporter|             |
100     |        |                            |  +---------v----------+  |
101     |        |   ops execution            |  |                    |  |
102     |     <----------------------------------+                    |  |
103     |        |                            |  |                    |  |
104     |        |                            |  + ^------------------+  |
105     |        |                            |    | request for ops     |
106     |        |                            |    | (recover, dump)     |
107     |        |                            |    |                     |
108     |        |                            |  +-+------------------+  |
109     |        |     health report          |  | health handler     |  |
110     |        +------------------------------->                    |  |
111     |        |                            |  +--------------------+  |
112     |        |     health reporter create |                          |
113     |        +---------------------------->                          |
114     +--------+                            +--------------------------+