^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) .. include:: <isonum.txt>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3) ============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Reliability, Availability and Serviceability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) RAS concepts
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) ************
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10) Reliability, Availability and Serviceability (RAS) is a concept used on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) servers meant to measure their robustness.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) Reliability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) is the probability that a system will produce correct outputs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) * Generally measured as Mean Time Between Failures (MTBF)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17) * Enhanced by features that help to avoid, detect and repair hardware faults
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) Availability
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) is the probability that a system is operational at a given time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22) * Generally measured as a percentage of downtime per a period of time
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) * Often uses mechanisms to detect and correct hardware faults in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24) runtime;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) Serviceability (or maintainability)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) is the simplicity and speed with which a system can be repaired or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) maintained
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) * Generally measured on Mean Time Between Repair (MTBR)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) Improving RAS
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) -------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) In order to reduce systems downtime, a system should be capable of detecting
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36) hardware errors, and, when possible correcting them in runtime. It should
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) also provide mechanisms to detect hardware degradation, in order to warn
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) the system administrator to take the action of replacing a component before
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) it causes data loss or system downtime.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) Among the monitoring measures, the most usual ones include:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) * CPU – detect errors at instruction execution and at L1/L2/L3 caches;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) * Memory – add error correction logic (ECC) to detect and correct errors;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) * I/O – add CRC checksums for transferred data;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) * Storage – RAID, journal file systems, checksums,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) Self-Monitoring, Analysis and Reporting Technology (SMART).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) By monitoring the number of occurrences of error detections, it is possible
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) to identify if the probability of hardware errors is increasing, and, on such
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51) case, do a preventive maintenance to replace a degraded component while
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) those errors are correctable.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) Types of errors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) Most mechanisms used on modern systems use technologies like Hamming
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) Codes that allow error correction when the number of errors on a bit packet
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59) is below a threshold. If the number of errors is above, those mechanisms
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) can indicate with a high degree of confidence that an error happened, but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61) they can't correct.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) Also, sometimes an error occur on a component that it is not used. For
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) example, a part of the memory that it is not currently allocated.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66) That defines some categories of errors:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68) * **Correctable Error (CE)** - the error detection mechanism detected and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) corrected the error. Such errors are usually not fatal, although some
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) Kernel mechanisms allow the system administrator to consider them as fatal.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) * **Uncorrected Error (UE)** - the amount of errors happened above the error
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) correction threshold, and the system was unable to auto-correct.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) * **Fatal Error** - when an UE error happens on a critical component of the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76) system (for example, a piece of the Kernel got corrupted by an UE), the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) only reliable way to avoid data corruption is to hang or reboot the machine.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) * **Non-fatal Error** - when an UE error happens on an unused component,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80) like a CPU in power down state or an unused memory bank, the system may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) still run, eventually replacing the affected hardware by a hot spare,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82) if available.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84) Also, when an error happens on a userspace process, it is also possible to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) kill such process and let userspace restart it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87) The mechanism for handling non-fatal errors is usually complex and may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) require the help of some userspace application, in order to apply the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89) policy desired by the system administrator.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91) Identifying a bad hardware component
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) ------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) Just detecting a hardware flaw is usually not enough, as the system needs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) to pinpoint to the minimal replaceable unit (MRU) that should be exchanged
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96) to make the hardware reliable again.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98) So, it requires not only error logging facilities, but also mechanisms that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) will translate the error message to the silkscreen or component label for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) the MRU.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) Typically, it is very complex for memory, as modern CPUs interlace memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103) from different memory modules, in order to provide a better performance. The
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) DMI BIOS usually have a list of memory module labels, with can be obtained
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105) using the ``dmidecode`` tool. For example, on a desktop machine, it shows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) Memory Device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108) Total Width: 64 bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109) Data Width: 64 bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) Size: 16384 MB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) Form Factor: SODIMM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112) Set: None
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) Locator: ChannelA-DIMM0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) Bank Locator: BANK 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) Type: DDR4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) Type Detail: Synchronous
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117) Speed: 2133 MHz
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) Rank: 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119) Configured Clock Speed: 2133 MHz
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121) On the above example, a DDR4 SO-DIMM memory module is located at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) system's memory labeled as "BANK 0", as given by the *bank locator* field.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) Please notice that, on such system, the *total width* is equal to the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124) *data width*. It means that such memory module doesn't have error
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) detection/correction mechanisms.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) Unfortunately, not all systems use the same field to specify the memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128) bank. On this example, from an older server, ``dmidecode`` shows::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130) Memory Device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) Array Handle: 0x1000
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) Error Information Handle: Not Provided
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133) Total Width: 72 bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) Data Width: 64 bits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) Size: 8192 MB
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136) Form Factor: DIMM
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) Set: 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) Locator: DIMM_A1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) Bank Locator: Not Specified
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) Type: DDR3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141) Type Detail: Synchronous Registered (Buffered)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) Speed: 1600 MHz
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) Rank: 2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144) Configured Clock Speed: 1600 MHz
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) There, the DDR3 RDIMM memory module is located at the system's memory labeled
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) as "DIMM_A1", as given by the *locator* field. Please notice that this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) memory module has 64 bits of *data width* and 72 bits of *total width*. So,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) it has 8 extra bits to be used by error detection and correction mechanisms.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) Such kind of memory is called Error-correcting code memory (ECC memory).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152) To make things even worse, it is not uncommon that systems with different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) labels on their system's board to use exactly the same BIOS, meaning that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154) the labels provided by the BIOS won't match the real ones.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) ECC memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) ----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159) As mentioned in the previous section, ECC memory has extra bits to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) used for error correction. In the above example, a memory module has
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161) 64 bits of *data width*, and 72 bits of *total width*. The extra 8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) bits which are used for the error detection and correction mechanisms
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) are referred to as the *syndrome*\ [#f1]_\ [#f2]_.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) So, when the cpu requests the memory controller to write a word with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) *data width*, the memory controller calculates the *syndrome* in real time,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) using Hamming code, or some other error correction code, like SECDED+,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) producing a code with *total width* size. Such code is then written
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) on the memory modules.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) At read, the *total width* bits code is converted back, using the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) ECC code used on write, producing a word with *data width* and a *syndrome*.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) The word with *data width* is sent to the CPU, even when errors happen.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175) The memory controller also looks at the *syndrome* in order to check if
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) there was an error, and if the ECC code was able to fix such error.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177) If the error was corrected, a Corrected Error (CE) happened. If not, an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) Uncorrected Error (UE) happened.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 179)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 180) The information about the CE/UE errors is stored on some special registers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 181) at the memory controller and can be accessed by reading such registers,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 182) either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 183) bit CPUs, such errors can also be retrieved via the Machine Check
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 184) Architecture (MCA)\ [#f3]_.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 185)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 186) .. [#f1] Please notice that several memory controllers allow operation on a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 187) mode called "Lock-Step", where it groups two memory modules together,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 188) doing 128-bit reads/writes. That gives 16 bits for error correction, with
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 189) significantly improves the error correction mechanism, at the expense
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 190) that, when an error happens, there's no way to know what memory module is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 191) to blame. So, it has to blame both memory modules.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 192)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 193) .. [#f2] Some memory controllers also allow using memory in mirror mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 194) On such mode, the same data is written to two memory modules. At read,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 195) the system checks both memory modules, in order to check if both provide
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 196) identical data. On such configuration, when an error happens, there's no
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 197) way to know what memory module is to blame. So, it has to blame both
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 198) memory modules (or 4 memory modules, if the system is also on Lock-step
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 199) mode).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 200)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 201) .. [#f3] For more details about the Machine Check Architecture (MCA),
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 202) please read Documentation/x86/x86_64/machinecheck.rst at the Kernel tree.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 203)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 204) EDAC - Error Detection And Correction
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 205) *************************************
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 206)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 207) .. note::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 208)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 209) "bluesmoke" was the name for this device driver subsystem when it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 210) was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 211) That site is mostly archaic now and can be used only for historical
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 212) purposes.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 213)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 214) When the subsystem was pushed upstream for the first time, on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 215) Kernel 2.6.16, it was renamed to ``EDAC``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 216)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 217) Purpose
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 218) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 219)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 220) The ``edac`` kernel module's goal is to detect and report hardware errors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 221) that occur within the computer system running under linux.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 222)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 223) Memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 224) ------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 225)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 226) Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 227) primary errors being harvested. These types of errors are harvested by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 228) the ``edac_mc`` device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 229)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 230) Detecting CE events, then harvesting those events and reporting them,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 231) **can** but must not necessarily be a predictor of future UE events. With
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 232) CE events only, the system can and will continue to operate as no data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 233) has been damaged yet.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 234)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 235) However, preventive maintenance and proactive part replacement of memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 236) modules exhibiting CEs can reduce the likelihood of the dreaded UE events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 237) and system panics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 238)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 239) Other hardware elements
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 240) -----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 241)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 242) A new feature for EDAC, the ``edac_device`` class of device, was added in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 243) the 2.6.23 version of the kernel.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 244)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 245) This new device type allows for non-memory type of ECC hardware detectors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 246) to have their states harvested and presented to userspace via the sysfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 247) interface.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 248)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 249) Some architectures have ECC detectors for L1, L2 and L3 caches,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 250) along with DMA engines, fabric switches, main data path switches,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 251) interconnections, and various other hardware data paths. If the hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 252) reports it, then a edac_device device probably can be constructed to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 253) harvest and present that to userspace.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 254)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 255)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 256) PCI bus scanning
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 257) ----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 258)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 259) In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 260) in order to determine if errors are occurring during data transfers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 261)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 262) The presence of PCI Parity errors must be examined with a grain of salt.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 263) There are several add-in adapters that do **not** follow the PCI specification
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 264) with regards to Parity generation and reporting. The specification says
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 265) the vendor should tie the parity status bits to 0 if they do not intend
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 266) to generate parity. Some vendors do not do this, and thus the parity bit
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 267) can "float" giving false positives.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 268)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 269) There is a PCI device attribute located in sysfs that is checked by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 270) the EDAC PCI scanning code. If that attribute is set, PCI parity/error
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 271) scanning is skipped for that device. The attribute is::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 272)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 273) broken_parity_status
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 274)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 275) and is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 276) PCI devices.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 277)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 278)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 279) Versioning
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 280) ----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 281)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 282) EDAC is composed of a "core" module (``edac_core.ko``) and several Memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 283) Controller (MC) driver modules. On a given system, the CORE is loaded
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 284) and one MC driver will be loaded. Both the CORE and the MC driver (or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 285) ``edac_device`` driver) have individual versions that reflect current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 286) release level of their respective modules.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 287)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 288) Thus, to "report" on what version a system is running, one must report
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 289) both the CORE's and the MC driver's versions.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 290)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 291)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 292) Loading
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 293) -------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 294)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 295) If ``edac`` was statically linked with the kernel then no loading
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 296) is necessary. If ``edac`` was built as modules then simply modprobe
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 297) the ``edac`` pieces that you need. You should be able to modprobe
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 298) hardware-specific modules and have the dependencies load the necessary
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 299) core modules.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 300)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 301) Example::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 302)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 303) $ modprobe amd76x_edac
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 304)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 305) loads both the ``amd76x_edac.ko`` memory controller module and the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 306) ``edac_mc.ko`` core module.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 307)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 308)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 309) Sysfs interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 310) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 311)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 312) EDAC presents a ``sysfs`` interface for control and reporting purposes. It
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 313) lives in the /sys/devices/system/edac directory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 314)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 315) Within this directory there currently reside 2 components:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 316)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 317) ======= ==============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 318) mc memory controller(s) system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 319) pci PCI control and status system
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 320) ======= ==============================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 321)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 322)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 323)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 324) Memory Controller (mc) Model
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 325) ----------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 326)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 327) Each ``mc`` device controls a set of memory modules [#f4]_. These modules
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 328) are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 329) There can be multiple csrows and multiple channels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 330)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 331) .. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 332) used to refer to a memory module, although there are other memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 333) packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 334) specification (Version 2.7) defines a memory module in the Common
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 335) Platform Error Record (CPER) section to be an SMBIOS Memory Device
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 336) (Type 17). Along this document, and inside the EDAC subsystem, the term
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 337) "dimm" is used for all memory modules, even when they use a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 338) different kind of packaging.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 339)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 340) Memory controllers allow for several csrows, with 8 csrows being a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 341) typical value. Yet, the actual number of csrows depends on the layout of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 342) a given motherboard, memory controller and memory module characteristics.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 343)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 344) Dual channels allow for dual data length (e. g. 128 bits, on 64 bit systems)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 345) data transfers to/from the CPU from/to memory. Some newer chipsets allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 346) for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 347) controllers. The following example will assume 2 channels:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 348)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 349) +------------+-----------------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 350) | CS Rows | Channels |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 351) +------------+-----------+-----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 352) | | ``ch0`` | ``ch1`` |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 353) +============+===========+===========+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 354) | |**DIMM_A0**|**DIMM_B0**|
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 355) +------------+-----------+-----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 356) | ``csrow0`` | rank0 | rank0 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 357) +------------+-----------+-----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 358) | ``csrow1`` | rank1 | rank1 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 359) +------------+-----------+-----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 360) | |**DIMM_A1**|**DIMM_B1**|
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 361) +------------+-----------+-----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 362) | ``csrow2`` | rank0 | rank0 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 363) +------------+-----------+-----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 364) | ``csrow3`` | rank1 | rank1 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 365) +------------+-----------+-----------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 366)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 367) In the above example, there are 4 physical slots on the motherboard
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 368) for memory DIMMs:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 369)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 370) +---------+---------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 371) | DIMM_A0 | DIMM_B0 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 372) +---------+---------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 373) | DIMM_A1 | DIMM_B1 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 374) +---------+---------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 375)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 376) Labels for these slots are usually silk-screened on the motherboard.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 377) Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 378) channel 1. Notice that there are two csrows possible on a physical DIMM.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 379) These csrows are allocated their csrow assignment based on the slot into
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 380) which the memory DIMM is placed. Thus, when 1 DIMM is placed in each
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 381) Channel, the csrows cross both DIMMs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 382)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 383) Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 384) In the example above 2 dual ranked DIMMs are similarly placed. Thus,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 385) both csrow0 and csrow1 are populated. On the other hand, when 2 single
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 386) ranked DIMMs are placed in slots DIMM_A0 and DIMM_B0, then they will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 387) have just one csrow (csrow0) and csrow1 will be empty. The pattern
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 388) repeats itself for csrow2 and csrow3. Also note that some memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 389) controllers don't have any logic to identify the memory module, see
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 390) ``rankX`` directories below.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 391)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 392) The representation of the above is reflected in the directory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 393) tree in EDAC's sysfs interface. Starting in directory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 394) ``/sys/devices/system/edac/mc``, each memory controller will be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 395) represented by its own ``mcX`` directory, where ``X`` is the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 396) index of the MC::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 397)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 398) ..../edac/mc/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 399) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 400) |->mc0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 401) |->mc1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 402) |->mc2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 403) ....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 404)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 405) Under each ``mcX`` directory each ``csrowX`` is again represented by a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 406) ``csrowX``, where ``X`` is the csrow index::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 407)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 408) .../mc/mc0/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 409) |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 410) |->csrow0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 411) |->csrow2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 412) |->csrow3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 413) ....
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 414)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 415) Notice that there is no csrow1, which indicates that csrow0 is composed
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 416) of a single ranked DIMMs. This should also apply in both Channels, in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 417) order to have dual-channel mode be operational. Since both csrow2 and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 418) csrow3 are populated, this indicates a dual ranked set of DIMMs for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 419) channels 0 and 1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 420)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 421) Within each of the ``mcX`` and ``csrowX`` directories are several EDAC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 422) control and attribute files.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 423)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 424) ``mcX`` directories
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 425) -------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 426)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 427) In ``mcX`` directories are EDAC control and attribute files for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 428) this ``X`` instance of the memory controllers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 429)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 430) For a description of the sysfs API, please see:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 431)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 432) Documentation/ABI/testing/sysfs-devices-edac
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 433)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 434)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 435) ``dimmX`` or ``rankX`` directories
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 436) ----------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 437)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 438) The recommended way to use the EDAC subsystem is to look at the information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 439) provided by the ``dimmX`` or ``rankX`` directories [#f5]_.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 440)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 441) A typical EDAC system has the following structure under
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 442) ``/sys/devices/system/edac/``\ [#f6]_::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 443)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 444) /sys/devices/system/edac/
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 445) ├── mc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 446) │ ├── mc0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 447) │ │ ├── ce_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 448) │ │ ├── ce_noinfo_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 449) │ │ ├── dimm0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 450) │ │ │ ├── dimm_ce_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 451) │ │ │ ├── dimm_dev_type
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 452) │ │ │ ├── dimm_edac_mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 453) │ │ │ ├── dimm_label
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 454) │ │ │ ├── dimm_location
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 455) │ │ │ ├── dimm_mem_type
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 456) │ │ │ ├── dimm_ue_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 457) │ │ │ ├── size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 458) │ │ │ └── uevent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 459) │ │ ├── max_location
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 460) │ │ ├── mc_name
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 461) │ │ ├── reset_counters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 462) │ │ ├── seconds_since_reset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 463) │ │ ├── size_mb
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 464) │ │ ├── ue_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 465) │ │ ├── ue_noinfo_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 466) │ │ └── uevent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 467) │ ├── mc1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 468) │ │ ├── ce_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 469) │ │ ├── ce_noinfo_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 470) │ │ ├── dimm0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 471) │ │ │ ├── dimm_ce_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 472) │ │ │ ├── dimm_dev_type
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 473) │ │ │ ├── dimm_edac_mode
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 474) │ │ │ ├── dimm_label
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 475) │ │ │ ├── dimm_location
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 476) │ │ │ ├── dimm_mem_type
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 477) │ │ │ ├── dimm_ue_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 478) │ │ │ ├── size
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 479) │ │ │ └── uevent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 480) │ │ ├── max_location
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 481) │ │ ├── mc_name
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 482) │ │ ├── reset_counters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 483) │ │ ├── seconds_since_reset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 484) │ │ ├── size_mb
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 485) │ │ ├── ue_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 486) │ │ ├── ue_noinfo_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 487) │ │ └── uevent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 488) │ └── uevent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 489) └── uevent
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 490)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 491) In the ``dimmX`` directories are EDAC control and attribute files for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 492) this ``X`` memory module:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 493)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 494) - ``size`` - Total memory managed by this csrow attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 495)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 496) This attribute file displays, in count of megabytes, the memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 497) that this csrow contains.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 498)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 499) - ``dimm_ue_count`` - Uncorrectable Errors count attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 500)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 501) This attribute file displays the total count of uncorrectable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 502) errors that have occurred on this DIMM. If panic_on_ue is set
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 503) this counter will not have a chance to increment, since EDAC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 504) will panic the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 505)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 506) - ``dimm_ce_count`` - Correctable Errors count attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 507)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 508) This attribute file displays the total count of correctable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 509) errors that have occurred on this DIMM. This count is very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 510) important to examine. CEs provide early indications that a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 511) DIMM is beginning to fail. This count field should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 512) monitored for non-zero values and report such information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 513) to the system administrator.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 514)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 515) - ``dimm_dev_type`` - Device type attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 516)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 517) This attribute file will display what type of DRAM device is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 518) being utilized on this DIMM.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 519) Examples:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 520)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 521) - x1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 522) - x2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 523) - x4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 524) - x8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 525)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 526) - ``dimm_edac_mode`` - EDAC Mode of operation attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 527)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 528) This attribute file will display what type of Error detection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 529) and correction is being utilized.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 530)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 531) - ``dimm_label`` - memory module label control file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 532)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 533) This control file allows this DIMM to have a label assigned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 534) to it. With this label in the module, when errors occur
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 535) the output can provide the DIMM label in the system log.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 536) This becomes vital for panic events to isolate the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 537) cause of the UE event.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 538)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 539) DIMM Labels must be assigned after booting, with information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 540) that correctly identifies the physical slot with its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 541) silk screen label. This information is currently very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 542) motherboard specific and determination of this information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 543) must occur in userland at this time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 544)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 545) - ``dimm_location`` - location of the memory module
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 546)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 547) The location can have up to 3 levels, and describe how the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 548) memory controller identifies the location of a memory module.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 549) Depending on the type of memory and memory controller, it
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 550) can be:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 551)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 552) - *csrow* and *channel* - used when the memory controller
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 553) doesn't identify a single DIMM - e. g. in ``rankX`` dir;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 554) - *branch*, *channel*, *slot* - typically used on FB-DIMM memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 555) controllers;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 556) - *channel*, *slot* - used on Nehalem and newer Intel drivers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 557)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 558) - ``dimm_mem_type`` - Memory Type attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 559)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 560) This attribute file will display what type of memory is currently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 561) on this csrow. Normally, either buffered or unbuffered memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 562) Examples:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 563)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 564) - Registered-DDR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 565) - Unbuffered-DDR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 566)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 567) .. [#f5] On some systems, the memory controller doesn't have any logic
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 568) to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 569) On modern Intel memory controllers, the memory controller identifies the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 570) memory modules directly. On such systems, the directory is called ``dimmX``.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 571)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 572) .. [#f6] There are also some ``power`` directories and ``subsystem``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 573) symlinks inside the sysfs mapping that are automatically created by
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 574) the sysfs subsystem. Currently, they serve no purpose.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 575)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 576) ``csrowX`` directories
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 577) ----------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 578)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 579) When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 580) directories. As this API doesn't work properly for Rambus, FB-DIMMs and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 581) modern Intel Memory Controllers, this is being deprecated in favor of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 582) ``dimmX`` directories.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 583)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 584) In the ``csrowX`` directories are EDAC control and attribute files for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 585) this ``X`` instance of csrow:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 586)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 587)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 588) - ``ue_count`` - Total Uncorrectable Errors count attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 589)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 590) This attribute file displays the total count of uncorrectable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 591) errors that have occurred on this csrow. If panic_on_ue is set
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 592) this counter will not have a chance to increment, since EDAC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 593) will panic the system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 594)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 595)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 596) - ``ce_count`` - Total Correctable Errors count attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 597)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 598) This attribute file displays the total count of correctable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 599) errors that have occurred on this csrow. This count is very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 600) important to examine. CEs provide early indications that a
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 601) DIMM is beginning to fail. This count field should be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 602) monitored for non-zero values and report such information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 603) to the system administrator.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 604)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 605)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 606) - ``size_mb`` - Total memory managed by this csrow attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 607)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 608) This attribute file displays, in count of megabytes, the memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 609) that this csrow contains.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 610)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 611)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 612) - ``mem_type`` - Memory Type attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 613)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 614) This attribute file will display what type of memory is currently
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 615) on this csrow. Normally, either buffered or unbuffered memory.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 616) Examples:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 617)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 618) - Registered-DDR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 619) - Unbuffered-DDR
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 620)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 621)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 622) - ``edac_mode`` - EDAC Mode of operation attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 623)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 624) This attribute file will display what type of Error detection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 625) and correction is being utilized.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 626)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 627)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 628) - ``dev_type`` - Device type attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 629)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 630) This attribute file will display what type of DRAM device is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 631) being utilized on this DIMM.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 632) Examples:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 633)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 634) - x1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 635) - x2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 636) - x4
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 637) - x8
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 638)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 639)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 640) - ``ch0_ce_count`` - Channel 0 CE Count attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 641)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 642) This attribute file will display the count of CEs on this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 643) DIMM located in channel 0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 644)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 645)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 646) - ``ch0_ue_count`` - Channel 0 UE Count attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 647)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 648) This attribute file will display the count of UEs on this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 649) DIMM located in channel 0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 650)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 651)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 652) - ``ch0_dimm_label`` - Channel 0 DIMM Label control file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 653)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 654)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 655) This control file allows this DIMM to have a label assigned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 656) to it. With this label in the module, when errors occur
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 657) the output can provide the DIMM label in the system log.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 658) This becomes vital for panic events to isolate the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 659) cause of the UE event.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 660)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 661) DIMM Labels must be assigned after booting, with information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 662) that correctly identifies the physical slot with its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 663) silk screen label. This information is currently very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 664) motherboard specific and determination of this information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 665) must occur in userland at this time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 666)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 667)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 668) - ``ch1_ce_count`` - Channel 1 CE Count attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 669)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 670)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 671) This attribute file will display the count of CEs on this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 672) DIMM located in channel 1.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 673)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 674)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 675) - ``ch1_ue_count`` - Channel 1 UE Count attribute file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 676)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 677)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 678) This attribute file will display the count of UEs on this
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 679) DIMM located in channel 0.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 680)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 681)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 682) - ``ch1_dimm_label`` - Channel 1 DIMM Label control file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 683)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 684) This control file allows this DIMM to have a label assigned
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 685) to it. With this label in the module, when errors occur
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 686) the output can provide the DIMM label in the system log.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 687) This becomes vital for panic events to isolate the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 688) cause of the UE event.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 689)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 690) DIMM Labels must be assigned after booting, with information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 691) that correctly identifies the physical slot with its
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 692) silk screen label. This information is currently very
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 693) motherboard specific and determination of this information
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 694) must occur in userland at this time.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 695)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 696)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 697) System Logging
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 698) --------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 699)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 700) If logging for UEs and CEs is enabled, then system logs will contain
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 701) information indicating that errors have been detected::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 702)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 703) EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 704) EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 705)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 706)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 707) The structure of the message is:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 708)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 709) +---------------------------------------+-------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 710) | Content | Example |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 711) +=======================================+=============+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 712) | The memory controller | MC0 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 713) +---------------------------------------+-------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 714) | Error type | CE |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 715) +---------------------------------------+-------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 716) | Memory page | 0x283 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 717) +---------------------------------------+-------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 718) | Offset in the page | 0xce0 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 719) +---------------------------------------+-------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 720) | The byte granularity | grain 8 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 721) | or resolution of the error | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 722) +---------------------------------------+-------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 723) | The error syndrome | 0xb741 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 724) +---------------------------------------+-------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 725) | Memory row | row 0 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 726) +---------------------------------------+-------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 727) | Memory channel | channel 1 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 728) +---------------------------------------+-------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 729) | DIMM label, if set prior | DIMM B1 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 730) +---------------------------------------+-------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 731) | And then an optional, driver-specific | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 732) | message that may have additional | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 733) | information. | |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 734) +---------------------------------------+-------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 735)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 736) Both UEs and CEs with no info will lack all but memory controller, error
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 737) type, a notice of "no info" and then an optional, driver-specific error
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 738) message.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 739)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 740)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 741) PCI Bus Parity Detection
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 742) ------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 743)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 744) On Header Type 00 devices, the primary status is looked at for any
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 745) parity error regardless of whether parity is enabled on the device or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 746) not. (The spec indicates parity is generated in some cases). On Header
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 747) Type 01 bridges, the secondary status register is also looked at to see
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 748) if parity occurred on the bus on the other side of the bridge.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 749)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 750)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 751) Sysfs configuration
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 752) -------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 753)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 754) Under ``/sys/devices/system/edac/pci`` are control and attribute files as
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 755) follows:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 756)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 757)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 758) - ``check_pci_parity`` - Enable/Disable PCI Parity checking control file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 759)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 760) This control file enables or disables the PCI Bus Parity scanning
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 761) operation. Writing a 1 to this file enables the scanning. Writing
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 762) a 0 to this file disables the scanning.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 763)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 764) Enable::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 765)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 766) echo "1" >/sys/devices/system/edac/pci/check_pci_parity
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 767)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 768) Disable::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 769)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 770) echo "0" >/sys/devices/system/edac/pci/check_pci_parity
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 771)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 772)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 773) - ``pci_parity_count`` - Parity Count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 774)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 775) This attribute file will display the number of parity errors that
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 776) have been detected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 777)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 778)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 779) Module parameters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 780) -----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 781)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 782) - ``edac_mc_panic_on_ue`` - Panic on UE control file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 783)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 784) An uncorrectable error will cause a machine panic. This is usually
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 785) desirable. It is a bad idea to continue when an uncorrectable error
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 786) occurs - it is indeterminate what was uncorrected and the operating
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 787) system context might be so mangled that continuing will lead to further
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 788) corruption. If the kernel has MCE configured, then EDAC will never
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 789) notice the UE.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 790)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 791) LOAD TIME::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 792)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 793) module/kernel parameter: edac_mc_panic_on_ue=[0|1]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 794)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 795) RUN TIME::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 796)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 797) echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 798)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 799)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 800) - ``edac_mc_log_ue`` - Log UE control file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 801)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 802)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 803) Generate kernel messages describing uncorrectable errors. These errors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 804) are reported through the system message log system. UE statistics
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 805) will be accumulated even when UE logging is disabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 806)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 807) LOAD TIME::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 808)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 809) module/kernel parameter: edac_mc_log_ue=[0|1]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 810)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 811) RUN TIME::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 812)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 813) echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 814)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 815)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 816) - ``edac_mc_log_ce`` - Log CE control file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 817)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 818)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 819) Generate kernel messages describing correctable errors. These
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 820) errors are reported through the system message log system.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 821) CE statistics will be accumulated even when CE logging is disabled.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 822)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 823) LOAD TIME::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 824)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 825) module/kernel parameter: edac_mc_log_ce=[0|1]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 826)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 827) RUN TIME::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 828)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 829) echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 830)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 831)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 832) - ``edac_mc_poll_msec`` - Polling period control file
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 833)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 834)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 835) The time period, in milliseconds, for polling for error information.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 836) Too small a value wastes resources. Too large a value might delay
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 837) necessary handling of errors and might loose valuable information for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 838) locating the error. 1000 milliseconds (once each second) is the current
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 839) default. Systems which require all the bandwidth they can get, may
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 840) increase this.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 841)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 842) LOAD TIME::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 843)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 844) module/kernel parameter: edac_mc_poll_msec=[0|1]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 845)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 846) RUN TIME::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 847)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 848) echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 849)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 850)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 851) - ``panic_on_pci_parity`` - Panic on PCI PARITY Error
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 852)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 853)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 854) This control file enables or disables panicking when a parity
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 855) error has been detected.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 856)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 857)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 858) module/kernel parameter::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 859)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 860) edac_panic_on_pci_pe=[0|1]
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 861)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 862) Enable::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 863)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 864) echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 865)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 866) Disable::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 867)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 868) echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 869)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 870)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 871)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 872) EDAC device type
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 873) ----------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 874)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 875) In the header file, edac_pci.h, there is a series of edac_device structures
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 876) and APIs for the EDAC_DEVICE.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 877)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 878) User space access to an edac_device is through the sysfs interface.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 879)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 880) At the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 881) will appear.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 882)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 883) There is a three level tree beneath the above ``edac`` directory. For example,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 884) the ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 885) website) installs itself as::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 886)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 887) /sys/devices/system/edac/test-instance
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 888)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 889) in this directory are various controls, a symlink and one or more ``instance``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 890) directories.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 891)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 892) The standard default controls are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 893)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 894) ============== =======================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 895) log_ce boolean to log CE events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 896) log_ue boolean to log UE events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 897) panic_on_ue boolean to ``panic`` the system if an UE is encountered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 898) (default off, can be set true via startup script)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 899) poll_msec time period between POLL cycles for events
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 900) ============== =======================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 901)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 902) The test_device_edac device adds at least one of its own custom control:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 903)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 904) ============== ==================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 905) test_bits which in the current test driver does nothing but
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 906) show how it is installed. A ported driver can
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 907) add one or more such controls and/or attributes
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 908) for specific uses.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 909) One out-of-tree driver uses controls here to allow
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 910) for ERROR INJECTION operations to hardware
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 911) injection registers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 912) ============== ==================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 913)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 914) The symlink points to the 'struct dev' that is registered for this edac_device.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 915)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 916) Instances
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 917) ---------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 918)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 919) One or more instance directories are present. For the ``test_device_edac``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 920) case:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 921)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 922) +----------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 923) | test-instance0 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 924) +----------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 925)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 926)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 927) In this directory there are two default counter attributes, which are totals of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 928) counter in deeper subdirectories.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 929)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 930) ============== ====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 931) ce_count total of CE events of subdirectories
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 932) ue_count total of UE events of subdirectories
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 933) ============== ====================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 934)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 935) Blocks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 936) ------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 937)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 938) At the lowest directory level is the ``block`` directory. There can be 0, 1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 939) or more blocks specified in each instance:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 940)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 941) +-------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 942) | test-block0 |
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 943) +-------------+
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 944)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 945) In this directory the default attributes are:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 946)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 947) ============== ================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 948) ce_count which is counter of CE events for this ``block``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 949) of hardware being monitored
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 950) ue_count which is counter of UE events for this ``block``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 951) of hardware being monitored
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 952) ============== ================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 953)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 954)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 955) The ``test_device_edac`` device adds 4 attributes and 1 control:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 956)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 957) ================== ====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 958) test-block-bits-0 for every POLL cycle this counter
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 959) is incremented
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 960) test-block-bits-1 every 10 cycles, this counter is bumped once,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 961) and test-block-bits-0 is set to 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 962) test-block-bits-2 every 100 cycles, this counter is bumped once,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 963) and test-block-bits-1 is set to 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 964) test-block-bits-3 every 1000 cycles, this counter is bumped once,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 965) and test-block-bits-2 is set to 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 966) ================== ====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 967)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 968)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 969) ================== ====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 970) reset-counters writing ANY thing to this control will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 971) reset all the above counters.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 972) ================== ====================================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 973)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 974)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 975) Use of the ``test_device_edac`` driver should enable any others to create their own
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 976) unique drivers for their hardware systems.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 977)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 978) The ``test_device_edac`` sample driver is located at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 979) http://bluesmoke.sourceforge.net project site for EDAC.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 980)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 981)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 982) Usage of EDAC APIs on Nehalem and newer Intel CPUs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 983) --------------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 984)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 985) On older Intel architectures, the memory controller was part of the North
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 986) Bridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 987) newer Intel architectures integrated an enhanced version of the memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 988) controller (MC) inside the CPUs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 989)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 990) This chapter will cover the differences of the enhanced memory controllers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 991) found on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 992) ``sbx_edac`` drivers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 993)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 994) .. note::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 995)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 996) The Xeon E7 processor families use a separate chip for the memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 997) controller, called Intel Scalable Memory Buffer. This section doesn't
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 998) apply for such families.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 999)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1000) 1) There is one Memory Controller per Quick Patch Interconnect
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1001) (QPI). At the driver, the term "socket" means one QPI. This is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1002) associated with a physical CPU socket.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1003)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1004) Each MC have 3 physical read channels, 3 physical write channels and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1005) 3 logic channels. The driver currently sees it as just 3 channels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1006) Each channel can have up to 3 DIMMs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1007)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1008) The minimum known unity is DIMMs. There are no information about csrows.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1009) As EDAC API maps the minimum unity is csrows, the driver sequentially
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1010) maps channel/DIMM into different csrows.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1011)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1012) For example, supposing the following layout::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1013)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1014) Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1015) dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1016) dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1017) Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1018) dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1019) Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1020) dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1021)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1022) The driver will map it as::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1023)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1024) csrow0: channel 0, dimm0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1025) csrow1: channel 0, dimm1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1026) csrow2: channel 1, dimm0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1027) csrow3: channel 2, dimm0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1028)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1029) exports one DIMM per csrow.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1030)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1031) Each QPI is exported as a different memory controller.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1032)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1033) 2) The MC has the ability to inject errors to test drivers. The drivers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1034) implement this functionality via some error injection nodes:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1035)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1036) For injecting a memory error, there are some sysfs nodes, under
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1037) ``/sys/devices/system/edac/mc/mc?/``:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1038)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1039) - ``inject_addrmatch/*``:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1040) Controls the error injection mask register. It is possible to specify
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1041) several characteristics of the address to match an error code::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1042)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1043) dimm = the affected dimm. Numbers are relative to a channel;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1044) rank = the memory rank;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1045) channel = the channel that will generate an error;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1046) bank = the affected bank;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1047) page = the page address;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1048) column (or col) = the address column.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1049)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1050) each of the above values can be set to "any" to match any valid value.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1051)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1052) At driver init, all values are set to any.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1053)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1054) For example, to generate an error at rank 1 of dimm 2, for any channel,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1055) any bank, any page, any column::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1056)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1057) echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1058) echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1059)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1060) To return to the default behaviour of matching any, you can do::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1061)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1062) echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1063) echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1064)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1065) - ``inject_eccmask``:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1066) specifies what bits will have troubles,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1067)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1068) - ``inject_section``:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1069) specifies what ECC cache section will get the error::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1070)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1071) 3 for both
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1072) 2 for the highest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1073) 1 for the lowest
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1074)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1075) - ``inject_type``:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1076) specifies the type of error, being a combination of the following bits::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1077)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1078) bit 0 - repeat
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1079) bit 1 - ecc
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1080) bit 2 - parity
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1081)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1082) - ``inject_enable``:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1083) starts the error generation when something different than 0 is written.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1084)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1085) All inject vars can be read. root permission is needed for write.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1086)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1087) Datasheet states that the error will only be generated after a write on an
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1088) address that matches inject_addrmatch. It seems, however, that reading will
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1089) also produce an error.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1090)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1091) For example, the following code will generate an error for any write access
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1092) at socket 0, on any DIMM/address on channel 2::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1093)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1094) echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1095) echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1096) echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1097) echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1098) echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1099) dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1100)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1101) For socket 1, it is needed to replace "mc0" by "mc1" at the above
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1102) commands.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1103)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1104) The generated error message will look like::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1105)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1106) EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1107)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1108) 3) Corrected Error memory register counters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1109)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1110) Those newer MCs have some registers to count memory errors. The driver
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1111) uses those registers to report Corrected Errors on devices with Registered
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1112) DIMMs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1113)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1114) However, those counters don't work with Unregistered DIMM. As the chipset
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1115) offers some counters that also work with UDIMMs (but with a worse level of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1116) granularity than the default ones), the driver exposes those registers for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1117) UDIMM memories.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1118)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1119) They can be read by looking at the contents of ``all_channel_counts/``::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1120)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1121) $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1122) /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1123) 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1124) /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1125) 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1126) /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1127) 0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1128)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1129) What happens here is that errors on different csrows, but at the same
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1130) dimm number will increment the same counter.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1131) So, in this memory mapping::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1132)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1133) csrow0: channel 0, dimm0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1134) csrow1: channel 0, dimm1
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1135) csrow2: channel 1, dimm0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1136) csrow3: channel 2, dimm0
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1137)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1138) The hardware will increment udimm0 for an error at the first dimm at either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1139) csrow0, csrow2 or csrow3;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1140)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1141) The hardware will increment udimm1 for an error at the second dimm at either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1142) csrow0, csrow2 or csrow3;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1143)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1144) The hardware will increment udimm2 for an error at the third dimm at either
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1145) csrow0, csrow2 or csrow3;
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1146)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1147) 4) Standard error counters
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1148)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1149) The standard error counters are generated when an mcelog error is received
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1150) by the driver. Since, with UDIMM, this is counted by software, it is
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1151) possible that some errors could be lost. With RDIMM's, they display the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1152) contents of the registers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1153)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1154) Reference documents used on ``amd64_edac``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1155) ------------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1156)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1157) ``amd64_edac`` module is based on the following documents
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1158) (available from http://support.amd.com/en-us/search/tech-docs):
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1159)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1160) 1. :Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1161) Opteron Processors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1162) :AMD publication #: 26094
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1163) :Revision: 3.26
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1164) :Link: http://support.amd.com/TechDocs/26094.PDF
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1165)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1166) 2. :Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1167) Processors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1168) :AMD publication #: 32559
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1169) :Revision: 3.00
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1170) :Issue Date: May 2006
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1171) :Link: http://support.amd.com/TechDocs/32559.pdf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1172)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1173) 3. :Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1174) Processors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1175) :AMD publication #: 31116
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1176) :Revision: 3.00
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1177) :Issue Date: September 07, 2007
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1178) :Link: http://support.amd.com/TechDocs/31116.pdf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1179)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1180) 4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1181) Models 30h-3Fh Processors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1182) :AMD publication #: 49125
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1183) :Revision: 3.06
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1184) :Issue Date: 2/12/2015 (latest release)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1185) :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1186)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1187) 5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1188) Models 60h-6Fh Processors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1189) :AMD publication #: 50742
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1190) :Revision: 3.01
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1191) :Issue Date: 7/23/2015 (latest release)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1192) :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1193)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1194) 6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1195) Models 00h-0Fh Processors
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1196) :AMD publication #: 48751
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1197) :Revision: 3.03
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1198) :Issue Date: 2/23/2015 (latest release)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1199) :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1200)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1201) Credits
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1202) =======
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1203)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1204) * Written by Doug Thompson <dougthompson@xmission.com>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1205)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1206) - 7 Dec 2005
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1207) - 17 Jul 2007 Updated
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1208)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1209) * |copy| Mauro Carvalho Chehab
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1210)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1211) - 05 Aug 2009 Nehalem interface
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1212) - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1213)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1214) * EDAC authors/maintainers:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1215)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1216) - Doug Thompson, Dave Jiang, Dave Peterson et al,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1217) - Mauro Carvalho Chehab
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1218) - Borislav Petkov
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1219) - original author: Thayne Harbaugh