^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 1) Error Detection And Correction (EDAC) Devices
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 2) =============================================
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 3)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 4) Main Concepts used at the EDAC subsystem
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 5) ----------------------------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 6)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 7) There are several things to be aware of that aren't at all obvious, like
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 8) *sockets, *socket sets*, *banks*, *rows*, *chip-select rows*, *channels*,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 9) etc...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 10)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 11) These are some of the many terms that are thrown about that don't always
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 12) mean what people think they mean (Inconceivable!). In the interest of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 13) creating a common ground for discussion, terms and their definitions
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 14) will be established.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 15)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 16) * Memory devices
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 17)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 18) The individual DRAM chips on a memory stick. These devices commonly
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 19) output 4 and 8 bits each (x4, x8). Grouping several of these in parallel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 20) provides the number of bits that the memory controller expects:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 21) typically 72 bits, in order to provide 64 bits + 8 bits of ECC data.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 22)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 23) * Memory Stick
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 24)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 25) A printed circuit board that aggregates multiple memory devices in
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 26) parallel. In general, this is the Field Replaceable Unit (FRU) which
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 27) gets replaced, in the case of excessive errors. Most often it is also
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 28) called DIMM (Dual Inline Memory Module).
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 29)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 30) * Memory Socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 31)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 32) A physical connector on the motherboard that accepts a single memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 33) stick. Also called as "slot" on several datasheets.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 34)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 35) * Channel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 36)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 37) A memory controller channel, responsible to communicate with a group of
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 38) DIMMs. Each channel has its own independent control (command) and data
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 39) bus, and can be used independently or grouped with other channels.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 40)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 41) * Branch
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 42)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 43) It is typically the highest hierarchy on a Fully-Buffered DIMM memory
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 44) controller. Typically, it contains two channels. Two channels at the
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 45) same branch can be used in single mode or in lockstep mode. When
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 46) lockstep is enabled, the cacheline is doubled, but it generally brings
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 47) some performance penalty. Also, it is generally not possible to point to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 48) just one memory stick when an error occurs, as the error correction code
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 49) is calculated using two DIMMs instead of one. Due to that, it is capable
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 50) of correcting more errors than on single mode.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 51)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 52) * Single-channel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 53)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 54) The data accessed by the memory controller is contained into one dimm
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 55) only. E. g. if the data is 64 bits-wide, the data flows to the CPU using
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 56) one 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 57) memories. FB-DIMM and RAMBUS use a different concept for channel, so
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 58) this concept doesn't apply there.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 59)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 60) * Double-channel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 61)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 62) The data size accessed by the memory controller is interlaced into two
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 63) dimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 64) bits with ECC), the data flows to the CPU using a 128 bits parallel
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 65) access.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 66)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 67) * Chip-select row
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 68)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 69) This is the name of the DRAM signal used to select the DRAM ranks to be
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 70) accessed. Common chip-select rows for single channel are 64 bits, for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 71) dual channel 128 bits. It may not be visible by the memory controller,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 72) as some DIMM types have a memory buffer that can hide direct access to
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 73) it from the Memory Controller.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 74)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 75) * Single-Ranked stick
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 76)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 77) A Single-ranked stick has 1 chip-select row of memory. Motherboards
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 78) commonly drive two chip-select pins to a memory stick. A single-ranked
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 79) stick, will occupy only one of those rows. The other will be unused.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 80)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 81) .. _doubleranked:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 82)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 83) * Double-Ranked stick
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 84)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 85) A double-ranked stick has two chip-select rows which access different
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 86) sets of memory devices. The two rows cannot be accessed concurrently.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 87)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 88) * Double-sided stick
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 89)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 90) **DEPRECATED TERM**, see :ref:`Double-Ranked stick <doubleranked>`.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 91)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 92) A double-sided stick has two chip-select rows which access different sets
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 93) of memory devices. The two rows cannot be accessed concurrently.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 94) "Double-sided" is irrespective of the memory devices being mounted on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 95) both sides of the memory stick.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 96)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 97) * Socket set
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 98)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 99) All of the memory sticks that are required for a single memory access or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 100) all of the memory sticks spanned by a chip-select row. A single socket
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 101) set has two chip-select rows and if double-sided sticks are used these
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 102) will occupy those chip-select rows.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 103)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 104) * Bank
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 105)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 106) This term is avoided because it is unclear when needing to distinguish
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 107) between chip-select rows and socket sets.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 108)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 109)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 110) Memory Controllers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 111) ------------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 112)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 113) Most of the EDAC core is focused on doing Memory Controller error detection.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 114) The :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info``
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 115) to describe the memory controllers, with is an opaque struct for the EDAC
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 116) drivers. Only the EDAC core is allowed to touch it.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 117)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 118) .. kernel-doc:: include/linux/edac.h
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 119)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 120) .. kernel-doc:: drivers/edac/edac_mc.h
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 121)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 122) PCI Controllers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 123) ---------------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 124)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 125) The EDAC subsystem provides a mechanism to handle PCI controllers by calling
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 126) the :c:func:`edac_pci_alloc_ctl_info`. It will use the struct
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 127) :c:type:`edac_pci_ctl_info` to describe the PCI controllers.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 128)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 129) .. kernel-doc:: drivers/edac/edac_pci.h
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 130)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 131) EDAC Blocks
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 132) -----------
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 133)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 134) The EDAC subsystem also provides a generic mechanism to report errors on
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 135) other parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 136)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 137) The structures :c:type:`edac_dev_sysfs_block_attribute`,
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 138) :c:type:`edac_device_block`, :c:type:`edac_device_instance` and
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 139) :c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device'
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 140) representation at sysfs.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 141)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 142) This set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 143) PCI, like:
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 144)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 145) - CPU caches (L1 and L2)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 146) - DMA engines
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 147) - Core CPU switches
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 148) - Fabric switch units
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 149) - PCIe interface controllers
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 150) - other EDAC/ECC type devices that can be monitored for
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 151) errors, etc.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 152)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 153) It allows for a 2 level set of hierarchy.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 154)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 155) For example, a cache could be composed of L1, L2 and L3 levels of cache.
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 156) Each CPU core would have its own L1 cache, while sharing L2 and maybe L3
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 157) caches. On such case, those can be represented via the following sysfs
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 158) nodes::
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 159)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 160) /sys/devices/system/edac/..
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 161)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 162) pci/ <existing pci directory (if available)>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 163) mc/ <existing memory device directory>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 164) cpu/cpu0/.. <L1 and L2 block directory>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 165) /L1-cache/ce_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 166) /ue_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 167) /L2-cache/ce_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 168) /ue_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 169) cpu/cpu1/.. <L1 and L2 block directory>
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 170) /L1-cache/ce_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 171) /ue_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 172) /L2-cache/ce_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 173) /ue_count
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 174) ...
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 175)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 176) the L1 and L2 directories would be "edac_device_block's"
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 177)
^8f3ce5b39 (kx 2023-10-28 12:00:06 +0300 178) .. kernel-doc:: drivers/edac/edac_device.h