Reliability, availability, serviceability (RAS)

Reliability, availability, serviceability (RAS)#

RAS aims to increase the robustness of a system by detecting hardware errors, recording them, and correcting them where possible. See Reliability, availability, serviceability (Linux kernel) for more general information.

ECC#

ECC (Error-Correcting Code) is a type of memory to automatically detect errors. Correctable 1-bit errors are handled by the ECC logic and logged by the hardware. Uncorrectable 2-bit errors can be detected but not reliably fixed; this is a more serious event that must be reported. See RAS Error Count sysfs Interface to learn how AMD SMI accesses error counts.

While ECC is a mechanism to handle different errors, CPER is the standard used to report that the event occurred.

CPER#

At its core, CPER (Common Platform Error Record) is a standard format included in the UEFI specification to report errors to the operating system. It works as a standard error report template that different hardware components can fill out when something goes wrong. It consists of a header, one or more section descriptors – and for each descriptor, an associated section containing error or informational data. See CPER (UEFI Specification) for more information.

A CPER record consists of vital information for diagnostics such as:

  • Error source

  • Error type

  • Error severity

    • 0 - Recoverable (also called non-fatal uncorrected)

    • 1 - Fatal

    • 2 - Corrected

    • 3 - Informational

  • Timestamp

  • Other data

A CPER record might contain an AFID in its data to help map a complex error to a more actionable service task.

AFID#

AFIDs (AMD Field ID) are unique numerical IDs associated with specific events or errors produced by AMD Instinct accelerators. It provides a specific identifier for a known condition, which helps facilitate root cause analysis. Each AFID is associated with category, type, and severity fields. See AFID Event List for more information.

From concept to action#

AMD SMI provides tools to programmatically monitor and manage these RAS features.

The AMD SMI library provides APIs to query ECC error counts and manage CPER records (list, decode, and clear).

See ECC information and RAS information for available APIs.

See amd-smi ras --help for details and available options.

amd-smi ras --help

Further reading#