ECC Information

ECC Information#

AMD SMI: ECC Information
ECC Information

Functions

amdsmi_status_t amdsmi_get_gpu_ecc_count (amdsmi_processor_handle processor_handle, amdsmi_gpu_block_t block, amdsmi_error_count_t *ec)
 Retrieve the error counts for a GPU block. It is not supported on virtual machine guest. More...
 
amdsmi_status_t amdsmi_get_gpu_ecc_enabled (amdsmi_processor_handle processor_handle, uint64_t *enabled_blocks)
 Retrieve the enabled ECC bit-mask. It is not supported on virtual machine guest. More...
 
amdsmi_status_t amdsmi_get_gpu_total_ecc_count (amdsmi_processor_handle processor_handle, amdsmi_error_count_t *ec)
 Returns the total number of ECC errors (correctable, uncorrectable and deferred) in the given GPU. It is not supported on virtual machine guest. More...
 

Detailed Description

Function Documentation

◆ amdsmi_get_gpu_ecc_count()

amdsmi_status_t amdsmi_get_gpu_ecc_count ( amdsmi_processor_handle  processor_handle,
amdsmi_gpu_block_t  block,
amdsmi_error_count_t ec 
)

Retrieve the error counts for a GPU block. It is not supported on virtual machine guest.

See RAS Error Count sysfs Interface (AMDGPU RAS Support - Linux Kernel documentation) to learn how these error counts are accessed.

Platform:

gpu_bm_linux

host

Given a processor handle processor_handle, an amdsmi_gpu_block_t block and a pointer to an amdsmi_error_count_t ec, this function will write the error count values for the GPU block indicated by block to memory pointed to by ec.

Parameters
[in]processor_handlea processor handle
[in]blockThe block for which error counts should be retrieved
[in,out]ecA pointer to an amdsmi_error_count_t to which the error counts should be written If this parameter is nullptr, this function will return AMDSMI_STATUS_INVAL if the function is supported with the provided, arguments and AMDSMI_STATUS_NOT_SUPPORTED if it is not supported with the provided arguments.
Returns
amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail

◆ amdsmi_get_gpu_ecc_enabled()

amdsmi_status_t amdsmi_get_gpu_ecc_enabled ( amdsmi_processor_handle  processor_handle,
uint64_t *  enabled_blocks 
)

Retrieve the enabled ECC bit-mask. It is not supported on virtual machine guest.

See RAS Error Count sysfs Interface (AMDGPU RAS Support - Linux Kernel documentation) to learn how these error counts are accessed.

Platform:

gpu_bm_linux

host

Given a processor handle processor_handle, and a pointer to a uint64_t enabled_mask, this function will write bits to memory pointed to by enabled_blocks. Upon a successful call, enabled_blocks can then be AND'd with elements of the amdsmi_gpu_block_t ennumeration to determine if the corresponding block has ECC enabled. Note that whether a block has ECC enabled or not in the device is independent of whether there is kernel support for error counting for that block. Although a block may be enabled, but there may not be kernel support for reading error counters for that block.

Parameters
[in]processor_handlea processor handle
[in,out]enabled_blocksA pointer to a uint64_t to which the enabled blocks bits will be written. If this parameter is nullptr, this function will return AMDSMI_STATUS_INVAL if the function is supported with the provided, arguments and AMDSMI_STATUS_NOT_SUPPORTED if it is not supported with the provided arguments.
Returns
amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail

◆ amdsmi_get_gpu_total_ecc_count()

amdsmi_status_t amdsmi_get_gpu_total_ecc_count ( amdsmi_processor_handle  processor_handle,
amdsmi_error_count_t ec 
)

Returns the total number of ECC errors (correctable, uncorrectable and deferred) in the given GPU. It is not supported on virtual machine guest.

See RAS Error Count sysfs Interface (AMDGPU RAS Support - Linux Kernel documentation) to learn how these error counts are accessed.

Platform:

gpu_bm_linux

host

guest_windows

Parameters
[in]processor_handleDevice which to query
[out]ecReference to ecc error count structure. Must be allocated by user.
Returns
amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail