Error Queries

Error Queries#

ROCmSMI: Error Queries
Error Queries

Functions

rsmi_status_t rsmi_dev_ecc_count_get (uint32_t dv_ind, rsmi_gpu_block_t block, rsmi_error_count_t *ec)
 Retrieve the error counts for a GPU block. More...
 
rsmi_status_t rsmi_dev_ecc_enabled_get (uint32_t dv_ind, uint64_t *enabled_blocks)
 Retrieve the enabled ECC bit-mask. More...
 
rsmi_status_t rsmi_dev_ecc_status_get (uint32_t dv_ind, rsmi_gpu_block_t block, rsmi_ras_err_state_t *state)
 Retrieve the ECC status for a GPU block. More...
 
rsmi_status_t rsmi_status_string (rsmi_status_t status, const char **status_string)
 Get a description of a provided RSMI error status. More...
 

Detailed Description

These functions provide error information about RSMI calls as well as device errors.

Function Documentation

◆ rsmi_dev_ecc_count_get()

rsmi_status_t rsmi_dev_ecc_count_get ( uint32_t  dv_ind,
rsmi_gpu_block_t  block,
rsmi_error_count_t ec 
)

Retrieve the error counts for a GPU block.

Given a device index dv_ind, an rsmi_gpu_block_t block and a pointer to an rsmi_error_count_t ec, this function will write the error count values for the GPU block indicated by block to memory pointed to by ec.

Parameters
[in]dv_inda device index
[in]blockThe block for which error counts should be retrieved
[in,out]ecA pointer to an rsmi_error_count_t to which the error counts should be written If this parameter is nullptr, this function will return RSMI_STATUS_INVALID_ARGS if the function is supported with the provided, arguments and RSMI_STATUS_NOT_SUPPORTED if it is not supported with the provided arguments.
Return values
RSMI_STATUS_SUCCESScall was successful
RSMI_STATUS_NOT_SUPPORTEDinstalled software or hardware does not support this function with the given arguments
RSMI_STATUS_INVALID_ARGSthe provided arguments are not valid

◆ rsmi_dev_ecc_enabled_get()

rsmi_status_t rsmi_dev_ecc_enabled_get ( uint32_t  dv_ind,
uint64_t *  enabled_blocks 
)

Retrieve the enabled ECC bit-mask.

Given a device index dv_ind, and a pointer to a uint64_t enabled_mask, this function will write bits to memory pointed to by enabled_blocks. Upon a successful call, enabled_blocks can then be AND'd with elements of the rsmi_gpu_block_t ennumeration to determine if the corresponding block has ECC enabled. Note that whether a block has ECC enabled or not in the device is independent of whether there is kernel support for error counting for that block. Although a block may be enabled, but there may not be kernel support for reading error counters for that block.

Parameters
[in]dv_inda device index
[in,out]enabled_blocksA pointer to a uint64_t to which the enabled blocks bits will be written. If this parameter is nullptr, this function will return RSMI_STATUS_INVALID_ARGS if the function is supported with the provided, arguments and RSMI_STATUS_NOT_SUPPORTED if it is not supported with the provided arguments.
Return values
RSMI_STATUS_SUCCESScall was successful
RSMI_STATUS_NOT_SUPPORTEDinstalled software or hardware does not support this function with the given arguments
RSMI_STATUS_INVALID_ARGSthe provided arguments are not valid

◆ rsmi_dev_ecc_status_get()

rsmi_status_t rsmi_dev_ecc_status_get ( uint32_t  dv_ind,
rsmi_gpu_block_t  block,
rsmi_ras_err_state_t state 
)

Retrieve the ECC status for a GPU block.

Given a device index dv_ind, an rsmi_gpu_block_t block and a pointer to an rsmi_ras_err_state_t state, this function will write the current state for the GPU block indicated by block to memory pointed to by state.

Parameters
[in]dv_inda device index
[in]blockThe block for which error counts should be retrieved
[in,out]stateA pointer to an rsmi_ras_err_state_t to which the ECC state should be written If this parameter is nullptr, this function will return RSMI_STATUS_INVALID_ARGS if the function is supported with the provided, arguments and RSMI_STATUS_NOT_SUPPORTED if it is not supported with the provided arguments.
Return values
RSMI_STATUS_SUCCESScall was successful
RSMI_STATUS_NOT_SUPPORTEDinstalled software or hardware does not support this function with the given arguments
RSMI_STATUS_INVALID_ARGSthe provided arguments are not valid

◆ rsmi_status_string()

rsmi_status_t rsmi_status_string ( rsmi_status_t  status,
const char **  status_string 
)

Get a description of a provided RSMI error status.

Set the provided pointer to a const char *, status_string, to a string containing a description of the provided error code status.

Parameters
[in]statusThe error status for which a description is desired
[in,out]status_stringA pointer to a const char * which will be made to point to a description of the provided error code
Return values
RSMI_STATUS_SUCCESSis returned upon successful call