System Debugging Guide#

Applies to Linux

2023-05-30

5 min read time

ROCm Language and System Level Debug, Flags, and Environment Variables#

Kernel options to avoid: the Ethernet port getting renamed every time you change graphics cards, net.ifnames=0 biosdevname=0

ROCr Error Code#

  • 2 Invalid Dimension

  • 4 Invalid Group Memory

  • 8 Invalid (or Null) Code

  • 32 Invalid Format

  • 64 Group is too large

  • 128 Out of VGPRs

  • 0x80000000 Debug Options

Command to Dump Firmware Version and Get Linux Kernel Version#

sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info

uname -a

Debug Flags#

Debug messages when developing/debugging base ROCm driver. You could enable the printing from libhsakmt.so by setting an environment variable, HSAKMT_DEBUG_LEVEL. Available debug levels are 3-7. The higher level you set, the more messages will print.

  • export HSAKMT_DEBUG_LEVEL=3 : Only pr_err() prints.

  • export HSAKMT_DEBUG_LEVEL=4 : pr_err() and pr_warn() print.

  • export HSAKMT_DEBUG_LEVEL=5 : We currently do not implement “notice”. Setting to 5 is same as setting to 4.

  • export HSAKMT_DEBUG_LEVEL=6 : pr_err(), pr_warn(), and pr_info print.

  • export HSAKMT_DEBUG_LEVEL=7 : Everything including pr_debug prints.

ROCr Level Environment Variables for Debug#

HSA_ENABLE_SDMA=0

HSA_ENABLE_INTERRUPT=0

HSA_SVM_GUARD_PAGES=0

HSA_DISABLE_CACHE=1

Turn Off Page Retry on GFX9/Vega Devices#

sudo -s

echo 1 > /sys/module/amdkfd/parameters/noretry

HIP Environment Variables 3.x#

OpenCL Debug Flags#

AMD_OCL_WAIT_COMMAND=1 (0 = OFF, 1 = On)

PCIe-Debug#

Refer to ROCm PCIe Debug, https://rocmdocs.amd.com/en/latest/Other_Solutions/PCIe-Debug.html#pcie-debug. For information on how to debug and profile HIP applications, see HIP Debugging