System debugging#
2025-01-14
5 min read time
ROCm language and system-level debug, flags, and environment variables#
Kernel options to avoid: the Ethernet port getting renamed every time you change graphics cards, net.ifnames=0 biosdevname=0
ROCr error code#
- 2 Invalid Dimension 
- 4 Invalid Group Memory 
- 8 Invalid (or Null) Code 
- 32 Invalid Format 
- 64 Group is too large 
- 128 Out of VGPRs 
- 0x80000000 Debug Options 
Command to dump firmware version and get Linux kernel version#
sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info
uname -a
Debug flags#
Debug messages when developing/debugging base ROCm driver. You could enable the printing from libhsakmt.so by setting an environment variable, HSAKMT_DEBUG_LEVEL. Available debug levels are 3-7. The higher level you set, the more messages will print.
- export HSAKMT_DEBUG_LEVEL=3: Only pr_err() prints.
- export HSAKMT_DEBUG_LEVEL=4: pr_err() and pr_warn() print.
- export HSAKMT_DEBUG_LEVEL=5: We currently do not implement “notice”. Setting to 5 is same as setting to 4.
- export HSAKMT_DEBUG_LEVEL=6: pr_err(), pr_warn(), and pr_info print.
- export HSAKMT_DEBUG_LEVEL=7: Everything including pr_debug prints.
ROCr level environment variables for debug#
HSA_ENABLE_SDMA=0
HSA_ENABLE_INTERRUPT=0
HSA_SVM_GUARD_PAGES=0
HSA_DISABLE_CACHE=1
Turn off page retry on GFX9/Vega devices#
sudo -s
echo 1 > /sys/module/amdkfd/parameters/noretry
HIP environment variables 3.x#
OpenCL debug flags#
AMD_OCL_WAIT_COMMAND=1 (0 = OFF, 1 = On)
PCIe-debug#
For information on how to debug and profile HIP applications, see Debugging with HIP