Run ROCm XIO tests#
2026-04-27
11 min read time
ROCm XIO uses CTest with CMake presets, label-based filtering, hardware fixture setup, and runtime skip detection. This topic explains how to run tests, what labels and presets exist, and how hardware-gated tests behave when the required NIC or GPU is absent.
Prerequisites#
Build with testing enabled (the default preset does this automatically):
cmake --preset default
cmake --build --preset default
CMake test presets#
The project provides six test presets in CMakePresets.json:
Preset |
Description |
Hardware |
|---|---|---|
|
CPU-only unit tests |
CPU-only |
|
System tests (emulation) |
GPU-only |
|
Hardware integration tests |
GPU + RDMA NIC |
|
Multi-seed loopback sweep |
GPU + RDMA NIC |
|
Install-integration examples |
CPU-only |
|
All tests |
Varies |
Run a preset:
ctest --preset unit
ctest --preset hardware
ctest --preset sweep
Or equivalently, without presets:
ctest --test-dir build -V -L "unit" \
--parallel --output-on-failure
Test labels#
Every test carries one or more CTest labels for filtering with
ctest -L <label>:
Label |
Definition |
|---|---|
|
CPU-only, no GPU or NIC (runs in CI) |
|
Needs a HIP-capable GPU |
|
Needs a GPU and a specific RDMA NIC |
|
Parameterized multi-seed loopback runs |
|
Long-running (timeout: 600 seconds) |
|
RDMA-related test |
|
Common library utilities |
|
CTest fixture (setup/teardown) |
Combine labels to narrow the scope:
ctest --test-dir build -L "unit" -L "rdma"
Test inventory#
Unit tests (CPU-only)#
These run in CI without hardware:
test-data-pattern– LFSR data pattern generation and verificationtest-rdma-config–RdmaEpConfigvalidation,Providerenum,provider_name(),provider_from_string()(all vendors including ROCM_ERNIC)test-rdma-vendors– Vendor ID constants,RmaDescriptor,AmoDescriptorstruct layouttest-rdma-endian– Endian byte-swap helpers (host and optional device)test-bnxt-sizing– BNXT DV queue sizing math:roundup_pow2,align_up,calc_wqe_sz,compute_sq,compute_rq,cqe_sizetest-rdma-topology– PCIe address parsing:ExtractBusNumber,GetBusIdDistance,GetLcaDepthtest-extract-endpoint– CLI argument parser:extractEndpointName()test-ep-config– test-ep configuration defaults
System tests#
test-ep-emulate– Full SQE/CQE round-trip in emulation mode (GPU required)
Hardware tests#
These require a GPU and the corresponding RDMA NIC. When hardware
is absent, tests report Skipped rather than Failed (see
below).
test-rdma-loopback– GPU-initiated RDMA WRITE loopback with LFSR verification (BNXT, MLX5, or Ionic)test-rdma-loopback-seed1throughtest-rdma-loopback-seed5– Parameterized seed sweep (label:sweep)test-rdma-2node– Two-node RDMA test (BNXT, MLX5, Ionic, or ERNIC)test-rdma-ernic-loopback– ERNIC loopback
Hardware skip detection#
Hardware tests use a three-layer gating strategy:
Compile-time gating – Tests are only registered with CTest when the corresponding
GDA_BNXT,GDA_MLX5,GDA_IONIC, orGDA_ERNICCMake variable is enabled at configure time.Runtime detection – Each hardware test probes for the required NIC and GPU at startup. If the hardware is absent, the test prints
SKIP: ...and exits with code 77. CTest recognises this viaSKIP_RETURN_CODE 77andSKIP_REGULAR_EXPRESSION "SKIP:"properties set byxio_add_test().GPU resource allocation – Tests with the
GPUflag declareRESOURCE_GROUPS "gpus:1"so CTest can schedule parallel tests without oversubscribing GPUs. The resource specification is auto-generated at configure time bycmake/XIODetectGPUs.cmakeusingrocm_agent_enumerator.
CTest fixtures#
Hardware tests depend on a RDMA_HW fixture that runs
scripts/test/setup-rdma-loopback.sh via sudo before any
hardware test executes. This fixture handles:
Kernel module reload (
modprobe bnxt_re/ionic_rdma)Ionic sysfs loopback mode configuration
RDMA device renaming (udev fallback)
IP address and static ARP neighbor setup
GID table readiness polling
When you run ctest -L hardware, CTest automatically runs the fixture
first in dependency order.
GPU resource spec#
At configure time, cmake/XIODetectGPUs.cmake runs rocm_agent_enumerator
and writes build/ctest-resources.json with the detected GPU count. When
rocm_agent_enumerator is unavailable the module defaults to a
single GPU. Use the generated file for parallel GPU-aware test
scheduling:
ctest --test-dir build \
--resource-spec-file build/ctest-resources.json \
--parallel 4
Environment#
Hardware tests automatically set LD_LIBRARY_PATH to include the
rdma-core build tree via the CTest ENVIRONMENT property. No
manual export is needed when running through ctest.
Shell script runner#
The convenience script scripts/test/test-rdma-ep-xio-loopback.sh
wraps the compiled test-rdma-loopback binary with additional
features:
Provider selection (
PROVIDER=bnxt|mlx5|ionic|auto)Transfer size configuration (
TRANSFER_SIZE=256)LFSR data-pattern seed (
LFSR_SEED=1)Iteration count (
ITERATIONS=1)RDMA device override (
ROCXIO_RDMA_DEVICE)Auto-detection of build directory, test binary (
TEST_BIN), and rdma-core library (RDMA_CORE_LIB)
# Loopback with BNXT provider, 128 iterations
PROVIDER=bnxt ITERATIONS=128 \
scripts/test/test-rdma-ep-xio-loopback.sh
# Ionic provider, 4 KiB transfers
PROVIDER=ionic TRANSFER_SIZE=4096 \
scripts/test/test-rdma-ep-xio-loopback.sh
# Quick CTest-only run
ctest --preset sweep
xio-tester rdma-ep#
xio-tester rdma-ep runs GPU-initiated RDMA WRITEs with per-iteration
timing statistics and histogram support. It honours --memory-mode
for queue and data buffer placement (see Memory modes, allocation, and coherence in ROCm XIO).
LIB=build/_deps/rdma-core/install/lib:/opt/rocm/lib
# BNXT loopback (128 iterations, 4 KiB)
sudo LD_LIBRARY_PATH="${LIB}" \
HSA_FORCE_FINE_GRAIN_PCIE=1 \
./build/xio-tester rdma-ep \
--provider bnxt \
--device rocm-rdma-bnxt0 \
--loopback --iterations 128 \
--transfer-size 4096
# Ionic loopback
sudo LD_LIBRARY_PATH="${LIB}" \
HSA_FORCE_FINE_GRAIN_PCIE=1 \
./build/xio-tester rdma-ep \
--provider ionic \
--device rocm-rdma-ionic0 \
--loopback --iterations 128 \
--transfer-size 4096
# With data buffer in VRAM (memory-mode bit 3)
sudo LD_LIBRARY_PATH="${LIB}" \
HSA_FORCE_FINE_GRAIN_PCIE=1 \
./build/xio-tester rdma-ep \
--provider bnxt \
--device rocm-rdma-bnxt0 \
--loopback --iterations 128 \
--transfer-size 4096 \
--memory-mode 8
The --device flag selects the RDMA device by name (as shown by
rdma link show). When omitted, topology-based selection picks
the NIC closest to the GPU.
Infinite mode and SIGINT#
Pass --iterations 0 to run indefinitely. Press Ctrl-C to stop
gracefully; the GPU kernel polls a host-mapped stopRequested
flag after each RDMA WRITE completion and exits cleanly.
# Infinite loopback with --less-timing stats
sudo LD_LIBRARY_PATH="${LIB}" \
HSA_FORCE_FINE_GRAIN_PCIE=1 \
./build/xio-tester rdma-ep \
--provider bnxt \
--device rocm-rdma-bnxt0 \
--loopback --iterations 0 \
--less-timing
SIGINT handling is supported by all endpoints: nvme-ep, rdma-ep,
test-ep, and sdma-ep.
Per-iteration data verification#
--verify checks the LFSR data pattern after each RDMA WRITE
completion, not just at the end. Verification runs outside the
timing window so it doesn’t inflate latency measurements. On
mismatch the kernel prints the iteration number and byte offset.
sudo LD_LIBRARY_PATH="${LIB}" \
HSA_FORCE_FINE_GRAIN_PCIE=1 \
./build/xio-tester rdma-ep \
--provider bnxt \
--device rocm-rdma-bnxt0 \
--loopback --iterations 128 \
--transfer-size 256 --verify
GPU configuration for multi-wavefront kernels#
Any endpoint kernel that spans multiple wavefronts (i.e., the thread
block contains more threads than the hardware wavefront size) uses
__syncthreads() barriers to coordinate work across wavefronts.
These barriers prevent the GPU scheduler from preempting the workgroup
mid-execution. Two amdgpu driver behaviours interact badly with
non-preemptible workgroups and must be configured before running long
or infinite multi-wavefront kernels.
For nvme-ep this applies when --batch-size exceeds the
wavefront size (typically 32 on RDNA or 64 on CDNA). Other
endpoints are similarly affected whenever their GPU kernels launch
thread blocks larger than one wavefront.
Background on GPU preemption and reset is documented in the
amdgpu module parameters section of the Linux kernel
documentation. The cwsr_enable parameter (Compute Wave Store
and Resume) controls mid-wave preemption support. When a workgroup
holds a __syncthreads() barrier, CWSR can’t save and restore
individual waves, so the entire workgroup becomes non-preemptible.
See also the ROCm system debugging guide for related environment
variables.
Disable GPU power management#
On headless systems the amdgpu driver periodically suspends and resumes
the GPU (every ~20 seconds) via two independent mechanisms: DPM (Dynamic
Power Management) level switching and PCI runtime power management
(runpm). Both are described in the amdgpu module parameters
documentation.
Single-wavefront kernels survive these suspend/resume cycles because
CWSR can preempt and restore them. Multi-wavefront kernels
that hold __syncthreads() barriers can’t be preempted, so a
power-gate cycle terminates the kernel and resets the GPU. A GPU
reset can cause system-wide instability including crashes in
unrelated processes.
Both DPM and PCI runtime PM must be disabled. Set them at runtime before launching kernels:
# Set DPM to high performance
echo high | sudo tee \
/sys/class/drm/card1/device/\
power_dpm_force_performance_level
# Disable PCI runtime power management
echo on | sudo tee \
/sys/class/drm/card1/device/power/control
To make runpm persist across reboots, add it to the modprobe configuration
alongside lockup_timeout:
echo "options amdgpu lockup_timeout=-1 runpm=0" \
| sudo tee /etc/modprobe.d/amdgpu-lockup.conf
sudo update-initramfs -u
The power_dpm_force_performance_level can’t be persisted via
modprobe and must be set each session, for example via a systemd
unit or rc.local script.
To restore automatic power management afterwards:
echo auto | sudo tee \
/sys/class/drm/card1/device/\
power_dpm_force_performance_level
echo auto | sudo tee \
/sys/class/drm/card1/device/power/control
Note
The card1 path assumes the GPU is the second DRM device.
Check ls /sys/class/drm/ to find the correct card number
for your system.
Set compute lockup timeout to infinity#
The amdgpu driver’s lockup_timeout parameter (default 2000 ms)
resets the GPU if a compute dispatch does not signal its completion
fence within the timeout window. Infinite-mode kernels never
complete by design, and long-running finite kernels may also exceed
the default. The amdgpu module parameters documentation
describes the timeout format and default values.
This parameter is read-only at runtime and must be set at module load time. Create a modprobe configuration file and rebuild the initramfs so the setting takes effect when the amdgpu module loads during boot:
echo "options amdgpu lockup_timeout=-1" \
| sudo tee /etc/modprobe.d/amdgpu-lockup.conf
sudo update-initramfs -u
sudo reboot
Verify after reboot:
cat /sys/module/amdgpu/parameters/lockup_timeout
# Should show: -1
Both settings are required for any endpoint kernel that uses multi-wavefront thread blocks in infinite or long-running mode. Single-wavefront kernels and short-duration tests don’t need them.
Add new tests#
Use the xio_add_test() CMake function defined in
cmake/XIOTestHelpers.cmake:
xio_add_test(
NAME test-my-feature
SOURCE test-my-feature.hip
LABELS unit rdma
TIMEOUT 30
INCLUDE_DIRS ${CMAKE_SOURCE_DIR}/src/my-dir
)
Parameters:
NAME– Test target and CTest name (required)SOURCE– HIP source file (required)LABELS– CTest labels for filteringTIMEOUT– Seconds (defaults by label: unit=60, hardware=300, stress=600, other=120)INCLUDE_DIRS– Extra include directoriesEXTRA_ARGS– Arguments passed to the test binaryGPU– If set, adds resource groups, skip detection, andLD_LIBRARY_PATH
For hardware tests that need the RDMA fixture, add after registration:
set_tests_properties(test-my-feature PROPERTIES
FIXTURES_REQUIRED RDMA_HW)
CI integration#
The GitHub Actions workflows run tests as follows:
build-check:
ctest -L "unit"– CPU-only tests in arocm/dev-ubuntu-24.04:7.2container (no GPU)test-emulate:
ctest -L "unit"plusxio-tester test-ep --emulate(no GPU, emulation mode)
Hardware and sweep tests are not run in CI – they require physical NIC and GPU hardware.
VM-isolated testing#
Hardware and RDMA tests can trigger kernel panics on bare metal. For a safer alternative that isolates failures inside a QEMU VM, see Run ROCm XIO VM-isolated testing.