ROCm XIO endpoints#
2026-04-27
10 min read time
Endpoints define hardware interfaces and protocols for different IO devices. Each endpoint provides its own queue-entry formats and IO semantics.
List available endpoints#
./build/xio-tester --list-endpoints
test-ep – test endpoint#
A software-only endpoint for validating the XIO framework. No hardware required.
sudo ./build/xio-tester test-ep --verbose
nvme-ep – NVMe endpoint#
Implements NVMe command submission (SQE) and completion (CQE) handling.
Supports Read and Write commands with configurable IO patterns and
doorbell batching. The --access-pattern flag controls LBA ordering
and defaults to random.
export HSA_FORCE_FINE_GRAIN_PCIE=1
sudo ./build/xio-tester nvme-ep \
--controller /dev/nvme0 \
--read-io 8 --verbose
Key features:
Direct GPU-to-NVMe submission via memory-mapped queues
Sequential and pseudo-random LBA access patterns
Configurable queue depth, IO size, and batch size
Multi-queue parallelism with
--num-queues
Doorbell Batching (--batch-size)#
--batch-size controls SQEs per doorbell ring: 1 (default)
submits one SQE at a time, 0 submits all at once, any other
value N batches N SQEs per doorbell.
When N > 1, each SQE is prepared by a separate GPU thread.
Thread 0 acts as a coordinator: it rings the SQ doorbell, polls
CQEs, and rings the CQ doorbell. The kernel launches N + 1
threads, which may span multiple wavefronts within a single thread
block. The upper bound on N is
min(queue_length - 1, maxThreadsPerBlock - 1) (typically 1023
when using the default queue length).
Dynamic shared memory is used for per-thread batch timing, so there is no fixed array ceiling.
# 16 reads, 4 SQEs per doorbell (4 rounds)
sudo ./build/xio-tester nvme-ep \
--controller /dev/nvme0 \
--read-io 16 --batch-size 4
# Multi-wavefront: 128 SQEs per doorbell
sudo ./build/xio-tester nvme-ep \
--controller /dev/nvme0 \
--read-io 256 --batch-size 128 \
--queue-length 256
# Infinite reads, 8 SQEs per doorbell
sudo ./build/xio-tester nvme-ep \
--controller /dev/nvme0 \
--read-io 1 --batch-size 8 \
--infinite --less-timing
Multi-queue parallelism (--num-queues)#
--num-queues N (default 1) creates N independent NVMe I/O
queue pairs. Each queue gets its own GPU kernel on a separate HIP
stream, its own data buffers, and its own doorbell offset. Queue
IDs are allocated as a contiguous range ending at the auto-detected
(or explicit --queue-id) value.
# 2 independent queues, each doing 32 reads
sudo ./build/xio-tester nvme-ep \
--controller /dev/nvme0 \
--read-io 32 --num-queues 2
# 4 queues with batched doorbell writes
sudo ./build/xio-tester nvme-ep \
--controller /dev/nvme0 \
--read-io 64 --num-queues 4 \
--batch-size 16
Timing semantics#
When --batch-size > 1 or --num-queues > 1, per-operation
timing arrays are not meaningful. The endpoint automatically forces
lightweight XioTimingStats mode (min/max/sum/count). For
multi-queue runs, per-queue stats are allocated independently and
aggregated after all kernels finish.
Doorbell modes#
The NVMe endpoint supports two doorbell delivery modes. Choosing the right one depends on whether the NVMe device is a real PCIe endpoint or an emulated device inside a virtual machine.
- Direct BAR0 (default – use with real hardware)
The GPU writes directly to the NVMe controller’s BAR0 doorbell registers. This is the correct mode when the NVMe SSD is passed through to the VM (or bare-metal host) via
vfio-pcior is otherwise a real PCIe endpoint. Direct BAR0 avoids the extra latency of the MMIO bridge and is the lowest-overhead path.The direct path uses
__threadfence_system()for cross-device ordering, which is sufficient on CDNA (MI-series) GPUs. On RDNA (consumer Radeon) GPUs, direct BAR0 doorbell writes have been observed to cause coherence issues in independent testing. If you experience hangs on RDNA hardware, try rebuilding with aggressive ISA-level fencing:cmake -DXIO_DOORBELL_FENCE_AGGRESSIVE=ON ..
- PCI MMIO bridge (emulated devices in VMs only)
Routes doorbell writes through a QEMU virtual PCI device that forwards them to the emulated NVMe controller’s BAR0. This mode is essential when the NVMe device is emulated by QEMU (for example, the built-in
nvmedevice model) because the emulated BAR0 lives in QEMU’s address space and cannot be reached by a direct GPU store. Enable with--pci-mmio-bridge:sudo ./build/xio-tester nvme-ep \ --controller /dev/nvme0 \ --read-io 8 --pci-mmio-bridge
Do not use the PCI MMIO bridge with real NVMe endpoints passed through via vfio-pci. The bridge adds an unnecessary PCIe hop and QEMU processing overhead that hurts latency and throughput with no benefit – real hardware already exposes its BAR0 directly to the GPU.
rdma-ep – RDMA endpoint#
GPU-Direct RDMA endpoint supporting four major vendors:
Vendor |
Hardware |
|---|---|
MLX5 |
Mellanox/NVIDIA ConnectX (IB/RoCE) |
BNXT_RE |
Broadcom NetXtreme RDMA Engine |
IONIC |
Pensando Ionic RDMA (SmartNIC) |
rocm-ernic |
AMD Emulated RDMA NIC |
By default, the endpoint runs in loopback mode, which exercises the
full RDMA path on a single node without requiring a second machine.
Pass --no-loopback to run in two-node mode where a real remote
peer is expected.
Loopback mode (default, single node):
sudo ./build/xio-tester rdma-ep
Two-node mode (requires a remote peer):
sudo ./build/xio-tester rdma-ep --no-loopback
All vendors support GPU-direct doorbell ringing using system-scope atomics. No HSA memory locking is needed (unlike NVMe, RDMA NICs support this pattern natively).
Architecture#
The RDMA endpoint is derived from the GDA (GPU-Direct Access) backend in ROCm/rocSHMEM. Key adaptations from the original rocSHMEM code:
Decoupled from rocSHMEM internals (
HIPAllocator,FreeList, MPI,constants.hpp)Simplified from a full PE mesh to a single-endpoint model (1 QP + 1 CQ per connection instead of
(max_contexts + 1) * num_pesQPs)Wrapped in
rdma_epnamespace with vendor sub-namespacesConsolidated duplicated vendor control flow into shared abstractions
Each vendor provides the same function signatures as static methods
on an Ops class, dispatched at compile time:
rdma_ep::bnxt::Ops::post_wqe_rma(qp, ...)
rdma_ep::mlx5::Ops::post_wqe_rma(qp, ...)
rdma_ep::ionic::Ops::post_wqe_rma(qp, ...)
The active vendor is selected by the CMake options GDA_BNXT,
GDA_MLX5, GDA_IONIC, or GDA_ERNIC.
Two-node RDMA test#
For cross-node testing with two Thor 2 NICs:
bash tests/unit/rdma-ep/run-2node-test.sh \
<server-node> <client-node>
Or manually on each node:
# Node A (server):
./build/tests/unit/rdma-ep/test-rdma-2node --server
# Node B (client):
./build/tests/unit/rdma-ep/test-rdma-2node \
--client --server-host <server-hostname>
The test uses TCP over the management network for QP info exchange and RDMA over the Thor 2 fabric for data transfer.
BNXT DV kernel module#
The BNXT vendor backend requires a patched bnxt_re kernel
module built via DKMS:
sudo kernel/bnxt/setup-bnxt-re-dkms.sh
This downloads stock bnxt_re source, applies patches
0001–0008, and builds/installs via DKMS. Patch 0007 extends the
udata ABI so the DV userspace can pass SQ/RQ buffer VAs through
the write-based verbs path. After installation:
sudo modprobe -r bnxt_re && sudo modprobe bnxt_re
Troubleshooting#
ibv_cmd_create_qp_ex2() failed: 14(EFAULT)The DKMS module was built without the DV QP handling code. The kernel falls back to its own buffer sizing, causing
ib_umem_getto fail. Re-runsetup-bnxt-re-dkms.shand reload the module.ibv_cmd_create_qp_ex2() failed: 22(EINVAL)The DV QP udata patch is missing. Verify with
grep -c DV_QP_ENABLE /usr/src/rocm-xio-bnxt-re-0.1.0-g<rev>/ib_verbs.c(expect >= 1). Re-runsetup-bnxt-re-dkms.shif missing.Could not open libbnxt_re.soLD_LIBRARY_PATHis missing the rdma-core install directory. When running through CTest this is set automatically; for manual runs addbuild/_deps/rdma-core/install/lib.DV Modify QP error: 110(ETIMEDOUT)The IPv4-mapped GID is not yet populated. Verify the IP address is assigned (
ip addr show), the static neighbor entry exists (ip neigh show), and the GID::ffff:c612:0001appears in/sys/class/infiniband/*/ports/1/gids/. Wait a few seconds after module reload for the GID table to populate.
sdma-ep – SDMA endpoint#
GPU-initiated DMA transfers using AMD hardware SDMA engines. Based on the anvil library from AMD’s RAD team.
Hardware requirements#
AMD Instinct GPUs
ROCm 6.0 and later with hsakmt library
P2P mode: multi-GPU system with XGMI/Infinity Fabric
Single-GPU mode: use
--to-host
Usage examples#
Single-GPU (SDMA to pinned host memory):
sudo ./build/xio-tester sdma-ep p2p \
--to-host -n 10 -v
P2P (two GPUs, default GPU 0 to GPU 1):
sudo ./build/xio-tester sdma-ep p2p -n 100 -v
With data verification (LFSR pattern):
sudo ./build/xio-tester sdma-ep p2p \
--to-host --verify -n 10 -v
Larger transfer size:
sudo ./build/xio-tester sdma-ep p2p \
--to-host -n 8 -s 1048576 --verify
Transfer size (-s) accepts bytes or suffixes: 4k, 1M,
2G. Suffixes are power-of-2 (KiB, MiB, GiB). Value must be a
multiple of 4.
Host-side setup (Library API)#
Applications that want to use shader-initiated SDMA from their own GPU kernels (outside the xio-tester) use the three-step host-side setup API:
#include "sdma-ep.h"
// 1. Initialize the SDMA subsystem (HSA + KFD)
sdma_ep::initEndpoint();
// 2. Create a connection (peer access + engine)
sdma_ep::SdmaConnectionInfo conn;
sdma_ep::createConnection(0, 1, &conn);
// 3. Create an SDMA queue
sdma_ep::SdmaQueueInfo qInfo;
sdma_ep::createQueue(0, 1, &qInfo);
// Pass qInfo.deviceHandle to your GPU kernel
myKernel<<<1,1>>>(
static_cast<sdma_ep::SdmaQueueHandle*>(
qInfo.deviceHandle),
dst, src, size);
// Cleanup (nullifies handle; hsakmt/HSA resources
// are released at process exit by AnvilLib destructor)
sdma_ep::destroyQueue(&qInfo);
sdma_ep::shutdownEndpoint();
See the ROCm XIO API reference page for full function signatures and cleanup semantics.
Device-side operations (Kernel API)#
GPU kernels use the sdma_ep:: device-side functions to issue
SDMA transfers, signal completion, and wait for results. All
functions are __device__ __forceinline__ and operate on a
SdmaQueueHandle reference.
#include "sdma-ep.h"
__global__ void myKernel(
sdma_ep::SdmaQueueHandle* handle,
void* dst, void* src, size_t size,
uint64_t* signal) {
// DMA copy (non-blocking)
sdma_ep::put(*handle, dst, src, size);
// Copy with completion signal
sdma_ep::putSignal(
*handle, dst, src, size, signal);
// Wait for remote signal
sdma_ep::waitSignal(signal, 1);
// Wait for all submitted ops
sdma_ep::quiet(*handle);
}
Available operations:
Function |
Description |
|---|---|
|
Linear DMA copy |
|
2D sub-window DMA copy |
|
Atomic increment via SDMA |
|
Copy + signal (batched) |
|
Copy + signal + counter |
|
Spin-poll signal >= expected |
|
Wait for specific op |
|
Wait for all submitted ops |
Limitations#
Hardware-only (no emulation mode)
P2P requires at least two GPUs
Requires hsakmt (KFD kernel driver interface)
Environment variables#
On Radeon GPUs (RX series), set the following before running any tests:
export HSA_FORCE_FINE_GRAIN_PCIE=1
This enables fine-grained memory coherence required for GPU-to-CPU memory visibility. Without it the GPU will encounter page faults when accessing host memory.
On AMD instinct GPUs (MI300X, etc.) this is typically not required.