ROCm XIO performance measurements#
2026-04-27
20 min read time
This page documents RDMA-EP loopback performance measurements collected on a single-node system. All results are from GPU-initiated RDMA WRITE operations with LFSR data verification, measured end-to-end from the GPU kernel (post WQE to CQE completion).
Test environment#
GPU |
AMD Radeon RX 9070 XT |
CPU |
AMD Ryzen Threadripper PRO 7955WX 16-Cores |
Broadcom NIC |
BCM57608 NetXtreme 25G/50G/100G/200G/400G |
Pensando NIC |
AMD Pensando DSC Ethernet Controller |
OS |
Ubuntu 24.04.4 LTS |
Kernel |
6.17.0-19-generic |
ROCm |
7.2.0 |
Test methodology#
Each measurement runs test-rdma-loopback in a single-QP loopback
configuration: the NIC sends an RDMA WRITE to itself over a QP connected to its
own GID. The GPU kernel posts a single WQE, rings the doorbell from device code,
then spin-polls the CQ for completion. The GPU wall clock measures the elapsed
time from WQE post to CQE arrival.
All test programs allocate memory through the xio::allocHostMemory /
xio::allocDeviceMemory abstraction rather than calling posix_memalign,
hipHostMalloc, or hipMalloc directly. This ensures the same allocation
flags and pinning semantics used by the production endpoint code paths.
Ten iterations are run per (vendor, transfer size) pair, each with a distinct LFSR seed for data verification. Statistics are computed over the ten successful iterations:
Min – fastest observed operation
Mean – arithmetic mean of all iterations
Std – population standard deviation
Max – slowest observed operation
Throughput and IOPS are derived from the per-operation latency:
Throughput = transfer size / latency (MB/s, where 1 MB = 106 bytes)
IOPS = 106 / latency (ops/s)
Because these are single-operation measurements (not pipelined), the IOPS figures represent the serial round-trip rate. Pipelined or multi-queue workloads will achieve higher aggregate IOPS.
Note
Transfer sizes less than 32 bytes cause the GPU kernel to hang on both BNXT and IONIC hardware. The minimum working transfer size for loopback RDMA WRITE is 32 bytes.
Queue memory placement#
The CQ and SQ buffers can reside in either host memory or GPU VRAM. The
--queue-mem host|vram flag on test-rdma-loopback (and the queueMem
field in RdmaEpConfig) selects the placement.
Mode |
Description |
|---|---|
|
|
|
|
BNXT always uses VRAM for its CQ (allocated via DMA-BUF UMEM in the DV backend)
regardless of this setting. IONIC uses host coherent memory by default. The
IONIC kernel driver doesn’t currently support VRAM-backed queues through
ib_umem_get.
RDMA-EP loopback results#
Broadcom (BNXT) – CQ in VRAM#
Latency#
Size |
Min (us) |
Mean (us) |
Std (us) |
Max (us) |
|---|---|---|---|---|
256 B |
21.8 |
23.8 |
4.6 |
37.4 |
1 KiB |
21.6 |
22.5 |
0.5 |
23.2 |
4 KiB |
22.1 |
22.9 |
0.7 |
24.4 |
64 KiB |
37.6 |
40.2 |
1.8 |
43.2 |
1 MiB |
150.3 |
153.3 |
1.8 |
156.6 |
Throughput#
Size |
Min (MB/s) |
Mean (MB/s) |
Std (MB/s) |
Max (MB/s) |
|---|---|---|---|---|
256 B |
6.8 |
10.8 |
2.1 |
11.7 |
1 KiB |
44.1 |
45.5 |
1.0 |
47.4 |
4 KiB |
167.9 |
178.9 |
5.5 |
185.3 |
64 KiB |
1,517.0 |
1,630.2 |
73.0 |
1,743.0 |
1 MiB |
6,695.9 |
6,840.0 |
80.3 |
6,976.6 |
IOPS#
Size |
Min |
Mean |
Std |
Max |
|---|---|---|---|---|
256 B |
26,738 |
42,017 |
8,121 |
45,872 |
1 KiB |
43,103 |
44,444 |
988 |
46,296 |
4 KiB |
40,984 |
43,668 |
1,335 |
45,249 |
64 KiB |
23,148 |
24,876 |
1,114 |
26,596 |
1 MiB |
6,386 |
6,523 |
77 |
6,653 |
Pensando (IONIC) – CQ in host coherent#
Latency#
Size |
Min (us) |
Mean (us) |
Std (us) |
Max (us) |
|---|---|---|---|---|
256 B |
14.6 |
15.5 |
0.7 |
17.1 |
1 KiB |
14.0 |
15.0 |
0.5 |
15.9 |
4 KiB |
16.3 |
16.7 |
0.3 |
17.2 |
64 KiB |
22.2 |
22.7 |
0.4 |
23.2 |
1 MiB |
96.7 |
97.9 |
0.5 |
98.6 |
Throughput#
Size |
Min (MB/s) |
Mean (MB/s) |
Std (MB/s) |
Max (MB/s) |
|---|---|---|---|---|
256 B |
15.0 |
16.5 |
0.7 |
17.5 |
1 KiB |
64.4 |
68.3 |
2.3 |
73.1 |
4 KiB |
238.1 |
245.3 |
4.4 |
251.3 |
64 KiB |
2,824.8 |
2,887.0 |
50.9 |
2,952.1 |
1 MiB |
10,634.6 |
10,710.7 |
54.7 |
10,843.6 |
IOPS#
Size |
Min |
Mean |
Std |
Max |
|---|---|---|---|---|
256 B |
58,480 |
64,516 |
2,914 |
68,493 |
1 KiB |
62,893 |
66,667 |
2,222 |
71,429 |
4 KiB |
58,140 |
59,880 |
1,076 |
61,350 |
64 KiB |
43,103 |
44,053 |
776 |
45,045 |
1 MiB |
10,142 |
10,215 |
52 |
10,341 |
RDMA WRITE with immediate#
The QueuePair API now includes put_nbi_imm() and
put_nbi_imm_single() for RDMA WRITE with Immediate Data
(IBV_WR_RDMA_WRITE_WITH_IMM). Opcode support is wired for all four vendors
(BNXT, IONIC, MLX5, ERNIC) and the --write-imm flag is available in
test-rdma-loopback. Only IONIC currently runs end-to-end with
test-rdma-loopback --write-imm; BNXT and other vendors exit with skip code
77 because their DV-created QPs do not expose ibv_post_recv, which the
WRITE_IMM responder path requires.
Per the InfiniBand specification (section 9.3.3.3), WRITE_IMM is commonly used
as a zero-length notification: the 32-bit immediate value is the entire
payload, posted with num_sge = 0. The NIC delivers the immediate value by
consuming a receive WQE from the responder’s RQ and generating a completion with
the immediate data.
Current status:
IONIC: Functional with
hipHostMallocCoherentqueue buffers. Zero-length WRITE_IMM completes in ~14–16 us (loopback). Occasional failures (~2/10) on rapid QP number reuse are a firmware timing issue, not a coherence problem.BNXT: The DV-created QP does not expose
ibv_post_recv. Posting receive WQEs (required for WRITE_IMM) is not supported through the current DV path. WRITE_IMM is skipped on BNXT (exit 77).
Note
The hipHostMallocCoherent fix for the IONIC parent domain allocator was
critical for reliable CQ polling. Without the coherent flag, the GPU L2 cache
served stale CQE data from previous QP allocations, causing ~60% of WRITE_IMM
operations to time out. This matches the nvme-ep queue allocation path which
also uses coherent memory.
NVMe-EP smoke test#
The xio-tester nvme-ep smoke test validates the GPU-initiated NVMe read path
end to end: admin queue setup, I/O queue creation, SQE construction from GPU
device code, doorbell ring, CQE polling, and LFSR data verification.
NVMe Device |
|
LBA Size |
512 bytes |
Namespace Capacity |
28,191,632 LBAs (~13.4 GiB) |
Max Queue ID |
32 |
PCI BDF |
|
Results (4 sequential 512-byte reads, batch size 1):
Min (us) |
Mean (us) |
Std (us) |
Max (us) |
|---|---|---|---|
29.6 |
30.0 |
0.5 |
30.6 |
The unit tests (test-nvme-config, test-nvme-helpers,
test-nvme-hardware) validate struct layout, helper functions, and hardware
queries (LBA size, namespace capacity, SMART log, queue ID enumeration) without
issuing I/O.
NVMe-EP performance#
This section compares GPU-initiated NVMe I/O (via xio-tester nvme-ep)
against CPU-initiated I/O (via fio) on two NVMe devices. The GPU drives the
NVMe submission and completion queues directly from device code, bypassing the
kernel block layer entirely. The fio baseline uses the kernel NVMe driver
with io_uring (QD=32 for throughput) and sync (QD=1 for latency).
NVMe devices under test#
Property |
|
|
|---|---|---|
Model |
MTR_SLC_16GB |
WD_BLACK SN850X 2000GB |
Firmware |
2.0.1.06 |
620361WD |
LBA Size |
512 bytes |
512 bytes |
Capacity |
28,191,632 LBAs (13.4 GiB) |
3,907,029,168 LBAs (1.8 TiB) |
PCI BDF |
|
|
VID:DID |
0x11f8:0xf117 (Microchip) |
0x15b7:0x5030 (Sandisk/WD) |
CTest results#
All NVMe CTests were run on both devices using ROCXIO_NVME_DEVICE.
Device |
Passed |
Failed |
|---|---|---|
|
23 / 25 |
|
|
24 / 25 |
|
The nvme-verify-seq-device-mem-multi-lba failure occurs on both devices and
indicates a data verification issue with multi-LBA writes (--lbas-per-io 32)
using device memory (memory mode 8). The nvme-verify-rand-device-mem timeout
on MTR_SLC is related to the same device-memory verification path.
fio CPU baseline#
All fio tests used --direct=1 (bypass page cache), 30-second runtime, and
--time_based. Bandwidth / IOPS tests used io_uring with
--iodepth=32; latency tests used sync with --iodepth=1.
MTR_SLC (/dev/nvme2n1) – io_uring QD=32#
BS |
Pattern |
IOPS |
BW (MB/s) |
Lat (us) |
p99 (us) |
usr% |
sys% |
|---|---|---|---|---|---|---|---|
512 |
randread |
93,251 |
46 |
343 |
106 |
4.2 |
23.4 |
512 |
seqread |
351,836 |
172 |
91 |
102 |
15.3 |
84.6 |
4K |
randread |
338,608 |
1,323 |
94 |
105 |
14.4 |
85.4 |
4K |
seqread |
344,886 |
1,347 |
93 |
103 |
13.3 |
86.5 |
64K |
randread |
80,864 |
5,054 |
396 |
709 |
3.6 |
47.3 |
64K |
seqread |
80,864 |
5,054 |
396 |
553 |
3.2 |
47.3 |
1M |
randread |
5,127 |
5,127 |
6,240 |
11,469 |
0.2 |
25.2 |
MTR_SLC (/dev/nvme2n1) – sync QD=1#
BS |
Pattern |
IOPS |
BW (MB/s) |
Lat (us) |
p99 (us) |
usr% |
sys% |
|---|---|---|---|---|---|---|---|
512 |
randread |
77,456 |
38 |
12.5 |
17.3 |
4.5 |
28.9 |
4K |
randread |
73,616 |
288 |
13.2 |
18.3 |
4.0 |
27.4 |
64K |
randread |
22,594 |
1,412 |
43.8 |
52.0 |
1.2 |
15.7 |
1M |
randread |
1,611 |
1,611 |
620 |
643 |
0.1 |
7.6 |
WD_BLACK (/dev/nvme1n1) – io_uring QD=32#
BS |
Pattern |
IOPS |
BW (MB/s) |
Lat (us) |
p99 (us) |
usr% |
sys% |
|---|---|---|---|---|---|---|---|
512 |
randread |
8,204 |
4 |
3,900 |
485 |
0.4 |
1.7 |
512 |
seqread |
234,273 |
114 |
136 |
247 |
10.3 |
66.1 |
4K |
randread |
264,744 |
1,034 |
121 |
514 |
13.2 |
56.5 |
4K |
seqread |
316,102 |
1,235 |
101 |
142 |
12.4 |
76.8 |
64K |
randread |
74,766 |
4,673 |
428 |
2,605 |
4.0 |
43.3 |
64K |
seqread |
97,875 |
6,117 |
327 |
717 |
3.7 |
52.2 |
1M |
randread |
6,730 |
6,730 |
4,754 |
6,849 |
0.3 |
24.4 |
WD_BLACK (/dev/nvme1n1) – sync QD=1#
BS |
Pattern |
IOPS |
BW (MB/s) |
Lat (us) |
p99 (us) |
usr% |
sys% |
|---|---|---|---|---|---|---|---|
512 |
randread |
17,449 |
9 |
56.7 |
86.5 |
1.4 |
7.3 |
4K |
randread |
33,931 |
133 |
28.9 |
74.2 |
2.3 |
14.2 |
64K |
randread |
9,559 |
597 |
104 |
173 |
0.6 |
6.8 |
1M |
randread |
2,117 |
2,117 |
472 |
823 |
0.2 |
10.2 |
GPU-initiated NVMe (xio-tester)#
All runs used HSA_FORCE_FINE_GRAIN_PCIE=1 and --less-timing. Latency is
measured end-to-end on the GPU wall clock: from SQE post to CQE arrival.
Transfer size is 512 bytes (1 LBA) unless noted.
Memory mode comparison – MTR_SLC#
Single-op latency (128 reads, batch size 1):
Mode |
Min (us) |
Avg (us) |
Max (us) |
Std (us) |
Placement |
|---|---|---|---|---|---|
0 |
33.8 |
38.4 |
200.0 |
14.3 |
SQ host, CQ host, data host |
3 |
27.6 |
35.5 |
196.5 |
14.3 |
SQ VRAM, CQ VRAM, data host |
8 |
27.8 |
41.8 |
214.8 |
15.4 |
SQ host, CQ host, data VRAM |
11 |
28.0 |
35.1 |
215.1 |
16.0 |
SQ VRAM, CQ VRAM, data VRAM |
Batched reads (128 reads, batch size 16):
Mode |
Min (us) |
Avg (us) |
Max (us) |
Std (us) |
|---|---|---|---|---|
0 |
30.2 |
188.9 |
524.4 |
32.8 |
3 |
41.7 |
193.5 |
521.8 |
32.0 |
8 |
41.2 |
209.9 |
477.0 |
27.9 |
11 |
31.6 |
203.6 |
523.6 |
32.1 |
Multi-queue (128 reads, batch 16, 4 queues):
Mode |
Min (us) |
Avg (us) |
Max (us) |
Std (us) |
|---|---|---|---|---|
0 |
27.5 |
151.5 |
387.0 |
11.8 |
3 |
27.2 |
152.8 |
383.8 |
11.6 |
8 |
28.4 |
170.5 |
432.0 |
13.2 |
11 |
28.0 |
164.2 |
401.7 |
12.1 |
Multi-LBA (128 reads, 8 LBAs/IO = 4 KiB, batch 16):
Mode |
Min (us) |
Avg (us) |
Max (us) |
Std (us) |
|---|---|---|---|---|
0 |
40.0 |
198.5 |
453.4 |
26.5 |
3 |
41.5 |
212.6 |
471.0 |
27.4 |
8 |
40.2 |
207.8 |
525.0 |
31.7 |
11 |
41.1 |
193.0 |
483.4 |
29.0 |
Memory mode comparison – WD_BLACK#
Single-op latency (128 reads, batch size 1):
Mode |
Min (us) |
Avg (us) |
Max (us) |
Std (us) |
Placement |
|---|---|---|---|---|---|
0 |
42.8 |
112.0 |
470.7 |
32.4 |
SQ host, CQ host, data host |
3 |
50.2 |
315.7 |
9,610 |
824.9 |
SQ VRAM, CQ VRAM, data host |
8 |
50.1 |
315.6 |
9,568 |
821.2 |
SQ host, CQ host, data VRAM |
11 |
50.8 |
312.4 |
9,653 |
829.0 |
SQ VRAM, CQ VRAM, data VRAM |
Batched reads (128 reads, batch size 16):
Mode |
Min (us) |
Avg (us) |
Max (us) |
Std (us) |
|---|---|---|---|---|
0 |
190 |
2,020 |
10,690 |
780 |
3 |
210 |
1,920 |
9,950 |
730 |
8 |
210 |
1,920 |
10,010 |
730 |
11 |
210 |
1,930 |
9,970 |
730 |
Sustained throughput (infinite mode)#
Both devices were run in infinite mode for ~15 seconds with 16 queues, batch size 16, memory mode 3 (SQ/CQ in VRAM), and 512-byte reads:
Device |
Iterations |
Avg (us) |
Min (us) |
Max (us) |
|---|---|---|---|---|
MTR_SLC (nvme2) |
1,028,434 |
126.8 |
24.8 |
1,838.8 |
WD_BLACK (nvme1) |
653,969 |
211.1 |
24.5 |
22,085.4 |
Derived sustained performance (16 queues x 512 B):
Device |
Duration |
IOPS |
BW (MB/s) |
|---|---|---|---|
MTR_SLC (nvme2) |
~15 s |
~68,562 |
~33.5 |
WD_BLACK (nvme1) |
~15 s |
~43,598 |
~21.3 |
Write + read verification#
The --write-io N --read-io N mode writes LFSR patterns and then reads them
back. The host-side LFSR verifier runs after the GPU kernel finishes and reports
pass/fail counts. On both devices, 64 writes followed by 64 reads (batch size 1)
completed successfully across all four memory modes.
Note
Pure read tests show Verify Failed counts equal to the number of reads.
This is expected: the LFSR verifier checks read data against a pattern that
was never written by xio-tester, so the comparison always fails for
arbitrary on-disk content.
NVMe-EP hot-path sub-step breakdown#
The --substep-timing flag profiles GPU clock cycles
spent in each phase of an NVMe IO. All runs below used
the WD_BLACK (/dev/nvme1), 128 reads, batch size 1,
queue length 1024, single queue, --less-timing, and
1 LBA per IO (512 bytes). Times are per-IO averages in
nanoseconds (GPU wall clock at 100 MHz = 10 ns/tick).
Memory mode comparison#
Sub-step |
Mode 0 FG |
Mode 0 |
Mode 3 FG |
Mode 11 FG |
Unit |
|---|---|---|---|---|---|
SQE build |
334 |
555 |
560 |
560 |
ns |
SQE enqueue |
2,353 |
2,706 |
2,735 |
2,731 |
ns |
SQ doorbell |
507 |
736 |
730 |
732 |
ns |
CQ poll |
10,144 |
10,471 |
10,429 |
10,428 |
ns |
CQ doorbell |
501 |
724 |
738 |
738 |
ns |
Total |
13,838 |
15,192 |
15,190 |
15,188 |
ns |
FG = HSA_FORCE_FINE_GRAIN_PCIE=1. Mode 0 =
SQ/CQ/data in host memory. Mode 3 = SQ/CQ in VRAM,
data in host. Mode 11 = SQ/CQ/data all in VRAM.
Observations:
CQ poll dominates at 68–70% of total per-IO time. This is pure NVMe device latency (controller processes SQE, performs DMA, writes CQE) and cannot be reduced in software.
SQE enqueue (2.3–2.7 us) is 8
uint64_twide stores viaXioComEnqueue. In mode 0 with fine-grain PCIe enabled, each store to host-coherent memory completes faster due to the always-active coherence protocol (no lazy-to-eager transition penalty).Doorbells (~500 ns each with fine-grain) use
XioComDoorbell: a single__threadfence_system()followed by__hip_atomic_storewith RELEASE/SYSTEM scope. This matches the RDMA vendor doorbell pattern and eliminates the post-store fence.Fine-grain PCIe helps mode 0 by ~1.4 us because system-scope fences and atomic stores complete faster when the GPU’s coherence domain already includes PCIe. Without it, each fence incurs a coherence transition.
VRAM modes (3, 11) are ~1.4 us slower than mode 0 with fine-grain, even though VRAM stores should be local. With
HSA_FORCE_FINE_GRAIN_PCIE=1, VRAM allocations also participate in system coherence, negating the local-store advantage.SQE build is only 334–560 ns thanks to PRP pre-computation at kernel start. The LBA hash, pre-computed PRP lookup, and
sqeSetupfield writes are all register/local operations.
Optimization impact summary#
The following table summarises the cumulative effect of
each optimization on GPU-side per-IO overhead (excluding
CQ poll device latency), measured in mode 0 with
HSA_FORCE_FINE_GRAIN_PCIE=1:
Optimization |
Overhead (ns) |
Savings (ns) |
|---|---|---|
Original (byte-by-byte + 6 fences) |
~20,000 (est.) |
– |
|
~4,600 |
~15,400 |
|
~4,100 |
~500 |
|
~3,700 |
~400 |
The wide-store refactoring (XioComEnqueue) delivered
the largest single improvement: an 8x reduction in SQE
write memory transactions, cutting enqueue time from an
estimated ~17.6 us (64 byte-stores across PCIe) to
~2.4 us (8 wide stores).
Reproducing sub-step measurements#
# Mode 0 with fine-grain PCIe (fastest)
sudo env LD_LIBRARY_PATH=/opt/rocm/lib \
HSA_FORCE_FINE_GRAIN_PCIE=1 \
build/xio-tester nvme-ep \
--controller /dev/nvme1 \
--memory-mode 0 \
--read-io 128 --batch-size 1 \
--queue-length 1024 --num-queues 1 \
--less-timing --lbas-per-io 1 \
--substep-timing
# Mode 0 without fine-grain
sudo env LD_LIBRARY_PATH=/opt/rocm/lib \
build/xio-tester nvme-ep \
--controller /dev/nvme1 \
--memory-mode 0 \
--read-io 128 --batch-size 1 \
--queue-length 1024 --num-queues 1 \
--less-timing --lbas-per-io 1 \
--substep-timing
# Mode 3 (SQ/CQ in VRAM) with fine-grain
sudo env LD_LIBRARY_PATH=/opt/rocm/lib \
HSA_FORCE_FINE_GRAIN_PCIE=1 \
build/xio-tester nvme-ep \
--controller /dev/nvme1 \
--memory-mode 3 \
--read-io 128 --batch-size 1 \
--queue-length 1024 --num-queues 1 \
--less-timing --lbas-per-io 1 \
--substep-timing
CPU utilization: GPU vs CPU#
A key advantage of GPU-initiated NVMe I/O is CPU offload. The GPU kernel constructs SQEs, rings doorbells, and polls CQEs entirely from device code. The CPU is only involved in queue setup and teardown.
fio (CPU-driven, 4K randread, io_uring QD=32):
Device |
IOPS |
usr% |
sys% |
Total% |
CPU us/op |
|---|---|---|---|---|---|
MTR_SLC |
338,608 |
14.4 |
85.4 |
99.8 |
2.95 |
WD_BLACK |
264,744 |
13.2 |
56.5 |
69.7 |
2.63 |
xio-tester (GPU-driven, 512B reads, mode 3):
Device |
IOPS |
CPU |
|---|---|---|
MTR_SLC |
~68,562 |
~0% (GPU-driven, CPU idle) |
WD_BLACK |
~43,598 |
~0% (GPU-driven, CPU idle) |
The CPU-driven path consumes one full core (99.8% usr+sys on MTR_SLC at 338K IOPS). The GPU-driven path achieves its IOPS with effectively zero CPU overhead, freeing the CPU for other work. At 512 B transfer sizes, the GPU single-op latency (~35 us on MTR_SLC) is higher than the kernel NVMe driver (~12.5 us via sync QD=1), but the GPU path trades latency for CPU offload and can scale across many queues without consuming CPU cores.
Access pattern comparison (MTR_SLC, mode 3, batch 16)#
Pattern |
Min (us) |
Avg (us) |
Max (us) |
Std (us) |
|---|---|---|---|---|
Sequential |
32.8 |
188.0 |
423.2 |
24.9 |
Random |
38.9 |
210.8 |
527.6 |
31.9 |
Sequential access is ~10% faster on average than random on the MTR_SLC device, consistent with the device’s internal read-ahead and sequential prefetch logic.
Reproduce these results#
Build with both BNXT and IONIC providers enabled:
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGDA_BNXT=ON -DGDA_IONIC=ON \
-DBUILD_TESTING=ON
cmake --build build \
--target test-rdma-loopback \
--target install-rdma-core \
--parallel
Run the hardware setup fixture, then execute the sweep for each vendor and transfer size:
# Setup loopback interfaces
sudo VENDOR=all \
scripts/test/setup-rdma-loopback.sh
# Run BNXT sweep (example: 4 KiB, 10 iters)
LIB="build/_deps/rdma-core/install/lib"
LIB="${LIB}:/opt/rocm/lib"
sudo env LD_LIBRARY_PATH="${LIB}" \
build/tests/unit/rdma-ep/test-rdma-loopback \
--provider bnxt \
--device rocm-rdma-bnxt0 \
--size 4096 --seed 1 -n 10
# Run IONIC sweep (example: 4 KiB, 10 iters)
sudo env LD_LIBRARY_PATH="${LIB}" \
build/tests/unit/rdma-ep/test-rdma-loopback \
--provider ionic \
--device rocm-rdma-ionic0 \
--size 4096 --seed 1 -n 10
Queue memory placement can be selected per-run:
# Host coherent (default, used by IONIC)
sudo env LD_LIBRARY_PATH="${LIB}" \
build/tests/unit/rdma-ep/test-rdma-loopback \
--provider ionic \
--device rocm-rdma-ionic0 \
--size 4096 --seed 1 -n 10 \
--queue-mem host
Or use the convenience script which iterates over multiple seeds and computes statistics automatically:
PROVIDER=bnxt TRANSFER_SIZE=4096 LFSR_SEED=1 \
ITERATIONS=10 \
scripts/test/test-rdma-ep-xio-loopback.sh
NVMe-EP smoke test (requires root and an NVMe device that is not the rootfs):
# Unit tests (no hardware required for config/helpers)
build/tests/unit/nvme-ep/test-nvme-config
build/tests/unit/nvme-ep/test-nvme-helpers
# Hardware query test
sudo build/tests/unit/nvme-ep/test-nvme-hardware \
--controller /dev/nvme2
# Full data-path smoke test (4 reads)
sudo build/xio-tester nvme-ep \
--controller /dev/nvme2 \
--read-io 4 --batch-size 1
NVMe-EP performance tests:
# Run all NVMe CTests on a device
sudo ROCXIO_NVME_DEVICE=/dev/nvme2 \
XIO_TESTER=$(pwd)/build/xio-tester \
ctest --test-dir build -L nvme \
--output-on-failure
# GPU-initiated: single-op latency, mode 3
sudo env LD_LIBRARY_PATH=/opt/rocm/lib \
HSA_FORCE_FINE_GRAIN_PCIE=1 \
build/xio-tester nvme-ep \
--controller /dev/nvme2 \
--memory-mode 3 \
--read-io 128 --batch-size 1 --less-timing
# GPU-initiated: sustained throughput
sudo env LD_LIBRARY_PATH=/opt/rocm/lib \
HSA_FORCE_FINE_GRAIN_PCIE=1 \
build/xio-tester nvme-ep \
--controller /dev/nvme2 \
--memory-mode 3 \
--read-io 128 --batch-size 16 \
--num-queues 16 --less-timing --infinite
# fio CPU baseline (io_uring, QD=32)
sudo fio --name=bw \
--filename=/dev/nvme2n1 \
--ioengine=io_uring --direct=1 \
--bs=4k --iodepth=32 --rw=randread \
--runtime=30 --time_based \
--group_reporting --output-format=json
# fio CPU baseline (sync, QD=1 latency)
sudo fio --name=lat \
--filename=/dev/nvme2n1 \
--ioengine=sync --direct=1 \
--bs=512 --iodepth=1 --rw=randread \
--runtime=30 --time_based \
--group_reporting --output-format=json