Memory modes, allocation, and coherence in ROCm XIO#
2026-04-27
8 min read time
This page documents the unified memory allocation API, memory coherence considerations, and the DMA-BUF export architecture in ROCm XIO.
Fine-grained vs. coarse-grained memory#
Coarse-grained memory means that storing to a memory location might only become visible to the CPU once the GPU kernel finishes. Fine-grained memory means stores are visible with system-level coherence (not just at kernel completion).
On consumer Radeon GPUs, fine-grained PCIe memory is only available when the environment variable is set:
export HSA_FORCE_FINE_GRAIN_PCIE=1
On MI-series GPUs (MI300X, etc.) fine-grained memory is available by default.
Host memory allocation#
All host allocation in ROCm XIO goes through allocHostMemory() with
XIO_HOST_MEM_* flags:
XIO_HOST_MEM_MAPPED(0x0) –hipHostMalloc(Mapped)withmallocfallback. Default for queues, timing buffers, and GPU-accessible host memory.XIO_HOST_MEM_COHERENT(0x1) – addshipHostMallocCoherentfor cross-device visibility (for example, 2-node RDMA buffers on MI250).XIO_HOST_MEM_PINNED(0x2) –posix_memalign+hipHostRegister. Required for RDMA verbs paths whereib_umem_getneeds real CPU pages.XIO_HOST_MEM_PLAIN(0x4) – rawmalloc. Used for NVMe admin buffers and control structures that don’t need GPU access.XIO_HOST_MEM_DEFAULT(0x8) –hipHostMalloc(Default)withoutMappedflag. Used for SDMA host destination buffers.
On MI300X, the host allocation should be UNCACHED.
Purpose |
Flag |
Callers |
|---|---|---|
Queues (host) |
|
|
RDMA data buffer |
|
|
BNXT/ERNIC SQ/RQ |
|
|
SDMA host dst |
|
|
NVMe admin |
|
|
Timing buffers |
|
|
2-node RDMA buffer |
|
|
The only remaining raw host allocations are the
ibv_alloc_parent_domain callbacks (constrained by the libibverbs
ABI).
Device memory allocation#
All device (VRAM) allocation goes through allocDeviceMemory() with
XIO_DEVICE_MEM_* flags:
XIO_DEVICE_MEM_FINE_GRAINED(0x0) – HSA region alloc with fine-grained coherence. Default for data buffers that need system-level visibility.XIO_DEVICE_MEM_COARSE_GRAINED(0x1) – HSA region alloc, coarse-grained. Stores only become visible at kernel completion.XIO_DEVICE_MEM_UNCACHED(0x2) –hipExtMallocWithFlags(Uncached). Required for RDMA CQ buffers and SDMA P2P signal/counter memory.XIO_DEVICE_MEM_VMEM(0x4) – HIP Virtual Memory Management (reserve + map + access). No device sync on free, per-buffer P2P access.XIO_DEVICE_MEM_HIP(0x8) – plainhipMalloc. DMA-BUF exportable. Used for queues, atomic scratch buffers, and GPU-side data structures.
Purpose |
Flag |
Callers |
|---|---|---|
Data buffers |
|
|
Queues (device) |
|
|
BNXT/ERNIC CQ |
|
|
SDMA buffers |
|
|
Atomic scratch |
|
|
MLX5 BF/UAR |
|
|
SDMA Anvil |
|
|
The only remaining raw HIP/HSA device allocation is the MLX5 BF/UAR
mapping (hsa_amd_memory_lock_to_pool) and the
ibv_alloc_parent_domain callbacks which are constrained by the
libibverbs ABI.
Memory mode CLI flags#
The --memory-mode CLI option (0–15) controls placement:
Bit 0 (
XIO_MEM_MODE_SQ_DEVICE): submission queue in VRAMBit 1 (
XIO_MEM_MODE_CQ_DEVICE): completion queue in VRAMBit 2 (
XIO_MEM_MODE_DOORBELL_DEVICE): reserved. Defined inxio.hbut not read by any endpoint implementation. The NVMe endpoint routes doorbells via--pci-mmio-bridgeinstead.Bit 3 (
XIO_MEM_MODE_DATA_DEVICE): data buffer in VRAM
When a bit is set, the corresponding buffer is allocated with
allocDeviceMemory() (VRAM); when clear, with
allocHostMemory() (system RAM).
NVMe-EP memory mode#
Bits 0, 1, and 3 are honored independently. The SQ and CQ are
allocated by createQueue() based on bits 0 and 1. Bit 2
(XIO_MEM_MODE_DOORBELL_DEVICE) is defined but not read by the
NVMe implementation; doorbell routing is controlled by the
--pci-mmio-bridge CLI flag (mapped to
XioEndpointConfig::pciMmioBridge) instead. Data buffers use bit 3
for host vs. device P2PDMA allocation.
RDMA-EP memory mode#
Bit |
Effect |
Details |
|---|---|---|
0 |
SQ and CQ placement |
Maps to |
1 |
Reserved for future per-queue CQ control |
Currently ignored. A future change could split
bit 0 (SQ only) and bit 1 (CQ only) by routing
the |
2 |
Not applicable |
RDMA doorbells are always MMIO writes to NIC-mapped BAR regions; memory mode does not apply. |
3 |
Data buffer placement |
When set, the loopback data buffer (src + dst)
is allocated with |
Mirrored host+device pairs#
allocDeviceMemoryPair() allocates a host+device pair
(XIO_HOST_MEM_PLAIN + XIO_DEVICE_MEM_HIP) for objects that
are constructed on the host then copied to the GPU with hipMemcpy.
Used for GPU QueuePair staging in the RDMA endpoint.
DMA-BUF export#
ROCm XIO calls hsa_amd_portable_export_dmabuf directly (v1, no
flags) at three call sites, all routed through the centralized
exportDmabuf() wrapper:
src/common/ibv-wrapper.cpp–IBVWrapper::reg_mr()exportsGPU memory as dmabuf, then calls
ibv_reg_dmabuf_mrfor RDMA MR registration. Used for GPU atomic buffers only, not the main data path.
src/common/xio-common.hip–exportRegVramBuf()exportsVRAM for NVMe physical address resolution via the kernel module ioctl.
src/endpoints/rdma-ep/bnxt/bnxt-backend.cpp– Exports CQbuffer for BNXT Direct Verbs UMEM registration with
BNXT_RE_DV_UMEM_FLAGS_DMABUF.
ERNIC doesn’t use dmabuf for CQ allocation; it uses host-pinned
memory (XIO_HOST_MEM_PINNED) with dmabuf_fd = -1.
HSA API v2#
ROCm 7.1.0 ships hsa_amd_portable_export_dmabuf_v2
(/opt/rocm/include/hsa/hsa_ext_amd.h) which adds a flags
argument of type hsa_amd_dma_buf_mapping_type_t:
HSA_AMD_DMABUF_MAPPING_TYPE_NONE(0) – default, identical to v1.HSA_AMD_DMABUF_MAPPING_TYPE_PCIE(1) – PCIe mapping type, potentially relevant for P2P paths.
The v1 API is documented as equivalent to v2 with
HSA_AMD_DMABUF_MAPPING_TYPE_NONE.
Additionally, hsa_amd_portable_close_dmabuf() provides proper
cleanup of dmabuf file descriptors, replacing raw close(fd) calls.
HIP virtual memory management#
ROCm 7 introduced HIP Virtual Memory Management built on HSA
hsa_amd_vmem_* APIs. Key benefits for ROCm XIO:
- No device sync on free –
hipMemUnmap/hipMemRelease/hipMemAddressFreedo not synchronize the device, unlikehipFree.
- Per-buffer P2P access –
hipMemSetAccessreplaces device-wide
hipDeviceEnablePeerAccesswith per-buffer, per-device access control.
- Per-buffer P2P access –
- Dynamic growth without copy – Buffers can be extended by
reserving additional VA and mapping new physical pages.
- dmabuf compatibility –
hsa_amd_vmem_export_shareable_handle()exports vmem allocations as dmabuf file descriptors.
The requestedHandleTypes field must be set to
hipMemHandleTypePosixFileDescriptor at allocation time to enable
later dmabuf export:
hipMemAllocationProp prop = {};
prop.type = hipMemAllocationTypePinned;
prop.location.type = hipMemLocationTypeDevice;
prop.location.id = gpuId;
prop.requestedHandleTypes =
hipMemHandleTypePosixFileDescriptor;
size_t granularity;
hipMemGetAllocationGranularity(
&granularity, &prop,
hipMemAllocationGranularityMinimum);
size_t allocSize =
((size + granularity - 1) / granularity)
* granularity;
hipMemCreate(&handle, allocSize, &prop, 0);
hipMemAddressReserve(&ptr, allocSize, 0, 0, 0);
hipMemMap(ptr, allocSize, 0, handle, 0);
hipMemAccessDesc access = {};
access.location.type = hipMemLocationTypeDevice;
access.location.id = gpuId;
access.flags = hipMemAccessFlagsProtReadWrite;
hipMemSetAccess(ptr, allocSize, &access, 1);
Device-to-host visibility#
If __device__ code stores to host memory, the following are
needed so a host CPU core can see the store:
hipHostMalloc()for the allocationvolatilequalification on the host-side pointer__threadfence_system()on the device side
Host-to-device visibility#
If __host__ code stores to host memory that a GPU core reads,
hipHostMalloc() alone is typically sufficient because it marks the
memory as uncached, making CPU writes visible at the system level.
Thread fences#
__threadfence()– sufficient on MI300X for intra-device ordering__threadfence_system()– needed on Radeon GPUs for cross-device (GPU-to-CPU) ordering
Comparison with rocSHMEM#
rocSHMEM has a full allocator hierarchy (HIPAllocator,
FreeList, Pow2Bins) managing a symmetric heap shared across
PEs. Each allocator knows how to allocate memory and export it as
dmabuf for RDMA registration.
ROCm XIO deliberately removed this abstraction (see
src/endpoints/rdma-ep/README.md: “Decoupled from rocshmem
internals”). ROCm XIO uses a single-endpoint model with no symmetric
heap, making a full allocator hierarchy unnecessary.
Upstream tracking#
Monitor ROCm/rocm-systems#3762 for any breaking changes to the
HSA allocator interface that would require ROCm XIO changes.