Memory modes, allocation, and coherence in ROCm XIO#

2026-04-27

8 min read time

Applies to Linux

This page documents the unified memory allocation API, memory coherence considerations, and the DMA-BUF export architecture in ROCm XIO.

Fine-grained vs. coarse-grained memory#

Coarse-grained memory means that storing to a memory location might only become visible to the CPU once the GPU kernel finishes. Fine-grained memory means stores are visible with system-level coherence (not just at kernel completion).

On consumer Radeon GPUs, fine-grained PCIe memory is only available when the environment variable is set:

export HSA_FORCE_FINE_GRAIN_PCIE=1

On MI-series GPUs (MI300X, etc.) fine-grained memory is available by default.

Host memory allocation#

All host allocation in ROCm XIO goes through allocHostMemory() with XIO_HOST_MEM_* flags:

  • XIO_HOST_MEM_MAPPED (0x0) – hipHostMalloc(Mapped) with malloc fallback. Default for queues, timing buffers, and GPU-accessible host memory.

  • XIO_HOST_MEM_COHERENT (0x1) – adds hipHostMallocCoherent for cross-device visibility (for example, 2-node RDMA buffers on MI250).

  • XIO_HOST_MEM_PINNED (0x2) – posix_memalign + hipHostRegister. Required for RDMA verbs paths where ib_umem_get needs real CPU pages.

  • XIO_HOST_MEM_PLAIN (0x4) – raw malloc. Used for NVMe admin buffers and control structures that don’t need GPU access.

  • XIO_HOST_MEM_DEFAULT (0x8) – hipHostMalloc(Default) without Mapped flag. Used for SDMA host destination buffers.

On MI300X, the host allocation should be UNCACHED.

Host Allocation Call Sites#

Purpose

Flag

Callers

Queues (host)

XIO_HOST_MEM_MAPPED

xio-common.hip

RDMA data buffer

XIO_HOST_MEM_PINNED or XIO_DEVICE_MEM_FINE_GRAINED

rdma-ep.hip (per --memory-mode bit 3)

BNXT/ERNIC SQ/RQ

XIO_HOST_MEM_PINNED

bnxt/ernic-backend.cpp

SDMA host dst

XIO_HOST_MEM_DEFAULT

sdma-ep.hip

NVMe admin

XIO_HOST_MEM_PLAIN

nvme-ep.hip

Timing buffers

XIO_HOST_MEM_MAPPED

xio-tester.cpp

2-node RDMA buffer

XIO_HOST_MEM_MAPPED | COHERENT

rdma-ep.hip

The only remaining raw host allocations are the ibv_alloc_parent_domain callbacks (constrained by the libibverbs ABI).

Device memory allocation#

All device (VRAM) allocation goes through allocDeviceMemory() with XIO_DEVICE_MEM_* flags:

  • XIO_DEVICE_MEM_FINE_GRAINED (0x0) – HSA region alloc with fine-grained coherence. Default for data buffers that need system-level visibility.

  • XIO_DEVICE_MEM_COARSE_GRAINED (0x1) – HSA region alloc, coarse-grained. Stores only become visible at kernel completion.

  • XIO_DEVICE_MEM_UNCACHED (0x2) – hipExtMallocWithFlags(Uncached). Required for RDMA CQ buffers and SDMA P2P signal/counter memory.

  • XIO_DEVICE_MEM_VMEM (0x4) – HIP Virtual Memory Management (reserve + map + access). No device sync on free, per-buffer P2P access.

  • XIO_DEVICE_MEM_HIP (0x8) – plain hipMalloc. DMA-BUF exportable. Used for queues, atomic scratch buffers, and GPU-side data structures.

Device Allocation Call Sites#

Purpose

Flag

Callers

Data buffers

XIO_DEVICE_MEM_FINE_GRAINED

xio-common.hip

Queues (device)

XIO_DEVICE_MEM_HIP

xio-common.hip

BNXT/ERNIC CQ

XIO_DEVICE_MEM_UNCACHED

bnxt/ernic-backend.cpp

SDMA buffers

XIO_DEVICE_MEM_HIP / UNCACHED

sdma-ep.hip

Atomic scratch

XIO_DEVICE_MEM_HIP

queue-pair.hip

MLX5 BF/UAR

hsa_amd_memory_lock_to_pool (HSA-only)

mlx5-backend.cpp

SDMA Anvil

XIO_DEVICE_MEM_HIP / UNCACHED

anvil.hip

The only remaining raw HIP/HSA device allocation is the MLX5 BF/UAR mapping (hsa_amd_memory_lock_to_pool) and the ibv_alloc_parent_domain callbacks which are constrained by the libibverbs ABI.

Memory mode CLI flags#

The --memory-mode CLI option (0–15) controls placement:

  • Bit 0 (XIO_MEM_MODE_SQ_DEVICE): submission queue in VRAM

  • Bit 1 (XIO_MEM_MODE_CQ_DEVICE): completion queue in VRAM

  • Bit 2 (XIO_MEM_MODE_DOORBELL_DEVICE): reserved. Defined in xio.h but not read by any endpoint implementation. The NVMe endpoint routes doorbells via --pci-mmio-bridge instead.

  • Bit 3 (XIO_MEM_MODE_DATA_DEVICE): data buffer in VRAM

When a bit is set, the corresponding buffer is allocated with allocDeviceMemory() (VRAM); when clear, with allocHostMemory() (system RAM).

NVMe-EP memory mode#

Bits 0, 1, and 3 are honored independently. The SQ and CQ are allocated by createQueue() based on bits 0 and 1. Bit 2 (XIO_MEM_MODE_DOORBELL_DEVICE) is defined but not read by the NVMe implementation; doorbell routing is controlled by the --pci-mmio-bridge CLI flag (mapped to XioEndpointConfig::pciMmioBridge) instead. Data buffers use bit 3 for host vs. device P2PDMA allocation.

RDMA-EP memory mode#

Bit

Effect

Details

0

SQ and CQ placement

Maps to QueueMemMode::DEVICE_VRAM (bit set) vs HOST_COHERENT (bit clear). Controls the ibv_alloc_parent_domain allocator used by ibv_create_qp_ex for SQ/RQ buffers and is also used by ionic for CQ allocation. BNXT and ERNIC allocate CQ via their DV API (VRAM, uncached).

1

Reserved for future per-queue CQ control

Currently ignored. A future change could split bit 0 (SQ only) and bit 1 (CQ only) by routing the resource_type callback in the parent domain allocator, or by switching vendors to explicit per-queue allocation.

2

Not applicable

RDMA doorbells are always MMIO writes to NIC-mapped BAR regions; memory mode does not apply.

3

Data buffer placement

When set, the loopback data buffer (src + dst) is allocated with allocDeviceMemory() (fine-grained VRAM). When clear, it is allocated with allocHostMemory() (XIO_HOST_MEM_PINNED). The LFSR fill and verify patterns are staged through a temporary host buffer when data is on device.

Mirrored host+device pairs#

allocDeviceMemoryPair() allocates a host+device pair (XIO_HOST_MEM_PLAIN + XIO_DEVICE_MEM_HIP) for objects that are constructed on the host then copied to the GPU with hipMemcpy. Used for GPU QueuePair staging in the RDMA endpoint.

DMA-BUF export#

ROCm XIO calls hsa_amd_portable_export_dmabuf directly (v1, no flags) at three call sites, all routed through the centralized exportDmabuf() wrapper:

  • src/common/ibv-wrapper.cppIBVWrapper::reg_mr() exports

    GPU memory as dmabuf, then calls ibv_reg_dmabuf_mr for RDMA MR registration. Used for GPU atomic buffers only, not the main data path.

  • src/common/xio-common.hipexportRegVramBuf() exports

    VRAM for NVMe physical address resolution via the kernel module ioctl.

  • src/endpoints/rdma-ep/bnxt/bnxt-backend.cpp – Exports CQ

    buffer for BNXT Direct Verbs UMEM registration with BNXT_RE_DV_UMEM_FLAGS_DMABUF.

ERNIC doesn’t use dmabuf for CQ allocation; it uses host-pinned memory (XIO_HOST_MEM_PINNED) with dmabuf_fd = -1.

HSA API v2#

ROCm 7.1.0 ships hsa_amd_portable_export_dmabuf_v2 (/opt/rocm/include/hsa/hsa_ext_amd.h) which adds a flags argument of type hsa_amd_dma_buf_mapping_type_t:

  • HSA_AMD_DMABUF_MAPPING_TYPE_NONE (0) – default, identical to v1.

  • HSA_AMD_DMABUF_MAPPING_TYPE_PCIE (1) – PCIe mapping type, potentially relevant for P2P paths.

The v1 API is documented as equivalent to v2 with HSA_AMD_DMABUF_MAPPING_TYPE_NONE.

Additionally, hsa_amd_portable_close_dmabuf() provides proper cleanup of dmabuf file descriptors, replacing raw close(fd) calls.

HIP virtual memory management#

ROCm 7 introduced HIP Virtual Memory Management built on HSA hsa_amd_vmem_* APIs. Key benefits for ROCm XIO:

  • No device sync on free

    hipMemUnmap/hipMemRelease/hipMemAddressFree do not synchronize the device, unlike hipFree.

  • Per-buffer P2P accesshipMemSetAccess replaces

    device-wide hipDeviceEnablePeerAccess with per-buffer, per-device access control.

  • Dynamic growth without copy – Buffers can be extended by

    reserving additional VA and mapping new physical pages.

  • dmabuf compatibility

    hsa_amd_vmem_export_shareable_handle() exports vmem allocations as dmabuf file descriptors.

The requestedHandleTypes field must be set to hipMemHandleTypePosixFileDescriptor at allocation time to enable later dmabuf export:

hipMemAllocationProp prop = {};
prop.type = hipMemAllocationTypePinned;
prop.location.type = hipMemLocationTypeDevice;
prop.location.id = gpuId;
prop.requestedHandleTypes =
    hipMemHandleTypePosixFileDescriptor;

size_t granularity;
hipMemGetAllocationGranularity(
    &granularity, &prop,
    hipMemAllocationGranularityMinimum);
size_t allocSize =
    ((size + granularity - 1) / granularity)
    * granularity;

hipMemCreate(&handle, allocSize, &prop, 0);
hipMemAddressReserve(&ptr, allocSize, 0, 0, 0);
hipMemMap(ptr, allocSize, 0, handle, 0);

hipMemAccessDesc access = {};
access.location.type = hipMemLocationTypeDevice;
access.location.id = gpuId;
access.flags = hipMemAccessFlagsProtReadWrite;
hipMemSetAccess(ptr, allocSize, &access, 1);

Device-to-host visibility#

If __device__ code stores to host memory, the following are needed so a host CPU core can see the store:

  • hipHostMalloc() for the allocation

  • volatile qualification on the host-side pointer

  • __threadfence_system() on the device side

Host-to-device visibility#

If __host__ code stores to host memory that a GPU core reads, hipHostMalloc() alone is typically sufficient because it marks the memory as uncached, making CPU writes visible at the system level.

Thread fences#

  • __threadfence() – sufficient on MI300X for intra-device ordering

  • __threadfence_system() – needed on Radeon GPUs for cross-device (GPU-to-CPU) ordering

Comparison with rocSHMEM#

rocSHMEM has a full allocator hierarchy (HIPAllocator, FreeList, Pow2Bins) managing a symmetric heap shared across PEs. Each allocator knows how to allocate memory and export it as dmabuf for RDMA registration.

ROCm XIO deliberately removed this abstraction (see src/endpoints/rdma-ep/README.md: “Decoupled from rocshmem internals”). ROCm XIO uses a single-endpoint model with no symmetric heap, making a full allocator hierarchy unnecessary.

Upstream tracking#

Monitor ROCm/rocm-systems#3762 for any breaking changes to the HSA allocator interface that would require ROCm XIO changes.