HIP Device Properties and Topology on CDNA Architectures

HIP Device Properties and Topology on CDNA Architectures#

Understanding GPU device properties is essential for writing performant and scalable HIP applications. Modern AMD GPUs such as MI300 (CDNA3) expose a a hierarchical, chiplet-based topology with multiple layers of compute, cache, and memory resources.

Key architectural components include:

XCCs (Accelerated Compute Cores)
XCDs (Compute Dies)
Partitioned L2 cache regions
High-bandwidth memory (HBM) channels and subsystems.
NUMA-like HBM memory domains

This document combines conceptual explanations with HIP and HSA examples to help developers understand and leverage these hardware characteristics.

Basic HIP Device Properties#

Use HIP runtime APIs to query fundamental device attributes:

hipDeviceProp_t props;
hipGetDeviceProperties(&props, 0);

printf("Device: %s\n", props.name);
printf("Global Memory: %zu\n", props.totalGlobalMem);
printf("Compute Units: %d\n", props.multiProcessorCount);
printf("L2 Cache Size: %zu\n", props.l2CacheSize);

Note

HIP exposes aggregate properties at the device level. Detailed topology (XCC/XCD/NUMA) is not directly exposed. Additional insight requires HSA APIs or profiling tools.

CDNA Topology Overview#

CDNA GPUs employ a chiplet-based architecture:

Multiple XCDs (compute dies) per package
Each XCD contains one or more XCCs
L2 cache is distributed and partitioned across XCD/XCC units
Multiple HBM stacks form NUMA-like memory domains

These characteristics influence memory locality, scheduling behavior, and overall performance.

XCC (Accelerated Compute Core)#

An XCC is the fundamental execution unit in CDNA architectures.

Each XCC includes:

Compute Units (CUs)
Command Processor (CP)
Run List Controller (RLC)
Cache subsystems (TCC)

Related architectural elements:

XCD: Physical die containing one or more XCCs
XCP: Logical compute partition consisting of one or more XCCs
AID: Interconnect die linking XCDs and enabling high-bandwidth communication

XCC Partitioning Modes#

XCC partitioning is configured at system initialization:

SPX (Single Partition): All XCCs form a single GPU
TPX (Tile Partition): XCDs grouped into multiple logical GPUs
CPX (Core Partition): Each XCC exposed as an independent GPU

Example:

In CPX mode on MI300X: 8 XCCs ⇒ 8 HIP devices

Partitioning directly affects how applications perceive and utilize hardware resources.

XCC in HIP Runtime#

Device visibility:

Each compute partition (XCP) appears as an independent HIP device

Driver-level register access example:

// Access a specific XCC
RegUtility* xcc0 = proc->Get(REG_ACCESS_GC_0);

// Broadcast to all XCCs in the partition
RegUtility* all = proc->Get(REG_ACCESS_GC_BROADCAST);

Key runtime identifiers:

Virtual_XCC_ID: identifies an XCC instance
NUM_XCC_IN_XCP: number of XCCs in a partition

Compute Dies (XCDs)#

XCDs aggregate XCCs into higher-level compute domains.

Approximate XCC/XCD count using HSA:

#include <hsa/hsa.h>

int gpu_agent_count = 0;

hsa_status_t callback(hsa_agent_t agent, void* data)
{
    hsa_device_type_t type;
    hsa_agent_get_info(agent, HSA_AGENT_INFO_DEVICE, &type);

    if (type == HSA_DEVICE_TYPE_GPU)
        (*(int*)data)++;

    return HSA_STATUS_SUCCESS;
}

hsa_init();
hsa_iterate_agents(callback, &gpu_agent_count);

printf("Approx compute partitions: %d\n", gpu_agent_count);

Take MI300X as an example, the out put will be:

Approx compute partitions: 8

Interpretation:

CPX mode: one agent ≈ one XCC
SPX mode: one agent ≈ full device

L2 Cache Regions#

HIP provides only total L2 cache size:

size_t total_l2 = props.l2CacheSize;

Notes:

L2 cache is partitioned across XCC/XCD domains
Per-region size must be inferred
Cross-domain accesses may increase latency

NUMA Memory Domains#

HBM memory behaves similarly to NUMA systems.

Common configurations:

NPS1: Fully interleaved memory (maximum bandwidth)
NPS4 / NPS8: Localized memory (improved latency locality)

Implications:

Memory placement affects performance
Remote memory accesses incur higher latency

Memory Channels#

The number of memory channels is not directly exposed.

Estimate using memory bus width:

int channel_width_bits = 128;  // architectural assumption
int channels = props.memoryBusWidth / channel_width_bits;

printf("Estimated memory channels: %d\n", channels);

Atomic Throughput#

Atomic throughput limits must be measured empirically.

__global__ void atomicKernel(int* data)
{
    int idx = threadIdx.x;
    atomicAdd(&data[idx % 1024], 1);
}

Guidelines:

Avoid contention on a single memory address
Distribute atomics across multiple locations
Use hierarchical reductions to reduce pressure on global memory

Multi-XCC Profiling#

In multi-XCC configurations:

Each XCC may execute workloads at different times
Profiling tools often report only master XCC timing

Optimization Strategy#

To optimize performance on CDNA GPUs such as MI300:

Partition workloads across XCCs/XCDs where possible
Apply NUMA-aware memory allocation strategies
Align data structures to cache line boundaries
Minimize cross-XCC communication and synchronization
Balance memory traffic across channels
Benchmark atomic scalability under load

Summary#

Topology-aware programming using XCC, XCD, memory hierarchy, and HIP APIs is critical for achieving peak performance on CDNA GPUs such as MI300.

To fully exploit the architecture, developers should combine:

HIP runtime APIs
HSA topology queries
Empirical benchmarking

to understand and optimize hardware utilization.