System Speed-of-Light#

This page documents System Speed-of-Light metrics for RDNA3.5 (gfx115x) with the shipped gfx115x analysis configuration. System Speed-of-Light is a high-level summary: it highlights the most important metrics for how your workload is performing on the target GPU, so you can spot bottlenecks before diving into block-specific panels.

The VALU FLOPs rows use aggregate VALU instruction counters, which include FP16, FP32, and FP64 contributions together. WMMA instructions follow a separate ISA path and are not broken out in this panel.

Wavefronts and FLOPs accounting#

RDNA 3.5 supports both Wave32 (the typical primary mode) and Wave64 wavefronts. The wavefront size is fixed per kernel at compile time. The profiler reads $wave_size from the hardware specs reported by rocminfo, but a given kernel may have been compiled for the other size. When that happens, treat the peak rates here as approximations. The VALU FLOPs row in this panel scales as wave_size * SQ_INSTS_VALU_sum / time.

Dual-issue VALU (VOPD)#

RDNA 3 and RDNA 3.5 add a dual-issue path to the VALU. A pair of independent VALU operations can be encoded into a single instruction and issued together in the same cycle. The RDNA ISA refers to this encoding as VOPD. Hardware accepts the pairing only when register, opcode, and operand constraints are satisfied, and the compiler emits VOPD when those constraints hold.

When dual-issue runs successfully, peak VALU throughput per CU per cycle doubles relative to single-issue: FP32 from 128 to 256 FLOPs/CU/cycle, and packed FP16 from 256 to 512 FLOPs/CU/cycle.

The aggregate VALU FLOPs row in this panel uses the FP32 single-issue FMA ceiling (128 FLOPs/CU/cycle) as the peak. A workload that issues VOPD heavily, or that runs packed FP16, can therefore report a percentage of peak above 100%. Use VALU FLOPs against the ceilings below together with ISA-level VOPD pairing to reason about how much throughput is coming from dual-issue paths.

Peak theoretical VALU rates#

The values below are per CU, before multiplying by the CU count and the shader clock. They anchor the percentage of peak reported for the FLOPs related rows.

  • FP32 FMA, single-issue: 128 FLOPs/CU/cycle. Two SIMD32 lanes per CU ($simd_per_cu), Wave32 ($wave_size 32), multiplied by two for FMA.

  • FP32 VOPD dual-issue: 256 FLOPs/CU/cycle. Paired dual-VALU instructions such as V_DUAL_ADD_F32 and V_DUAL_MUL_F32.

  • FP16 packed FMA, single-issue: 256 FLOPs/CU/cycle. Packed FP16 runs at twice the FP32 rate.

  • FP16 packed VOPD dual-issue: 512 FLOPs/CU/cycle. Dual-issue packed FP16 pairings.

  • FP64 FMA: 4 FLOPs/CU/cycle. RDNA 3.5 runs FP64 FMA at 1/32 of the FP32 single-issue FMA rate.

  • Mixed FP32 single-issue with FP16 packed dual-issue (illustrative ceiling): about 384 FLOPs/CU/cycle combined (128 + 256).

Scaling and clocks#

$cu_per_gpu is the total CU count from system info, not WGP count. On RDNA 3.5, each WGP pairs two CUs, so CU count is roughly twice the WGP count - since $cu_per_gpu reflects active CUs discovered at runtime (rather than just doubling a fixed WGP count). Peak FLOPs scale with CUs. $max_sclk is the shader/engine clock in MHz from profiler system specs.

Bandwidth and cache rows#

The throughput rows for GL0 (TCP Cache), GL1, GL2, and SQC use heuristic ceilings that are not anchored to a single public RDNA 3.5 table, so the percentage of peak reported for these rows is indicative rather than exact. The memory hierarchy runs GL0 (TCP Cache) -> GL1 -> GL2 -> system memory via GCEA. Each level’s ceiling is one peak transfer per cycle, scaled by instance count and $max_sclk:

  • GL0 (TCP Cache): one 128-byte cacheline per cycle per CU – $cu_per_gpu * 128 B/cycle * $max_sclk

  • GL1: one 128-byte request per cycle per GL1C instance – ($cu_per_gpu / 8) * 128 B/cycle * $max_sclk

  • GL2: one 128-byte request per cycle per L2 bank – $total_l2_chan * 128 B/cycle * $max_sclk

  • SQC (scalar data cache and instruction cache): one 64-byte TC request per cycle per SQC instance – $sqc_per_gpu * 64 B/cycle * $max_sclk

Note

For AMD Instinct accelerators (CDNA-CDNA4), see System Speed-of-Light.

Other gfx115x metric tables follow the RDNA3 hierarchy in RDNA3

(shader engine: Workgroup Manager (SPI), Workgroup processor (WGP), GL0 (TCP Vector Cache), GL1; then GL2, GCEA, Command processor (CP), Graphics Register Bus Manager (GRBM)).

Warning

Theoretical peaks use the maximum clock frequency reported for the GPU (for example via rocminfo). That may not match sustained clocks under your workload.

Metric

Description

Unit

VALU FLOPs

Floating-point operations per second on the VALU. Peak is based on FP32 FMA single-issue (128 FLOPs/CU/cycle). VOPD dual-issue doubles throughput: FP32 to 256, FP16 packed to 512 FLOPs/CU/cycle. Uses aggregate instruction counter.

GFLOP/s

VALU FLOPs (F64)

64-bit floating-point operations per second on the VALU. Peak is 4 FLOPs/CU/cycle (FP32 FMA rate / 32). Uses dedicated FP64 instruction counter (SQ_INSTS_VALU_DP_sum). FP64 is 1/32 rate of FP32 on RDNA.

GFLOP/s

IPC

Instructions executed per cycle, measuring overall instruction throughput efficiency. Higher IPC indicates better utilization of the shader pipeline. Low IPC may indicate stalls due to memory latency, register dependencies, or insufficient wavefront occupancy to hide latencies.

Instr/cycle

Wavefront Occupancy

Average number of wavefronts resident on the GPU during kernel execution. Higher occupancy provides more opportunities to hide memory latency through wavefront switching. However, maximum occupancy is not always optimal. Excessive register or LDS usage per wavefront may limit occupancy.

Wavefronts

GL2 Cache Hit Rate

Percentage of GL2 cache requests that are serviced from cache without accessing system memory. Higher hit rates reduce memory bandwidth pressure and improve performance. Low hit rates may indicate poor data locality or working sets that exceed cache capacity.

Percent

GL2-Fabric Read BW

Read bandwidth between the GL2 cache and the memory fabric/system memory. High bandwidth utilization indicates memory-intensive read patterns. This metric helps identify when memory bandwidth is a performance bottleneck. Requires $max_mclk populated in sysinfo (max MEM clock from amd-smi).

Bytes/s

GL2-Fabric Write BW

Write bandwidth between the GL2 cache and the memory fabric/system memory. High values indicate write-intensive workloads. Combined with read bandwidth, this shows total memory traffic leaving the GPU caches.

Bytes/s

LDS Bank Conflicts

Number of Local Data Share bank conflicts during execution. Bank conflicts occur when multiple work-items in a wavefront access the same LDS bank simultaneously, causing serialized accesses. Reducing conflicts improves shared memory throughput.

Conflicts

Wave Dependency Wait

Percentage of wavefront execution time spent waiting for data dependencies to resolve, such as pending memory operations or prior instruction results. High values indicate latency-bound execution that could benefit from increased occupancy or reduced dependency chains.

Percent

Wave Issue Wait

Percentage of wavefront execution time spent waiting for instructions to become ready for issue. High values may indicate instruction cache misses, complex control flow, or pipeline stalls preventing instruction dispatch.

Percent

GL0 Cache Hit Rate (TCP Cache)

Percentage of GL0 vector cache (TCP Cache) requests that hit in cache. The TCP cache is the first-level cache for vector memory operations. High hit rates reduce traffic to the GL1 cache and improve memory access latency.

Percent

GL0 Cache BW

Bandwidth of requests to the GL0 vector cache (TCP Cache). This represents the rate at which the shader cores are requesting data from the cache hierarchy. High bandwidth indicates memory-intensive workloads.

Bytes/s

GL1 Cache Hit Rate

Percentage of L1 cache (GL1C) requests that hit in cache. GL1C is shared across multiple workgroup processors within a shader engine. Higher hit rates reduce traffic to the L2 cache and improve performance.

Percent

GL1 Cache Read BW

Read bandwidth between L1 (GL1C) and L2 (GL2C) caches. This represents cache miss traffic from L1 that must be serviced by L2. Lower values indicate better L1 cache efficiency.

Bytes/s

GL2 Cache BW

Bandwidth of read operations at the L2 cache level. This metric shows overall L2 cache utilization and helps identify memory subsystem bottlenecks.

Bytes/s

Scalar Data Cache Hit Rate

Percentage of scalar data cache accesses that hit in cache. The scalar cache stores uniform data accessed by scalar instructions. High hit rates indicate efficient reuse of constant and uniform data across wavefronts.

Percent

Scalar Data Cache BW

Bandwidth of scalar data cache requests to the GL1 cache. Scalar operations load uniform data shared across all threads in a wavefront, making efficient caching critical for performance.

Bytes/s

Instruction Cache Hit Rate

Percentage of instruction fetch requests that hit in the instruction cache. High hit rates are essential for sustained instruction throughput. Low hit rates may indicate large kernel code size or divergent control flow.

Percent

Instruction Cache BW

Bandwidth of instruction cache requests to the GL1 cache. High values may indicate instruction cache pressure from large kernels or frequent instruction cache misses.

Bytes/s