System Speed-of-Light

System Speed-of-Light#

This page lists System Speed-of-Light metrics for RDNA3.5 (gfx1151) when you profile with the shipped gfx1151 analysis configuration. The same metric keys appear in the RDNA3.5 (gfx1151) tab of the analysis report.

Note

For AMD Instinct accelerators (CDNA-CDNA4), see System Speed-of-Light.

Other gfx1151 metric tables grouped by hardware block live under Workgroup processor (WGP), GL0, GL1, GL2, Shader engine, and Command processor (CP).

Warning

Theoretical peaks use the maximum clock frequency reported for the GPU (for example via rocminfo). That may not match sustained clocks under your workload.

Metric

Description

Unit

VALU FLOPs (FP16)

Sixteen-bit VALU throughput (GFLOP/s) from SQ_INSTS_VALU × 64 / kernel time. Theoretical peak uses packed FP16 at 2× the FP32 per-CU rate: max_sclk (MHz) × $cu_per_gpu × 256 / 1000 GFLOP/s. $cu_per_gpu is total CU count, not WGP (1 WGP = 2 CUs on RDNA 3.5).

GFLOP/s

VALU FLOPs (FP32)

Thirty-two-bit VALU throughput (GFLOP/s) from SQ_INSTS_VALU × 64 / kernel time. Theoretical peak: max_sclk (MHz) × $cu_per_gpu × 128 / 1000 GFLOP/s (wave64 / dual-issue VALU class used for Ryzen APUs). $cu_per_gpu is the total number of Compute Units from system info, not WGP count (on RDNA 3.5, one WGP pairs two CUs).

GFLOP/s

IPC

Instructions Per Cycle — ratio of instructions executed over busy cycles. Calculated as (Total Instructions - Internal Instructions) / Busy Cycles. Higher IPC indicates better instruction throughput.

Instr/cycle

Wavefront Occupancy

Average wavefront occupancy proxy: SQ_WAVE_CYCLES_sum / SQ_BUSY_CYCLES_avr per sample, where SQ_BUSY_CYCLES_avr is the mean of per–shader-engine SQ_BUSY_CYCLES (active-wave cycles per SE). Peak is $max_waves_per_cu × $cu_per_gpu (theoretical in-flight cap). Pct of peak scales the average against that ceiling.

Wavefronts

L2 Cache Hit Rate

GL2C last-level cache hit rate (same as Memory Chart / GL2C panel). Formula: 100 * GL2C_HIT_sum / (GL2C_HIT_sum + GL2C_MISS_sum) when the denominator is non-zero.

Percent

L2-Fabric Read BW

GL2C/EA read bandwidth from sized request counters (gl2c_perf_sel_ea_rdreq_32b/64B/96B/128B): 32·GL2C_EA_RDREQ_32B_sum + 64·GL2C_EA_RDREQ_64B_sum + 96·GL2C_EA_RDREQ_96B_sum + 128·GL2C_EA_RDREQ_128B_sum, divided by kernel wall time. Sums over GL2C instances. Peak / % of peak are N/A (no sysinfo memory BW).

Bytes/s

L2-Fabric Write BW

GL2C/EA write bandwidth: GL2C_MC_WRREQ_sum is the EA write transaction count (32 B or 64 B per txn; the spec names this path MC_WRREQ). Bandwidth = 32·(GL2C_MC_WRREQ_sum − GL2C_EA_WRREQ_64B_sum) + 64·GL2C_EA_WRREQ_64B_sum, over time. Peak / % of peak are N/A (no sysinfo memory BW).

Bytes/s

LDS Bank Conflicts

Number of LDS bank conflicts. Peak / % of Peak are N/A for raw counts.

Conflicts

Wave Dependency Wait

Percent of wave lifetime spent waiting for data dependencies (e.g. VMEM/LDS results).

Percent

Wave Issue Wait

Percent of wave lifetime spent waiting for any instruction to be ready to issue.

Percent

TCP Cache Hit Rate

Texture Cache Per-pipe (TCP) hit rate. TCP is the L0 vector cache in RDNA3.5. Same as Memory Chart TCP Hit Rate: 100 * (TCP_REQ_sum - TCP_REQ_MISS_sum) / TCP_REQ_sum.

Percent

TCP Cache BW

Same as Memory Chart “TCP Request Bandwidth” (table 303): TCP_REQ_sum × 64 B / kernel wall time. Peak / % of peak are N/A (same as Memory Chart — no theoretical ceiling row).

Bytes/s

GL1C Hit Rate

GL1 Cache hit rate (same as Memory Chart / GL1C panel). Formula: 100 * (GL1C_REQ_sum - GL1C_REQ_MISS_sum) / GL1C_REQ_sum when GL1C_REQ_sum != 0.

Percent

GL1C Read BW

Same as Memory Chart table 307 “GL1C-GL2 Read Bandwidth”: 32·GL1C_GL2_REQ_READ_32B_sum + 64·GL1C_GL2_REQ_READ_64B_sum + 128·GL1C_GL2_REQ_READ_128B_sum, over time. Peak / % of peak are N/A.

Bytes/s

GL2C Cache BW

Same as Memory Chart table 308 “GL2C Read Bandwidth”: sized client read bins plus 96 B compressed reads, over time. Peak / % of peak are N/A.

Bytes/s

Scalar Data Cache Hit Rate

Same as Memory Chart Dcache Hit Rate: 100 * SQC_DCACHE_HITS_sum / (SQC_DCACHE_HITS_sum + SQC_DCACHE_MISSES_sum) when the denominator is non-zero.

Percent

Scalar Data Cache BW

Same as Memory Chart table 302 “Dcache-GL1 Read Bandwidth”: SQC_TC_DATA_READ_REQ_sum × 128 B / kernel time. Peak / % of peak are N/A.

Bytes/s

Instruction Cache Hit Rate

Same as Memory Chart Icache Hit Rate: 100 * SQC_ICACHE_HITS_sum / (SQC_ICACHE_HITS_sum + SQC_ICACHE_MISSES_sum) when the denominator is non-zero.

Percent

Instruction Cache BW

Same as Memory Chart table 301 “ICache-GL1 Read Bandwidth”: SQC_TC_INST_REQ_sum × 128 B / kernel time. Peak / % of peak are N/A.

Bytes/s