System Speed-of-Light#
This page lists System Speed-of-Light metrics for RDNA3.5 (gfx1151) when you profile with the shipped gfx1151 analysis configuration. The same metric keys appear in the RDNA3.5 (gfx1151) tab of the analysis report.
Note
For AMD Instinct accelerators (CDNA-CDNA4), see System Speed-of-Light.
Other gfx1151 metric tables grouped by hardware block live under Workgroup processor (WGP), GL0, GL1, GL2, Shader engine, and Command processor (CP).
Warning
Theoretical peaks use the maximum clock frequency reported for the GPU (for
example via rocminfo). That may not match sustained clocks under your
workload.
Metric |
Description |
Unit |
|---|---|---|
VALU FLOPs (FP16) |
Sixteen-bit VALU throughput (GFLOP/s) from SQ_INSTS_VALU × 64 / kernel time. Theoretical peak uses packed FP16 at 2× the FP32 per-CU rate: max_sclk (MHz) × $cu_per_gpu × 256 / 1000 GFLOP/s. $cu_per_gpu is total CU count, not WGP (1 WGP = 2 CUs on RDNA 3.5). |
GFLOP/s |
VALU FLOPs (FP32) |
Thirty-two-bit VALU throughput (GFLOP/s) from SQ_INSTS_VALU × 64 / kernel time. Theoretical peak: max_sclk (MHz) × $cu_per_gpu × 128 / 1000 GFLOP/s (wave64 / dual-issue VALU class used for Ryzen APUs). $cu_per_gpu is the total number of Compute Units from system info, not WGP count (on RDNA 3.5, one WGP pairs two CUs). |
GFLOP/s |
IPC |
Instructions Per Cycle — ratio of instructions executed over busy cycles. Calculated as (Total Instructions - Internal Instructions) / Busy Cycles. Higher IPC indicates better instruction throughput. |
Instr/cycle |
Wavefront Occupancy |
Average wavefront occupancy proxy: SQ_WAVE_CYCLES_sum / SQ_BUSY_CYCLES_avr per sample, where SQ_BUSY_CYCLES_avr is the mean of per–shader-engine SQ_BUSY_CYCLES (active-wave cycles per SE). Peak is $max_waves_per_cu × $cu_per_gpu (theoretical in-flight cap). Pct of peak scales the average against that ceiling. |
Wavefronts |
L2 Cache Hit Rate |
GL2C last-level cache hit rate (same as Memory Chart / GL2C panel). Formula: 100 * GL2C_HIT_sum / (GL2C_HIT_sum + GL2C_MISS_sum) when the denominator is non-zero. |
Percent |
L2-Fabric Read BW |
GL2C/EA read bandwidth from sized request counters (gl2c_perf_sel_ea_rdreq_32b/64B/96B/128B): 32·GL2C_EA_RDREQ_32B_sum + 64·GL2C_EA_RDREQ_64B_sum + 96·GL2C_EA_RDREQ_96B_sum + 128·GL2C_EA_RDREQ_128B_sum, divided by kernel wall time. Sums over GL2C instances. Peak / % of peak are N/A (no sysinfo memory BW). |
Bytes/s |
L2-Fabric Write BW |
GL2C/EA write bandwidth: GL2C_MC_WRREQ_sum is the EA write transaction count (32 B or 64 B per txn; the spec names this path MC_WRREQ). Bandwidth = 32·(GL2C_MC_WRREQ_sum − GL2C_EA_WRREQ_64B_sum) + 64·GL2C_EA_WRREQ_64B_sum, over time. Peak / % of peak are N/A (no sysinfo memory BW). |
Bytes/s |
LDS Bank Conflicts |
Number of LDS bank conflicts. Peak / % of Peak are N/A for raw counts. |
Conflicts |
Wave Dependency Wait |
Percent of wave lifetime spent waiting for data dependencies (e.g. VMEM/LDS results). |
Percent |
Wave Issue Wait |
Percent of wave lifetime spent waiting for any instruction to be ready to issue. |
Percent |
TCP Cache Hit Rate |
Texture Cache Per-pipe (TCP) hit rate. TCP is the L0 vector cache in RDNA3.5. Same as Memory Chart TCP Hit Rate: 100 * (TCP_REQ_sum - TCP_REQ_MISS_sum) / TCP_REQ_sum. |
Percent |
TCP Cache BW |
Same as Memory Chart “TCP Request Bandwidth” (table 303): TCP_REQ_sum × 64 B / kernel wall time. Peak / % of peak are N/A (same as Memory Chart — no theoretical ceiling row). |
Bytes/s |
GL1C Hit Rate |
GL1 Cache hit rate (same as Memory Chart / GL1C panel). Formula: 100 * (GL1C_REQ_sum - GL1C_REQ_MISS_sum) / GL1C_REQ_sum when GL1C_REQ_sum != 0. |
Percent |
GL1C Read BW |
Same as Memory Chart table 307 “GL1C-GL2 Read Bandwidth”: 32·GL1C_GL2_REQ_READ_32B_sum + 64·GL1C_GL2_REQ_READ_64B_sum + 128·GL1C_GL2_REQ_READ_128B_sum, over time. Peak / % of peak are N/A. |
Bytes/s |
GL2C Cache BW |
Same as Memory Chart table 308 “GL2C Read Bandwidth”: sized client read bins plus 96 B compressed reads, over time. Peak / % of peak are N/A. |
Bytes/s |
Scalar Data Cache Hit Rate |
Same as Memory Chart Dcache Hit Rate: 100 * SQC_DCACHE_HITS_sum / (SQC_DCACHE_HITS_sum + SQC_DCACHE_MISSES_sum) when the denominator is non-zero. |
Percent |
Scalar Data Cache BW |
Same as Memory Chart table 302 “Dcache-GL1 Read Bandwidth”: SQC_TC_DATA_READ_REQ_sum × 128 B / kernel time. Peak / % of peak are N/A. |
Bytes/s |
Instruction Cache Hit Rate |
Same as Memory Chart Icache Hit Rate: 100 * SQC_ICACHE_HITS_sum / (SQC_ICACHE_HITS_sum + SQC_ICACHE_MISSES_sum) when the denominator is non-zero. |
Percent |
Instruction Cache BW |
Same as Memory Chart table 301 “ICache-GL1 Read Bandwidth”: SQC_TC_INST_REQ_sum × 128 B / kernel time. Peak / % of peak are N/A. |
Bytes/s |