Workgroup processor (WGP)#
On RDNA3-class GPUs (including discrete Ryzen APU 3x and RDNA3.5 / gfx1151 integrations), shader work is organized into Workgroup Processors (WGPs). A WGP pairs two Compute Units (CUs) that share resources; compute kernels are typically tracked with wave32-oriented waves in the gfx1151 panel set. The profiler’s WGP panel is the analogue of the Instinct compute unit chapter: occupancy, dispatch, instruction mix, and local caches at that granularity.
The sections below list RDNA3.5 (gfx1151) metric descriptions.
Note
For VALU / VMEM / MFMA-style pipeline tables and MI-series diagrams, see Compute unit (CU).
Roofline#
Roofline performance rates#
Metric |
Description |
Unit |
|---|---|---|
VALU FLOPs (F16) |
The total 16-bit floating-point operations executed per second on the VALU. Theoretical FP16 throughput is 2× the FP32 FMA peak (packed half on Navi3 / RDNA3-class VALU). This is presented with the value of the peak empirical F16 FLOPs achievable on the specific accelerator. |
GFLOP/s |
VALU FLOPs (F32) |
The total 32-bit floating-point operations executed per second on the VALU. Navi3 / RDNA3-class CUs sustain up to 128 V_FMA_F32 ops per cycle per CU in wave64 mode; the roofline peak column uses empirically measured F32 throughput. This is presented with the value of the peak empirical F32 FLOPs achievable on the specific accelerator. |
GFLOP/s |
VALU FLOPs (F64) |
The total 64-bit floating-point operations executed per second on the VALU. FP64 throughput is typically 1/16 or 1/32 of FP32 on RDNA3.5 consumer parts. This is presented with the value of the peak empirical F64 FLOPs achievable on the specific accelerator. |
GFLOP/s |
GL2 Cache Bandwidth |
The number of bytes transferred between GL1C and GL2C per unit time. GL2C is the last-level cache in RDNA3.5 before system memory. The peak empirically measured bandwidth achievable on the specific accelerator is displayed alongside for comparison. |
Bytes/s |
GL1 Cache Bandwidth |
The number of bytes transferred between TCP and GL1C per unit time. GL1C is the shared L1 cache within each Shader Engine. The peak empirically measured bandwidth achievable on the specific accelerator is displayed alongside for comparison. |
Bytes/s |
TCP Cache Bandwidth |
The number of bytes looked up in the TCP (Texture Cache Per-pipe) per unit time. TCP is the L0 vector cache in RDNA3.5, one per WGP. The peak empirically measured bandwidth achievable on the specific accelerator is displayed alongside for comparison. |
Bytes/s |
LDS Bandwidth |
Indicates the maximum amount of bytes that could have been loaded from, stored to, or atomically updated in the LDS per unit time. RDNA3.5 has 32 LDS banks per CU. The peak empirically measured LDS bandwidth achievable on the specific accelerator is displayed alongside for comparison. |
Bytes/s |
Roofline plot points#
Metric |
Description |
Unit |
|---|---|---|
AI TCP |
The Arithmetic Intensity (AI) relative to TCP cache. It is the ratio of total floating-point operations (FLOPs) to total bytes transferred from TCP to the shader cores. This value is used as the x-coordinate for the TCP roofline. |
FLOPs/Byte |
AI GL1 |
The Arithmetic Intensity (AI) relative to GL1C cache. It is the ratio of total floating-point operations (FLOPs) to total bytes transferred between GL1C and TCP. This value is used as the x-coordinate for the GL1 roofline. |
FLOPs/Byte |
AI GL2 |
The Arithmetic Intensity (AI) relative to GL2C (L2 Cache). It is the ratio of total floating-point operations (FLOPs) to total bytes transferred between GL2C and GL1C. This value is used as the x-coordinate for the GL2 roofline. |
FLOPs/Byte |
Performance (GFLOPs) |
The overall achieved performance, measured in GigaFLOPs per second (GFLOP/s). This is calculated as the sum of all VALU floating-point operations divided by the total execution time. This value is used as the y-coordinate for the kernel’s point on the Roofline plot. |
GFLOP/s |
WGP block metrics#
WGP utilization#
Metric |
Description |
Unit |
|---|---|---|
WGP Utilization |
WGP busy as a percent of GRBM GUI-active cycles. RDNA3.5 WGPs pair 2 CUs, each with 2 SIMD32 units. |
Percent |
Wavefront launch stats#
Metric |
Description |
Unit |
|---|---|---|
Grid Size |
Number of work-items in the dispatch grid for the kernel (Grid_Size), averaged/min/maxed across samples. Defines total parallel work launched. |
Work-items |
Workgroup Size |
Number of work-items per workgroup (Workgroup_Size). Together with grid size and resource limits it determines wave/wavefront count. |
Work-items |
VGPRs |
Vector general-purpose registers allocated per wave (Arch_VGPR). Limits occupancy when the per-CU VGPR capacity is exhausted. |
Registers |
SGPRs |
Scalar general-purpose registers allocated for the wave (SGPR). Uniform data and execution state live in the scalar path. |
Registers |
LDS Allocation |
Local Data Share bytes reserved per workgroup (LDS_Per_Workgroup). High LDS per group reduces how many workgroups can run concurrently on a CU/WGP. |
Bytes |
Scratch Allocation |
Scratch (private) memory bytes per work-item (Scratch_Per_Workitem). Backed by off-chip memory when spill/stack exceeds limits. |
Bytes per Work-item |
Wave dispatch#
Metric |
Description |
Unit |
|---|---|---|
Dispatched Waves |
Number of shader waves launched (SQ_WAVES_sum), averaged/min/maxed. Each wave is a SIMD execution entity (typically wave32 on this configuration). |
Waves |
Dispatched Threads |
Number of work-items launched as attributed in SQ_ITEMS_sum (threads), averaged/min/maxed. |
Threads |
Wave life#
Metric |
Description |
Unit |
|---|---|---|
Wave Life |
Average number of cycles a wave exists from creation to completion. Calculated as SQ_WAVE_CYCLES / SQ_WAVES. |
Cycles |
Wave instruction mix#
Metric |
Description |
Unit |
|---|---|---|
Total Instructions |
Total number of instructions executed across all waves. |
Count per Normalization Unit |
Instructions - VALU |
Number of Vector ALU instructions executed (SQ_INSTS_VALU). VALU handles all vector arithmetic and logic operations. Transcendental VALU ops are reported separately as “Instructions - Transcendental” in the same Wave Instruction Mix table. |
Count per Normalization Unit |
Instructions - SALU |
Number of Scalar ALU instructions executed. SALU handles scalar operations that are uniform across the wave. |
Count per Normalization Unit |
Instructions - SMEM |
Number of Scalar Memory instructions executed. SMEM loads uniform data through the scalar cache. |
Count per Normalization Unit |
Instructions - VMEM |
Number of Vector Memory instructions executed. Includes global, buffer, and texture memory operations. |
Count per Normalization Unit |
Instructions - LDS |
Total Local Data Share instructions (SQ_INSTS_LDS). LDS provides fast shared memory within a workgroup. For LDS sub-categories (direct/param/ wave32 param loads), see the “LDS Instruction Mix” table. |
Count per Normalization Unit |
Instructions - Internal |
Internal SQ instructions (SQ_INSTS_INTERNAL) per normalization unit — bookkeeping or non-user-visible work attributed to the scheduler. |
Count per Normalization Unit |
Instructions - Transcendental |
Transcendental VALU instructions (SQ_INSTS_VALU_TRANS), per normalization unit — e.g. sin/cos/log approximations on the vector ALU. |
Count per Normalization Unit |
VMEM instruction mix#
Metric |
Description |
Unit |
|---|---|---|
Instructions - VMEM Flat |
Flat-addressing vector memory instructions (SQ_INSTS_FLAT) per normalization unit. Flat can resolve to global, scratch or LDS depending on address. |
Count per Normalization Unit |
Instructions - TEX Load |
Texture / image load instructions (SQ_INSTS_TEX_LOAD) per normalization unit. |
Count per Normalization Unit |
Instructions - TEX Store |
Texture / image store instructions (SQ_INSTS_TEX_STORE) per normalization unit. |
Count per Normalization Unit |
Instructions - Flat Load |
Flat load instructions (SQ_INSTS_FLAT_LOAD) per normalization unit. |
Count per Normalization Unit |
Instructions - Flat Store |
Flat store instructions (SQ_INSTS_FLAT_STORE) per normalization unit. |
Count per Normalization Unit |
LDS instruction mix#
Metric |
Description |
Unit |
|---|---|---|
LDS Direct Load Instructions |
LDS direct load instructions issued (SQ_INSTS_LDS_DIRECT_LOAD). Subset of total LDS; not required to sum with other LDS mix rows to equal total. |
Count per Normalization Unit |
LDS Parameter Load Instructions |
LDS parameter load instructions issued (SQ_INSTS_LDS_PARAM_LOAD). |
Count per Normalization Unit |
WAVE32 LDS Parameter Load Instructions |
LDS parameter load instructions under wave32 execution (SQ_INSTS_WAVE32_LDS_PARAM_LOAD). |
Count per Normalization Unit |
Wait state analysis#
Metric |
Description |
Unit |
|---|---|---|
Wait for Any |
Percentage of wave lifetime spent waiting for any reason including memory, barriers, or instruction fetch. Computed as 100 * SQ_WAIT_ANY / SQ_WAVE_CYCLES. |
Percent |
Wait for Instruction Fetch |
Percentage of wave lifetime spent waiting for instructions to arrive from the instruction cache. Computed as 100 * SQ_WAIT_IFETCH / SQ_WAVE_CYCLES. |
Percent |
Wait for Barrier |
Percentage of wave lifetime spent waiting at barrier synchronization points. Computed as 100 * SQ_WAIT_BARRIER / SQ_WAVE_CYCLES. |
Percent |
Wait for Counter |
Percentage of wave lifetime spent waiting for waitcnt counters (VMCNT, LGKMCNT, etc.) to reach target values. Computed as 100 * SQ_WAIT_CNT_ANY / SQ_WAVE_CYCLES. |
Percent |
WGP instruction cache#
Metric |
Description |
Unit |
|---|---|---|
Icache Hit Rate |
Instruction cache hit percentage using 100 × SQC_ICACHE_HITS / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES). Do not divide only by SQC_ICACHE_REQ — hits and requests are both per-SQ per-bank, and summed hits can exceed summed requests. |
Percent |
WGP scalar data cache#
Metric |
Description |
Unit |
|---|---|---|
Dcache Hit Rate |
Scalar data cache hit percentage: 100 × SQC_DCACHE_HITS / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES). Same rationale as Icache (HITS vs REQ can exceed 100%). |
Percent |