Workgroup processor (WGP)#
Within each shader engine, Workgroup Processors (WGPs) pair two Compute Units (CUs) that share resources and execute dispatched waves after the SPI workgroup manager hands off work. On RDNA3-class GPUs (including discrete Ryzen APU 3x and RDNA3.5 / gfx115x integrations), compute kernels are typically tracked with wave32-oriented waves; the gfx115x WGP panels cover occupancy, dispatch, instruction mix, and local caches at that WGP/CU-pair granularity.
The sections below list RDNA3.5 (gfx115x) metric descriptions.
Note
AMD Instinct (CDNA) GPUs use a different execution hierarchy and panel grouping. For Instinct-only pipeline metrics (for example, VALU / VMEM / MFMA-style tables), see AMD CDNA architecture (CDNA-CDNA4)-without assuming RDNA WGPs or CUs map directly to those layouts.
Roofline#
Roofline performance rates#
Metric |
Description |
Unit |
|---|---|---|
VALU FLOPs |
Floating-point operations per second on the VALU. Peak is based on FP32 FMA single-issue (128 FLOPs/CU/cycle). VOPD dual-issue doubles throughput: FP32 to 256, FP16 packed to 512 FLOPs/CU/cycle. Uses aggregate instruction counter. |
GFLOP/s |
VALU FLOPs (F64) |
64-bit floating-point operations per second on the VALU. Peak is 4 FLOPs/CU/cycle (FP32 FMA rate / 32). Uses dedicated FP64 instruction counter (SQ_INSTS_VALU_DP_sum). FP64 is 1/32 rate of FP32 on RDNA. |
GFLOP/s |
GL2 Cache BW |
Achieved bandwidth between the GL1 and GL2 caches. This represents GL1 miss traffic that must be serviced by GL2. High bandwidth relative to peak may indicate GL1 cache pressure or poor data locality at the GL1 level. |
Bytes/s |
GL1 Cache BW |
Achieved bandwidth between the GL0 (TCP Cache) and GL1 caches. This represents GL0 miss traffic. High bandwidth relative to peak may indicate cache pressure at the GL0 level or memory access patterns with poor locality. |
Bytes/s |
GL0 Cache BW |
Achieved bandwidth at the GL0 vector cache (TCP Cache). This represents total data requested by shader cores from the memory hierarchy. Compare against peak to assess memory subsystem utilization. |
Bytes/s |
LDS BW |
Achieved bandwidth for Local Data Share memory operations. LDS provides high-bandwidth, low-latency shared memory within a workgroup. Compare against peak to identify if LDS bandwidth is limiting performance. |
Bytes/s |
Roofline plot points#
Metric |
Description |
Unit |
|---|---|---|
AI L2 |
Arithmetic Intensity relative to the L2 cache. This metric shows compute density relative to L2 cache traffic. Higher values indicate better data reuse upstream of L2, while lower values suggest heavy traffic into L2. |
FLOPs/Byte |
AI L1 |
Arithmetic Intensity relative to the L1 cache (TCP). Higher values indicate more compute operations per byte of data loaded, suggesting compute-bound behavior. Lower values suggest memory-bound behavior limited by L1 bandwidth. |
FLOPs/Byte |
AI LDS |
Arithmetic Intensity relative to the Local Data Share (LDS). Higher values indicate more compute per byte of LDS traffic, suggesting effective reuse of shared data within a workgroup. Lower values suggest memory-bound behavior limited by LDS bandwidth or bank-conflict pressure. |
FLOPs/Byte |
Performance (GFLOPs) |
Achieved compute throughput measured in billions of floating-point operations per second. This value, combined with arithmetic intensity, determines the kernel’s position on the roofline chart for performance analysis. |
GFLOP/s |
WGP block metrics#
WGP utilization#
Metric |
Description |
Unit |
|---|---|---|
WGP Utilization |
Percentage of GPU active cycles where the Workgroup Processor is busy executing wavefronts. RDNA 3.5 WGPs contain two compute units, each with dual SIMD32 units. Low utilization may indicate insufficient parallelism or resource bottlenecks. |
Percent |
Wavefront launch stats#
Metric |
Description |
Unit |
|---|---|---|
Grid Size |
Number of work-items in the dispatch grid for the kernel (Grid_Size), averaged/min/maxed across samples. Defines total parallel work launched. |
Work-items |
Workgroup Size |
Number of work-items per workgroup (Workgroup_Size). Together with grid size and resource limits it determines wave/wavefront count. |
Work-items |
VGPRs |
Vector general-purpose registers allocated per wave (Arch_VGPR). Limits occupancy when the per-CU VGPR capacity is exhausted. |
Registers |
SGPRs |
Scalar general-purpose registers allocated for the wave (SGPR). Uniform data and execution state live in the scalar path. |
Registers |
LDS Allocation |
Local Data Share bytes reserved per workgroup (LDS_Per_Workgroup). High LDS per group reduces how many workgroups can run concurrently on a CU/WGP. |
Bytes |
Scratch Allocation |
Scratch (private) memory bytes per work-item (Scratch_Per_Workitem). Backed by off-chip memory when spill/stack exceeds limits. |
Bytes per Work-item |
Wave dispatch#
Metric |
Description |
Unit |
|---|---|---|
Dispatched Waves |
Number of shader waves launched (SQ_WAVES_sum), averaged/min/maxed. Each wave is a SIMD execution entity (typically wave32 on this configuration). |
Waves |
Dispatched Threads |
Number of work-items launched as attributed in SQ_ITEMS_sum (threads), averaged/min/maxed. |
Threads |
Wave life#
Metric |
Description |
Unit |
|---|---|---|
Wave Life |
Average number of cycles from wavefront creation to completion. Longer wave life indicates more execution time per wavefront, which may be due to complex kernels, memory latency, or synchronization waits. |
Cycles |
Wave instruction mix#
Metric |
Description |
Unit |
|---|---|---|
Total Instructions |
Total number of instructions executed across all wavefronts. This provides an overall measure of computational work. Compare with specific instruction types to understand the workload characteristics. |
Count per Normalization Unit |
Instructions - Internal |
Internal SQ instructions (SQ_INSTS_INTERNAL) per normalization unit. These represent bookkeeping or non-user-visible work issued by the SQ. |
Count per Normalization Unit |
Instructions - VALU |
Number of Vector ALU instructions executed. VALU handles arithmetic and logic operations across all work-items in a wavefront. High VALU counts indicate compute-intensive kernels. |
Count per Normalization Unit |
Instructions - SALU |
Number of Scalar ALU instructions executed. SALU handles operations with uniform values across the wavefront. Efficient kernels often have a balance of VALU and SALU operations. |
Count per Normalization Unit |
Instructions - SMEM |
Number of scalar memory instructions executed. SMEM loads constant and uniform data through the scalar cache. High counts may indicate heavy use of constant buffers or kernel arguments. |
Count per Normalization Unit |
Instructions - Transcendental |
Transcendental VALU instructions (SQ_INSTS_VALU_TRANS) per normalization unit. These cover hardware approximations such as sin, cos, and log on the vector ALU. |
Count per Normalization Unit |
Instructions - VMEM |
Number of vector memory instructions executed, including global, buffer, and texture operations. High VMEM counts relative to VALU may indicate memory-bound workloads. |
Count per Normalization Unit |
Instructions - LDS |
Number of Local Data Share instructions executed. LDS provides fast shared memory within a workgroup. High counts indicate active use of shared memory for inter-work-item communication. |
Count per Normalization Unit |
VMEM instruction mix#
Metric |
Description |
Unit |
|---|---|---|
Instructions - VMEM Flat |
Flat-addressing vector memory instructions (SQ_INSTS_FLAT) per normalization unit. Flat can resolve to global, scratch or LDS depending on address. |
Count per Normalization Unit |
Instructions - TEX Load |
Texture / image load instructions (SQ_INSTS_TEX_LOAD) per normalization unit. |
Count per Normalization Unit |
Instructions - TEX Store |
Texture / image store instructions (SQ_INSTS_TEX_STORE) per normalization unit. |
Count per Normalization Unit |
Instructions - Flat Load |
Flat load instructions (SQ_INSTS_FLAT_LOAD) per normalization unit. |
Count per Normalization Unit |
Instructions - Flat Store |
Flat store instructions (SQ_INSTS_FLAT_STORE) per normalization unit. |
Count per Normalization Unit |
LDS instruction mix#
Metric |
Description |
Unit |
|---|---|---|
LDS Direct Load Instructions |
Number of direct LDS load instructions. Direct loads fetch data from LDS without address indirection. These are typically faster than indexed LDS operations. |
Count per Normalization Unit |
LDS Parameter Load Instructions |
Number of LDS parameter load instructions. These load shader input parameters that have been staged in LDS for efficient access by the wavefront. |
Count per Normalization Unit |
WAVE32 LDS Parameter Load Instructions |
Number of LDS parameter loads executed in wave32 mode. RDNA 3.5 supports both wave32 and wave64 execution modes, with wave32 providing lower latency for some operations. |
Count per Normalization Unit |
Wait state analysis#
Metric |
Description |
Unit |
|---|---|---|
Wait for Any |
Percentage of wavefront lifetime spent waiting for any reason. High wait percentages indicate opportunities to improve performance through better memory access patterns, reduced synchronization, or increased occupancy. |
Percent |
Wait for Instruction Fetch |
Percentage of wavefront lifetime spent waiting for instruction fetch. High values may indicate instruction cache misses, large kernel code, or divergent control flow impacting instruction locality. |
Percent |
Wait for Barrier |
Percentage of wavefront lifetime spent at barrier synchronization points. High barrier wait times may indicate workload imbalance within workgroups. Consider restructuring algorithms to reduce synchronization frequency. |
Percent |
Wait for Counter |
Percentage of wavefront lifetime spent waiting for memory operation counters. High values indicate memory latency is impacting performance. Consider improving memory access patterns or increasing occupancy to hide latency. |
Percent |
WGP instruction cache#
Metric |
Description |
Unit |
|---|---|---|
Icache Hit Rate |
Percentage of instruction cache accesses that hit. High hit rates are essential for sustained instruction throughput. Low rates may indicate large kernels or divergent branches causing instruction cache thrashing. |
Percent |
WGP scalar data cache#
Metric |
Description |
Unit |
|---|---|---|
Dcache Hit Rate |
Percentage of scalar data cache accesses that hit. High hit rates indicate efficient reuse of constant and uniform data. Low rates may suggest excessive unique constant data or poor temporal locality. |
Percent |