Workgroup processor (WGP)#

Within each shader engine, Workgroup Processors (WGPs) pair two Compute Units (CUs) that share resources and execute dispatched waves after the SPI workgroup manager hands off work. On RDNA3-class GPUs (including discrete Ryzen APU 3x and RDNA3.5 / gfx115x integrations), compute kernels are typically tracked with wave32-oriented waves; the gfx115x WGP panels cover occupancy, dispatch, instruction mix, and local caches at that WGP/CU-pair granularity.

The sections below list RDNA3.5 (gfx115x) metric descriptions.

Note

AMD Instinct (CDNA) GPUs use a different execution hierarchy and panel grouping. For Instinct-only pipeline metrics (for example, VALU / VMEM / MFMA-style tables), see AMD CDNA architecture (CDNA-CDNA4)-without assuming RDNA WGPs or CUs map directly to those layouts.

Roofline#

Roofline performance rates#

Metric

Description

Unit

VALU FLOPs

Floating-point operations per second on the VALU. Peak is based on FP32 FMA single-issue (128 FLOPs/CU/cycle). VOPD dual-issue doubles throughput: FP32 to 256, FP16 packed to 512 FLOPs/CU/cycle. Uses aggregate instruction counter.

GFLOP/s

VALU FLOPs (F64)

64-bit floating-point operations per second on the VALU. Peak is 4 FLOPs/CU/cycle (FP32 FMA rate / 32). Uses dedicated FP64 instruction counter (SQ_INSTS_VALU_DP_sum). FP64 is 1/32 rate of FP32 on RDNA.

GFLOP/s

GL2 Cache BW

Achieved bandwidth between the GL1 and GL2 caches. This represents GL1 miss traffic that must be serviced by GL2. High bandwidth relative to peak may indicate GL1 cache pressure or poor data locality at the GL1 level.

Bytes/s

GL1 Cache BW

Achieved bandwidth between the GL0 (TCP Cache) and GL1 caches. This represents GL0 miss traffic. High bandwidth relative to peak may indicate cache pressure at the GL0 level or memory access patterns with poor locality.

Bytes/s

GL0 Cache BW

Achieved bandwidth at the GL0 vector cache (TCP Cache). This represents total data requested by shader cores from the memory hierarchy. Compare against peak to assess memory subsystem utilization.

Bytes/s

LDS BW

Achieved bandwidth for Local Data Share memory operations. LDS provides high-bandwidth, low-latency shared memory within a workgroup. Compare against peak to identify if LDS bandwidth is limiting performance.

Bytes/s

Roofline plot points#

Metric

Description

Unit

AI L2

Arithmetic Intensity relative to the L2 cache. This metric shows compute density relative to L2 cache traffic. Higher values indicate better data reuse upstream of L2, while lower values suggest heavy traffic into L2.

FLOPs/Byte

AI L1

Arithmetic Intensity relative to the L1 cache (TCP). Higher values indicate more compute operations per byte of data loaded, suggesting compute-bound behavior. Lower values suggest memory-bound behavior limited by L1 bandwidth.

FLOPs/Byte

AI LDS

Arithmetic Intensity relative to the Local Data Share (LDS). Higher values indicate more compute per byte of LDS traffic, suggesting effective reuse of shared data within a workgroup. Lower values suggest memory-bound behavior limited by LDS bandwidth or bank-conflict pressure.

FLOPs/Byte

Performance (GFLOPs)

Achieved compute throughput measured in billions of floating-point operations per second. This value, combined with arithmetic intensity, determines the kernel’s position on the roofline chart for performance analysis.

GFLOP/s

WGP block metrics#

WGP utilization#

Metric

Description

Unit

WGP Utilization

Percentage of GPU active cycles where the Workgroup Processor is busy executing wavefronts. RDNA 3.5 WGPs contain two compute units, each with dual SIMD32 units. Low utilization may indicate insufficient parallelism or resource bottlenecks.

Percent

Wavefront launch stats#

Metric

Description

Unit

Grid Size

Number of work-items in the dispatch grid for the kernel (Grid_Size), averaged/min/maxed across samples. Defines total parallel work launched.

Work-items

Workgroup Size

Number of work-items per workgroup (Workgroup_Size). Together with grid size and resource limits it determines wave/wavefront count.

Work-items

VGPRs

Vector general-purpose registers allocated per wave (Arch_VGPR). Limits occupancy when the per-CU VGPR capacity is exhausted.

Registers

SGPRs

Scalar general-purpose registers allocated for the wave (SGPR). Uniform data and execution state live in the scalar path.

Registers

LDS Allocation

Local Data Share bytes reserved per workgroup (LDS_Per_Workgroup). High LDS per group reduces how many workgroups can run concurrently on a CU/WGP.

Bytes

Scratch Allocation

Scratch (private) memory bytes per work-item (Scratch_Per_Workitem). Backed by off-chip memory when spill/stack exceeds limits.

Bytes per Work-item

Wave dispatch#

Metric

Description

Unit

Dispatched Waves

Number of shader waves launched (SQ_WAVES_sum), averaged/min/maxed. Each wave is a SIMD execution entity (typically wave32 on this configuration).

Waves

Dispatched Threads

Number of work-items launched as attributed in SQ_ITEMS_sum (threads), averaged/min/maxed.

Threads

Wave life#

Metric

Description

Unit

Wave Life

Average number of cycles from wavefront creation to completion. Longer wave life indicates more execution time per wavefront, which may be due to complex kernels, memory latency, or synchronization waits.

Cycles

Wave instruction mix#

Metric

Description

Unit

Total Instructions

Total number of instructions executed across all wavefronts. This provides an overall measure of computational work. Compare with specific instruction types to understand the workload characteristics.

Count per Normalization Unit

Instructions - Internal

Internal SQ instructions (SQ_INSTS_INTERNAL) per normalization unit. These represent bookkeeping or non-user-visible work issued by the SQ.

Count per Normalization Unit

Instructions - VALU

Number of Vector ALU instructions executed. VALU handles arithmetic and logic operations across all work-items in a wavefront. High VALU counts indicate compute-intensive kernels.

Count per Normalization Unit

Instructions - SALU

Number of Scalar ALU instructions executed. SALU handles operations with uniform values across the wavefront. Efficient kernels often have a balance of VALU and SALU operations.

Count per Normalization Unit

Instructions - SMEM

Number of scalar memory instructions executed. SMEM loads constant and uniform data through the scalar cache. High counts may indicate heavy use of constant buffers or kernel arguments.

Count per Normalization Unit

Instructions - Transcendental

Transcendental VALU instructions (SQ_INSTS_VALU_TRANS) per normalization unit. These cover hardware approximations such as sin, cos, and log on the vector ALU.

Count per Normalization Unit

Instructions - VMEM

Number of vector memory instructions executed, including global, buffer, and texture operations. High VMEM counts relative to VALU may indicate memory-bound workloads.

Count per Normalization Unit

Instructions - LDS

Number of Local Data Share instructions executed. LDS provides fast shared memory within a workgroup. High counts indicate active use of shared memory for inter-work-item communication.

Count per Normalization Unit

VMEM instruction mix#

Metric

Description

Unit

Instructions - VMEM Flat

Flat-addressing vector memory instructions (SQ_INSTS_FLAT) per normalization unit. Flat can resolve to global, scratch or LDS depending on address.

Count per Normalization Unit

Instructions - TEX Load

Texture / image load instructions (SQ_INSTS_TEX_LOAD) per normalization unit.

Count per Normalization Unit

Instructions - TEX Store

Texture / image store instructions (SQ_INSTS_TEX_STORE) per normalization unit.

Count per Normalization Unit

Instructions - Flat Load

Flat load instructions (SQ_INSTS_FLAT_LOAD) per normalization unit.

Count per Normalization Unit

Instructions - Flat Store

Flat store instructions (SQ_INSTS_FLAT_STORE) per normalization unit.

Count per Normalization Unit

LDS instruction mix#

Metric

Description

Unit

LDS Direct Load Instructions

Number of direct LDS load instructions. Direct loads fetch data from LDS without address indirection. These are typically faster than indexed LDS operations.

Count per Normalization Unit

LDS Parameter Load Instructions

Number of LDS parameter load instructions. These load shader input parameters that have been staged in LDS for efficient access by the wavefront.

Count per Normalization Unit

WAVE32 LDS Parameter Load Instructions

Number of LDS parameter loads executed in wave32 mode. RDNA 3.5 supports both wave32 and wave64 execution modes, with wave32 providing lower latency for some operations.

Count per Normalization Unit

Wait state analysis#

Metric

Description

Unit

Wait for Any

Percentage of wavefront lifetime spent waiting for any reason. High wait percentages indicate opportunities to improve performance through better memory access patterns, reduced synchronization, or increased occupancy.

Percent

Wait for Instruction Fetch

Percentage of wavefront lifetime spent waiting for instruction fetch. High values may indicate instruction cache misses, large kernel code, or divergent control flow impacting instruction locality.

Percent

Wait for Barrier

Percentage of wavefront lifetime spent at barrier synchronization points. High barrier wait times may indicate workload imbalance within workgroups. Consider restructuring algorithms to reduce synchronization frequency.

Percent

Wait for Counter

Percentage of wavefront lifetime spent waiting for memory operation counters. High values indicate memory latency is impacting performance. Consider improving memory access patterns or increasing occupancy to hide latency.

Percent

WGP instruction cache#

Metric

Description

Unit

Icache Hit Rate

Percentage of instruction cache accesses that hit. High hit rates are essential for sustained instruction throughput. Low rates may indicate large kernels or divergent branches causing instruction cache thrashing.

Percent

WGP scalar data cache#

Metric

Description

Unit

Dcache Hit Rate

Percentage of scalar data cache accesses that hit. High hit rates indicate efficient reuse of constant and uniform data. Low rates may suggest excessive unique constant data or poor temporal locality.

Percent