Workgroup processor (WGP)

Workgroup processor (WGP)#

On RDNA3-class GPUs (including discrete Ryzen APU 3x and RDNA3.5 / gfx1151 integrations), shader work is organized into Workgroup Processors (WGPs). A WGP pairs two Compute Units (CUs) that share resources; compute kernels are typically tracked with wave32-oriented waves in the gfx1151 panel set. The profiler’s WGP panel is the analogue of the Instinct compute unit chapter: occupancy, dispatch, instruction mix, and local caches at that granularity.

The sections below list RDNA3.5 (gfx1151) metric descriptions.

Note

For VALU / VMEM / MFMA-style pipeline tables and MI-series diagrams, see Compute unit (CU).

Roofline#

Roofline performance rates#

RDNA 3.5 (gfx1151)

Metric	Description	Unit
VALU FLOPs (F16)	The total 16-bit floating-point operations executed per second on the VALU. Theoretical FP16 throughput is 2× the FP32 FMA peak (packed half on Navi3 / RDNA3-class VALU). This is presented with the value of the peak empirical F16 FLOPs achievable on the specific accelerator.	GFLOP/s
VALU FLOPs (F32)	The total 32-bit floating-point operations executed per second on the VALU. Navi3 / RDNA3-class CUs sustain up to 128 V_FMA_F32 ops per cycle per CU in wave64 mode; the roofline peak column uses empirically measured F32 throughput. This is presented with the value of the peak empirical F32 FLOPs achievable on the specific accelerator.	GFLOP/s
VALU FLOPs (F64)	The total 64-bit floating-point operations executed per second on the VALU. FP64 throughput is typically 1/16 or 1/32 of FP32 on RDNA3.5 consumer parts. This is presented with the value of the peak empirical F64 FLOPs achievable on the specific accelerator.	GFLOP/s
GL2 Cache Bandwidth	The number of bytes transferred between GL1C and GL2C per unit time. GL2C is the last-level cache in RDNA3.5 before system memory. The peak empirically measured bandwidth achievable on the specific accelerator is displayed alongside for comparison.	Bytes/s
GL1 Cache Bandwidth	The number of bytes transferred between TCP and GL1C per unit time. GL1C is the shared L1 cache within each Shader Engine. The peak empirically measured bandwidth achievable on the specific accelerator is displayed alongside for comparison.	Bytes/s
TCP Cache Bandwidth	The number of bytes looked up in the TCP (Texture Cache Per-pipe) per unit time. TCP is the L0 vector cache in RDNA3.5, one per WGP. The peak empirically measured bandwidth achievable on the specific accelerator is displayed alongside for comparison.	Bytes/s
LDS Bandwidth	Indicates the maximum amount of bytes that could have been loaded from, stored to, or atomically updated in the LDS per unit time. RDNA3.5 has 32 LDS banks per CU. The peak empirically measured LDS bandwidth achievable on the specific accelerator is displayed alongside for comparison.	Bytes/s

Roofline plot points#

RDNA 3.5 (gfx1151)

Metric	Description	Unit
AI TCP	The Arithmetic Intensity (AI) relative to TCP cache. It is the ratio of total floating-point operations (FLOPs) to total bytes transferred from TCP to the shader cores. This value is used as the x-coordinate for the TCP roofline.	FLOPs/Byte
AI GL1	The Arithmetic Intensity (AI) relative to GL1C cache. It is the ratio of total floating-point operations (FLOPs) to total bytes transferred between GL1C and TCP. This value is used as the x-coordinate for the GL1 roofline.	FLOPs/Byte
AI GL2	The Arithmetic Intensity (AI) relative to GL2C (L2 Cache). It is the ratio of total floating-point operations (FLOPs) to total bytes transferred between GL2C and GL1C. This value is used as the x-coordinate for the GL2 roofline.	FLOPs/Byte
Performance (GFLOPs)	The overall achieved performance, measured in GigaFLOPs per second (GFLOP/s). This is calculated as the sum of all VALU floating-point operations divided by the total execution time. This value is used as the y-coordinate for the kernel’s point on the Roofline plot.	GFLOP/s

WGP block metrics#

WGP utilization#

RDNA 3.5 (gfx1151)

Metric	Description	Unit
WGP Utilization	WGP busy as a percent of GRBM GUI-active cycles. RDNA3.5 WGPs pair 2 CUs, each with 2 SIMD32 units.	Percent

Wavefront launch stats#

RDNA 3.5 (gfx1151)

Metric	Description	Unit
Grid Size	Number of work-items in the dispatch grid for the kernel (Grid_Size), averaged/min/maxed across samples. Defines total parallel work launched.	Work-items
Workgroup Size	Number of work-items per workgroup (Workgroup_Size). Together with grid size and resource limits it determines wave/wavefront count.	Work-items
VGPRs	Vector general-purpose registers allocated per wave (Arch_VGPR). Limits occupancy when the per-CU VGPR capacity is exhausted.	Registers
SGPRs	Scalar general-purpose registers allocated for the wave (SGPR). Uniform data and execution state live in the scalar path.	Registers
LDS Allocation	Local Data Share bytes reserved per workgroup (LDS_Per_Workgroup). High LDS per group reduces how many workgroups can run concurrently on a CU/WGP.	Bytes
Scratch Allocation	Scratch (private) memory bytes per work-item (Scratch_Per_Workitem). Backed by off-chip memory when spill/stack exceeds limits.	Bytes per Work-item

Wave dispatch#

RDNA 3.5 (gfx1151)

Metric	Description	Unit
Dispatched Waves	Number of shader waves launched (SQ_WAVES_sum), averaged/min/maxed. Each wave is a SIMD execution entity (typically wave32 on this configuration).	Waves
Dispatched Threads	Number of work-items launched as attributed in SQ_ITEMS_sum (threads), averaged/min/maxed.	Threads

Wave life#

RDNA 3.5 (gfx1151)

Metric	Description	Unit
Wave Life	Average number of cycles a wave exists from creation to completion. Calculated as SQ_WAVE_CYCLES / SQ_WAVES.	Cycles

Wave instruction mix#

RDNA 3.5 (gfx1151)

Metric	Description	Unit
Total Instructions	Total number of instructions executed across all waves.	Count per Normalization Unit
Instructions - VALU	Number of Vector ALU instructions executed (SQ_INSTS_VALU). VALU handles all vector arithmetic and logic operations. Transcendental VALU ops are reported separately as “Instructions - Transcendental” in the same Wave Instruction Mix table.	Count per Normalization Unit
Instructions - SALU	Number of Scalar ALU instructions executed. SALU handles scalar operations that are uniform across the wave.	Count per Normalization Unit
Instructions - SMEM	Number of Scalar Memory instructions executed. SMEM loads uniform data through the scalar cache.	Count per Normalization Unit
Instructions - VMEM	Number of Vector Memory instructions executed. Includes global, buffer, and texture memory operations.	Count per Normalization Unit
Instructions - LDS	Total Local Data Share instructions (SQ_INSTS_LDS). LDS provides fast shared memory within a workgroup. For LDS sub-categories (direct/param/ wave32 param loads), see the “LDS Instruction Mix” table.	Count per Normalization Unit
Instructions - Internal	Internal SQ instructions (SQ_INSTS_INTERNAL) per normalization unit — bookkeeping or non-user-visible work attributed to the scheduler.	Count per Normalization Unit
Instructions - Transcendental	Transcendental VALU instructions (SQ_INSTS_VALU_TRANS), per normalization unit — e.g. sin/cos/log approximations on the vector ALU.	Count per Normalization Unit

VMEM instruction mix#

RDNA 3.5 (gfx1151)

Metric	Description	Unit
Instructions - VMEM Flat	Flat-addressing vector memory instructions (SQ_INSTS_FLAT) per normalization unit. Flat can resolve to global, scratch or LDS depending on address.	Count per Normalization Unit
Instructions - TEX Load	Texture / image load instructions (SQ_INSTS_TEX_LOAD) per normalization unit.	Count per Normalization Unit
Instructions - TEX Store	Texture / image store instructions (SQ_INSTS_TEX_STORE) per normalization unit.	Count per Normalization Unit
Instructions - Flat Load	Flat load instructions (SQ_INSTS_FLAT_LOAD) per normalization unit.	Count per Normalization Unit
Instructions - Flat Store	Flat store instructions (SQ_INSTS_FLAT_STORE) per normalization unit.	Count per Normalization Unit

LDS instruction mix#

RDNA 3.5 (gfx1151)

Metric	Description	Unit
LDS Direct Load Instructions	LDS direct load instructions issued (SQ_INSTS_LDS_DIRECT_LOAD). Subset of total LDS; not required to sum with other LDS mix rows to equal total.	Count per Normalization Unit
LDS Parameter Load Instructions	LDS parameter load instructions issued (SQ_INSTS_LDS_PARAM_LOAD).	Count per Normalization Unit
WAVE32 LDS Parameter Load Instructions	LDS parameter load instructions under wave32 execution (SQ_INSTS_WAVE32_LDS_PARAM_LOAD).	Count per Normalization Unit

Wait state analysis#

RDNA 3.5 (gfx1151)

Metric	Description	Unit
Wait for Any	Percentage of wave lifetime spent waiting for any reason including memory, barriers, or instruction fetch. Computed as 100 * SQ_WAIT_ANY / SQ_WAVE_CYCLES.	Percent
Wait for Instruction Fetch	Percentage of wave lifetime spent waiting for instructions to arrive from the instruction cache. Computed as 100 * SQ_WAIT_IFETCH / SQ_WAVE_CYCLES.	Percent
Wait for Barrier	Percentage of wave lifetime spent waiting at barrier synchronization points. Computed as 100 * SQ_WAIT_BARRIER / SQ_WAVE_CYCLES.	Percent
Wait for Counter	Percentage of wave lifetime spent waiting for waitcnt counters (VMCNT, LGKMCNT, etc.) to reach target values. Computed as 100 * SQ_WAIT_CNT_ANY / SQ_WAVE_CYCLES.	Percent

WGP instruction cache#

RDNA 3.5 (gfx1151)

Metric	Description	Unit
Icache Hit Rate	Instruction cache hit percentage using 100 × SQC_ICACHE_HITS / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES). Do not divide only by SQC_ICACHE_REQ — hits and requests are both per-SQ per-bank, and summed hits can exceed summed requests.	Percent

WGP scalar data cache#

RDNA 3.5 (gfx1151)

Metric	Description	Unit
Dcache Hit Rate	Scalar data cache hit percentage: 100 × SQC_DCACHE_HITS / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES). Same rationale as Icache (HITS vs REQ can exceed 100%).	Percent

Workgroup processor (WGP)

Contents

Workgroup processor (WGP)#

Roofline#

Roofline performance rates#

Roofline plot points#

WGP block metrics#

WGP utilization#

Wavefront launch stats#

Wave dispatch#

Wave life#

Wave instruction mix#

VMEM instruction mix#

LDS instruction mix#

Wait state analysis#

WGP instruction cache#

WGP scalar data cache#