Workgroup processor (WGP)#

On RDNA3-class GPUs (including discrete Ryzen APU 3x and RDNA3.5 / gfx1151 integrations), shader work is organized into Workgroup Processors (WGPs). A WGP pairs two Compute Units (CUs) that share resources; compute kernels are typically tracked with wave32-oriented waves in the gfx1151 panel set. The profiler’s WGP panel is the analogue of the Instinct compute unit chapter: occupancy, dispatch, instruction mix, and local caches at that granularity.

The sections below list RDNA3.5 (gfx1151) metric descriptions.

Note

For VALU / VMEM / MFMA-style pipeline tables and MI-series diagrams, see Compute unit (CU).

Roofline#

Roofline performance rates#

Metric

Description

Unit

VALU FLOPs (F16)

The total 16-bit floating-point operations executed per second on the VALU. Theoretical FP16 throughput is 2× the FP32 FMA peak (packed half on Navi3 / RDNA3-class VALU). This is presented with the value of the peak empirical F16 FLOPs achievable on the specific accelerator.

GFLOP/s

VALU FLOPs (F32)

The total 32-bit floating-point operations executed per second on the VALU. Navi3 / RDNA3-class CUs sustain up to 128 V_FMA_F32 ops per cycle per CU in wave64 mode; the roofline peak column uses empirically measured F32 throughput. This is presented with the value of the peak empirical F32 FLOPs achievable on the specific accelerator.

GFLOP/s

VALU FLOPs (F64)

The total 64-bit floating-point operations executed per second on the VALU. FP64 throughput is typically 1/16 or 1/32 of FP32 on RDNA3.5 consumer parts. This is presented with the value of the peak empirical F64 FLOPs achievable on the specific accelerator.

GFLOP/s

GL2 Cache Bandwidth

The number of bytes transferred between GL1C and GL2C per unit time. GL2C is the last-level cache in RDNA3.5 before system memory. The peak empirically measured bandwidth achievable on the specific accelerator is displayed alongside for comparison.

Bytes/s

GL1 Cache Bandwidth

The number of bytes transferred between TCP and GL1C per unit time. GL1C is the shared L1 cache within each Shader Engine. The peak empirically measured bandwidth achievable on the specific accelerator is displayed alongside for comparison.

Bytes/s

TCP Cache Bandwidth

The number of bytes looked up in the TCP (Texture Cache Per-pipe) per unit time. TCP is the L0 vector cache in RDNA3.5, one per WGP. The peak empirically measured bandwidth achievable on the specific accelerator is displayed alongside for comparison.

Bytes/s

LDS Bandwidth

Indicates the maximum amount of bytes that could have been loaded from, stored to, or atomically updated in the LDS per unit time. RDNA3.5 has 32 LDS banks per CU. The peak empirically measured LDS bandwidth achievable on the specific accelerator is displayed alongside for comparison.

Bytes/s

Roofline plot points#

Metric

Description

Unit

AI TCP

The Arithmetic Intensity (AI) relative to TCP cache. It is the ratio of total floating-point operations (FLOPs) to total bytes transferred from TCP to the shader cores. This value is used as the x-coordinate for the TCP roofline.

FLOPs/Byte

AI GL1

The Arithmetic Intensity (AI) relative to GL1C cache. It is the ratio of total floating-point operations (FLOPs) to total bytes transferred between GL1C and TCP. This value is used as the x-coordinate for the GL1 roofline.

FLOPs/Byte

AI GL2

The Arithmetic Intensity (AI) relative to GL2C (L2 Cache). It is the ratio of total floating-point operations (FLOPs) to total bytes transferred between GL2C and GL1C. This value is used as the x-coordinate for the GL2 roofline.

FLOPs/Byte

Performance (GFLOPs)

The overall achieved performance, measured in GigaFLOPs per second (GFLOP/s). This is calculated as the sum of all VALU floating-point operations divided by the total execution time. This value is used as the y-coordinate for the kernel’s point on the Roofline plot.

GFLOP/s

WGP block metrics#

WGP utilization#

Metric

Description

Unit

WGP Utilization

WGP busy as a percent of GRBM GUI-active cycles. RDNA3.5 WGPs pair 2 CUs, each with 2 SIMD32 units.

Percent

Wavefront launch stats#

Metric

Description

Unit

Grid Size

Number of work-items in the dispatch grid for the kernel (Grid_Size), averaged/min/maxed across samples. Defines total parallel work launched.

Work-items

Workgroup Size

Number of work-items per workgroup (Workgroup_Size). Together with grid size and resource limits it determines wave/wavefront count.

Work-items

VGPRs

Vector general-purpose registers allocated per wave (Arch_VGPR). Limits occupancy when the per-CU VGPR capacity is exhausted.

Registers

SGPRs

Scalar general-purpose registers allocated for the wave (SGPR). Uniform data and execution state live in the scalar path.

Registers

LDS Allocation

Local Data Share bytes reserved per workgroup (LDS_Per_Workgroup). High LDS per group reduces how many workgroups can run concurrently on a CU/WGP.

Bytes

Scratch Allocation

Scratch (private) memory bytes per work-item (Scratch_Per_Workitem). Backed by off-chip memory when spill/stack exceeds limits.

Bytes per Work-item

Wave dispatch#

Metric

Description

Unit

Dispatched Waves

Number of shader waves launched (SQ_WAVES_sum), averaged/min/maxed. Each wave is a SIMD execution entity (typically wave32 on this configuration).

Waves

Dispatched Threads

Number of work-items launched as attributed in SQ_ITEMS_sum (threads), averaged/min/maxed.

Threads

Wave life#

Metric

Description

Unit

Wave Life

Average number of cycles a wave exists from creation to completion. Calculated as SQ_WAVE_CYCLES / SQ_WAVES.

Cycles

Wave instruction mix#

Metric

Description

Unit

Total Instructions

Total number of instructions executed across all waves.

Count per Normalization Unit

Instructions - VALU

Number of Vector ALU instructions executed (SQ_INSTS_VALU). VALU handles all vector arithmetic and logic operations. Transcendental VALU ops are reported separately as “Instructions - Transcendental” in the same Wave Instruction Mix table.

Count per Normalization Unit

Instructions - SALU

Number of Scalar ALU instructions executed. SALU handles scalar operations that are uniform across the wave.

Count per Normalization Unit

Instructions - SMEM

Number of Scalar Memory instructions executed. SMEM loads uniform data through the scalar cache.

Count per Normalization Unit

Instructions - VMEM

Number of Vector Memory instructions executed. Includes global, buffer, and texture memory operations.

Count per Normalization Unit

Instructions - LDS

Total Local Data Share instructions (SQ_INSTS_LDS). LDS provides fast shared memory within a workgroup. For LDS sub-categories (direct/param/ wave32 param loads), see the “LDS Instruction Mix” table.

Count per Normalization Unit

Instructions - Internal

Internal SQ instructions (SQ_INSTS_INTERNAL) per normalization unit — bookkeeping or non-user-visible work attributed to the scheduler.

Count per Normalization Unit

Instructions - Transcendental

Transcendental VALU instructions (SQ_INSTS_VALU_TRANS), per normalization unit — e.g. sin/cos/log approximations on the vector ALU.

Count per Normalization Unit

VMEM instruction mix#

Metric

Description

Unit

Instructions - VMEM Flat

Flat-addressing vector memory instructions (SQ_INSTS_FLAT) per normalization unit. Flat can resolve to global, scratch or LDS depending on address.

Count per Normalization Unit

Instructions - TEX Load

Texture / image load instructions (SQ_INSTS_TEX_LOAD) per normalization unit.

Count per Normalization Unit

Instructions - TEX Store

Texture / image store instructions (SQ_INSTS_TEX_STORE) per normalization unit.

Count per Normalization Unit

Instructions - Flat Load

Flat load instructions (SQ_INSTS_FLAT_LOAD) per normalization unit.

Count per Normalization Unit

Instructions - Flat Store

Flat store instructions (SQ_INSTS_FLAT_STORE) per normalization unit.

Count per Normalization Unit

LDS instruction mix#

Metric

Description

Unit

LDS Direct Load Instructions

LDS direct load instructions issued (SQ_INSTS_LDS_DIRECT_LOAD). Subset of total LDS; not required to sum with other LDS mix rows to equal total.

Count per Normalization Unit

LDS Parameter Load Instructions

LDS parameter load instructions issued (SQ_INSTS_LDS_PARAM_LOAD).

Count per Normalization Unit

WAVE32 LDS Parameter Load Instructions

LDS parameter load instructions under wave32 execution (SQ_INSTS_WAVE32_LDS_PARAM_LOAD).

Count per Normalization Unit

Wait state analysis#

Metric

Description

Unit

Wait for Any

Percentage of wave lifetime spent waiting for any reason including memory, barriers, or instruction fetch. Computed as 100 * SQ_WAIT_ANY / SQ_WAVE_CYCLES.

Percent

Wait for Instruction Fetch

Percentage of wave lifetime spent waiting for instructions to arrive from the instruction cache. Computed as 100 * SQ_WAIT_IFETCH / SQ_WAVE_CYCLES.

Percent

Wait for Barrier

Percentage of wave lifetime spent waiting at barrier synchronization points. Computed as 100 * SQ_WAIT_BARRIER / SQ_WAVE_CYCLES.

Percent

Wait for Counter

Percentage of wave lifetime spent waiting for waitcnt counters (VMCNT, LGKMCNT, etc.) to reach target values. Computed as 100 * SQ_WAIT_CNT_ANY / SQ_WAVE_CYCLES.

Percent

WGP instruction cache#

Metric

Description

Unit

Icache Hit Rate

Instruction cache hit percentage using 100 × SQC_ICACHE_HITS / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES). Do not divide only by SQC_ICACHE_REQ — hits and requests are both per-SQ per-bank, and summed hits can exceed summed requests.

Percent

WGP scalar data cache#

Metric

Description

Unit

Dcache Hit Rate

Scalar data cache hit percentage: 100 × SQC_DCACHE_HITS / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES). Same rationale as Icache (HITS vs REQ can exceed 100%).

Percent