Pipeline metrics#

In this section, we describe the metrics available in ROCm Compute Profiler to analyze the pipelines discussed in the Pipeline descriptions.

Wavefront#

Wavefront launch stats#

The wavefront launch stats panel gives general information about the kernel launch:

Metric

Description

Unit

Grid Size

The total number of work-items (or, threads) launched as a part of the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied by the total workgroup (or, block) size.

Work-items

Workgroup Size

The total number of work-items (or, threads) in each workgroup (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent to the total block size.

Work-items

Total Wavefronts

The total number of wavefronts launched as part of the kernel dispatch. On AMD Instinct™ CDNA™ accelerators and GCN™ GPUs, the wavefront size is always 64 work-items. Thus, the total number of wavefronts should be equivalent to the ceiling of grid size divided by 64.

Wavefronts

Saved Wavefronts

The total number of wavefronts saved at a context-save. See cwsr_enable.

Wavefronts

Restored Wavefronts

The total number of wavefronts restored from a context-save. See cwsr_enable.

Wavefronts

VGPRs

The number of architected vector general-purpose registers allocated for the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested by the compiler due to allocation granularity.

VGPRs

AGPRs

The number of accumulation vector general-purpose registers allocated for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs requested by the compiler due to allocation granularity.

AGPRs

SGPRs

The number of scalar general-purpose registers allocated for the kernel, see SALU. Note: this may not exactly match the number of SGPRs requested by the compiler due to allocation granularity.

SGPRs

LDS Allocation

The number of bytes of LDS memory (or, shared memory) allocated for this kernel. Note: This may also be larger than what was requested at compile time due to both allocation granularity and dynamic per-dispatch LDS allocations.

Bytes per workgroup

Scratch Allocation

The number of bytes of scratch memory requested per work-item for this kernel. Scratch memory is used for stack memory on the accelerator, as well as for register spills and restores.

Bytes per work-item

Wavefront runtime stats#

The wavefront runtime statistics gives a high-level overview of the execution of wavefronts in a kernel:

Metric

Description

Unit

Kernel time

The total duration of the executed kernel. Note: this should not be directly compared to the wavefront cycles / timings below.

Nanoseconds

Kernel cycles

The total duration of the executed kernel in cycles. Note: this should not be directly compared to the wavefront cycles / timings below.

Cycles

Instructions per wavefront

The average number of instructions (of all types) executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.

Instructions / wavefront

Wave cycles

The number of cycles a wavefront in the kernel dispatch spent resident on a compute unit per normalization unit. This is averaged over all wavefronts in a kernel dispatch. Note: this should not be directly compared to the kernel cycles above.

Cycles per normalization unit

Dependency wait cycles

The number of cycles a wavefront in the kernel dispatch stalled waiting on memory of any kind (e.g., instruction fetch, vector or scalar memory, etc.) per normalization unit. This counter is incremented at every cycle by all wavefronts on a CU stalled at a memory operation. As such, it is most useful to get a sense of how waves were spending their time, rather than identification of a precise limiter because another wave could be actively executing while a wave is stalled. The sum of this metric, Issue Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.

Cycles per normalization unit

Issue Wait Cycles

The number of cycles a wavefront in the kernel dispatch was unable to issue an instruction for any reason (e.g., execution pipe back-pressure, arbitration loss, etc.) per normalization unit. This counter is incremented at every cycle by all wavefronts on a CU unable to issue an instruction. As such, it is most useful to get a sense of how waves were spending their time, rather than identification of a precise limiter because another wave could be actively executing while a wave is issue stalled. The sum of this metric, Dependency Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.

Cycles per normalization unit

Active Cycles

The average number of cycles a wavefront in the kernel dispatch was actively executing instructions per normalization unit. This measurement is made on a per-wavefront basis, and may include cycles that another wavefront spent actively executing (on another execution unit, for example) or was stalled. As such, it is most useful to get a sense of how waves were spending their time, rather than identification of a precise limiter. The sum of this metric, Issue Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles metric.

Cycles per normalization unit

Wavefront Occupancy

The time-averaged number of wavefronts resident on the accelerator over the lifetime of the kernel. Note: this metric may be inaccurate for short-running kernels (less than 1ms).

Wavefronts

Note

As mentioned earlier, the measurement of kernel cycles and time typically cannot be directly compared to, for example, wave cycles. This is due to two factors: first, the kernel cycles/timings are measured using a counter that is impacted by scheduling overhead, this is particularly noticeable for “short-running” kernels (less than 1ms) where scheduling overhead forms a significant portion of the overall kernel runtime. Secondly, the wave cycles metric is incremented per-wavefront scheduled to a SIMD every cycle whereas the kernel cycles counter is incremented only once per-cycle when any wavefront is scheduled.

Instruction mix#

The instruction mix panel shows a breakdown of the various types of instructions executed by the user’s kernel, and which pipelines on the CU they were executed on. In addition, ROCm Compute Profiler reports further information about the breakdown of operation types for the VALU, vector-memory, and MFMA instructions.

Note

All metrics in this section count instructions issued, and not the total number of operations executed. The values reported by these metrics will not change regardless of the execution mask of the wavefront. Note that even if the execution mask is identically zero (meaning that no lanes are active) the instruction will still be counted, as CDNA accelerators still consider these instructions issued. See EXECute Mask, section 3.3 of the CDNA2 ISA guide for examples and further details.

Overall instruction mix#

This panel shows the total number of each type of instruction issued to the various compute pipelines on the CU. These are:

Metric

Description

Unit

VALU instructions

The total number of vector arithmetic logic unit (VALU) operations issued. These are the workhorses of the compute unit, and are used to execute a wide range of instruction types including floating point operations, non-uniform address calculations, transcendental operations, integer operations, shifts, conditional evaluation, etc.

Instructions

VMEM instructions

The total number of vector memory operations issued. These include most loads, stores and atomic operations and all accesses to generic, global, private and texture memory.

Instructions

LDS instructions

The total number of LDS (also known as shared memory) operations issued. These include loads, stores, atomics, and HIP’s __shfl operations.

Instructions

MFMA instructions

The total number of matrix fused multiply-add instructions issued.

Instructions

SALU instructions

The total number of scalar arithmetic logic unit (SALU) operations issued. Typically these are used for address calculations, literal constants, and other operations that are provably uniform across a wavefront. Although scalar memory (SMEM) operations are issued by the SALU, they are counted separately in this section.

Instructions

SMEM instructions

The total number of scalar memory (SMEM) operations issued. These are typically used for loading kernel arguments, base-pointers and loads from HIP’s __constant__ memory.

Instructions

Branch instructions

The total number of branch operations issued. These typically consist of jump or branch operations and are used to implement control flow.

Instructions

Note

Note, as mentioned in the Branch section: branch operations are not used for execution mask updates, but only for “whole wavefront” control flow changes.

VALU arithmetic instruction mix#

Warning

Not all metrics in this section (for instance, the floating-point instruction breakdowns) are available on CDNA accelerators older than the MI2XX series.

This panel details the various types of vector instructions that were issued to the VALU. The metrics in this section do not include MFMA instructions using the same precision; for instance, the “F16-ADD” metric does not include any 16-bit floating point additions executed as part of an MFMA instruction using the same precision.

Metric

Description

Unit

INT32

The total number of instructions operating on 32-bit integer operands issued to the VALU per normalization unit.

Instructions per normalization unit

INT64

The total number of instructions operating on 64-bit integer operands issued to the VALU per normalization unit.

Instructions per normalization unit

F16-ADD

The total number of addition instructions operating on 16-bit floating-point operands issued to the VALU per normalization unit.

Instructions per normalization unit

F16-MUL

The total number of multiplication instructions operating on 16-bit floating-point operands issued to the VALU per normalization unit.

Instructions per normalization unit

F16-FMA

The total number of fused multiply-add instructions operating on 16-bit floating-point operands issued to the VALU per normalization unit.

Instructions per normalization unit

F16-TRANS

The total number of transcendental instructions (e.g., sqrt) operating on 16-bit floating-point operands issued to the VALU per normalization unit.

Instructions per normalization unit

F32-ADD

The total number of addition instructions operating on 32-bit floating-point operands issued to the VALU per normalization unit.

Instructions per normalization unit

F32-MUL

The total number of multiplication instructions operating on 32-bit floating-point operands issued to the VALU per normalization unit.

Instructions per normalization unit

F32-FMA

The total number of fused multiply-add instructions operating on 32-bit floating-point operands issued to the VALU per normalization unit.

Instructions per normalization unit

F32-TRANS

The total number of transcendental instructions (such as sqrt) operating on 32-bit floating-point operands issued to the VALU per normalization unit.

Instructions per normalization unit

F64-ADD

The total number of addition instructions operating on 64-bit floating-point operands issued to the VALU per normalization unit.

Instructions per normalization unit

F64-MUL

The total number of multiplication instructions operating on 64-bit floating-point operands issued to the VALU per normalization unit.

Instructions per normalization unit

F64-FMA

The total number of fused multiply-add instructions operating on 64-bit floating-point operands issued to the VALU per normalization unit.

Instructions per normalization unit

F64-TRANS

The total number of transcendental instructions (such as sqrt) operating on 64-bit floating-point operands issued to the VALU per normalization unit.

Instructions per normalization unit

Conversion

The total number of type conversion instructions (such as converting data to or from F32↔F64) issued to the VALU per normalization unit.

Instructions per normalization unit

For an example of these counters in action, refer to VALU arithmetic instruction mix.

VMEM instruction mix#

This section breaks down the types of vector memory (VMEM) instructions that were issued. Refer to the Instruction Counts metrics section under address processor front end of the vL1D cache for descriptions of these VMEM instructions.

MFMA instruction mix#

Warning

The metrics in this section are only available on CDNA2 (MI2XX) accelerators and newer.

This section details the types of Matrix Fused Multiply-Add (MFMA) instructions that were issued. Note that MFMA instructions are classified by the type of input data they operate on, and not the data type the result is accumulated to.

Metric

Description

Unit

MFMA-I8 Instructions

The total number of 8-bit integer MFMA instructions issued per normalization unit.

Instructions per normalization unit

MFMA-F16 Instructions

The total number of 16-bit floating point MFMA instructions issued per normalization unit.

Instructions per normalization unit

MFMA-BF16 Instructions

The total number of 16-bit brain floating point MFMA instructions issued per normalization unit.

Instructions per normalization unit

MFMA-F32 Instructions

The total number of 32-bit floating-point MFMA instructions issued per normalization unit.

Instructions per normalization unit

MFMA-F64 Instructions

The total number of 64-bit floating-point MFMA instructions issued per normalization unit.

Instructions per normalization unit

Compute pipeline#

FLOP counting conventions#

ROCm Compute Profiler’s conventions for VALU FLOP counting are as follows:

  • Addition or multiplication: 1 operation

  • Transcendentals: 1 operation

  • Fused multiply-add (FMA): 2 operations

Integer operations (IOPs) do not use this convention. They are counted as a single operation regardless of the instruction type.

Note

Packed operations which operate on multiple operands in the same instruction are counted identically to the underlying instruction type. For example, the v_pk_add_f32 instruction on MI2XX, which performs an add operation on two pairs of aligned 32-bit floating-point operands is counted only as a single addition – that is, 1 operation.

As discussed in the Instruction mix section, the FLOP/IOP metrics in this section do not take into account the execution mask of the operation, and will report the same value even if the execution mask is identically zero.

For example, a FMA instruction operating on 32-bit floating-point operands (such as v_fma_f32 on a MI2XX accelerator) would be counted as 128 total FLOPs: 2 operations (due to the instruction type) multiplied by 64 operations (because the wavefront is composed of 64 work-items).

Compute Speed-of-Light#

Warning

The theoretical maximum throughput for some metrics in this section are currently computed with the maximum achievable clock frequency, as reported by rocminfo, for an accelerator. This may not be realistic for all workloads.

This section reports the number of floating-point and integer operations executed on the VALU and MFMA units in various precisions. We note that unlike the VALU instruction mix and MFMA instruction mix sections, the metrics here are reported as FLOPs and IOPs, that is, the total number of operations executed.

Metric

Description

Unit

VALU FLOPs

The total floating-point operations executed per second on the VALU. This is also presented as a percent of the peak theoretical FLOPs achievable on the specific accelerator. Note: this does not include any floating-point operations from MFMA instructions.

GFLOPs

VALU IOPs

The total integer operations executed per second on the VALU. This is also presented as a percent of the peak theoretical IOPs achievable on the specific accelerator. Note: this does not include any integer operations from MFMA instructions.

GIOPs

MFMA FLOPs (BF16)

The total number of 16-bit brain floating point MFMA operations executed per second. Note: this does not include any 16-bit brain floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical BF16 MFMA operations achievable on the specific accelerator.

GFLOPs

MFMA FLOPs (F16)

The total number of 16-bit floating point MFMA operations executed per second. Note: this does not include any 16-bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F16 MFMA operations achievable on the specific accelerator.

GFLOPs

MFMA FLOPs (F32)

The total number of 32-bit floating point MFMA operations executed per second. Note: this does not include any 32-bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F32 MFMA operations achievable on the specific accelerator.

GFLOPs

MFMA FLOPs (F64)

The total number of 64-bit floating point MFMA operations executed per second. Note: this does not include any 64-bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F64 MFMA operations achievable on the specific accelerator.

GFLOPs

MFMA IOPs (INT8)

The total number of 8-bit integer MFMA operations executed per second. Note: this does not include any 8-bit integer operations from VALU instructions. This is also presented as a percent of the peak theoretical INT8 MFMA operations achievable on the specific accelerator.

GIOPs

Pipeline statistics#

This section reports a number of key performance characteristics of various execution units on the CU. Refer to Instructions-per-cycle and utilizations example for a detailed dive into these metrics, and the scheduler the for a high-level overview of execution units and instruction issue.

Metric

Description

Unit

IPC

The ratio of the total number of instructions executed on the CU over the total active CU cycles.

Instructions per-cycle

IPC (Issued)

The ratio of the total number of (non-internal) instructions issued over the number of cycles where the scheduler was actively working on issuing instructions. Refer to the Issued IPC example for further detail.

Instructions per-cycle

SALU utilization

Indicates what percent of the kernel’s duration the SALU was busy executing instructions. Computed as the ratio of the total number of cycles spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.

Percent

VALU utilization

Indicates what percent of the kernel’s duration the VALU was busy executing instructions. Does not include VMEM operations. Computed as the ratio of the total number of cycles spent by the scheduler issuing VALU instructions over the total CU cycles.

Percent

VMEM utilization

Indicates what percent of the kernel’s duration the VMEM unit was busy executing instructions, including both global/generic and spill/scratch operations (see the VMEM instruction count metrics for more detail). Does not include VALU operations. Computed as the ratio of the total number of cycles spent by the scheduler issuing VMEM instructions over the total CU cycles.

Percent

Branch utilization

Indicates what percent of the kernel’s duration the branch unit was busy executing instructions. Computed as the ratio of the total number of cycles spent by the scheduler issuing branch instructions over the total CU cycles.

Percent

VALU active threads

Indicates the average level of divergence within a wavefront over the lifetime of the kernel. The number of work-items that were active in a wavefront during execution of each VALU instruction, time-averaged over all VALU instructions run on all wavefronts in the kernel.

Work-items

MFMA utilization

Indicates what percent of the kernel’s duration the MFMA unit was busy executing instructions. Computed as the ratio of the total number of cycles spent by the MFMA was busy over the total CU cycles.

Percent

MFMA instruction cycles

The average duration of MFMA instructions in this kernel in cycles. Computed as the ratio of the total number of cycles the MFMA unit was busy over the total number of MFMA instructions. Compare to, for example, the AMD Matrix Instruction Calculator.

Cycles per instruction

VMEM latency

The average number of round-trip cycles (that is, from issue to data return / acknowledgment) required for a VMEM instruction to complete.

Cycles

SMEM latency

The average number of round-trip cycles (that is, from issue to data return / acknowledgment) required for a SMEM instruction to complete.

Cycles

Note

The branch utilization reported in this section also includes time spent in other instruction types (namely: s_endpgm) that are typically a very small percentage of the overall kernel execution. This complication is omitted for simplicity, but may result in small amounts of branch utilization (typically less than 1%) for otherwise branch-less kernels.

Arithmetic operations#

This section reports the total number of floating-point and integer operations executed in various precisions. Unlike the Compute Speed-of-Light panel, this section reports both VALU and MFMA operations of the same precision (e.g., F32) in the same metric. Additionally, this panel lets the user control how the data is normalized (i.e., control the normalization unit), while the speed-of-light panel does not. For more detail on how operations are counted see the FLOP counting convention section.

Warning

As discussed in Instruction mix, the metrics in this section do not take into account the execution mask of the operation, and will report the same value even if EXEC is identically zero.

Metric

Description

Unit

FLOPs (Total)

The total number of floating-point operations executed on either the VALU or MFMA units, per normalization unit.

FLOP per normalization unit

IOPs (Total)

The total number of integer operations executed on either the VALU or MFMA units, per normalization unit.

IOP per normalization unit

F16 OPs

The total number of 16-bit floating-point operations executed on either the VALU or MFMA units, per normalization unit.

FLOP per normalization unit

BF16 OPs

The total number of 16-bit brain floating-point operations executed on either the VALU or MFMA units, per normalization unit. Note: on current CDNA accelerators, the VALU has no native BF16 instructions.

FLOP per normalization unit

F32 OPs

The total number of 32-bit floating-point operations executed on either the VALU or MFMA units, per normalization unit.

FLOP per normalization unit

F64 OPs

The total number of 64-bit floating-point operations executed on either the VALU or MFMA units, per normalization unit.

FLOP per normalization unit

INT8 OPs

The total number of 8-bit integer operations executed on either the VALU or MFMA units, per normalization unit. Note: on current CDNA accelerators, the VALU has no native INT8 instructions.

IOPs per normalization unit