Pipeline metrics#
In this section, we describe the metrics available in Omniperf to analyze the pipelines discussed in the Pipeline descriptions.
Wavefront#
Wavefront launch stats#
The wavefront launch stats panel gives general information about the kernel launch:
Metric 
Description 
Unit 

Grid Size 
The total number of workitems (or, threads) launched as a part of the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied by the total workgroup (or, block) size. 

Workgroup Size 
The total number of workitems (or, threads) in each workgroup (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent to the total block size. 

Total Wavefronts 
The total number of wavefronts launched as part of the kernel dispatch. On AMD Instinct™ CDNA™ accelerators and GCN™ GPUs, the wavefront size is always 64 workitems. Thus, the total number of wavefronts should be equivalent to the ceiling of grid size divided by 64. 

Saved Wavefronts 
The total number of wavefronts saved at a contextsave. See cwsr_enable. 

Restored Wavefronts 
The total number of wavefronts restored from a contextsave. See cwsr_enable. 

VGPRs 
The number of architected vector generalpurpose registers allocated for the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested by the compiler due to allocation granularity. 

AGPRs 
The number of accumulation vector generalpurpose registers allocated for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs requested by the compiler due to allocation granularity. 

SGPRs 
The number of scalar generalpurpose registers allocated for the kernel, see SALU. Note: this may not exactly match the number of SGPRs requested by the compiler due to allocation granularity. 

LDS Allocation 
The number of bytes of LDS memory (or, shared memory) allocated for this kernel. Note: This may also be larger than what was requested at compile time due to both allocation granularity and dynamic perdispatch LDS allocations. 
Bytes per workgroup 
Scratch Allocation 
The number of bytes of scratch memory requested per workitem for this kernel. Scratch memory is used for stack memory on the accelerator, as well as for register spills and restores. 
Bytes per workitem 
Wavefront runtime stats#
The wavefront runtime statistics gives a highlevel overview of the execution of wavefronts in a kernel:
Metric 
Description 
Unit 

The total duration of the executed kernel. Note: this should not be directly compared to the wavefront cycles / timings below. 
Nanoseconds 

The total duration of the executed kernel in cycles. Note: this should not be directly compared to the wavefront cycles / timings below. 
Cycles 

Instructions per wavefront 
The average number of instructions (of all types) executed per wavefront. This is averaged over all wavefronts in a kernel dispatch. 
Instructions / wavefront 
Wave cycles 
The number of cycles a wavefront in the kernel dispatch spent resident on a compute unit per normalization unit. This is averaged over all wavefronts in a kernel dispatch. Note: this should not be directly compared to the kernel cycles above. 
Cycles per normalization unit 
Dependency wait cycles 
The number of cycles a wavefront in the kernel dispatch stalled waiting on memory of any kind (e.g., instruction fetch, vector or scalar memory, etc.) per normalization unit. This counter is incremented at every cycle by all wavefronts on a CU stalled at a memory operation. As such, it is most useful to get a sense of how waves were spending their time, rather than identification of a precise limiter because another wave could be actively executing while a wave is stalled. The sum of this metric, Issue Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric. 
Cycles per normalization unit 
Issue Wait Cycles 
The number of cycles a wavefront in the kernel dispatch was unable to issue an instruction for any reason (e.g., execution pipe backpressure, arbitration loss, etc.) per normalization unit. This counter is incremented at every cycle by all wavefronts on a CU unable to issue an instruction. As such, it is most useful to get a sense of how waves were spending their time, rather than identification of a precise limiter because another wave could be actively executing while a wave is issue stalled. The sum of this metric, Dependency Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric. 
Cycles per normalization unit 
Active Cycles 
The average number of cycles a wavefront in the kernel dispatch was actively executing instructions per normalization unit. This measurement is made on a perwavefront basis, and may include cycles that another wavefront spent actively executing (on another execution unit, for example) or was stalled. As such, it is most useful to get a sense of how waves were spending their time, rather than identification of a precise limiter. The sum of this metric, Issue Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles metric. 
Cycles per normalization unit 
Wavefront Occupancy 
The timeaveraged number of wavefronts resident on the accelerator over the lifetime of the kernel. Note: this metric may be inaccurate for shortrunning kernels (less than 1ms). 
Note
As mentioned earlier, the measurement of kernel cycles and time typically cannot be directly compared to, for example, wave cycles. This is due to two factors: first, the kernel cycles/timings are measured using a counter that is impacted by scheduling overhead, this is particularly noticeable for “shortrunning” kernels (less than 1ms) where scheduling overhead forms a significant portion of the overall kernel runtime. Secondly, the wave cycles metric is incremented perwavefront scheduled to a SIMD every cycle whereas the kernel cycles counter is incremented only once percycle when any wavefront is scheduled.
Instruction mix#
The instruction mix panel shows a breakdown of the various types of instructions executed by the user’s kernel, and which pipelines on the CU they were executed on. In addition, Omniperf reports further information about the breakdown of operation types for the VALU, vectormemory, and MFMA instructions.
Note
All metrics in this section count instructions issued, and not the total number of operations executed. The values reported by these metrics will not change regardless of the execution mask of the wavefront. Note that even if the execution mask is identically zero (meaning that no lanes are active) the instruction will still be counted, as CDNA accelerators still consider these instructions issued. See EXECute Mask, section 3.3 of the CDNA2 ISA guide for examples and further details.
Overall instruction mix#
This panel shows the total number of each type of instruction issued to the various compute pipelines on the CU. These are:
Metric 
Description 
Unit 

VALU instructions 
The total number of vector arithmetic logic unit (VALU) operations issued. These are the workhorses of the compute unit, and are used to execute a wide range of instruction types including floating point operations, nonuniform address calculations, transcendental operations, integer operations, shifts, conditional evaluation, etc. 
Instructions 
VMEM instructions 
The total number of vector memory operations issued. These include most loads, stores and atomic operations and all accesses to generic, global, private and texture memory. 
Instructions 
LDS instructions 
The total number of LDS (also known as shared memory) operations issued.
These include loads, stores, atomics, and HIP’s 
Instructions 
MFMA instructions 
The total number of matrix fused multiplyadd instructions issued. 
Instructions 
SALU instructions 
The total number of scalar arithmetic logic unit (SALU) operations issued. Typically these are used for address calculations, literal constants, and other operations that are provably uniform across a wavefront. Although scalar memory (SMEM) operations are issued by the SALU, they are counted separately in this section. 
Instructions 
SMEM instructions 
The total number of scalar memory (SMEM) operations issued. These are
typically used for loading kernel arguments, basepointers and loads
from HIP’s 
Instructions 
Branch instructions 
The total number of branch operations issued. These typically consist of jump or branch operations and are used to implement control flow. 
Instructions 
Note
Note, as mentioned in the Branch section: branch operations are not used for execution mask updates, but only for “whole wavefront” control flow changes.
VALU arithmetic instruction mix#
Warning
Not all metrics in this section (for instance, the floatingpoint instruction breakdowns) are available on CDNA accelerators older than the MI2XX series.
This panel details the various types of vector instructions that were issued to the VALU. The metrics in this section do not include MFMA instructions using the same precision; for instance, the “F16ADD” metric does not include any 16bit floating point additions executed as part of an MFMA instruction using the same precision.
Metric 
Description 
Unit 

INT32 
The total number of instructions operating on 32bit integer operands issued to the VALU per normalization unit. 
Instructions per normalization unit 
INT64 
The total number of instructions operating on 64bit integer operands issued to the VALU per normalization unit. 
Instructions per normalization unit 
F16ADD 
The total number of addition instructions operating on 16bit floatingpoint operands issued to the VALU per normalization unit. 
Instructions per normalization unit 
F16MUL 
The total number of multiplication instructions operating on 16bit floatingpoint operands issued to the VALU per normalization unit. 
Instructions per normalization unit 
F16FMA 
The total number of fused multiplyadd instructions operating on 16bit floatingpoint operands issued to the VALU per normalization unit. 
Instructions per normalization unit 
F16TRANS 
The total number of transcendental instructions (e.g., sqrt) operating on 16bit floatingpoint operands issued to the VALU per normalization unit. 
Instructions per normalization unit 
F32ADD 
The total number of addition instructions operating on 32bit floatingpoint operands issued to the VALU per normalization unit. 
Instructions per normalization unit 
F32MUL 
The total number of multiplication instructions operating on 32bit floatingpoint operands issued to the VALU per normalization unit. 
Instructions per normalization unit 
F32FMA 
The total number of fused multiplyadd instructions operating on 32bit floatingpoint operands issued to the VALU per normalization unit. 
Instructions per normalization unit 
F32TRANS 
The total number of transcendental instructions (such as 
Instructions per normalization unit 
F64ADD 
The total number of addition instructions operating on 64bit floatingpoint operands issued to the VALU per normalization unit. 
Instructions per normalization unit 
F64MUL 
The total number of multiplication instructions operating on 64bit floatingpoint operands issued to the VALU per normalization unit. 
Instructions per normalization unit 
F64FMA 
The total number of fused multiplyadd instructions operating on 64bit floatingpoint operands issued to the VALU per normalization unit. 
Instructions per normalization unit 
F64TRANS 
The total number of transcendental instructions (such as sqrt) operating on 64bit floatingpoint operands issued to the VALU per normalization unit. 
Instructions per normalization unit 
Conversion 
The total number of type conversion instructions (such as converting data to or from F32↔F64) issued to the VALU per normalization unit. 
Instructions per normalization unit 
For an example of these counters in action, refer to VALU arithmetic instruction mix.
VMEM instruction mix#
This section breaks down the types of vector memory (VMEM) instructions that were issued. Refer to the Instruction Counts metrics section under address processor front end of the vL1D cache for descriptions of these VMEM instructions.
MFMA instruction mix#
Warning
The metrics in this section are only available on CDNA2 (MI2XX) accelerators and newer.
This section details the types of Matrix Fused MultiplyAdd (MFMA) instructions that were issued. Note that MFMA instructions are classified by the type of input data they operate on, and not the data type the result is accumulated to.
Metric 
Description 
Unit 

MFMAI8 Instructions 
The total number of 8bit integer MFMA instructions issued per normalization unit. 
Instructions per normalization unit 
MFMAF16 Instructions 
The total number of 16bit floating point MFMA instructions issued per normalization unit. 
Instructions per normalization unit 
MFMABF16 Instructions 
The total number of 16bit brain floating point MFMA instructions issued per normalization unit. 
Instructions per normalization unit 
MFMAF32 Instructions 
The total number of 32bit floatingpoint MFMA instructions issued per normalization unit. 
Instructions per normalization unit 
MFMAF64 Instructions 
The total number of 64bit floatingpoint MFMA instructions issued per normalization unit. 
Instructions per normalization unit 
Compute pipeline#
FLOP counting conventions#
Omniperf’s conventions for VALU FLOP counting are as follows:
Addition or multiplication: 1 operation
Transcendentals: 1 operation
Fused multiplyadd (FMA): 2 operations
Integer operations (IOPs) do not use this convention. They are counted as a single operation regardless of the instruction type.
Note
Packed operations which operate on multiple operands in the same instruction
are counted identically to the underlying instruction type. For example, the
v_pk_add_f32
instruction on MI2XX, which performs an
add operation on two pairs of aligned 32bit floatingpoint operands is
counted only as a single addition – that is, 1 operation.
As discussed in the Instruction mix section, the FLOP/IOP metrics in this section do not take into account the execution mask of the operation, and will report the same value even if the execution mask is identically zero.
For example, a FMA instruction operating on 32bit floatingpoint
operands (such as v_fma_f32
on a MI2XX accelerator)
would be counted as 128 total FLOPs: 2 operations (due to the
instruction type) multiplied by 64 operations (because the wavefront is
composed of 64 workitems).
Compute SpeedofLight#
Warning
The theoretical maximum throughput for some metrics in this section are
currently computed with the maximum achievable clock frequency, as reported
by rocminfo
, for an accelerator. This may not be realistic for all
workloads.
This section reports the number of floatingpoint and integer operations executed on the VALU and MFMA units in various precisions. We note that unlike the VALU instruction mix and MFMA instruction mix sections, the metrics here are reported as FLOPs and IOPs, that is, the total number of operations executed.
Metric 
Description 
Unit 

VALU FLOPs 
The total floatingpoint operations executed per second on the VALU. This is also presented as a percent of the peak theoretical FLOPs achievable on the specific accelerator. Note: this does not include any floatingpoint operations from MFMA instructions. 
GFLOPs 
VALU IOPs 
The total integer operations executed per second on the VALU. This is also presented as a percent of the peak theoretical IOPs achievable on the specific accelerator. Note: this does not include any integer operations from MFMA instructions. 
GIOPs 
MFMA FLOPs (BF16) 
The total number of 16bit brain floating point MFMA operations executed per second. Note: this does not include any 16bit brain floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical BF16 MFMA operations achievable on the specific accelerator. 
GFLOPs 
MFMA FLOPs (F16) 
The total number of 16bit floating point MFMA operations executed per second. Note: this does not include any 16bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F16 MFMA operations achievable on the specific accelerator. 
GFLOPs 
MFMA FLOPs (F32) 
The total number of 32bit floating point MFMA operations executed per second. Note: this does not include any 32bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F32 MFMA operations achievable on the specific accelerator. 
GFLOPs 
MFMA FLOPs (F64) 
The total number of 64bit floating point MFMA operations executed per second. Note: this does not include any 64bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F64 MFMA operations achievable on the specific accelerator. 
GFLOPs 
MFMA IOPs (INT8) 
The total number of 8bit integer MFMA operations executed per second. Note: this does not include any 8bit integer operations from VALU instructions. This is also presented as a percent of the peak theoretical INT8 MFMA operations achievable on the specific accelerator. 
GIOPs 
Pipeline statistics#
This section reports a number of key performance characteristics of various execution units on the CU. Refer to Instructionspercycle and utilizations example for a detailed dive into these metrics, and the scheduler the for a highlevel overview of execution units and instruction issue.
Metric 
Description 
Unit 

IPC 
The ratio of the total number of instructions executed on the CU over the total active CU cycles. 
Instructions percycle 
IPC (Issued) 
The ratio of the total number of (noninternal) instructions issued over the number of cycles where the scheduler was actively working on issuing instructions. Refer to the Issued IPC example for further detail. 
Instructions percycle 
SALU utilization 
Indicates what percent of the kernel’s duration the SALU was busy executing instructions. Computed as the ratio of the total number of cycles spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles. 
Percent 
VALU utilization 
Indicates what percent of the kernel’s duration the VALU was busy executing instructions. Does not include VMEM operations. Computed as the ratio of the total number of cycles spent by the scheduler issuing VALU instructions over the total CU cycles. 
Percent 
VMEM utilization 
Indicates what percent of the kernel’s duration the VMEM unit was busy executing instructions, including both global/generic and spill/scratch operations (see the VMEM instruction count metrics for more detail). Does not include VALU operations. Computed as the ratio of the total number of cycles spent by the scheduler issuing VMEM instructions over the total CU cycles. 
Percent 
Branch utilization 
Indicates what percent of the kernel’s duration the branch unit was busy executing instructions. Computed as the ratio of the total number of cycles spent by the scheduler issuing branch instructions over the total CU cycles. 
Percent 
VALU active threads 
Indicates the average level of divergence within a wavefront over the lifetime of the kernel. The number of workitems that were active in a wavefront during execution of each VALU instruction, timeaveraged over all VALU instructions run on all wavefronts in the kernel. 
Workitems 
MFMA utilization 
Indicates what percent of the kernel’s duration the MFMA unit was busy executing instructions. Computed as the ratio of the total number of cycles spent by the MFMA was busy over the total CU cycles. 
Percent 
MFMA instruction cycles 
The average duration of MFMA instructions in this kernel in cycles. Computed as the ratio of the total number of cycles the MFMA unit was busy over the total number of MFMA instructions. Compare to, for example, the AMD Matrix Instruction Calculator. 
Cycles per instruction 
VMEM latency 
The average number of roundtrip cycles (that is, from issue to data return / acknowledgment) required for a VMEM instruction to complete. 
Cycles 
SMEM latency 
The average number of roundtrip cycles (that is, from issue to data return / acknowledgment) required for a SMEM instruction to complete. 
Cycles 
Note
The branch utilization reported in this section also includes time spent in
other instruction types (namely: s_endpgm
) that are typically a very
small percentage of the overall kernel execution. This complication is
omitted for simplicity, but may result in small amounts of branch utilization
(typically less than 1%) for otherwise branchless kernels.
Arithmetic operations#
This section reports the total number of floatingpoint and integer operations executed in various precisions. Unlike the Compute SpeedofLight panel, this section reports both VALU and MFMA operations of the same precision (e.g., F32) in the same metric. Additionally, this panel lets the user control how the data is normalized (i.e., control the normalization unit), while the speedoflight panel does not. For more detail on how operations are counted see the FLOP counting convention section.
Warning
As discussed in Instruction mix, the metrics in this section do not take into account the execution mask of the operation, and will report the same value even if EXEC is identically zero.
Metric 
Description 
Unit 

FLOPs (Total) 
The total number of floatingpoint operations executed on either the VALU or MFMA units, per normalization unit. 
FLOP per normalization unit 
IOPs (Total) 
The total number of integer operations executed on either the VALU or MFMA units, per normalization unit. 
IOP per normalization unit 
F16 OPs 
The total number of 16bit floatingpoint operations executed on either the VALU or MFMA units, per normalization unit. 
FLOP per normalization unit 
BF16 OPs 
The total number of 16bit brain floatingpoint operations executed on either the VALU or MFMA units, per normalization unit. Note: on current CDNA accelerators, the VALU has no native BF16 instructions. 
FLOP per normalization unit 
F32 OPs 
The total number of 32bit floatingpoint operations executed on either the VALU or MFMA units, per normalization unit. 
FLOP per normalization unit 
F64 OPs 
The total number of 64bit floatingpoint operations executed on either the VALU or MFMA units, per normalization unit. 
FLOP per normalization unit 
INT8 OPs 
The total number of 8bit integer operations executed on either the VALU or MFMA units, per normalization unit. Note: on current CDNA accelerators, the VALU has no native INT8 instructions. 
IOPs per normalization unit 