Definitions#

The following table briefly defines some terminology used in Omniperf interfaces and in this documentation.

Name

Description

Unit

Kernel time

The number of seconds the accelerator was executing a kernel, from the command processor’s (CP) start-of-kernel timestamp (a number of cycles after the CP beings processing the packet) to the CP’s end-of-kernel timestamp (a number of cycles before the CP stops processing the packet).

Seconds

Kernel cycles

The number of cycles the accelerator was active doing any work, as measured by the command processor (CP).

Cycles

Total CU cycles

The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of compute units on the accelerator. A measure of the total possible active cycles the compute units could be doing work, useful for the normalization of metrics inside the CU.

Cycles

Total active CU cycles

The number of cycles a CU on the accelerator was active doing any work, summed over all compute units on the accelerator.

Cycles

Total SIMD cycles

The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of SIMDs on the accelerator. A measure of the total possible active cycles the SIMDs could be doing work, useful for the normalization of metrics inside the CU.

Cycles

Total L2 cycles

The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of L2 channels on the accelerator. A measure of the total possible active cycles the L2 channels could be doing work, useful for the normalization of metrics inside the L2.

Cycles

Total active L2 cycles

The number of cycles a channel of the L2 cache was active doing any work, summed over all L2 channels on the accelerator.

Cycles

Total sL1D cycles

The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of scalar L1 data caches on the accelerator. A measure of the total possible active cycles the sL1Ds could be doing work, useful for the normalization of metrics inside the sL1D.

Cycles

Total L1I cycles

The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of L1 instruction caches (L1I) on the accelerator. A measure of the total possible active cycles the L1Is could be doing work, useful for the normalization of metrics inside the L1I.

Cycles

Total scheduler-pipe cycles

The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of scheduler pipes on the accelerator. A measure of the total possible active cycles the scheduler-pipes could be doing work, useful for the normalization of metrics inside the workgroup manager and command processor.

Cycles

Total shader-engine cycles

The total number of cycles the accelerator was active doing any work, multiplied by the number of shader engines on the accelerator. A measure of the total possible active cycles the shader engines could be doing work, useful for the normalization of metrics inside the workgroup manager.

Cycles

Thread-requests

The number of unique memory addresses accessed by a single memory instruction. On AMD Instinct accelerators, this has a maximum of 64 (that is, the size of the wavefront).

Addresses

Work-item

A single thread, or lane, of execution that executes in lockstep with the rest of the work-items comprising a wavefront of execution.

N/A

Wavefront

A group of work-items, or threads, that execute in lockstep on the compute unit. On AMD Instinct accelerators, the wavefront size is always 64 work-items.

N/A

Workgroup

A group of wavefronts that execute on the same compute unit, and can cooperatively execute and share data via the use of synchronization primitives, LDS, atomics, and others.

N/A

Divergence

Divergence within a wavefront occurs when not all work-items are active when executing an instruction, that is, due to non-uniform control flow within a wavefront. Can reduce execution efficiency by causing, for instance, the VALU to need to execute both branches of a conditional with different sets of work-items active.

N/A

Normalization units#

A user-configurable unit by which you can choose to normalize data. Options include:

Name

Description

per_wave

The total value of the measured counter or metric that occurred per kernel invocation divided by the total number of wavefronts launched in the kernel.

per_cycle

The total value of the measured counter or metric that occurred per kernel invocation divided by the kernel cycles, that is, the total number of cycles the kernel executed as measured by the command processor.

per_kernel

The total value of the measured counter or metric that occurred per kernel invocation.

per_second

The total value of the measured counter or metric that occurred per kernel invocation divided by the kernel time, that is, the total runtime of the kernel in seconds, as measured by the command processor.

By default, Omniperf uses the per_wave normalization.

Tip

The best normalization may vary depending on your use case. For instance, a per_second normalization might be useful for FLOP or bandwidth comparisons, while a per_wave normalization could be useful to see how many (and what types) of instructions are used per wavefront. A per_kernel normalization can be useful to get the total aggregate values of metrics for comparison between different configurations.

Memory spaces#

AMD Instinct™ MI-series accelerators can access memory through multiple address spaces which may map to different physical memory locations on the system. The following table provides a view into how various types of memory used in HIP map onto these constructs:

LLVM Address Space

Hardware Memory Space

HIP Terminology

Generic

Flat

N/A

Global

Global

Global

Local

LDS

LDS/Shared

Private

Scratch

Private

Constant

Same as global

Constant

The following is a high-level description of the address spaces in the AMDGPU backend of LLVM:

Address space

Description

Global

Memory that can be seen by all threads in a process, and may be backed by the local accelerator’s HBM, a remote accelerator’s HBM, or the CPU’s DRAM.

Local

Memory that is only visible to a particular workgroup. On AMD’s Instinct accelerator hardware, this is stored in LDS memory.

Private

Memory that is only visible to a particular [work-item](workitem) (thread), stored in the scratch space on AMD’s Instinct accelerators.

Constant

Read-only memory that is in the global address space and stored on the local accelerator’s HBM.

Generic

Used when the compiler cannot statically prove that a pointer is addressing memory in a single (non-generic) address space. Mapped to Flat on AMD’s Instinct accelerators, the pointer could dynamically address global, local, private or constant memory.

LLVM’s documentation for AMDGPU Backend has the most up-to-date information. Refer to this source for a more complete explanation.

Memory type#

AMD Instinct accelerators contain a number of different memory allocation types to enable the HIP language’s memory coherency model. These memory types are broadly similar between AMD Instinct accelerator generations, but may differ in exact implementation.

In addition, these memory types might differ between accelerators on the same system, even when accessing the same memory allocation.

For example, an MI2XX accelerator accessing fine-grained memory allocated local to that device may see the allocation as coherently cacheable, while a remote accelerator might see the same allocation as uncached.

These memory types include:

Memory type

Description

Uncached Memory (UC)

Memory that will not be cached in this accelerator. On MI2XX accelerators, this corresponds “fine-grained” (or, “coherent”) memory allocated on a remote accelerator or the host, for example, using hipHostMalloc or hipMallocManaged with default allocation flags.

Non-hardware-Coherent Memory (NC)

Memory that will be cached by the accelerator, and is only guaranteed to be consistent at kernel boundaries / after software-driven synchronization events. On MI2XX accelerators, this type of memory maps to, for example, “coarse-grained” hipHostMalloc’d memory – that is, allocated with the hipHostMallocNonCoherent flag – or hipMalloc’d memory allocated on a remote accelerator.

Coherently Cachable (CC)

Memory for which only reads from the accelerator where the memory was allocated will be cached. Writes to CC memory are uncached, and trigger invalidations of any line within this accelerator. On MI2XX accelerators, this type of memory maps to “fine-grained” memory allocated on the local accelerator using, for example, the hipExtMallocWithFlags API using the hipDeviceMallocFinegrained flag.

Read/Write Coherent Memory (RW)

Memory that will be cached by the accelerator, but may be invalidated by writes from remote devices at kernel boundaries / after software-driven synchronization events. On MI2XX accelerators, this corresponds to “coarse-grained” memory allocated locally to the accelerator, using for example, the default hipMalloc allocator.

Find a good discussion of coarse and fine-grained memory allocations and what type of memory is returned by various combinations of memory allocators, flags and arguments in the Crusher quick-start guide.