Definitions

Definitions#

The following table briefly defines some terminology used in Omniperf interfaces and in this documentation.

Name	Description	Unit
Kernel time	The number of seconds the accelerator was executing a kernel, from the command processor’s (CP) start-of-kernel timestamp (a number of cycles after the CP beings processing the packet) to the CP’s end-of-kernel timestamp (a number of cycles before the CP stops processing the packet).	Seconds
Kernel cycles	The number of cycles the accelerator was active doing any work, as measured by the command processor (CP).	Cycles
Total CU cycles	The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of compute units on the accelerator. A measure of the total possible active cycles the compute units could be doing work, useful for the normalization of metrics inside the CU.	Cycles
Total active CU cycles	The number of cycles a CU on the accelerator was active doing any work, summed over all compute units on the accelerator.	Cycles
Total SIMD cycles	The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of SIMDs on the accelerator. A measure of the total possible active cycles the SIMDs could be doing work, useful for the normalization of metrics inside the CU.	Cycles
Total L2 cycles	The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of L2 channels on the accelerator. A measure of the total possible active cycles the L2 channels could be doing work, useful for the normalization of metrics inside the L2.	Cycles
Total active L2 cycles	The number of cycles a channel of the L2 cache was active doing any work, summed over all L2 channels on the accelerator.	Cycles
Total sL1D cycles	The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of scalar L1 data caches on the accelerator. A measure of the total possible active cycles the sL1Ds could be doing work, useful for the normalization of metrics inside the sL1D.	Cycles
Total L1I cycles	The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of L1 instruction caches (L1I) on the accelerator. A measure of the total possible active cycles the L1Is could be doing work, useful for the normalization of metrics inside the L1I.	Cycles
Total scheduler-pipe cycles	The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of scheduler pipes on the accelerator. A measure of the total possible active cycles the scheduler-pipes could be doing work, useful for the normalization of metrics inside the workgroup manager and command processor.	Cycles
Total shader-engine cycles	The total number of cycles the accelerator was active doing any work, multiplied by the number of shader engines on the accelerator. A measure of the total possible active cycles the shader engines could be doing work, useful for the normalization of metrics inside the workgroup manager.	Cycles
Thread-requests	The number of unique memory addresses accessed by a single memory instruction. On AMD Instinct accelerators, this has a maximum of 64 (that is, the size of the wavefront).	Addresses
Work-item	A single thread, or lane, of execution that executes in lockstep with the rest of the work-items comprising a wavefront of execution.	N/A
Wavefront	A group of work-items, or threads, that execute in lockstep on the compute unit. On AMD Instinct accelerators, the wavefront size is always 64 work-items.	N/A
Workgroup	A group of wavefronts that execute on the same compute unit, and can cooperatively execute and share data via the use of synchronization primitives, LDS, atomics, and others.	N/A
Divergence	Divergence within a wavefront occurs when not all work-items are active when executing an instruction, that is, due to non-uniform control flow within a wavefront. Can reduce execution efficiency by causing, for instance, the VALU to need to execute both branches of a conditional with different sets of work-items active.	N/A

Normalization units#

A user-configurable unit by which you can choose to normalize data. Options include:

Name	Description
`per_wave`	The total value of the measured counter or metric that occurred per kernel invocation divided by the total number of wavefronts launched in the kernel.
`per_cycle`	The total value of the measured counter or metric that occurred per kernel invocation divided by the kernel cycles, that is, the total number of cycles the kernel executed as measured by the command processor.
`per_kernel`	The total value of the measured counter or metric that occurred per kernel invocation.
`per_second`	The total value of the measured counter or metric that occurred per kernel invocation divided by the kernel time, that is, the total runtime of the kernel in seconds, as measured by the command processor.

By default, Omniperf uses the per_wave normalization.

Tip

The best normalization may vary depending on your use case. For instance, a per_second normalization might be useful for FLOP or bandwidth comparisons, while a per_wave normalization could be useful to see how many (and what types) of instructions are used per wavefront. A per_kernel normalization can be useful to get the total aggregate values of metrics for comparison between different configurations.

Memory spaces#

AMD Instinct™ MI-series accelerators can access memory through multiple address spaces which may map to different physical memory locations on the system. The following table provides a view into how various types of memory used in HIP map onto these constructs:

LLVM Address Space	Hardware Memory Space	HIP Terminology
Generic	Flat	N/A
Global	Global	Global
Local	LDS	LDS/Shared
Private	Scratch	Private
Constant	Same as global	Constant

The following is a high-level description of the address spaces in the AMDGPU backend of LLVM:

Address space	Description
Global	Memory that can be seen by all threads in a process, and may be backed by the local accelerator’s HBM, a remote accelerator’s HBM, or the CPU’s DRAM.
Local	Memory that is only visible to a particular workgroup. On AMD’s Instinct accelerator hardware, this is stored in LDS memory.
Private	Memory that is only visible to a particular [work-item](workitem) (thread), stored in the scratch space on AMD’s Instinct accelerators.
Constant	Read-only memory that is in the global address space and stored on the local accelerator’s HBM.
Generic	Used when the compiler cannot statically prove that a pointer is addressing memory in a single (non-generic) address space. Mapped to Flat on AMD’s Instinct accelerators, the pointer could dynamically address global, local, private or constant memory.

LLVM’s documentation for AMDGPU Backend has the most up-to-date information. Refer to this source for a more complete explanation.

Memory type#

AMD Instinct accelerators contain a number of different memory allocation types to enable the HIP language’s memory coherency model. These memory types are broadly similar between AMD Instinct accelerator generations, but may differ in exact implementation.

In addition, these memory types might differ between accelerators on the same system, even when accessing the same memory allocation.

For example, an MI2XX accelerator accessing fine-grained memory allocated local to that device may see the allocation as coherently cacheable, while a remote accelerator might see the same allocation as uncached.

These memory types include:

Memory type	Description
Uncached Memory (UC)	Memory that will not be cached in this accelerator. On MI2XX accelerators, this corresponds “fine-grained” (or, “coherent”) memory allocated on a remote accelerator or the host, for example, using `hipHostMalloc` or `hipMallocManaged` with default allocation flags.
Non-hardware-Coherent Memory (NC)	Memory that will be cached by the accelerator, and is only guaranteed to be consistent at kernel boundaries / after software-driven synchronization events. On MI2XX accelerators, this type of memory maps to, for example, “coarse-grained” `hipHostMalloc`’d memory – that is, allocated with the `hipHostMallocNonCoherent` flag – or `hipMalloc`’d memory allocated on a remote accelerator.
Coherently Cachable (CC)	Memory for which only reads from the accelerator where the memory was allocated will be cached. Writes to CC memory are uncached, and trigger invalidations of any line within this accelerator. On MI2XX accelerators, this type of memory maps to “fine-grained” memory allocated on the local accelerator using, for example, the `hipExtMallocWithFlags` API using the `hipDeviceMallocFinegrained` flag.
Read/Write Coherent Memory (RW)	Memory that will be cached by the accelerator, but may be invalidated by writes from remote devices at kernel boundaries / after software-driven synchronization events. On MI2XX accelerators, this corresponds to “coarse-grained” memory allocated locally to the accelerator, using for example, the default `hipMalloc` allocator.

Find a good discussion of coarse and fine-grained memory allocations and what type of memory is returned by various combinations of memory allocators, flags and arguments in the Crusher quick-start guide.

Definitions

Contents

Definitions#

Normalization units#

Memory spaces#

Memory type#