Definitions#
The following table briefly defines some terminology used in ROCm Compute Profiler interfaces and in this documentation.
Name |
Description |
Unit |
---|---|---|
Kernel time |
The number of seconds the accelerator was executing a kernel, from the command processor’s (CP) start-of-kernel timestamp (a number of cycles after the CP beings processing the packet) to the CP’s end-of-kernel timestamp (a number of cycles before the CP stops processing the packet). |
Seconds |
Kernel cycles |
The number of cycles the accelerator was active doing any work, as measured by the command processor (CP). |
Cycles |
Total CU cycles |
The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of compute units on the accelerator. A measure of the total possible active cycles the compute units could be doing work, useful for the normalization of metrics inside the CU. |
Cycles |
Total active CU cycles |
The number of cycles a CU on the accelerator was active doing any work, summed over all compute units on the accelerator. |
Cycles |
Total SIMD cycles |
The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of SIMDs on the accelerator. A measure of the total possible active cycles the SIMDs could be doing work, useful for the normalization of metrics inside the CU. |
Cycles |
Total L2 cycles |
The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of L2 channels on the accelerator. A measure of the total possible active cycles the L2 channels could be doing work, useful for the normalization of metrics inside the L2. |
Cycles |
Total active L2 cycles |
The number of cycles a channel of the L2 cache was active doing any work, summed over all L2 channels on the accelerator. |
Cycles |
Total sL1D cycles |
The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of scalar L1 data caches on the accelerator. A measure of the total possible active cycles the sL1Ds could be doing work, useful for the normalization of metrics inside the sL1D. |
Cycles |
Total L1I cycles |
The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of L1 instruction caches (L1I) on the accelerator. A measure of the total possible active cycles the L1Is could be doing work, useful for the normalization of metrics inside the L1I. |
Cycles |
Total scheduler-pipe cycles |
The number of cycles the accelerator was active doing any work (that is, kernel cycles), multiplied by the number of scheduler pipes on the accelerator. A measure of the total possible active cycles the scheduler-pipes could be doing work, useful for the normalization of metrics inside the workgroup manager and command processor. |
Cycles |
Total shader-engine cycles |
The total number of cycles the accelerator was active doing any work, multiplied by the number of shader engines on the accelerator. A measure of the total possible active cycles the shader engines could be doing work, useful for the normalization of metrics inside the workgroup manager. |
Cycles |
Thread-requests |
The number of unique memory addresses accessed by a single memory instruction. On AMD Instinct accelerators, this has a maximum of 64 (that is, the size of the wavefront). |
Addresses |
Work-item |
A single thread, or lane, of execution that executes in lockstep with the rest of the work-items comprising a wavefront of execution. |
N/A |
Wavefront |
A group of work-items, or threads, that execute in lockstep on the compute unit. On AMD Instinct accelerators, the wavefront size is always 64 work-items. |
N/A |
Workgroup |
A group of wavefronts that execute on the same compute unit, and can cooperatively execute and share data via the use of synchronization primitives, LDS, atomics, and others. |
N/A |
Divergence |
Divergence within a wavefront occurs when not all work-items are active when executing an instruction, that is, due to non-uniform control flow within a wavefront. Can reduce execution efficiency by causing, for instance, the VALU to need to execute both branches of a conditional with different sets of work-items active. |
N/A |
Normalization units#
A user-configurable unit by which you can choose to normalize data. Options include:
Name |
Description |
---|---|
|
The total value of the measured counter or metric that occurred per kernel invocation divided by the total number of wavefronts launched in the kernel. |
|
The total value of the measured counter or metric that occurred per kernel invocation divided by the kernel cycles, that is, the total number of cycles the kernel executed as measured by the command processor. |
|
The total value of the measured counter or metric that occurred per kernel invocation. |
|
The total value of the measured counter or metric that occurred per kernel invocation divided by the kernel time, that is, the total runtime of the kernel in seconds, as measured by the command processor. |
By default, ROCm Compute Profiler uses the per_wave
normalization.
Tip
The best normalization may vary depending on your use case. For instance, a
per_second
normalization might be useful for FLOP or bandwidth
comparisons, while a per_wave
normalization could be useful to see how many
(and what types) of instructions are used per wavefront. A per_kernel
normalization can be useful to get the total aggregate values of metrics for
comparison between different configurations.
Memory spaces#
AMD Instinct™ MI-series accelerators can access memory through multiple address spaces which may map to different physical memory locations on the system. The following table provides a view into how various types of memory used in HIP map onto these constructs:
LLVM Address Space |
Hardware Memory Space |
HIP Terminology |
---|---|---|
Generic |
Flat |
N/A |
Global |
Global |
Global |
Local |
LDS |
LDS/Shared |
Private |
Scratch |
Private |
Constant |
Same as global |
Constant |
The following is a high-level description of the address spaces in the AMDGPU backend of LLVM:
Address space |
Description |
---|---|
Global |
Memory that can be seen by all threads in a process, and may be backed by the local accelerator’s HBM, a remote accelerator’s HBM, or the CPU’s DRAM. |
Local |
Memory that is only visible to a particular workgroup. On AMD’s Instinct accelerator hardware, this is stored in LDS memory. |
Private |
Memory that is only visible to a particular [work-item](workitem) (thread), stored in the scratch space on AMD’s Instinct accelerators. |
Constant |
Read-only memory that is in the global address space and stored on the local accelerator’s HBM. |
Generic |
Used when the compiler cannot statically prove that a pointer is addressing memory in a single (non-generic) address space. Mapped to Flat on AMD’s Instinct accelerators, the pointer could dynamically address global, local, private or constant memory. |
LLVM’s documentation for AMDGPU Backend has the most up-to-date information. Refer to this source for a more complete explanation.
Memory type#
AMD Instinct accelerators contain a number of different memory allocation types to enable the HIP language’s memory coherency model. These memory types are broadly similar between AMD Instinct accelerator generations, but may differ in exact implementation.
In addition, these memory types might differ between accelerators on the same system, even when accessing the same memory allocation.
For example, an MI2XX accelerator accessing fine-grained memory allocated local to that device may see the allocation as coherently cacheable, while a remote accelerator might see the same allocation as uncached.
These memory types include:
Memory type |
Description |
---|---|
Uncached Memory (UC) |
Memory that will not be cached in this accelerator. On
MI2XX accelerators, this corresponds “fine-grained”
(or, “coherent”) memory allocated on a remote accelerator or the host,
for example, using |
Non-hardware-Coherent Memory (NC) |
Memory that will be cached by the accelerator, and is only guaranteed to
be consistent at kernel boundaries / after software-driven
synchronization events. On MI2XX accelerators, this
type of memory maps to, for example, “coarse-grained” |
Coherently Cachable (CC) |
Memory for which only reads from the accelerator where the memory was
allocated will be cached. Writes to CC memory are uncached, and trigger
invalidations of any line within this accelerator. On
MI2XX accelerators, this type of memory maps to
“fine-grained” memory allocated on the local accelerator using, for
example, the |
Read/Write Coherent Memory (RW) |
Memory that will be cached by the accelerator, but may be invalidated by
writes from remote devices at kernel boundaries / after software-driven
synchronization events. On MI2XX accelerators, this
corresponds to “coarse-grained” memory allocated locally to the
accelerator, using for example, the default |
Find a good discussion of coarse and fine-grained memory allocations and what type of memory is returned by various combinations of memory allocators, flags and arguments in the Crusher quick-start guide.