Profiling by example

Profiling by example#

The following examples refer to sample HIP code located in ROCm/omniperf/blob/dev/sample and distributed as part of Omniperf.

VALU arithmetic instruction mix#

For this example, consider the instruction mix sample distributed as a part of Omniperf.

Note

The examples in the section are expected to work on all CDNA™ accelerators. However, the actual experiment results in this section were collected on an MI2XX accelerator.

Design note#

This code uses a number of inline assembly instructions to cleanly identify the types of instructions being issued, as well as to avoid optimization / dead-code elimination by the compiler. While inline assembly is inherently not portable, this example is expected to work on all GCN™ GPUs and CDNA accelerators.

We reproduce a sample of the kernel as follows:

// fp32: add, mul, transcendental and fma
float f1, f2;
asm volatile(
    "v_add_f32_e32 %0, %1, %0\n"
    "v_mul_f32_e32 %0, %1, %0\n"
    "v_sqrt_f32 %0, %1\n"
    "v_fma_f32 %0, %1, %0, %1\n"
    : "=v"(f1)
    : "v"(f2));

These instructions correspond to:

A 32-bit floating point addition,
a 32-bit floating point multiplication,
a 32-bit floating point square-root transcendental operation, and
a 32-bit floating point fused multiply-add operation.

For more detail, refer to the CDNA2 ISA Guide.

Instruction mix#

This example was compiled and run on a MI250 accelerator using ROCm v5.6.0, and Omniperf v2.0.0.

$ hipcc -O3 instmix.hip -o instmix

Generate the profile for this example using the following command.

$ omniperf profile -n instmix --no-roof -- ./instmix

Analyze the instruction mix section.

$ omniperf analyze -p workloads/instmix/mi200/ -b 10.2
<...>
10. Compute Units - Instruction Mix
10.2 VALU Arithmetic Instr Mix
╒═════════╤════════════╤═════════╤════════════════╕
│ Index   │ Metric     │   Count │ Unit           │
╞═════════╪════════════╪═════════╪════════════════╡
│ 10.2.0  │ INT32      │    1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.1  │ INT64      │    1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.2  │ F16-ADD    │    1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.3  │ F16-MUL    │    1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.4  │ F16-FMA    │    1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.5  │ F16-Trans  │    1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.6  │ F32-ADD    │    1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.7  │ F32-MUL    │    1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.8  │ F32-FMA    │    1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.9  │ F32-Trans  │    1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.10 │ F64-ADD    │    1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.11 │ F64-MUL    │    1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.12 │ F64-FMA    │    1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.13 │ F64-Trans  │    1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.14 │ Conversion │    1.00 │ Instr per wave │
╘═════════╧════════════╧═════════╧════════════════╛

This shows that we have exactly one of each type of VALU arithmetic instruction by construction.

Infinity Fabric transactions#

For this example, consider the Infinity Fabric™ sample distributed as a part of Omniperf.

This following code snippet launches a simple read-only kernel.

// the main streaming kernel
__global__ void kernel(int* x, size_t N, int zero) {
  int sum = 0;
  const size_t offset_start = threadIdx.x + blockIdx.x * blockDim.x;
  for (int i = 0; i < 10; ++i) {
    for (size_t offset = offset_start; offset < N; offset += blockDim.x * gridDim.x) {
      sum += x[offset];
    }
  }
  if (sum != 0) {
    x[offset_start] = sum;
  }
}

This happens twice – once as a warm-up and once for analysis. Note that the buffer x is initialized to all zeros via a call to hipMemcpy on the host before the kernel is ever launched. Therefore, the following conditional is identically false – and thus we expect no writes.

if (sum != 0) { ...

Note

The actual sample included with Omniperf also includes the ability to select different operation types (such as atomics, writes). This abbreviated version is presented here for reference only.

Finally, this sample code lets the user control the granularity of an allocation, the owner of an allocation (local HBM, CPU DRAM or remote HBM), and the size of an allocation (the default is $\sim4$GiB) via command line arguments. In doing so, we can explore the impact of these parameters on the L2-Fabric metrics reported by Omniperf to further understand their meaning.

Note

All results in this section were generated an a node of Infinity Fabric connected MI250 accelerators using ROCm version 5.6.0, and Omniperf version 2.0.0. Although results may vary with ROCm versions and accelerator connectivity, we expect the lessons learned here to be broadly applicable.

Experiment 1: Coarse-grained, accelerator-local HBM reads#

In our first experiment, we consider the simplest possible case, a hipMalloc’d buffer that is local to our current accelerator:

$ omniperf profile -n coarse_grained_local --no-roof -- ./fabric -t 1 -o 0
Using:
  mtype:CoarseGrained
  mowner:Device
  mspace:Global
  mop:Read
  mdata:Unsigned
  remoteId:-1
<...>
$ omniperf analyze -p workloads/coarse_grained_local/mi200 -b 17.2.0 17.2.1 17.2.2 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
╒═════════╤═════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
│ Index   │ Metric              │            Avg │            Min │            Max │ Unit             │
╞═════════╪═════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
│ 17.2.0  │ L2-Fabric Read BW   │ 42947428672.00 │ 42947428672.00 │ 42947428672.00 │ Bytes per kernel │
├─────────┼─────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.1  │ HBM Read Traffic    │         100.00 │         100.00 │         100.00 │ Pct              │
├─────────┼─────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.2  │ Remote Read Traffic │           0.00 │           0.00 │           0.00 │ Pct              │
╘═════════╧═════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
17.4 L2 - Fabric Interface Stalls
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
│ Index   │ Metric                        │ Type                   │ Transaction   │   Avg │   Min │   Max │ Unit   │
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
│ 17.4.0  │ Read - PCIe Stall             │ PCIe Stall             │ Read          │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.1  │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read          │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.2  │ Read - HBM Stall              │ HBM Stall              │ Read          │  0.07 │  0.07 │  0.07 │ Pct    │
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
17.5 L2 - Fabric Detailed Transaction Breakdown
╒═════════╤═════════════════╤══════════════╤══════════════╤══════════════╤════════════════╕
│ Index   │ Metric          │          Avg │          Min │          Max │ Unit           │
╞═════════╪═════════════════╪══════════════╪══════════════╪══════════════╪════════════════╡
│ 17.5.0  │ Read (32B)      │         0.00 │         0.00 │         0.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.1  │ Read (Uncached) │      1450.00 │      1450.00 │      1450.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.2  │ Read (64B)      │ 671053573.00 │ 671053573.00 │ 671053573.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.3  │ HBM Read        │ 671053565.00 │ 671053565.00 │ 671053565.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.4  │ Remote Read     │         8.00 │         8.00 │         8.00 │ Req per kernel │
╘═════════╧═════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛

Here, you can make the following observations.

The vast majority of L2-Fabric requests (>99%) are 64B read requests (17.5.2).
Nearly 100% of the read requests (17.2.1) are homed in on the accelerator-local HBM (17.5.3), while some small fraction of these reads are routed to a “remote” device (17.5.4).
These drive a $\sim40$GiB per kernel read-bandwidth (17.2.0).

In addition, we see a small amount of uncached reads (17.5.1), these correspond to things like:

The assembly code to execute the kernel
Kernel arguments
Coordinate parameters (such as blockDim.z) that were not initialized by the hardware, etc. and may account for some of our “remote” read requests (17.5.4), for example, reading from CPU DRAM

The above list is not exhaustive, nor are all of these guaranteed to be “uncached” – the exact implementation depends on the accelerator and ROCm versions used. These read requests could be interrogated further in the Scalar L1 Data Cache and Instruction Cache metric sections.

Note

The Traffic metrics in Sec 17.2 are presented as a percentage of the total number of requests. For example, “HBM Read Traffic” is the percent of read requests (17.5.0 - 17.5.2) that were directed to the accelerators’ local HBM (17.5.3).

Experiment 2: Fine-grained, accelerator-local HBM reads#

In this experiment, we change the granularity of our device-allocation to be fine-grained device memory, local to the current accelerator. Our code uses the hipExtMallocWithFlag API with the hipDeviceMallocFinegrained flag to accomplish this.

Note

On some systems (such as those with only PCIe® connected accelerators), you need to set the environment variable HSA_FORCE_FINE_GRAIN_PCIE=1 to enable this memory type.

$ omniperf profile -n fine_grained_local --no-roof -- ./fabric -t 0 -o 0
Using:
  mtype:FineGrained
  mowner:Device
  mspace:Global
  mop:Read
  mdata:Unsigned
  remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_local/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4  -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
╒═════════╤═══════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
│ Index   │ Metric                │            Avg │            Min │            Max │ Unit             │
╞═════════╪═══════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
│ 17.2.0  │ L2-Fabric Read BW     │ 42948661824.00 │ 42948661824.00 │ 42948661824.00 │ Bytes per kernel │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.1  │ HBM Read Traffic      │         100.00 │         100.00 │         100.00 │ Pct              │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.2  │ Remote Read Traffic   │           0.00 │           0.00 │           0.00 │ Pct              │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.3  │ Uncached Read Traffic │           0.00 │           0.00 │           0.00 │ Pct              │
╘═════════╧═══════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
17.4 L2 - Fabric Interface Stalls
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
│ Index   │ Metric                        │ Type                   │ Transaction   │   Avg │   Min │   Max │ Unit   │
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
│ 17.4.0  │ Read - PCIe Stall             │ PCIe Stall             │ Read          │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.1  │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read          │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.2  │ Read - HBM Stall              │ HBM Stall              │ Read          │  0.07 │  0.07 │  0.07 │ Pct    │
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
17.5 L2 - Fabric Detailed Transaction Breakdown
╒═════════╤═════════════════╤══════════════╤══════════════╤══════════════╤════════════════╕
│ Index   │ Metric          │          Avg │          Min │          Max │ Unit           │
╞═════════╪═════════════════╪══════════════╪══════════════╪══════════════╪════════════════╡
│ 17.5.0  │ Read (32B)      │         0.00 │         0.00 │         0.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.1  │ Read (Uncached) │      1334.00 │      1334.00 │      1334.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.2  │ Read (64B)      │ 671072841.00 │ 671072841.00 │ 671072841.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.3  │ HBM Read        │ 671072835.00 │ 671072835.00 │ 671072835.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.4  │ Remote Read     │         6.00 │         6.00 │         6.00 │ Req per kernel │
╘═════════╧═════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛

Comparing with our previous example, we see a relatively similar result, namely:

The vast majority of L2-Fabric requests are 64B read requests (17.5.2)
Nearly all these read requests are directed to the accelerator-local HBM (17.2.1)

In addition, we now see a small percentage of HBM Read Stalls (17.4.2), as streaming fine-grained memory is putting more stress on Infinity Fabric.

Note

The stalls in Sec 17.4 are presented as a percentage of the total number active L2 cycles, summed over all L2 channels.

Experiment 3: Fine-grained, remote-accelerator HBM reads#

In this experiment, we move our fine-grained allocation to be owned by a remote accelerator. We accomplish this by first changing the HIP device using, for instance, the hipSetDevice(1) API, then allocating fine-grained memory (as described previously), and finally resetting the device back to the default, for instance, hipSetDevice(0).

Although we have not changed our code significantly, we do see a substantial change in the L2-Fabric metrics:

$ omniperf profile -n fine_grained_remote --no-roof -- ./fabric -t 0 -o 2
Using:
  mtype:FineGrained
  mowner:Remote
  mspace:Global
  mop:Read
  mdata:Unsigned
  remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_remote/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4  -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
╒═════════╤═══════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
│ Index   │ Metric                │            Avg │            Min │            Max │ Unit             │
╞═════════╪═══════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
│ 17.2.0  │ L2-Fabric Read BW     │ 42949692736.00 │ 42949692736.00 │ 42949692736.00 │ Bytes per kernel │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.1  │ HBM Read Traffic      │           0.00 │           0.00 │           0.00 │ Pct              │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.2  │ Remote Read Traffic   │         100.00 │         100.00 │         100.00 │ Pct              │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.3  │ Uncached Read Traffic │         200.00 │         200.00 │         200.00 │ Pct              │
╘═════════╧═══════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
17.4 L2 - Fabric Interface Stalls
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
│ Index   │ Metric                        │ Type                   │ Transaction   │   Avg │   Min │   Max │ Unit   │
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
│ 17.4.0  │ Read - PCIe Stall             │ PCIe Stall             │ Read          │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.1  │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read          │ 17.85 │ 17.85 │ 17.85 │ Pct    │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.2  │ Read - HBM Stall              │ HBM Stall              │ Read          │  0.00 │  0.00 │  0.00 │ Pct    │
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
17.5 L2 - Fabric Detailed Transaction Breakdown
╒═════════╤═════════════════╤═══════════════╤═══════════════╤═══════════════╤════════════════╕
│ Index   │ Metric          │           Avg │           Min │           Max │ Unit           │
╞═════════╪═════════════════╪═══════════════╪═══════════════╪═══════════════╪════════════════╡
│ 17.5.0  │ Read (32B)      │          0.00 │          0.00 │          0.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.1  │ Read (Uncached) │ 1342177894.00 │ 1342177894.00 │ 1342177894.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.2  │ Read (64B)      │  671088949.00 │  671088949.00 │  671088949.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.3  │ HBM Read        │        307.00 │        307.00 │        307.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.4  │ Remote Read     │  671088642.00 │  671088642.00 │  671088642.00 │ Req per kernel │
╘═════════╧═════════════════╧═══════════════╧═══════════════╧═══════════════╧════════════════╛

First, we see that while we still observe approximately the same number of 64B Read Requests (17.5.2), we now see an even larger number of Uncached Read Requests (17.5.3). Some simple division reveals:

\[342177894.00 / 671088949.00 ≈ 2\]

That is, each 64B Read Request is also counted as two Uncached Read Requests, as reflected in the request-flow diagram. This is also why the Uncached Read Traffic metric (17.2.3) is at the counter-intuitive value of 200%!

In addition, observe that:

We no longer see any significant number of HBM Read Requests (17.2.1, 17.5.3), nor HBM Read Stalls (17.4.2), but instead,
we see that almost all of these requests are considered “remote” (17.2.2, 17.5.4) are being routed to another accelerator, or the CPU — in this case HIP Device 1 — and,
we see a significantly larger percentage of AMD Infinity Fabric Read Stalls (17.4.1) as compared to the HBM Read Stalls in the previous example.

These stalls correspond to reads that are going out over the AMD Infinity Fabric connection to another MI250 accelerator. In addition, because these are crossing between accelerators, we expect significantly lower achievable bandwidths as compared to the local accelerator’s HBM – this is reflected (indirectly) in the magnitude of the stall metric (17.4.1). Finally, we note that if our system contained only PCIe connected accelerators, these observations will differ.

Experiment 4: Fine-grained, CPU-DRAM reads#

In this experiment, we move our fine-grained allocation to be owned by the CPU’s DRAM. We accomplish this by allocating host-pinned fine-grained memory using the hipHostMalloc API:

$ omniperf profile -n fine_grained_host --no-roof -- ./fabric -t 0 -o 1
Using:
  mtype:FineGrained
  mowner:Host
  mspace:Global
  mop:Read
  mdata:Unsigned
  remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_host/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4  -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
╒═════════╤═══════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
│ Index   │ Metric                │            Avg │            Min │            Max │ Unit             │
╞═════════╪═══════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
│ 17.2.0  │ L2-Fabric Read BW     │ 42949691264.00 │ 42949691264.00 │ 42949691264.00 │ Bytes per kernel │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.1  │ HBM Read Traffic      │           0.00 │           0.00 │           0.00 │ Pct              │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.2  │ Remote Read Traffic   │         100.00 │         100.00 │         100.00 │ Pct              │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.3  │ Uncached Read Traffic │         200.00 │         200.00 │         200.00 │ Pct              │
╘═════════╧═══════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
17.4 L2 - Fabric Interface Stalls
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
│ Index   │ Metric                        │ Type                   │ Transaction   │   Avg │   Min │   Max │ Unit   │
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
│ 17.4.0  │ Read - PCIe Stall             │ PCIe Stall             │ Read          │ 91.29 │ 91.29 │ 91.29 │ Pct    │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.1  │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read          │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.2  │ Read - HBM Stall              │ HBM Stall              │ Read          │  0.00 │  0.00 │  0.00 │ Pct    │
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
17.5 L2 - Fabric Detailed Transaction Breakdown
╒═════════╤═════════════════╤═══════════════╤═══════════════╤═══════════════╤════════════════╕
│ Index   │ Metric          │           Avg │           Min │           Max │ Unit           │
╞═════════╪═════════════════╪═══════════════╪═══════════════╪═══════════════╪════════════════╡
│ 17.5.0  │ Read (32B)      │          0.00 │          0.00 │          0.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.1  │ Read (Uncached) │ 1342177848.00 │ 1342177848.00 │ 1342177848.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.2  │ Read (64B)      │  671088926.00 │  671088926.00 │  671088926.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.3  │ HBM Read        │        284.00 │        284.00 │        284.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.4  │ Remote Read     │  671088642.00 │  671088642.00 │  671088642.00 │ Req per kernel │
╘═════════╧═════════════════╧═══════════════╧═══════════════╧═══════════════╧════════════════╛

Here we see almost the same results as in the previous experiment, however now as we are crossing a PCIe bus to the CPU, we see that the Infinity Fabric Read stalls (17.4.1) have shifted to be a PCIe stall (17.4.2). In addition, as (on this system) the PCIe bus has a lower peak bandwidth than the AMD Infinity Fabric connection between two accelerators, we once again observe an increase in the percentage of stalls on this interface.

Note

Had we performed this same experiment on an MI250X system, these transactions would again have been marked as Infinity Fabric Read stalls (17.4.1), as the CPU is connected to the accelerator via AMD Infinity Fabric.

Experiment 5: Coarse-grained, CPU-DRAM reads#

In our next fabric experiment, we change our CPU memory allocation to be coarse-grained. We accomplish this by passing the hipHostMalloc API the hipHostMallocNonCoherent flag, to mark the allocation as coarse-grained:

$ omniperf profile -n coarse_grained_host --no-roof -- ./fabric -t 1 -o 1
Using:
  mtype:CoarseGrained
  mowner:Host
  mspace:Global
  mop:Read
  mdata:Unsigned
  remoteId:-1
<...>
$ omniperf analyze -p workloads/coarse_grained_host/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4  -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
╒═════════╤═══════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
│ Index   │ Metric                │            Avg │            Min │            Max │ Unit             │
╞═════════╪═══════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
│ 17.2.0  │ L2-Fabric Read BW     │ 42949691264.00 │ 42949691264.00 │ 42949691264.00 │ Bytes per kernel │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.1  │ HBM Read Traffic      │           0.00 │           0.00 │           0.00 │ Pct              │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.2  │ Remote Read Traffic   │         100.00 │         100.00 │         100.00 │ Pct              │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.3  │ Uncached Read Traffic │           0.00 │           0.00 │           0.00 │ Pct              │
╘═════════╧═══════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
17.4 L2 - Fabric Interface Stalls
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
│ Index   │ Metric                        │ Type                   │ Transaction   │   Avg │   Min │   Max │ Unit   │
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
│ 17.4.0  │ Read - PCIe Stall             │ PCIe Stall             │ Read          │ 91.27 │ 91.27 │ 91.27 │ Pct    │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.1  │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read          │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.2  │ Read - HBM Stall              │ HBM Stall              │ Read          │  0.00 │  0.00 │  0.00 │ Pct    │
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
17.5 L2 - Fabric Detailed Transaction Breakdown
╒═════════╤═════════════════╤══════════════╤══════════════╤══════════════╤════════════════╕
│ Index   │ Metric          │          Avg │          Min │          Max │ Unit           │
╞═════════╪═════════════════╪══════════════╪══════════════╪══════════════╪════════════════╡
│ 17.5.0  │ Read (32B)      │         0.00 │         0.00 │         0.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.1  │ Read (Uncached) │       562.00 │       562.00 │       562.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.2  │ Read (64B)      │ 671088926.00 │ 671088926.00 │ 671088926.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.3  │ HBM Read        │       281.00 │       281.00 │       281.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.4  │ Remote Read     │ 671088645.00 │ 671088645.00 │ 671088645.00 │ Req per kernel │
╘═════════╧═════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛

Here we see a similar result to our previous experiment, with one key difference: our accesses are no longer marked as Uncached Read requests (17.2.3, 17.5.1), but instead are 64B read requests (17.5.2), as observed in our Coarse-grained, accelerator-local HBM experiment.

Experiment 6: Fine-grained, CPU-DRAM writes#

Thus far in our exploration of the L2-Fabric interface, we have primarily focused on read operations. However, in our request flow diagram, we note that writes are counted separately. To observe this, we use the -p flag to trigger write operations to fine-grained memory allocated on the host:

$ omniperf profile -n fine_grained_host_write --no-roof -- ./fabric -t 0 -o 1 -p 1
Using:
  mtype:FineGrained
  mowner:Host
  mspace:Global
  mop:Write
  mdata:Unsigned
  remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_host_writes/mi200 -b 17.2.4 17.2.5 17.2.6 17.2.7 17.2.8 17.4.3 17.4.4 17.4.5 17.4.6 17.5.5 17.5.6 17.5.7 17.5.8 17.5.9 17.5.10 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
╒═════════╤═══════════════════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
│ Index   │ Metric                            │            Avg │            Min │            Max │ Unit             │
╞═════════╪═══════════════════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
│ 17.2.4  │ L2-Fabric Write and Atomic BW     │ 42949672960.00 │ 42949672960.00 │ 42949672960.00 │ Bytes per kernel │
├─────────┼───────────────────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.5  │ HBM Write and Atomic Traffic      │           0.00 │           0.00 │           0.00 │ Pct              │
├─────────┼───────────────────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.6  │ Remote Write and Atomic Traffic   │         100.00 │         100.00 │         100.00 │ Pct              │
├─────────┼───────────────────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.7  │ Atomic Traffic                    │           0.00 │           0.00 │           0.00 │ Pct              │
├─────────┼───────────────────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.8  │ Uncached Write and Atomic Traffic │         100.00 │         100.00 │         100.00 │ Pct              │
╘═════════╧═══════════════════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
17.4 L2 - Fabric Interface Stalls
╒═════════╤════════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
│ Index   │ Metric                         │ Type                   │ Transaction   │   Avg │   Min │   Max │ Unit   │
╞═════════╪════════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
│ 17.4.3  │ Write - PCIe Stall             │ PCIe Stall             │ Write         │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.4  │ Write - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Write         │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.5  │ Write - HBM Stall              │ HBM Stall              │ Write         │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.6  │ Write - Credit Starvation      │ Credit Starvation      │ Write         │  0.00 │  0.00 │  0.00 │ Pct    │
╘═════════╧════════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
17.5 L2 - Fabric Detailed Transaction Breakdown
╒═════════╤═════════════════════════╤══════════════╤══════════════╤══════════════╤════════════════╕
│ Index   │ Metric                  │          Avg │          Min │          Max │ Unit           │
╞═════════╪═════════════════════════╪══════════════╪══════════════╪══════════════╪════════════════╡
│ 17.5.5  │ Write (32B)             │         0.00 │         0.00 │         0.00 │ Req per kernel │
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.6  │ Write (Uncached)        │ 671088640.00 │ 671088640.00 │ 671088640.00 │ Req per kernel │
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.7  │ Write (64B)             │ 671088640.00 │ 671088640.00 │ 671088640.00 │ Req per kernel │
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.8  │ HBM Write and Atomic    │         0.00 │         0.00 │         0.00 │ Req per kernel │
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.9  │ Remote Write and Atomic │ 671088640.00 │ 671088640.00 │ 671088640.00 │ Req per kernel │
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.10 │ Atomic                  │         0.00 │         0.00 │         0.00 │ Req per kernel │
╘═════════╧═════════════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛

Here we notice a few changes in our request pattern:

As expected, the requests have changed from 64B Reads to 64B Write requests (17.5.7),
these requests are homed in on a “remote” destination (17.2.6, 17.5.9), as expected, and
these are also counted as a single Uncached Write request (17.5.6).

In addition, there are rather significant changes in the bandwidth values reported:

The “L2-Fabric Write and Atomic” bandwidth metric (17.2.4) reports about 40GiB of data written across Infinity Fabric while
The “Remote Write and Traffic” metric (17.2.5) indicates that nearly 100% of these request are being directed to a remote source.

The precise meaning of these metrics are explored in the subsequent experiment.

Finally, we note that we see no write stalls on the PCIe bus (17.4.3). This is because writes over a PCIe bus are non-posted, that is, they do not require acknowledgement.

Experiment 7: Fine-grained, CPU-DRAM atomicAdd#

Next, we change our experiment to instead target atomicAdd operations to the CPU’s DRAM.

$ omniperf profile -n fine_grained_host_add --no-roof -- ./fabric -t 0 -o 1 -p 2
Using:
  mtype:FineGrained
  mowner:Host
  mspace:Global
  mop:Add
  mdata:Unsigned
  remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_host_add/mi200 -b 17.2.4 17.2.5 17.2.6 17.2.7 17.2.8 17.4.3 17.4.4 17.4.5 17.4.6 17.5.5 17.5.6 17.5.7 17.5.8 17.5.9 17.5.10 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
╒═════════╤═══════════════════════════════════╤══════════════╤══════════════╤══════════════╤══════════════════╕
│ Index   │ Metric                            │          Avg │          Min │          Max │ Unit             │
╞═════════╪═══════════════════════════════════╪══════════════╪══════════════╪══════════════╪══════════════════╡
│ 17.2.4  │ L2-Fabric Write and Atomic BW     │ 429496736.00 │ 429496736.00 │ 429496736.00 │ Bytes per kernel │
├─────────┼───────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────────┤
│ 17.2.5  │ HBM Write and Atomic Traffic      │         0.00 │         0.00 │         0.00 │ Pct              │
├─────────┼───────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────────┤
│ 17.2.6  │ Remote Write and Atomic Traffic   │       100.00 │       100.00 │       100.00 │ Pct              │
├─────────┼───────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────────┤
│ 17.2.7  │ Atomic Traffic                    │       100.00 │       100.00 │       100.00 │ Pct              │
├─────────┼───────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────────┤
│ 17.2.8  │ Uncached Write and Atomic Traffic │       100.00 │       100.00 │       100.00 │ Pct              │
╘═════════╧═══════════════════════════════════╧══════════════╧══════════════╧══════════════╧══════════════════╛
17.4 L2 - Fabric Interface Stalls
╒═════════╤════════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
│ Index   │ Metric                         │ Type                   │ Transaction   │   Avg │   Min │   Max │ Unit   │
╞═════════╪════════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
│ 17.4.3  │ Write - PCIe Stall             │ PCIe Stall             │ Write         │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.4  │ Write - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Write         │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.5  │ Write - HBM Stall              │ HBM Stall              │ Write         │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.6  │ Write - Credit Starvation      │ Credit Starvation      │ Write         │  0.00 │  0.00 │  0.00 │ Pct    │
╘═════════╧════════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
17.5 L2 - Fabric Detailed Transaction Breakdown
╒═════════╤═════════════════════════╤═════════════╤═════════════╤═════════════╤════════════════╕
│ Index   │ Metric                  │         Avg │         Min │         Max │ Unit           │
╞═════════╪═════════════════════════╪═════════════╪═════════════╪═════════════╪════════════════╡
│ 17.5.5  │ Write (32B)             │ 13421773.00 │ 13421773.00 │ 13421773.00 │ Req per kernel │
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
│ 17.5.6  │ Write (Uncached)        │ 13421773.00 │ 13421773.00 │ 13421773.00 │ Req per kernel │
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
│ 17.5.7  │ Write (64B)             │        0.00 │        0.00 │        0.00 │ Req per kernel │
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
│ 17.5.8  │ HBM Write and Atomic    │        0.00 │        0.00 │        0.00 │ Req per kernel │
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
│ 17.5.9  │ Remote Write and Atomic │ 13421773.00 │ 13421773.00 │ 13421773.00 │ Req per kernel │
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
│ 17.5.10 │ Atomic                  │ 13421773.00 │ 13421773.00 │ 13421773.00 │ Req per kernel │
╘═════════╧═════════════════════════╧═════════════╧═════════════╧═════════════╧════════════════╛

In this case, there is quite a lot to unpack:

For the first time, the 32B Write requests (17.5.5) are heavily used.
These correspond to Atomic requests (17.2.7, 17.5.10), and are counted as Uncached Writes (17.5.6).
The L2-Fabric Write and Atomic bandwidth metric (17.2.4) shows about 0.4 GiB of traffic. For convenience, the sample reduces the default problem size for this case due to the speed of atomics across a PCIe bus, and finally,
The traffic is directed to a remote device (17.2.6, 17.5.9).

Let’s consider what an “atomic” request means in this context. Recall that we are discussing memory traffic flowing from the L2 cache, the device-wide coherence point on current CDNA accelerators such as the MI250, to for example, the CPU’s DRAM. In this light, we see that these requests correspond to system scope atomics, and specifically in the case of the MI250, to fine-grained memory.

Disclaimer

PCIe® is a registered trademark of PCI-SIG Corporation.

Vector memory operation counting#

Global / Generic (FLAT)#

For this example, consider the vector memory sample distributed as a part of Omniperf. This code launches many different versions of a simple read/write/atomic-only kernels targeting various address spaces. For example, below is our simple global_write kernel:

// write to a global pointer
__global__ void global_write(int* ptr, int zero) {
  ptr[threadIdx.x] = zero;
}

Note

This example was compiled and run on an MI250 accelerator using ROCm v5.6.0, and Omniperf v2.0.0.

$ hipcc -O3 --save-temps vmem.hip -o vmem

We have also chosen to include the --save-temps flag to save the compiler temporary files, such as the generated CDNA assembly code, for inspection.

Finally, we generate our omniperf profile as follows.

$ omniperf profile -n vmem --no-roof -- ./vmem

Design note#

This section explains some of the more peculiar lines of code in the example, for example, the use of compiler built-ins and explicit address space casting, and so forth.

// write to a generic pointer
typedef int __attribute__((address_space(0)))* generic_ptr;

__attribute__((noinline)) __device__ void generic_store(generic_ptr ptr, int zero) { *ptr = zero; }

__global__ void generic_write(int* ptr, int zero, int filter) {
  __shared__ int lds[1024];
  int* generic = (threadIdx.x < filter) ? &ptr[threadIdx.x] : &lds[threadIdx.x];
  generic_store((generic_ptr)generic, zero);
}

One of the aims of this example is to demonstrate the use of the “generic” FLAT address space. This address space is typically used when the compiler cannot statically prove where the backing memory is located.

To try to force the compiler to use this address space, we applied __attribute__((noinline)) to the generic_store function to have the compiler treat it as a function call (that is, on the other side of which, the address space may not be known). However, in a trivial example such as this, the compiler may choose to specialize the generic_store function to the two address spaces that might provably be used from our translation unit, that is, “local” (or, LDS) and “global”. Hence, we forcibly cast the address space to “generic” (or, FLAT) to avoid this compiler optimization.

Warning

While convenient for this example, this sort of explicit address space casting can lead to strange compilation errors, and in the worst case, incorrect results. As a result, use is discouraged in production code.

For more details on address spaces, refer to Memory spaces.

Global write#

First, we demonstrate our simple global_write kernel:

$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 1 -b 10.3 15.1.4 15.1.5 15.1.6 15.1.7 15.1.8 15.1.9 15.1.10 15.1.11  -n per_kernel
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤═════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
│    │ KernelName                          │   Count │   Sum(ns) │   Mean(ns) │   Median(ns) │    Pct │
╞════╪═════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
│  0 │ global_write(int*, int) [clone .kd] │    1.00 │   2400.00 │    2400.00 │      2400.00 │ 100.00 │
╘════╧═════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛


--------------------------------------------------------------------------------
10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index   │ Metric                │   Avg │   Min │   Max │ Unit             │
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 10.3.0  │ Global/Generic Instr  │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.1  │ Global/Generic Read   │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.2  │ Global/Generic Write  │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.3  │ Global/Generic Atomic │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.4  │ Spill/Stack Instr     │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.5  │ Spill/Stack Read      │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.6  │ Spill/Stack Write     │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.7  │ Spill/Stack Atomic    │  0.00 │  0.00 │  0.00 │ Instr per kernel │
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛


--------------------------------------------------------------------------------
15. Address Processing Unit and Data Return Path (TA/TD)
15.1 Address Processing Unit
╒═════════╤═════════════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index   │ Metric                      │   Avg │   Min │   Max │ Unit             │
╞═════════╪═════════════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 15.1.4  │ Total Instructions          │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 15.1.5  │ Global/Generic Instr        │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 15.1.6  │ Global/Generic Read Instr   │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 15.1.7  │ Global/Generic Write Instr  │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 15.1.8  │ Global/Generic Atomic Instr │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 15.1.9  │ Spill/Stack Instr           │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 15.1.10 │ Spill/Stack Read Instr      │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 15.1.11 │ Spill/Stack Write Instr     │  0.00 │  0.00 │  0.00 │ Instr per kernel │
╘═════════╧═════════════════════════════╧═══════╧═══════╧═══════╧══════════════════╛

Here, we have presented both the information in the VMEM Instruction Mix table (10.3) and the Address Processing Unit (15.1). We note that this data is expected to be identical, and hence we omit table 15.1 in our subsequent examples.

In addition, as expected, we see a single Global/Generic Write instruction (10.3.2, 15.1.7). Inspecting the generated assembly, we get:

        .protected      _Z12global_writePii     ; -- Begin function _Z12global_writePii
        .globl  _Z12global_writePii
        .p2align        8
        .type   _Z12global_writePii,@function
_Z12global_writePii:                    ; @_Z12global_writePii
; %bb.0:
        s_load_dword s2, s[4:5], 0x8
        s_load_dwordx2 s[0:1], s[4:5], 0x0
        v_lshlrev_b32_e32 v0, 2, v0
        s_waitcnt lgkmcnt(0)
        v_mov_b32_e32 v1, s2
        global_store_dword v0, v1, s[0:1]
        s_endpgm
        .section        .rodata,#alloc
        .p2align        6, 0x0
        .amdhsa_kernel _Z12global_writePii

Notice that this corresponds to an instance of a global_store_dword operation.

Note

The assembly in these experiments were generated for an MI2XX accelerator using ROCm 5.6.0, and may change depending on ROCm versions and the targeted hardware architecture.

Generic write to LDS#

Next, we examine a generic write. As discussed previously, our generic_write kernel uses an address space cast to force the compiler to choose our desired address space, regardless of other optimizations that may be possible.

Also note that the filter parameter passed in as a kernel argument (see example and design note) is set to zero on the host, such that we always write to the local (LDS) memory allocation lds.

Examining this kernel in the VMEM Instruction Mix table yields:

$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 2 -b 10.3 -n per_kernel
<...>
0. Top Stat
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
│    │ KernelName                               │   Count │   Sum(ns) │   Mean(ns) │   Median(ns) │    Pct │
╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
│  0 │ generic_write(int*, int, int) [clone .kd │    1.00 │   2880.00 │    2880.00 │      2880.00 │ 100.00 │
│    │ ]                                        │         │           │            │              │        │
╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛


--------------------------------------------------------------------------------
10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index   │ Metric                │   Avg │   Min │   Max │ Unit             │
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 10.3.0  │ Global/Generic Instr  │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.1  │ Global/Generic Read   │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.2  │ Global/Generic Write  │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.3  │ Global/Generic Atomic │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.4  │ Spill/Stack Instr     │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.5  │ Spill/Stack Read      │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.6  │ Spill/Stack Write     │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.7  │ Spill/Stack Atomic    │  0.00 │  0.00 │  0.00 │ Instr per kernel │
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛

As expected we see a single generic write (10.3.2). In the assembly generated for this kernel (in particular, we care about the generic_store function), we see that this corresponds to a flat_store_dword instruction:

        .type   _Z13generic_storePii,@function
_Z13generic_storePii:                   ; @_Z13generic_storePii
; %bb.0:
        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        flat_store_dword v[0:1], v2
        s_waitcnt vmcnt(0) lgkmcnt(0)
        s_setpc_b64 s[30:31]
.Lfunc_end0:

In addition, we note that we can observe the destination of this request by looking at the LDS Instructions metric (12.2.0) – which indicates one LDS access.

$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 2 -b 12.2.0 -n per_kernel
<...>
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index   │ Metric     │   Avg │   Min │   Max │ Unit             │
╞═════════╪════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 12.2.0  │ LDS Instrs │  1.00 │  1.00 │  1.00 │ Instr per kernel │
╘═════════╧════════════╧═══════╧═══════╧═══════╧══════════════════╛

Note

Exercise for the reader: if this access had been targeted at global memory (for instance, by changing value of filter), where should we look for the memory traffic? Hint: see the generic read example.

Global read#

Next, we examine a simple global read operation:

__global__ void global_read(int* ptr, int zero) {
  int x = ptr[threadIdx.x];
  if (x != zero) {
    ptr[threadIdx.x] = x + 1;
  }
}

Here we observe a now familiar pattern:

Read a value in from global memory.
Have a write hidden behind a conditional that is impossible for the compiler to statically eliminate, but is identically false. In this case, our main() function initializes the data in ptr to zero.

Running Omniperf on this kernel yields:

$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 3 -b 10.3 -n per_kernel
<...>
0. Top Stat
╒════╤════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
│    │ KernelName                         │   Count │   Sum(ns) │   Mean(ns) │   Median(ns) │    Pct │
╞════╪════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
│  0 │ global_read(int*, int) [clone .kd] │    1.00 │   4480.00 │    4480.00 │      4480.00 │ 100.00 │
╘════╧════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛


--------------------------------------------------------------------------------
10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index   │ Metric                │   Avg │   Min │   Max │ Unit             │
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 10.3.0  │ Global/Generic Instr  │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.1  │ Global/Generic Read   │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.2  │ Global/Generic Write  │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.3  │ Global/Generic Atomic │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.4  │ Spill/Stack Instr     │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.5  │ Spill/Stack Read      │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.6  │ Spill/Stack Write     │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.7  │ Spill/Stack Atomic    │  0.00 │  0.00 │  0.00 │ Instr per kernel │
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛

Here we see a single global/generic instruction (10.3.0) which, as expected, is a read (10.3.1).

Generic read from global memory#

For our generic read example, we choose to change our target for the generic read to be global memory:

__global__ void generic_read(int* ptr, int zero, int filter) {
  __shared__ int lds[1024];
  if (static_cast<int>(filter - 1) == zero) {
    lds[threadIdx.x] = 0; // initialize to zero to avoid conditional, but hide behind _another_ conditional
  }
  int* generic;
  if (static_cast<int>(threadIdx.x) > filter - 1) {
    generic = &ptr[threadIdx.x];
  } else {
    generic = &lds[threadIdx.x];
    abort();
  }
  int x = generic_load((generic_ptr)generic);
  if (x != zero) {
    ptr[threadIdx.x] = x + 1;
  }
}

In addition to our usual if (condition_that_wont_happen) guard around the write operation, there is an additional conditional around the initialization of the lds buffer. We note that it’s typically required to write to this buffer to prevent the compiler from eliminating the local memory branch entirely due to undefined behavior (use of an uninitialized value). However, to report only our global memory read, we again hide this initialization behind an identically false conditional (both zero and filter are set to zero in the kernel launch). Note that this is a different conditional from our pointer assignment (to avoid combination of the two).

Running Omniperf on this kernel reports:

$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 4 -b 10.3 12.2.0 16.3.10 -n per_kernel
<...>
0. Top Stat
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
│    │ KernelName                               │   Count │   Sum(ns) │   Mean(ns) │   Median(ns) │    Pct │
╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
│  0 │ generic_read(int*, int, int) [clone .kd] │    1.00 │   2240.00 │    2240.00 │      2240.00 │ 100.00 │
╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛


--------------------------------------------------------------------------------
10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index   │ Metric                │   Avg │   Min │   Max │ Unit             │
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 10.3.0  │ Global/Generic Instr  │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.1  │ Global/Generic Read   │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.2  │ Global/Generic Write  │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.3  │ Global/Generic Atomic │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.4  │ Spill/Stack Instr     │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.5  │ Spill/Stack Read      │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.6  │ Spill/Stack Write     │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.7  │ Spill/Stack Atomic    │  0.00 │  0.00 │  0.00 │ Instr per kernel │
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛


--------------------------------------------------------------------------------
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index   │ Metric     │   Avg │   Min │   Max │ Unit             │
╞═════════╪════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 12.2.0  │ LDS Instrs │  0.00 │  0.00 │  0.00 │ Instr per kernel │
╘═════════╧════════════╧═══════╧═══════╧═══════╧══════════════════╛


--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.3 L1D Cache Accesses
╒═════════╤════════════╤═══════╤═══════╤═══════╤════════════════╕
│ Index   │ Metric     │   Avg │   Min │   Max │ Unit           │
╞═════════╪════════════╪═══════╪═══════╪═══════╪════════════════╡
│ 16.3.10 │ L1-L2 Read │  1.00 │  1.00 │  1.00 │ Req per kernel │
╘═════════╧════════════╧═══════╧═══════╧═══════╧════════════════╛

Here we observe:

A single global/generic read operation (10.3.1), which
Is not an LDS instruction (12.2), as seen in the generic write example, but is instead
An L1-L2 read operation (16.3.10)

That is, we have successfully targeted our generic read at global memory. Inspecting the assembly shows this corresponds to a flat_load_dword instruction.

Global atomic#

Our global atomic kernel simply atomically adds a (non-compile-time) zero value to a pointer.

__global__ void global_atomic(int* ptr, int zero) {
  atomicAdd(ptr, zero);
}

Running Omniperf on this kernel yields:

$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 5 -b 10.3 16.3.12 -n per_kernel
<...>
0. Top Stat
╒════╤══════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
│    │ KernelName                           │   Count │   Sum(ns) │   Mean(ns) │   Median(ns) │    Pct │
╞════╪══════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
│  0 │ global_atomic(int*, int) [clone .kd] │    1.00 │   4640.00 │    4640.00 │      4640.00 │ 100.00 │
╘════╧══════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛


--------------------------------------------------------------------------------
10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index   │ Metric                │   Avg │   Min │   Max │ Unit             │
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 10.3.0  │ Global/Generic Instr  │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.1  │ Global/Generic Read   │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.2  │ Global/Generic Write  │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.3  │ Global/Generic Atomic │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.4  │ Spill/Stack Instr     │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.5  │ Spill/Stack Read      │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.6  │ Spill/Stack Write     │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.7  │ Spill/Stack Atomic    │  0.00 │  0.00 │  0.00 │ Instr per kernel │
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛


--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.3 L1D Cache Accesses
╒═════════╤══════════════╤═══════╤═══════╤═══════╤════════════════╕
│ Index   │ Metric       │   Avg │   Min │   Max │ Unit           │
╞═════════╪══════════════╪═══════╪═══════╪═══════╪════════════════╡
│ 16.3.12 │ L1-L2 Atomic │  1.00 │  1.00 │  1.00 │ Req per kernel │
╘═════════╧══════════════╧═══════╧═══════╧═══════╧════════════════╛

Here we see a single global/generic atomic instruction (10.3.3), which corresponds to an L1-L2 atomic request (16.3.12).

Generic, mixed atomic#

In our final global/generic example, we look at a case where our generic operation targets both LDS and global memory:

__global__ void generic_atomic(int* ptr, int filter, int zero) {
  __shared__ int lds[1024];
  int* generic = (threadIdx.x % 2 == filter) ? &ptr[threadIdx.x] : &lds[threadIdx.x];
  generic_atomic((generic_ptr)generic, zero);
}

This assigns every other work-item to atomically update global memory or local memory.

Running this kernel through Omniperf shows:

$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 6 -b 10.3 12.2.0 16.3.12 -n per_kernel
<...>
0. Top Stat
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
│    │ KernelName                               │   Count │   Sum(ns) │   Mean(ns) │   Median(ns) │    Pct │
╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
│  0 │ generic_atomic(int*, int, int) [clone .k │    1.00 │   3360.00 │    3360.00 │      3360.00 │ 100.00 │
│    │ d]                                       │         │           │            │              │        │
╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛


10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index   │ Metric                │   Avg │   Min │   Max │ Unit             │
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 10.3.0  │ Global/Generic Instr  │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.1  │ Global/Generic Read   │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.2  │ Global/Generic Write  │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.3  │ Global/Generic Atomic │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.4  │ Spill/Stack Instr     │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.5  │ Spill/Stack Read      │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.6  │ Spill/Stack Write     │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.7  │ Spill/Stack Atomic    │  0.00 │  0.00 │  0.00 │ Instr per kernel │
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛


--------------------------------------------------------------------------------
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index   │ Metric     │   Avg │   Min │   Max │ Unit             │
╞═════════╪════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 12.2.0  │ LDS Instrs │  1.00 │  1.00 │  1.00 │ Instr per kernel │
╘═════════╧════════════╧═══════╧═══════╧═══════╧══════════════════╛


--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.3 L1D Cache Accesses
╒═════════╤══════════════╤═══════╤═══════╤═══════╤════════════════╕
│ Index   │ Metric       │   Avg │   Min │   Max │ Unit           │
╞═════════╪══════════════╪═══════╪═══════╪═══════╪════════════════╡
│ 16.3.12 │ L1-L2 Atomic │  1.00 │  1.00 │  1.00 │ Req per kernel │
╘═════════╧══════════════╧═══════╧═══════╧═══════╧════════════════╛

That is, we see:

A single generic atomic instruction (10.3.3) that maps to both
An LDS instruction (12.2.0), and
An L1-L2 atomic request (16.3)

We have demonstrated the ability of the generic address space to dynamically target different backing memory.

Spill/Scratch (BUFFER)#

Next we examine the use of “Spill/Scratch” memory. On current CDNA accelerators such as the MI2XX, this is implemented using the private memory space, which maps to “scratch” memory in AMDGPU hardware terminology. This type of memory can be accessed via different instructions depending on the specific architecture targeted. However, current CDNA accelerators such as the MI2XX use so called buffer instructions to access private memory in a simple (and typically) coalesced manner. See Sec. 9.1, “Vector Memory Buffer Instructions” of the CDNA2 ISA guide for further reading on this instruction type.

We develop a simple kernel that uses stack memory:

#include <hip/hip_runtime.h>
__global__ void knl(int* out, int filter) {
  int x[1024];
  x[filter] = 0;
  if (threadIdx.x < filter)
    out[threadIdx.x] = x[threadIdx.x];
}

Our strategy here is to:

Create a large stack buffer (that cannot reasonably fit into registers) - Write to a compile-time unknown location on the stack, and then
Behind the typical compile-time unknown if(condition_that_wont_happen)
Read from a different, compile-time unknown, location on the stack and write to global memory to prevent the compiler from optimizing it out.

This example was compiled and run on an MI250 accelerator using ROCm v5.6.0, and Omniperf v2.0.0.

$ hipcc -O3 stack.hip -o stack.hip

And profiled using Omniperf:

$ omniperf profile -n stack --no-roof -- ./stack
<...>
$ omniperf analyze -p workloads/stack/mi200/  -b 10.3 16.3.11 -n per_kernel
<...>
10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index   │ Metric                │   Avg │   Min │   Max │ Unit             │
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 10.3.0  │ Global/Generic Instr  │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.1  │ Global/Generic Read   │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.2  │ Global/Generic Write  │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.3  │ Global/Generic Atomic │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.4  │ Spill/Stack Instr     │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.5  │ Spill/Stack Read      │  0.00 │  0.00 │  0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.6  │ Spill/Stack Write     │  1.00 │  1.00 │  1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.7  │ Spill/Stack Atomic    │  0.00 │  0.00 │  0.00 │ Instr per kernel │
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛


--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.3 L1D Cache Accesses
╒═════════╤═════════════╤═══════╤═══════╤═══════╤════════════════╕
│ Index   │ Metric      │   Avg │   Min │   Max │ Unit           │
╞═════════╪═════════════╪═══════╪═══════╪═══════╪════════════════╡
│ 16.3.11 │ L1-L2 Write │  1.00 │  1.00 │  1.00 │ Req per kernel │
╘═════════╧═════════════╧═══════╧═══════╧═══════╧════════════════╛

Here we see a single write to the stack (10.3.6), which corresponds to an L1-L2 write request (16.3.11), that is, the stack is backed by global memory and travels through the same memory hierarchy.

Instructions-per-cycle and utilizations example#

For this example, consider the instructions-per-cycle (IPC) example included with Omniperf.

This example is compiled using c++17 support:

$ hipcc -O3 ipc.hip -o ipc -std=c++17

and was run on an MI250 CDNA2 accelerator:

$ omniperf profile -n ipc --no-roof -- ./ipc

The results shown in this section are generally applicable to CDNA accelerators, but may vary between generations and specific products.

Design note#

The kernels in this example all execute a specific assembly operation N times (1000, by default), for instance the vmov kernel:

template<int N=1000>
__device__ void vmov_op() {
    int dummy;
    if constexpr (N >= 1) {
        asm volatile("v_mov_b32 v0, v1\n" : : "{v31}"(dummy));
        vmov_op<N - 1>();
    }
}

template<int N=1000>
__global__ void vmov() {
    vmov_op<N>();
}

The kernels are then launched twice, once for a warm-up run, and once for measurement.

VALU utilization and IPC#

Now we can use our test to measure the achieved instructions-per-cycle of various types of instructions. We start with a simple VALU operation, i.e., a v_mov_b32 instruction, e.g.:

v_mov_b32 v0, v1

This instruction simply copies the contents from the source register (v1) to the destination register (v0). Investigating this kernel with Omniperf, we see:

$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 7 -b 11.2
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤═══════════════════════════════╤═════════╤═════════════╤═════════════╤══════════════╤════════╕
│    │ KernelName                    │   Count │     Sum(ns) │    Mean(ns) │   Median(ns) │    Pct │
╞════╪═══════════════════════════════╪═════════╪═════════════╪═════════════╪══════════════╪════════╡
│  0 │ void vmov<1000>() [clone .kd] │    1.00 │ 99317423.00 │ 99317423.00 │  99317423.00 │ 100.00 │
╘════╧═══════════════════════════════╧═════════╧═════════════╧═════════════╧══════════════╧════════╛


--------------------------------------------------------------------------------
11. Compute Units - Compute Pipeline
11.2 Pipeline Stats
╒═════════╤═════════════════════╤═══════╤═══════╤═══════╤══════════════╕
│ Index   │ Metric              │ Avg   │ Min   │ Max   │ Unit         │
╞═════════╪═════════════════════╪═══════╪═══════╪═══════╪══════════════╡
│ 11.2.0  │ IPC                 │ 1.0   │ 1.0   │ 1.0   │ Instr/cycle  │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.1  │ IPC (Issued)        │ 1.0   │ 1.0   │ 1.0   │ Instr/cycle  │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.2  │ SALU Util           │ 0.0   │ 0.0   │ 0.0   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.3  │ VALU Util           │ 99.98 │ 99.98 │ 99.98 │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.4  │ VMEM Util           │ 0.0   │ 0.0   │ 0.0   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.5  │ Branch Util         │ 0.1   │ 0.1   │ 0.1   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.6  │ VALU Active Threads │ 64.0  │ 64.0  │ 64.0  │ Threads      │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.7  │ MFMA Util           │ 0.0   │ 0.0   │ 0.0   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.8  │ MFMA Instr Cycles   │       │       │       │ Cycles/instr │
╘═════════╧═════════════════════╧═══════╧═══════╧═══════╧══════════════╛

Here we see that:

Both the IPC (11.2.0) and “Issued” IPC (11.2.1) metrics are $\sim 1$
The VALU Utilization metric (11.2.3) is also $\sim100\%$, and finally
The VALU Active Threads metric (11.2.4) is 64, i.e., the wavefront size on CDNA accelerators, as all threads in the wavefront are active.

We will explore the difference between the IPC (11.2.0) and “Issued” IPC (11.2.1) metrics in the next section.

Additionally, we notice a small (0.1%) Branch utilization (11.2.5). Inspecting the assembly of this kernel shows there are no branch operations, however recalling the note in the Pipeline statistics section:

The branch utilization <…> includes time spent in other instruction types (namely: s_endpgm) that are typically a very small percentage of the overall kernel execution.

We see that this is coming from execution of the s_endpgm instruction at the end of every wavefront.

Note

Technically, the cycle counts used in the denominators of our IPC metrics are actually in units of quad-cycles, a group of 4 consecutive cycles. However, a typical VALU instruction on CDNA accelerators runs for a single quad-cycle (see The AMD GCN Architecture - A Crash Course (slide 30)). Therefore, for simplicity, we simply report these metrics as “instructions per cycle”.

Exploring “issued” IPC via MFMA operations#

Warning

The MFMA assembly operations used in this example are inherently not portable to older CDNA architectures.

Unlike the simple quad-cycle v_mov_b32 operation discussed in our previous example, some operations take many quad-cycles to execute. For example, using the AMD Matrix Instruction Calculator we can see that some MFMA operations take 64 cycles, e.g.:

$ ./matrix_calculator.py --arch CDNA2 --detail-instruction --instruction v_mfma_f32_32x32x8bf16_1k
Architecture: CDNA2
Instruction: V_MFMA_F32_32X32X8BF16_1K
<...>
    Execution statistics:
        FLOPs: 16384
        Execution cycles: 64
        FLOPs/CU/cycle: 1024
        Can co-execute with VALU: True
        VALU co-execution cycles possible: 60

What happens to our IPC when we utilize this v_mfma_f32_32x32x8bf16_1k instruction on a CDNA2 accelerator? To find out, we turn to our mfma kernel in the IPC example:

$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 8 -b 11.2 --decimal 4
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤═══════════════════════════════╤═════════╤═════════════════╤═════════════════╤═════════════════╤══════════╕
│    │ KernelName                    │   Count │         Sum(ns) │        Mean(ns) │      Median(ns) │      Pct │
╞════╪═══════════════════════════════╪═════════╪═════════════════╪═════════════════╪═════════════════╪══════════╡
│  0 │ void mfma<1000>() [clone .kd] │  1.0000 │ 1623167595.0000 │ 1623167595.0000 │ 1623167595.0000 │ 100.0000 │
╘════╧═══════════════════════════════╧═════════╧═════════════════╧═════════════════╧═════════════════╧══════════╛


--------------------------------------------------------------------------------
11. Compute Units - Compute Pipeline
11.2 Pipeline Stats
╒═════════╤═════════════════════╤═════════╤═════════╤═════════╤══════════════╕
│ Index   │ Metric              │     Avg │     Min │     Max │ Unit         │
╞═════════╪═════════════════════╪═════════╪═════════╪═════════╪══════════════╡
│ 11.2.0  │ IPC                 │  0.0626 │  0.0626 │  0.0626 │ Instr/cycle  │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.1  │ IPC (Issued)        │  1.0000 │  1.0000 │  1.0000 │ Instr/cycle  │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.2  │ SALU Util           │  0.0000 │  0.0000 │  0.0000 │ Pct          │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.3  │ VALU Util           │  6.2496 │  6.2496 │  6.2496 │ Pct          │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.4  │ VMEM Util           │  0.0000 │  0.0000 │  0.0000 │ Pct          │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.5  │ Branch Util         │  0.0062 │  0.0062 │  0.0062 │ Pct          │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.6  │ VALU Active Threads │ 64.0000 │ 64.0000 │ 64.0000 │ Threads      │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.7  │ MFMA Util           │ 99.9939 │ 99.9939 │ 99.9939 │ Pct          │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.8  │ MFMA Instr Cycles   │ 64.0000 │ 64.0000 │ 64.0000 │ Cycles/instr │
╘═════════╧═════════════════════╧═════════╧═════════╧═════════╧══════════════╛

In contrast to our VALU IPC example, we now see that the IPC metric (11.2.0) and Issued IPC (11.2.1) metric differ substantially. First, we see the VALU utilization (11.2.3) has decreased substantially, from nearly 100% to $\sim6.25\%$. We note that this matches the ratio of: $((Execution\ cycles) - (VALU\ coexecution\ cycles)) / (Execution\ cycles)$ reported by the matrix calculator, while the MFMA utilization (11.2.7) has increased to nearly 100%.

Recall that our v_mfma_f32_32x32x8bf16_1k instruction takes 64 cycles to execute, or 16 quad-cycles, matching our observed MFMA Instruction Cycles (11.2.8). That is, we have a single instruction executed every 16 quad-cycles, or $1/16 = 0.0625$, which is almost identical to our IPC metric (11.2.0). Why then is the Issued IPC metric (11.2.1) equal to 1.0?

Instead of simply counting the number of instructions issued and dividing by the number of cycles the CUs on the accelerator were active (as is done for 11.2.0), this metric is formulated differently, and instead counts the number of (non-internal) instructions issued divided by the number of (quad-) cycles where the scheduler was actively working on issuing instructions. Thus the Issued IPC metric (11.2.1) gives more of a sense of “what percent of the total number of scheduler cycles did a wave schedule an instruction?” while the IPC metric (11.2.0) indicates the ratio of the number of instructions executed over the total active CU cycles.

Warning

There are further complications of the Issued IPC metric (11.2.1) that make its use more complicated. We will be explore that in the following section. For these reasons, Omniperf typically promotes use of the regular IPC metric (11.2.0), e.g., in the top-level Speed-of-Light chart.

Internal instructions and IPC#

Next, we explore the concept of an “internal” instruction. From The AMD GCN Architecture - A Crash Course (slide 29), we see a few candidates for internal instructions, and we choose a s_nop instruction, which according to the CDNA2 ISA guide:

Does nothing; it can be repeated in hardware up to eight times.

Here we choose to use the following no-op to make our point:

s_nop 0x0

Running this kernel through Omniperf yields:

$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 9 -b 11.2
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤═══════════════════════════════╤═════════╤═════════════╤═════════════╤══════════════╤════════╕
│    │ KernelName                    │   Count │     Sum(ns) │    Mean(ns) │   Median(ns) │    Pct │
╞════╪═══════════════════════════════╪═════════╪═════════════╪═════════════╪══════════════╪════════╡
│  0 │ void snop<1000>() [clone .kd] │    1.00 │ 14221851.50 │ 14221851.50 │  14221851.50 │ 100.00 │
╘════╧═══════════════════════════════╧═════════╧═════════════╧═════════════╧══════════════╧════════╛


--------------------------------------------------------------------------------
11. Compute Units - Compute Pipeline
11.2 Pipeline Stats
╒═════════╤═════════════════════╤═══════╤═══════╤═══════╤══════════════╕
│ Index   │ Metric              │ Avg   │ Min   │ Max   │ Unit         │
╞═════════╪═════════════════════╪═══════╪═══════╪═══════╪══════════════╡
│ 11.2.0  │ IPC                 │ 6.79  │ 6.79  │ 6.79  │ Instr/cycle  │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.1  │ IPC (Issued)        │ 1.0   │ 1.0   │ 1.0   │ Instr/cycle  │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.2  │ SALU Util           │ 0.0   │ 0.0   │ 0.0   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.3  │ VALU Util           │ 0.0   │ 0.0   │ 0.0   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.4  │ VMEM Util           │ 0.0   │ 0.0   │ 0.0   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.5  │ Branch Util         │ 0.68  │ 0.68  │ 0.68  │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.6  │ VALU Active Threads │       │       │       │ Threads      │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.7  │ MFMA Util           │ 0.0   │ 0.0   │ 0.0   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.8  │ MFMA Instr Cycles   │       │       │       │ Cycles/instr │
╘═════════╧═════════════════════╧═══════╧═══════╧═══════╧══════════════╛

First, we see that the IPC metric (11.2.0) tops our theoretical maximum of 5 instructions per cycle (discussed in the scheduler section). How can this be?

Recall that The AMD GCN Architecture - A Crash Course (slide 27) say “no functional unit” for the internal instructions. This removes the limitation on the IPC. If we are only issuing internal instructions, we are not issuing to any execution units! However, workloads such as these are almost entirely artificial (that is, repeatedly issuing internal instructions almost exclusively). In practice, a maximum of IPC of 5 is expected in almost all cases.

Secondly, note that our “Issued” IPC (11.2.1) is still identical to the one here. Again, this has to do with the details of “internal” instructions. Recall in our previous example we defined this metric as explicitly excluding internal instruction counts. The logical question then is, “what is this metric counting in our s_nop kernel?”

The generated assembly looks something like:

;;#ASMSTART
s_nop 0x0
;;#ASMEND
;;#ASMSTART
s_nop 0x0
;;#ASMEND
;;<... omitting many more ...>
s_endpgm
.section        .rodata,#alloc
.p2align        6, 0x0
.amdhsa_kernel _Z4snopILi1000EEvv

Of particular interest here is the s_endpgm instruction, of which the CDNA2 ISA guide states:

End of program; terminate wavefront.

This is not on our list of internal instructions from The AMD GCN Architecture, and is therefore counted as part of our Issued IPC (11.2.1). Thus, the issued IPC being equal to one here indicates that we issued an s_endpgm instruction every cycle the scheduler was active for non-internal instructions, which is expected as this was our only non-internal instruction.

SALU Utilization#

Next, we explore a simple SALU kernel in our on-going IPC and utilization example. For this case, we select a simple scalar move operation, for instance:

s_mov_b32 s0, s1

which, in analogue to our v_mov example, copies the contents of the source scalar register (s1) to the destination scalar register (s0). Running this kernel through Omniperf yields:

$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 10 -b 11.2
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤═══════════════════════════════╤═════════╤═════════════╤═════════════╤══════════════╤════════╕
│    │ KernelName                    │   Count │     Sum(ns) │    Mean(ns) │   Median(ns) │    Pct │
╞════╪═══════════════════════════════╪═════════╪═════════════╪═════════════╪══════════════╪════════╡
│  0 │ void smov<1000>() [clone .kd] │    1.00 │ 96246554.00 │ 96246554.00 │  96246554.00 │ 100.00 │
╘════╧═══════════════════════════════╧═════════╧═════════════╧═════════════╧══════════════╧════════╛


--------------------------------------------------------------------------------
11. Compute Units - Compute Pipeline
11.2 Pipeline Stats
╒═════════╤═════════════════════╤═══════╤═══════╤═══════╤══════════════╕
│ Index   │ Metric              │ Avg   │ Min   │ Max   │ Unit         │
╞═════════╪═════════════════════╪═══════╪═══════╪═══════╪══════════════╡
│ 11.2.0  │ IPC                 │ 1.0   │ 1.0   │ 1.0   │ Instr/cycle  │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.1  │ IPC (Issued)        │ 1.0   │ 1.0   │ 1.0   │ Instr/cycle  │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.2  │ SALU Util           │ 99.98 │ 99.98 │ 99.98 │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.3  │ VALU Util           │ 0.0   │ 0.0   │ 0.0   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.4  │ VMEM Util           │ 0.0   │ 0.0   │ 0.0   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.5  │ Branch Util         │ 0.1   │ 0.1   │ 0.1   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.6  │ VALU Active Threads │       │       │       │ Threads      │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.7  │ MFMA Util           │ 0.0   │ 0.0   │ 0.0   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.8  │ MFMA Instr Cycles   │       │       │       │ Cycles/instr │
╘═════════╧═════════════════════╧═══════╧═══════╧═══════╧══════════════╛

Here we see that:

Both our IPC (11.2.0) and Issued IPC (11.2.1) are $\sim1.0$ as expected, and
The SALU Utilization (11.2.2) was nearly 100% as it was active for almost the entire kernel.

VALU Active Threads#

For our final IPC/Utilization example, we consider a slight modification of our v_mov example:

template<int N=1000>
__global__ void vmov_with_divergence() {
    if (threadIdx.x % 64 == 0)
        vmov_op<N>();
}

That is, we wrap our VALU operation inside a conditional where only one lane in our wavefront is active. Running this kernel through Omniperf yields:

$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 11 -b 11.2
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤══════════════════════════════════════════╤═════════╤═════════════╤═════════════╤══════════════╤════════╕
│    │ KernelName                               │   Count │     Sum(ns) │    Mean(ns) │   Median(ns) │    Pct │
╞════╪══════════════════════════════════════════╪═════════╪═════════════╪═════════════╪══════════════╪════════╡
│  0 │ void vmov_with_divergence<1000>() [clone │    1.00 │ 97125097.00 │ 97125097.00 │  97125097.00 │ 100.00 │
│    │  .kd]                                    │         │             │             │              │        │
╘════╧══════════════════════════════════════════╧═════════╧═════════════╧═════════════╧══════════════╧════════╛


--------------------------------------------------------------------------------
11. Compute Units - Compute Pipeline
11.2 Pipeline Stats
╒═════════╤═════════════════════╤═══════╤═══════╤═══════╤══════════════╕
│ Index   │ Metric              │ Avg   │ Min   │ Max   │ Unit         │
╞═════════╪═════════════════════╪═══════╪═══════╪═══════╪══════════════╡
│ 11.2.0  │ IPC                 │ 1.0   │ 1.0   │ 1.0   │ Instr/cycle  │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.1  │ IPC (Issued)        │ 1.0   │ 1.0   │ 1.0   │ Instr/cycle  │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.2  │ SALU Util           │ 0.1   │ 0.1   │ 0.1   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.3  │ VALU Util           │ 99.98 │ 99.98 │ 99.98 │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.4  │ VMEM Util           │ 0.0   │ 0.0   │ 0.0   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.5  │ Branch Util         │ 0.2   │ 0.2   │ 0.2   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.6  │ VALU Active Threads │ 1.13  │ 1.13  │ 1.13  │ Threads      │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.7  │ MFMA Util           │ 0.0   │ 0.0   │ 0.0   │ Pct          │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.8  │ MFMA Instr Cycles   │       │       │       │ Cycles/instr │
╘═════════╧═════════════════════╧═══════╧═══════╧═══════╧══════════════╛

Here we see that once again, our VALU Utilization (11.2.3) is nearly 100%. However, we note that the VALU Active Threads metric (11.2.6) is $\sim 1$, which matches our conditional in the source code. So VALU Active Threads reports the average number of lanes of our wavefront that are active over all VALU instructions, or thread “convergence” (i.e., 1 - divergence).

Note

The act of evaluating a vector conditional in this example typically triggers VALU operations, contributing to why the VALU Active Threads metric is not identically one.
This metric is a time (cycle) averaged value, and thus contains an implicit dependence on the duration of various VALU instructions.

Nonetheless, this metric serves as a useful measure of thread-convergence.

Finally, we note that our branch utilization (11.2.5) has increased slightly from our baseline, as we now have a branch (checking the value of threadIdx.x).

LDS examples#

For this example, consider the LDS sample distributed as a part of Omniperf. This code contains two kernels to explore how both LDS bandwidth and bank conflicts are calculated in Omniperf.

This example was compiled and run on an MI250 accelerator using ROCm v5.6.0, and Omniperf v2.0.0.

$ hipcc -O3 lds.hip -o lds

Finally, we generate our omniperf profile as:

$ omniperf profile -n lds --no-roof -- ./lds

LDS bandwidth#

To explore our theoretical LDS bandwidth metric, we use a simple kernel:

constexpr unsigned max_threads = 256;
__global__ void load(int* out, int flag) {
  __shared__ int array[max_threads];
  int index = threadIdx.x;
  // fake a store to the LDS array to avoid unwanted behavior
  if (flag)
    array[max_threads - index] = index;
  __syncthreads();
  int x = array[index];
  if (x == int(-1234567))
    out[threadIdx.x] = x;
}

Here we:

Create an array of 256 integers in LDS
Fake a write to the LDS using the flag variable (always set to zero on the host) to avoid dead-code elimination
Read a single integer per work-item from threadIdx.x of the LDS array
If the integer is equal to a magic number (always false), write the value out to global memory to again, avoid dead-code elimination

Finally, we launch this kernel repeatedly, varying the number of threads in our workgroup:

void bandwidth_demo(int N) {
  for (int i = 1; i <= N; ++i)
    load<<<1,i>>>(nullptr, 0);
  hipDeviceSynchronize();
}

Next, let’s analyze the first of our bandwidth kernel dispatches:

$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.1 --dispatch 0 -n per_kernel
<...>
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤═══════════════════════╤════════╤════════╤════════╤══════════════════╕
│ Index   │ Metric                │    Avg │    Min │    Max │ Unit             │
╞═════════╪═══════════════════════╪════════╪════════╪════════╪══════════════════╡
│ 12.2.1  │ Theoretical Bandwidth │ 256.00 │ 256.00 │ 256.00 │ Bytes per kernel │
╘═════════╧═══════════════════════╧════════╧════════╧════════╧══════════════════╛

Here we see that our Theoretical Bandwidth metric (12.2.1) is reporting 256 Bytes were loaded even though we launched a single work-item workgroup, and thus only loaded a single integer from LDS. Why is this?

Recall our definition of this metric:

Indicates the maximum amount of bytes that could have been loaded from/stored to/atomically updated in the LDS per normalization unit.

Here we see that this instruction could have loaded up to 256 bytes of data (4 bytes for each work-item in the wavefront), and therefore this is the expected value for this metric in Omniperf, hence why this metric is named the “theoretical” bandwidth.

To further illustrate this point we plot the relationship of the theoretical bandwidth metric (12.2.1) as compared to the effective (or achieved) bandwidth of this kernel, varying the number of work-items launched from 1 to 256:

Fig. 52 Comparison of effective bandwidth versus the theoretical bandwidth metric in Omniperf for our simple example.#

Here we see that the theoretical bandwidth metric follows a step-function. It increases only when another wavefront issues an LDS instruction for up to 256 bytes of data. Such increases are marked in the plot using dashed lines. In contrast, the effective bandwidth increases linearly, by 4 bytes, with the number of work-items in the kernel, N.

Bank conflicts#

Next we explore bank conflicts using a slight modification of our bandwidth kernel:

constexpr unsigned nbanks = 32;
__global__ void conflicts(int* out, int flag) {
  constexpr unsigned nelements = nbanks * max_threads;
  __shared__ int array[nelements];
  // each thread reads from the same bank
  int index = threadIdx.x * nbanks;
  // fake a store to the LDS array to avoid unwanted behavior
  if (flag)
    array[max_threads - index] = index;
  __syncthreads();
  int x = array[index];
  if (x == int(-1234567))
    out[threadIdx.x] = x;
}

Here we:

Allocate an LDS array of size $32*256*4{B}=32{KiB}$
Fake a write to the LDS using the flag variable (always set to zero on the host) to avoid dead-code elimination
Read a single integer per work-item from index threadIdx.x * nbanks of the LDS array
If the integer is equal to a magic number (always false), write the value out to global memory to, again, avoid dead-code elimination.

On the host, we again repeatedly launch this kernel, varying the number of work-items:

void conflicts_demo(int N) {
  for (int i = 1; i <= N; ++i)
    conflicts<<<1,i>>>(nullptr, 0);
  hipDeviceSynchronize();
}

Analyzing our first conflicts kernel (i.e., a single work-item), we see:

$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.4 12.2.6 --dispatch 256 -n per_kernel
<...>
--------------------------------------------------------------------------------
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤════════════════╤═══════╤═══════╤═══════╤═══════════════════╕
│ Index   │ Metric         │   Avg │   Min │   Max │ Unit              │
╞═════════╪════════════════╪═══════╪═══════╪═══════╪═══════════════════╡
│ 12.2.4  │ Index Accesses │  2.00 │  2.00 │  2.00 │ Cycles per kernel │
├─────────┼────────────────┼───────┼───────┼───────┼───────────────────┤
│ 12.2.6  │ Bank Conflict  │  0.00 │  0.00 │  0.00 │ Cycles per kernel │
╘═════════╧════════════════╧═══════╧═══════╧═══════╧═══════════════════╛

In our previous example, we showed how a load from a single work-item is considered to have a theoretical bandwidth of 256B. Recall, the LDS can load up to $128B$ per cycle (i.e, 32 banks x 4B / bank / cycle). Hence, we see that loading an 4B integer spends two cycles accessing the LDS ($2\ {cycle} = (256B) / (128\ B/{cycle})$).

Looking at the next conflicts dispatch (i.e., two work-items) yields:

$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.4 12.2.6 --dispatch 257 -n per_kernel
<...>
--------------------------------------------------------------------------------
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤════════════════╤═══════╤═══════╤═══════╤═══════════════════╕
│ Index   │ Metric         │   Avg │   Min │   Max │ Unit              │
╞═════════╪════════════════╪═══════╪═══════╪═══════╪═══════════════════╡
│ 12.2.4  │ Index Accesses │  3.00 │  3.00 │  3.00 │ Cycles per kernel │
├─────────┼────────────────┼───────┼───────┼───────┼───────────────────┤
│ 12.2.6  │ Bank Conflict  │  1.00 │  1.00 │  1.00 │ Cycles per kernel │
╘═════════╧════════════════╧═══════╧═══════╧═══════╧═══════════════════╛

Here we see a bank conflict! What happened?

Recall that the index for each thread was calculated as:

int index = threadIdx.x * nbanks;

Or, precisely 32 elements, and each element is 4B wide (for a standard integer). That is, each thread strides back to the same bank in the LDS, such that each work-item we add to the dispatch results in another bank conflict!

Recalling our discussion of bank conflicts in our LDS description:

A bank conflict occurs when two (or more) work-items in a wavefront want to read, write, or atomically update different addresses that map to the same bank in the same cycle. In this case, the conflict detection hardware will determined a new schedule such that the access is split into multiple cycles with no conflicts in any single cycle.

Here we see the conflict resolution hardware in action! Because we have engineered our kernel to generate conflicts, we expect our bank conflict metric to scale linearly with the number of work-items:

Fig. 53 Comparison of LDS conflict cycles versus access cycles for our simple example.#

Here we show the comparison of the Index Accesses (12.2.4), to the Bank Conflicts (12.2.6) for the first 20 kernel invocations. We see that each grows linearly, and there is a constant gap of 2 cycles between them (i.e., the first access is never considered a conflict).

Finally, we can use these two metrics to derive the Bank Conflict Rate (12.1.4). Since within an Index Access we have 32 banks that may need to be updated, we use:

$$ BankConflictRate = 100 * ((BankConflicts / 32) / (IndexAccesses - BankConflicts)) $$

Plotting this, we see:

LDS bank conflict rate example — Fig. 54 LDS Bank Conflict rate for our simple example.#

The bank conflict rate linearly increases with the number of work-items within a wavefront that are active, approaching 100%, but never quite reaching it.

Occupancy limiters example#

For this example, consider the occupancy included with Omniperf. We will investigate the use of the resource allocation panel in the Workgroup Manager’s metrics section to determine occupancy limiters. This code contains several kernels to explore how both various kernel resources impact achieved occupancy, and how this is reported in Omniperf.

This example was compiled and run on a MI250 accelerator using ROCm v5.6.0, and Omniperf v2.0.0:

$ hipcc -O3 occupancy.hip -o occupancy --save-temps

We have again included the --save-temps flag to get the corresponding assembly.

Finally, we generate our Omniperf profile as:

$ omniperf profile -n occupancy --no-roof -- ./occupancy

Design note#

For our occupancy test, we need to create a kernel that is resource heavy, in various ways. For this purpose, we use the following (somewhat funny-looking) kernel:

constexpr int bound = 16;
__launch_bounds__(256)
__global__ void vgprbound(int N, double* ptr) {
    double intermediates[bound];
    for (int i = 0 ; i < bound; ++i) intermediates[i] = N * threadIdx.x;
    double x = ptr[threadIdx.x];
    for (int i = 0; i < 100; ++i) {
        x += sin(pow(__shfl(x, i % warpSize) * intermediates[(i - 1) % bound], intermediates[i % bound]));
        intermediates[i % bound] = x;
    }
    if (x == N) ptr[threadIdx.x] = x;
}

Here we try to use as many VGPRs as possible, to this end:

We create a small array of double precision floats, that we size to try to fit into registers (i.e., bound, this may need to be tuned depending on the ROCm version).
We specify __launch_bounds___(256) to increase the number of VPGRs available to the kernel (by limiting the number of wavefronts that can be resident on a CU).
Write a unique non-compile time constant to each element of the array.
Repeatedly permute and call relatively expensive math functions on our array elements.
Keep the compiler from optimizing out any operations by faking a write to the ptr based on a run-time conditional.

This yields a total of 122 VGPRs, but it is expected this number will depend on the exact ROCm/compiler version.

        .size   _Z9vgprboundiPd, .Lfunc_end1-_Z9vgprboundiPd
                                        ; -- End function
        .section        .AMDGPU.csdata
; Kernel info:
; codeLenInByte = 4732
; NumSgprs: 68
; NumVgprs: 122
; NumAgprs: 0
; <...>
; AccumOffset: 124

We will use various permutations of this kernel to limit occupancy, and more importantly for the purposes of this example, demonstrate how this is reported in Omniperf.

VGPR limited#

For our first test, we use the vgprbound kernel discussed in the design note. After profiling, we run the analyze step on this kernel:

$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 --dispatch 1
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤═════════════════════════╤═════════╤══════════════╤══════════════╤══════════════╤════════╕
│    │ KernelName              │   Count │      Sum(ns) │     Mean(ns) │   Median(ns) │    Pct │
╞════╪═════════════════════════╪═════════╪══════════════╪══════════════╪══════════════╪════════╡
│  0 │ vgprbound(int, double*) │    1.00 │ 923093822.50 │ 923093822.50 │ 923093822.50 │ 100.00 │
╘════╧═════════════════════════╧═════════╧══════════════╧══════════════╧══════════════╧════════╛


--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
╒═════════╤═════════════════════╤═════════╤════════════╤═════════╤═══════════════╕
│ Index   │ Metric              │     Avg │ Unit       │    Peak │   Pct of Peak │
╞═════════╪═════════════════════╪═════════╪════════════╪═════════╪═══════════════╡
│ 2.1.15  │ Wavefront Occupancy │ 1661.24 │ Wavefronts │ 3328.00 │         49.92 │
╘═════════╧═════════════════════╧═════════╧════════════╧═════════╧═══════════════╛


--------------------------------------------------------------------------------
6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
╒═════════╤════════════════════════════════════════╤═══════╤═══════╤═══════╤════════╕
│ Index   │ Metric                                 │   Avg │   Min │   Max │ Unit   │
╞═════════╪════════════════════════════════════════╪═══════╪═══════╪═══════╪════════╡
│ 6.2.0   │ Not-scheduled Rate (Workgroup Manager) │  0.64 │  0.64 │  0.64 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.1   │ Not-scheduled Rate (Scheduler-Pipe)    │ 24.94 │ 24.94 │ 24.94 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.2   │ Scheduler-Pipe Stall Rate              │ 24.49 │ 24.49 │ 24.49 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.3   │ Scratch Stall Rate                     │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.4   │ Insufficient SIMD Waveslots            │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.5   │ Insufficient SIMD VGPRs                │ 94.90 │ 94.90 │ 94.90 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.6   │ Insufficient SIMD SGPRs                │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.7   │ Insufficient CU LDS                    │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.8   │ Insufficient CU Barriers               │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.9   │ Reached CU Workgroup Limit             │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.10  │ Reached CU Wavefront Limit             │  0.00 │  0.00 │  0.00 │ Pct    │
╘═════════╧════════════════════════════════════════╧═══════╧═══════╧═══════╧════════╛


--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
╒═════════╤══════════╤════════╤════════╤════════╤═══════════╕
│ Index   │ Metric   │    Avg │    Min │    Max │ Unit      │
╞═════════╪══════════╪════════╪════════╪════════╪═══════════╡
│ 7.1.5   │ VGPRs    │ 124.00 │ 124.00 │ 124.00 │ Registers │
├─────────┼──────────┼────────┼────────┼────────┼───────────┤
│ 7.1.6   │ AGPRs    │   4.00 │   4.00 │   4.00 │ Registers │
├─────────┼──────────┼────────┼────────┼────────┼───────────┤
│ 7.1.7   │ SGPRs    │  80.00 │  80.00 │  80.00 │ Registers │
╘═════════╧══════════╧════════╧════════╧════════╧═══════════╛

Here we see that the kernel indeed does use around (but not exactly) 122 VGPRs, with the difference due to granularity of VGPR allocations. In addition, we see that we have allocated 4 “AGPRs”. We note that on current CDNA2 accelerators, the AccumOffset field of the assembly metadata:

; AccumOffset: 124

denotes the divide between VGPRs and AGPRs.

Next, we examine our wavefront occupancy (2.1.15), and see that we are reaching only $\sim50\%$ of peak occupancy. As a result, we see that:

We are not scheduling workgroups $\sim25\%$ of total scheduler-pipe cycles (6.2.1); recall from the discussion of the workgroup manager <desc-spi>, 25% is the maximum.
The scheduler-pipe is stalled (6.2.2) from scheduling workgroups due to resource constraints for the same $\sim25\%$ of the time.
And finally, $\sim91\%$ of those stalls are due to a lack of SIMDs with the appropriate number of VGPRs available (6.2.5).

That is, the reason we can’t reach full occupancy is due to our VGPR usage, as expected!

LDS limited#

To examine an LDS limited example, we must change our kernel slightly:

constexpr size_t fully_allocate_lds = 64ul * 1024ul / sizeof(double);
__launch_bounds__(256)
__global__ void ldsbound(int N, double* ptr) {
    __shared__ double intermediates[fully_allocate_lds];
    for (int i = threadIdx.x ; i < fully_allocate_lds; i += blockDim.x) intermediates[i] = N * threadIdx.x;
    __syncthreads();
    double x = ptr[threadIdx.x];
    for (int i = threadIdx.x; i < fully_allocate_lds; i += blockDim.x) {
        x += sin(pow(__shfl(x, i % warpSize) * intermediates[(i - 1) % fully_allocate_lds], intermediates[i % fully_allocate_lds]));
        __syncthreads();
        intermediates[i % fully_allocate_lds] = x;
    }
    if (x == N) ptr[threadIdx.x] = x;
}

Where we now:

Allocate an 64 KiB LDS array per workgroup, and
Use our allocated LDS array instead of a register array

Analyzing this:

$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 7.1.8 --dispatch 3
<...>
--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
╒═════════╤═════════════════════╤════════╤════════════╤═════════╤═══════════════╕
│ Index   │ Metric              │    Avg │ Unit       │    Peak │   Pct of Peak │
╞═════════╪═════════════════════╪════════╪════════════╪═════════╪═══════════════╡
│ 2.1.15  │ Wavefront Occupancy │ 415.52 │ Wavefronts │ 3328.00 │         12.49 │
╘═════════╧═════════════════════╧════════╧════════════╧═════════╧═══════════════╛


--------------------------------------------------------------------------------
6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
╒═════════╤════════════════════════════════════════╤═══════╤═══════╤═══════╤════════╕
│ Index   │ Metric                                 │   Avg │   Min │   Max │ Unit   │
╞═════════╪════════════════════════════════════════╪═══════╪═══════╪═══════╪════════╡
│ 6.2.0   │ Not-scheduled Rate (Workgroup Manager) │  0.13 │  0.13 │  0.13 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.1   │ Not-scheduled Rate (Scheduler-Pipe)    │ 24.87 │ 24.87 │ 24.87 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.2   │ Scheduler-Pipe Stall Rate              │ 24.84 │ 24.84 │ 24.84 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.3   │ Scratch Stall Rate                     │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.4   │ Insufficient SIMD Waveslots            │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.5   │ Insufficient SIMD VGPRs                │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.6   │ Insufficient SIMD SGPRs                │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.7   │ Insufficient CU LDS                    │ 96.47 │ 96.47 │ 96.47 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.8   │ Insufficient CU Barriers               │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.9   │ Reached CU Workgroup Limit             │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.10  │ Reached CU Wavefront Limit             │  0.00 │  0.00 │  0.00 │ Pct    │
╘═════════╧════════════════════════════════════════╧═══════╧═══════╧═══════╧════════╛


--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
╒═════════╤════════════════╤══════════╤══════════╤══════════╤═══════════╕
│ Index   │ Metric         │      Avg │      Min │      Max │ Unit      │
╞═════════╪════════════════╪══════════╪══════════╪══════════╪═══════════╡
│ 7.1.5   │ VGPRs          │    96.00 │    96.00 │    96.00 │ Registers │
├─────────┼────────────────┼──────────┼──────────┼──────────┼───────────┤
│ 7.1.6   │ AGPRs          │     0.00 │     0.00 │     0.00 │ Registers │
├─────────┼────────────────┼──────────┼──────────┼──────────┼───────────┤
│ 7.1.7   │ SGPRs          │    80.00 │    80.00 │    80.00 │ Registers │
├─────────┼────────────────┼──────────┼──────────┼──────────┼───────────┤
│ 7.1.8   │ LDS Allocation │ 65536.00 │ 65536.00 │ 65536.00 │ Bytes     │
╘═════════╧════════════════╧══════════╧══════════╧══════════╧═══════════╛

We see that our VGPR allocation has gone down to 96 registers, but now we see our 64KiB LDS allocation (7.1.8). In addition, we see a similar non-schedule rate (6.2.1) and stall rate (6.2.2) as in our VGPR example. However, our occupancy limiter has now shifted from VGPRs (6.2.5) to LDS (6.2.7).

We note that although we see the around the same scheduler/stall rates (with our LDS limiter), our wave occupancy (2.1.15) is significantly lower ($\sim12\%$)! This is important to remember: the occupancy limiter metrics in the resource allocation section tell you what the limiter was, but not how much the occupancy was limited. These metrics should always be analyzed in concert with the wavefront occupancy metric!

SGPR limited#

Finally, we modify our kernel once more to make it limited by SGPRs:

constexpr int sgprlim = 1;
__launch_bounds__(1024, 8)
__global__ void sgprbound(int N, double* ptr) {
    double intermediates[sgprlim];
    for (int i = 0 ; i < sgprlim; ++i) intermediates[i] = i;
    double x = ptr[0];
    #pragma unroll 1
    for (int i = 0; i < 100; ++i) {
        x += sin(pow(intermediates[(i - 1) % sgprlim], intermediates[i % sgprlim]));
        intermediates[i % sgprlim] = x;
    }
    if (x == N) ptr[0] = x;
}

The major changes here are to: - make as much as possible provably uniform across the wave (notice the lack of threadIdx.x in the intermediates initialization and elsewhere), - addition of __launch_bounds__(1024, 8), which reduces our maximum VGPRs to 64 (such that 8 waves can fit per SIMD), but causes some register spills (i.e., scratch usage), and - lower the bound (here we use sgprlim) of the array to reduce VGPR/Scratch usage.

This results in the following assembly metadata for this kernel:

        .size   _Z9sgprboundiPd, .Lfunc_end3-_Z9sgprboundiPd
                                        ; -- End function
        .section        .AMDGPU.csdata
; Kernel info:
; codeLenInByte = 4872
; NumSgprs: 76
; NumVgprs: 64
; NumAgprs: 0
; TotalNumVgprs: 64
; ScratchSize: 60
; <...>
; AccumOffset: 64
; Occupancy: 8

Analyzing this workload yields:

$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 7.1.8 7.1.9 --dispatch 5
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤═════════════════════════╤═════════╤══════════════╤══════════════╤══════════════╤════════╕
│    │ KernelName              │   Count │      Sum(ns) │     Mean(ns) │   Median(ns) │    Pct │
╞════╪═════════════════════════╪═════════╪══════════════╪══════════════╪══════════════╪════════╡
│  0 │ sgprbound(int, double*) │    1.00 │ 782069812.00 │ 782069812.00 │ 782069812.00 │ 100.00 │
╘════╧═════════════════════════╧═════════╧══════════════╧══════════════╧══════════════╧════════╛


--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
╒═════════╤═════════════════════╤═════════╤════════════╤═════════╤═══════════════╕
│ Index   │ Metric              │     Avg │ Unit       │    Peak │   Pct of Peak │
╞═════════╪═════════════════════╪═════════╪════════════╪═════════╪═══════════════╡
│ 2.1.15  │ Wavefront Occupancy │ 3291.76 │ Wavefronts │ 3328.00 │         98.91 │
╘═════════╧═════════════════════╧═════════╧════════════╧═════════╧═══════════════╛


--------------------------------------------------------------------------------
6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
╒═════════╤════════════════════════════════════════╤═══════╤═══════╤═══════╤════════╕
│ Index   │ Metric                                 │   Avg │   Min │   Max │ Unit   │
╞═════════╪════════════════════════════════════════╪═══════╪═══════╪═══════╪════════╡
│ 6.2.0   │ Not-scheduled Rate (Workgroup Manager) │  7.72 │  7.72 │  7.72 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.1   │ Not-scheduled Rate (Scheduler-Pipe)    │ 15.17 │ 15.17 │ 15.17 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.2   │ Scheduler-Pipe Stall Rate              │  7.38 │  7.38 │  7.38 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.3   │ Scratch Stall Rate                     │ 39.76 │ 39.76 │ 39.76 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.4   │ Insufficient SIMD Waveslots            │ 26.32 │ 26.32 │ 26.32 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.5   │ Insufficient SIMD VGPRs                │ 26.32 │ 26.32 │ 26.32 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.6   │ Insufficient SIMD SGPRs                │ 25.52 │ 25.52 │ 25.52 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.7   │ Insufficient CU LDS                    │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.8   │ Insufficient CU Barriers               │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.9   │ Reached CU Workgroup Limit             │  0.00 │  0.00 │  0.00 │ Pct    │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.10  │ Reached CU Wavefront Limit             │  0.00 │  0.00 │  0.00 │ Pct    │
╘═════════╧════════════════════════════════════════╧═══════╧═══════╧═══════╧════════╛


--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
╒═════════╤════════════════════╤═══════╤═══════╤═══════╤════════════════╕
│ Index   │ Metric             │   Avg │   Min │   Max │ Unit           │
╞═════════╪════════════════════╪═══════╪═══════╪═══════╪════════════════╡
│ 7.1.5   │ VGPRs              │ 64.00 │ 64.00 │ 64.00 │ Registers      │
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
│ 7.1.6   │ AGPRs              │  0.00 │  0.00 │  0.00 │ Registers      │
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
│ 7.1.7   │ SGPRs              │ 80.00 │ 80.00 │ 80.00 │ Registers      │
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
│ 7.1.8   │ LDS Allocation     │  0.00 │  0.00 │  0.00 │ Bytes          │
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
│ 7.1.9   │ Scratch Allocation │ 60.00 │ 60.00 │ 60.00 │ Bytes/workitem │
╘═════════╧════════════════════╧═══════╧═══════╧═══════╧════════════════╛

Here we see that our wavefront launch stats (7.1) have changed to reflect the metadata seen in the --save-temps output. Of particular interest, we see:

The SGPR allocation (7.1.7) is 80 registers, slightly more than the 76 requested by the compiler due to allocation granularity, and
We have a “scratch”, that is, private memory, allocation of 60 bytes per work-item.

Analyzing the resource allocation block (6.2) we now see that for the first time, the “Not-scheduled Rate (Workgroup Manager)” metric (6.2.0) has become non-zero. This is because the workgroup manager is responsible for management of scratch, which we see also contributes to our occupancy limiters in the “Scratch Stall Rate” (6.2.3). Note that the sum of the workgroup manager not-scheduled rate and the scheduler-pipe non-scheduled rate is still $\sim25\%$, as in our previous examples.

Next, we see that the scheduler-pipe stall rate (6.2.2), that is, how often we could not schedule a workgroup to a CU, was only about $\sim8\%$. This hints that perhaps, our kernel is not particularly occupancy limited by resources. Indeed, checking the wave occupancy metric (2.1.15) shows that this kernel is reaching nearly 99% occupancy.

Finally, we inspect the occupancy limiter metrics and see a roughly even split between waveslots (6.2.4), VGPRs (6.2.5), and SGPRs (6.2.6) along with the scratch stalls (6.2.3) previously mentioned.

This is yet another reminder to view occupancy holistically. While these metrics tell you why a workgroup cannot be scheduled, they do not tell you what your occupancy was (consult wavefront occupancy) nor whether increasing occupancy will be beneficial to performance.

Profiling by example

Contents

Profiling by example#

VALU arithmetic instruction mix#

Design note#

Instruction mix#

Infinity Fabric transactions#

Experiment 1: Coarse-grained, accelerator-local HBM reads#

Experiment 2: Fine-grained, accelerator-local HBM reads#

Experiment 3: Fine-grained, remote-accelerator HBM reads#

Experiment 4: Fine-grained, CPU-DRAM reads#

Experiment 5: Coarse-grained, CPU-DRAM reads#

Experiment 6: Fine-grained, CPU-DRAM writes#

Experiment 7: Fine-grained, CPU-DRAM atomicAdd#

Vector memory operation counting#

Global / Generic (FLAT)#

Design note#

Global write#

Generic write to LDS#

Global read#

Generic read from global memory#

Global atomic#

Generic, mixed atomic#

Spill/Scratch (BUFFER)#

Instructions-per-cycle and utilizations example#

Design note#

VALU utilization and IPC#

Exploring “issued” IPC via MFMA operations#

Internal instructions and IPC#

SALU Utilization#

VALU Active Threads#

LDS examples#

LDS bandwidth#

Bank conflicts#

Occupancy limiters example#

Design note#

VGPR limited#

LDS limited#

SGPR limited#