System Speed-of-Light

System Speed-of-Light#

System Speed-of-Light summarizes some of the key metrics from various sections of ROCm Compute Profiler’s profiling report.

Warning

The theoretical maximum throughput for some metrics in this section are currently computed with the maximum achievable clock frequency, as reported by rocminfo, for an accelerator. This may not be realistic for all workloads.

Also, not all metrics – such as FLOP counters – are available on all AMD Instinct™ MI-series accelerators. For more detail on how operations are counted, see the FLOP counting conventions section.

Metric

Description

Unit

VALU FLOPs

The total floating-point operations executed per second on the VALU. This is also presented as a percent of the peak theoretical FLOPs achievable on the specific accelerator. Note: this does not include any floating-point operations from MFMA instructions.

GFLOPs

VALU IOPs

The total integer operations executed per second on the VALU. This is also presented as a percent of the peak theoretical IOPs achievable on the specific accelerator. Note: this does not include any integer operations from MFMA instructions.

GIOPs

MFMA FLOPs (BF16)

The total number of 16-bit brain floating point MFMA operations executed per second. Note: this does not include any 16-bit brain floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical BF16 MFMA operations achievable on the specific accelerator.

GFLOPs

MFMA FLOPs (F16)

The total number of 16-bit floating point MFMA operations executed per second. Note: this does not include any 16-bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F16 MFMA operations achievable on the specific accelerator.

GFLOPs

MFMA FLOPs (F32)

The total number of 32-bit floating point MFMA operations executed per second. Note: this does not include any 32-bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F32 MFMA operations achievable on the specific accelerator.

GFLOPs

MFMA FLOPs (F64)

The total number of 64-bit floating point MFMA operations executed per second. Note: this does not include any 64-bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F64 MFMA operations achievable on the specific accelerator.

GFLOPs

MFMA IOPs (INT8)

The total number of 8-bit integer MFMA operations executed per second. Note: this does not include any 8-bit integer operations from VALU instructions. This is also presented as a percent of the peak theoretical INT8 MFMA operations achievable on the specific accelerator.

GIOPs

SALU utilization

Indicates what percent of the kernel’s duration the SALU was busy executing instructions. Computed as the ratio of the total number of cycles spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.

Percent

VALU utilization

Indicates what percent of the kernel’s duration the VALU was busy executing instructions. Does not include VMEM operations. Computed as the ratio of the total number of cycles spent by the scheduler issuing VALU instructions over the total CU cycles.

Percent

MFMA utilization

Indicates what percent of the kernel’s duration the MFMA unit was busy executing instructions. Computed as the ratio of the total number of cycles the MFMA was busy over the total CU cycles.

Percent

VMEM utilization

Indicates what percent of the kernel’s duration the VMEM unit was busy executing instructions, including both global/generic and spill/scratch operations (see the VMEM instruction count metrics) for more detail). Does not include VALU operations. Computed as the ratio of the total number of cycles spent by the scheduler issuing VMEM instructions over the total CU cycles.

Percent

Branch utilization

Indicates what percent of the kernel’s duration the branch unit was busy executing instructions. Computed as the ratio of the total number of cycles spent by the scheduler issuing branch instructions over the total CU cycles

Percent

VALU active threads

Indicates the average level of divergence within a wavefront over the lifetime of the kernel. The number of work-items that were active in a wavefront during execution of each VALU instruction, time-averaged over all VALU instructions run on all wavefronts in the kernel.

Work-items

IPC

The ratio of the total number of instructions executed on the CU over the total active CU cycles. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator.

Instructions per-cycle

Wavefront occupancy

The time-averaged number of wavefronts resident on the accelerator over the lifetime of the kernel. Note: this metric may be inaccurate for short-running kernels (less than 1ms). This is also presented as a percent of the peak theoretical occupancy achievable on the specific accelerator.

Wavefronts

LDS theoretical bandwidth

Indicates the maximum amount of bytes that could have been loaded from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth example for more detail). This is also presented as a percent of the peak theoretical F64 MFMA operations achievable on the specific accelerator.

GB/s

LDS bank conflicts/access

The ratio of the number of cycles spent in the LDS scheduler due to bank conflicts (as determined by the conflict resolution hardware) to the base number of cycles that would be spent in the LDS scheduler in a completely uncontended case. This is also presented in normalized form (i.e., the Bank Conflict Rate).

Conflicts/Access

vL1D cache hit rate

The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D cache RAM.

Percent

vL1D cache bandwidth

The number of bytes looked up in the vL1D cache as a result of VMEM instructions per unit time. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator.

GB/s

L2 cache hit rate

The ratio of the number of L2 cache line requests that hit in the L2 cache over the total number of incoming cache line requests to the L2 cache.

Percent

L2 cache bandwidth

The number of bytes looked up in the L2 cache per unit time. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator.

GB/s

L2-fabric read BW

The number of bytes read by the L2 over the Infinity Fabric™ interface per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator.

GB/s

L2-fabric write and atomic BW

The number of bytes sent by the L2 over the Infinity Fabric interface by write and atomic operations per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator.

GB/s

L2-fabric read latency

The time-averaged number of cycles read requests spent in Infinity Fabric before data was returned to the L2.

Cycles

L2-fabric write latency

The time-averaged number of cycles write requests spent in Infinity Fabric before a completion acknowledgement was returned to the L2.

Cycles

sL1D cache hit rate

The percent of sL1D requests that hit on a previously loaded line the cache. Calculated as the ratio of the number of sL1D requests that hit over the number of all sL1D requests.

Percent

sL1D bandwidth

The number of bytes looked up in the sL1D cache per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator.

GB/s

L1I bandwidth

The number of bytes looked up in the L1I cache per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator.

GB/s

L1I cache hit rate

The percent of L1I requests that hit on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests.

Percent

L1I fetch latency

The average number of cycles spent to fetch instructions to a CU.

Cycles