System Speed-of-Light#
System Speed-of-Light summarizes some of the key metrics from various sections of ROCm Compute Profiler’s profiling report.
Warning
The theoretical maximum throughput for some metrics in this section are
currently computed with the maximum achievable clock frequency, as reported
by rocminfo
, for an accelerator. This may not be realistic for
all workloads.
Also, not all metrics – such as FLOP counters – are available on all AMD Instinct™ MI-series accelerators. For more detail on how operations are counted, see the FLOP counting conventions section.
Metric |
Description |
Unit |
---|---|---|
VALU FLOPs |
The total floating-point operations executed per second on the VALU. This is also presented as a percent of the peak theoretical FLOPs achievable on the specific accelerator. Note: this does not include any floating-point operations from MFMA instructions. |
GFLOPs |
VALU IOPs |
The total integer operations executed per second on the VALU. This is also presented as a percent of the peak theoretical IOPs achievable on the specific accelerator. Note: this does not include any integer operations from MFMA instructions. |
GIOPs |
MFMA FLOPs (BF16) |
The total number of 16-bit brain floating point MFMA operations executed per second. Note: this does not include any 16-bit brain floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical BF16 MFMA operations achievable on the specific accelerator. |
GFLOPs |
MFMA FLOPs (F16) |
The total number of 16-bit floating point MFMA operations executed per second. Note: this does not include any 16-bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F16 MFMA operations achievable on the specific accelerator. |
GFLOPs |
MFMA FLOPs (F32) |
The total number of 32-bit floating point MFMA operations executed per second. Note: this does not include any 32-bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F32 MFMA operations achievable on the specific accelerator. |
GFLOPs |
MFMA FLOPs (F64) |
The total number of 64-bit floating point MFMA operations executed per second. Note: this does not include any 64-bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F64 MFMA operations achievable on the specific accelerator. |
GFLOPs |
MFMA IOPs (INT8) |
The total number of 8-bit integer MFMA operations executed per second. Note: this does not include any 8-bit integer operations from VALU instructions. This is also presented as a percent of the peak theoretical INT8 MFMA operations achievable on the specific accelerator. |
GIOPs |
SALU utilization |
Indicates what percent of the kernel’s duration the SALU was busy executing instructions. Computed as the ratio of the total number of cycles spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles. |
Percent |
VALU utilization |
Indicates what percent of the kernel’s duration the VALU was busy executing instructions. Does not include VMEM operations. Computed as the ratio of the total number of cycles spent by the scheduler issuing VALU instructions over the total CU cycles. |
Percent |
MFMA utilization |
Indicates what percent of the kernel’s duration the MFMA unit was busy executing instructions. Computed as the ratio of the total number of cycles the MFMA was busy over the total CU cycles. |
Percent |
VMEM utilization |
Indicates what percent of the kernel’s duration the VMEM unit was busy executing instructions, including both global/generic and spill/scratch operations (see the VMEM instruction count metrics) for more detail). Does not include VALU operations. Computed as the ratio of the total number of cycles spent by the scheduler issuing VMEM instructions over the total CU cycles. |
Percent |
Branch utilization |
Indicates what percent of the kernel’s duration the branch unit was busy executing instructions. Computed as the ratio of the total number of cycles spent by the scheduler issuing branch instructions over the total CU cycles |
Percent |
VALU active threads |
Indicates the average level of divergence within a wavefront over the lifetime of the kernel. The number of work-items that were active in a wavefront during execution of each VALU instruction, time-averaged over all VALU instructions run on all wavefronts in the kernel. |
Work-items |
IPC |
The ratio of the total number of instructions executed on the CU over the total active CU cycles. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. |
Instructions per-cycle |
Wavefront occupancy |
The time-averaged number of wavefronts resident on the accelerator over the lifetime of the kernel. Note: this metric may be inaccurate for short-running kernels (less than 1ms). This is also presented as a percent of the peak theoretical occupancy achievable on the specific accelerator. |
Wavefronts |
LDS theoretical bandwidth |
Indicates the maximum amount of bytes that could have been loaded from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth example for more detail). This is also presented as a percent of the peak theoretical F64 MFMA operations achievable on the specific accelerator. |
GB/s |
LDS bank conflicts/access |
The ratio of the number of cycles spent in the LDS scheduler due to bank conflicts (as determined by the conflict resolution hardware) to the base number of cycles that would be spent in the LDS scheduler in a completely uncontended case. This is also presented in normalized form (i.e., the Bank Conflict Rate). |
Conflicts/Access |
vL1D cache hit rate |
The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D cache RAM. |
Percent |
vL1D cache bandwidth |
The number of bytes looked up in the vL1D cache as a result of VMEM instructions per unit time. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. |
GB/s |
L2 cache hit rate |
The ratio of the number of L2 cache line requests that hit in the L2 cache over the total number of incoming cache line requests to the L2 cache. |
Percent |
L2 cache bandwidth |
The number of bytes looked up in the L2 cache per unit time. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. |
GB/s |
L2-fabric read BW |
The number of bytes read by the L2 over the Infinity Fabric™ interface per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. |
GB/s |
L2-fabric write and atomic BW |
The number of bytes sent by the L2 over the Infinity Fabric interface by write and atomic operations per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. |
GB/s |
L2-fabric read latency |
The time-averaged number of cycles read requests spent in Infinity Fabric before data was returned to the L2. |
Cycles |
L2-fabric write latency |
The time-averaged number of cycles write requests spent in Infinity Fabric before a completion acknowledgement was returned to the L2. |
Cycles |
sL1D cache hit rate |
The percent of sL1D requests that hit on a previously loaded line the cache. Calculated as the ratio of the number of sL1D requests that hit over the number of all sL1D requests. |
Percent |
sL1D bandwidth |
The number of bytes looked up in the sL1D cache per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. |
GB/s |
L1I bandwidth |
The number of bytes looked up in the L1I cache per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. |
GB/s |
L1I cache hit rate |
The percent of L1I requests that hit on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests. |
Percent |
L1I fetch latency |
The average number of cycles spent to fetch instructions to a CU. |
Cycles |