System Speed-of-Light

System Speed-of-Light#

System Speed-of-Light summarizes some of the key metrics from various sections of Omniperf’s profiling report.

Warning

The theoretical maximum throughput for some metrics in this section are currently computed with the maximum achievable clock frequency, as reported by rocminfo, for an accelerator. This may not be realistic for all workloads.

Also, not all metrics – such as FLOP counters – are available on all AMD Instinct™ MI-series accelerators. For more detail on how operations are counted, see the FLOP counting conventions section.

Metric	Description	Unit
VALU FLOPs	The total floating-point operations executed per second on the VALU. This is also presented as a percent of the peak theoretical FLOPs achievable on the specific accelerator. Note: this does not include any floating-point operations from MFMA instructions.	GFLOPs
VALU IOPs	The total integer operations executed per second on the VALU. This is also presented as a percent of the peak theoretical IOPs achievable on the specific accelerator. Note: this does not include any integer operations from MFMA instructions.	GIOPs
MFMA FLOPs (BF16)	The total number of 16-bit brain floating point MFMA operations executed per second. Note: this does not include any 16-bit brain floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical BF16 MFMA operations achievable on the specific accelerator.	GFLOPs
MFMA FLOPs (F16)	The total number of 16-bit floating point MFMA operations executed per second. Note: this does not include any 16-bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F16 MFMA operations achievable on the specific accelerator.	GFLOPs
MFMA FLOPs (F32)	The total number of 32-bit floating point MFMA operations executed per second. Note: this does not include any 32-bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F32 MFMA operations achievable on the specific accelerator.	GFLOPs
MFMA FLOPs (F64)	The total number of 64-bit floating point MFMA operations executed per second. Note: this does not include any 64-bit floating point operations from VALU instructions. This is also presented as a percent of the peak theoretical F64 MFMA operations achievable on the specific accelerator.	GFLOPs
MFMA IOPs (INT8)	The total number of 8-bit integer MFMA operations executed per second. Note: this does not include any 8-bit integer operations from VALU instructions. This is also presented as a percent of the peak theoretical INT8 MFMA operations achievable on the specific accelerator.	GIOPs
SALU utilization	Indicates what percent of the kernel’s duration the SALU was busy executing instructions. Computed as the ratio of the total number of cycles spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.	Percent
VALU utilization	Indicates what percent of the kernel’s duration the VALU was busy executing instructions. Does not include VMEM operations. Computed as the ratio of the total number of cycles spent by the scheduler issuing VALU instructions over the total CU cycles.	Percent
MFMA utilization	Indicates what percent of the kernel’s duration the MFMA unit was busy executing instructions. Computed as the ratio of the total number of cycles the MFMA was busy over the total CU cycles.	Percent
VMEM utilization	Indicates what percent of the kernel’s duration the VMEM unit was busy executing instructions, including both global/generic and spill/scratch operations (see the VMEM instruction count metrics) for more detail). Does not include VALU operations. Computed as the ratio of the total number of cycles spent by the scheduler issuing VMEM instructions over the total CU cycles.	Percent
Branch utilization	Indicates what percent of the kernel’s duration the branch unit was busy executing instructions. Computed as the ratio of the total number of cycles spent by the scheduler issuing branch instructions over the total CU cycles	Percent
VALU active threads	Indicates the average level of divergence within a wavefront over the lifetime of the kernel. The number of work-items that were active in a wavefront during execution of each VALU instruction, time-averaged over all VALU instructions run on all wavefronts in the kernel.	Work-items
IPC	The ratio of the total number of instructions executed on the CU over the total active CU cycles. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator.	Instructions per-cycle
Wavefront occupancy	The time-averaged number of wavefronts resident on the accelerator over the lifetime of the kernel. Note: this metric may be inaccurate for short-running kernels (less than 1ms). This is also presented as a percent of the peak theoretical occupancy achievable on the specific accelerator.	Wavefronts
LDS theoretical bandwidth	Indicates the maximum amount of bytes that could have been loaded from, stored to, or atomically updated in the LDS per unit time (see LDS Bandwidth example for more detail). This is also presented as a percent of the peak theoretical F64 MFMA operations achievable on the specific accelerator.	GB/s
LDS bank conflicts/access	The ratio of the number of cycles spent in the LDS scheduler due to bank conflicts (as determined by the conflict resolution hardware) to the base number of cycles that would be spent in the LDS scheduler in a completely uncontended case. This is also presented in normalized form (i.e., the Bank Conflict Rate).	Conflicts/Access
vL1D cache hit rate	The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the vL1D cache RAM.	Percent
vL1D cache bandwidth	The number of bytes looked up in the vL1D cache as a result of VMEM instructions per unit time. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator.	GB/s
L2 cache hit rate	The ratio of the number of L2 cache line requests that hit in the L2 cache over the total number of incoming cache line requests to the L2 cache.	Percent
L2 cache bandwidth	The number of bytes looked up in the L2 cache per unit time. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator.	GB/s
L2-fabric read BW	The number of bytes read by the L2 over the Infinity Fabric™ interface per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator.	GB/s
L2-fabric write and atomic BW	The number of bytes sent by the L2 over the Infinity Fabric interface by write and atomic operations per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator.	GB/s
L2-fabric read latency	The time-averaged number of cycles read requests spent in Infinity Fabric before data was returned to the L2.	Cycles
L2-fabric write latency	The time-averaged number of cycles write requests spent in Infinity Fabric before a completion acknowledgement was returned to the L2.	Cycles
sL1D cache hit rate	The percent of sL1D requests that hit on a previously loaded line the cache. Calculated as the ratio of the number of sL1D requests that hit over the number of all sL1D requests.	Percent
sL1D bandwidth	The number of bytes looked up in the sL1D cache per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator.	GB/s
L1I bandwidth	The number of bytes looked up in the L1I cache per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator.	GB/s
L1I cache hit rate	The percent of L1I requests that hit on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests.	Percent
L1I fetch latency	The average number of cycles spent to fetch instructions to a CU.	Cycles