Grafana GUI analysis#
Find setup instructions in Setting up a Grafana server for Omniperf.
The Omniperf Grafana analysis dashboard GUI supports the following features to facilitate MI accelerator performance profiling and analysis:
System and hardware component (hardware block)
Speed-of-Light (SOL)
Multiple normalization options
Baseline comparisons
Regex-based dispatch ID filtering
Roofline analysis
Detailed performance counters and metrics per hardware component, such as:
Command Processor - Fetch (CPF) / Command Processor - Controller (CPC)
Workgroup Manager (SPI)
Shader Sequencer (SQ)
Shader Sequencer Controller (SQC)
L1 Address Processing Unit, a.k.a. Texture Addresser (TA) / L1 Backend Data Processing Unit, a.k.a. Texture Data (TD)
L1 Cache (TCP)
L2 Cache (TCC) (both aggregated and per-channel perf info)
See the full list of Omniperf’s analysis panels.
Speed-of-Light#
Speed-of-Light panels are provided at both the system and per hardware component level to help diagnosis performance bottlenecks. The performance numbers of the workload under testing are compared to the theoretical maximum, such as floating point operations, bandwidth, cache hit rate, etc., to indicate the available room to further utilize the hardware capability.
Normalizations#
Multiple performance number normalizations are provided to allow performance inspection within both hardware and software context. The following normalizations are available.
per_wave
per_cycle
per_kernel
per_second
See Normalization units to learn more about Omniperf normalizations.
Baseline comparison#
Omniperf enables baseline comparison to allow checking A/B effect. Currently baseline comparison is limited to the same SoC. Cross comparison between SoCs is in development.
For both the Current Workload and the Baseline Workload, you can independently setup the following filters to allow fine grained comparisons:
Workload Name
GPU ID filtering (multi-selection)
Kernel Name filtering (multi-selection)
Dispatch ID filtering (regex filtering)
Omniperf Panels (multi-selection)
Regex-based dispatch ID filtering#
Omniperf allows filtering via Regular Expressions (regex), a standard Linux string matching syntax, based dispatch ID filtering to flexibly choose the kernel invocations.
For example, to inspect Dispatch Range from 17 to 48, inclusive, the
corresponding regex is : (1[7-9]|[23]\d|4[0-8])
.
Tip
Try Regex Numeric Range Generator for help generating typical number ranges.
Incremental profiling#
Omniperf supports incremental profiling to speed up performance analysis.
Refer to the Hardware component filtering section for this command.
By default, the entire application is profiled to collect performance counters for all hardware blocks, giving a complete view of where the workload stands in terms of performance optimization opportunities and bottlenecks.
You can choose to focus on only a few hardware components – for example L1 cache or LDS – to closely check the effect of software optimizations, without performing application replay for all other hardware components. This saves a lot of compute time. In addition, prior profiling results for other hardware components are not overwritten; instead, they can be merged during the import to piece together an overall profile of the system.
Color coding#
Uniform color coding applies to most visualizations – including bar graphs, tables, and diagrams – for easy inspection. As a rule of thumb, yellow means over 50%, while red means over 90% percent.
Global variables and configurations#
Grafana GUI import#
The Omniperf database --import
option imports the raw profiling data to
Grafana’s backend MongoDB database. This step is only required for Grafana
GUI-based performance analysis.
Default username and password for MongoDB (to be used in database mode) are as follows:
Username:
temp
Password:
temp123
Each workload is imported to a separate database with the following naming convention:
omniperf_<team>_<database>_<soc>
For example:
omniperf_asw_vcopy_mi200
When using database mode, be sure to tailor the
connection options to the machine hosting your
server-side instance. Below is the sample
command to import the vcopy profiling data, assuming our host machine is
called dummybox
.
$ omniperf database --help
usage:
omniperf database <interaction type> [connection options]
-------------------------------------------------------------------------------
Examples:
omniperf database --import -H pavii1 -u temp -t asw -w workloads/vcopy/mi200/
omniperf database --remove -H pavii1 -u temp -w omniperf_asw_sample_mi200
-------------------------------------------------------------------------------
Help:
-h, --help show this help message and exit
General Options:
-v, --version show program's version number and exit
-V, --verbose Increase output verbosity (use multiple times for higher levels)
-s, --specs Print system specs.
Interaction Type:
-i, --import Import workload to Omniperf DB
-r, --remove Remove a workload from Omniperf DB
Connection Options:
-H , --host Name or IP address of the server host.
-P , --port TCP/IP Port. (DEFAULT: 27018)
-u , --username Username for authentication.
-p , --password The user's password. (will be requested later if it's not set)
-t , --team Specify Team prefix.
-w , --workload Specify name of workload (to remove) or path to workload (to import)
--kernel-verbose Specify Kernel Name verbose level 1-5. Lower the level, shorter the kernel name. (DEFAULT: 5) (DISABLE: 5)
Omniperf import for vcopy:#
$ omniperf database --import -H dummybox -u temp -t asw -w workloads/vcopy/mi200/
___ _ __
/ _ \ _ __ ___ _ __ (_)_ __ ___ _ __ / _|
| | | | '_ ` _ \| '_ \| | '_ \ / _ \ '__| |_
| |_| | | | | | | | | | | |_) | __/ | | _|
\___/|_| |_| |_|_| |_|_| .__/ \___|_| |_|
|_|
Pulling data from /home/auser/repos/omniperf/sample/workloads/vcopy/MI200
The directory exists
Found sysinfo file
KernelName shortening enabled
Kernel name verbose level: 2
Password:
Password received
-- Conversion & Upload in Progress --
0%| | 0/11 [00:00<?, ?it/s]/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/SQ_IFETCH_LEVEL.csv
9%|█████████████████▉ | 1/11 [00:00<00:01, 8.53it/s]/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/pmc_perf.csv
18%|███████████████████████████████████▊ | 2/11 [00:00<00:01, 6.99it/s]/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/SQ_INST_LEVEL_SMEM.csv
27%|█████████████████████████████████████████████████████▋ | 3/11 [00:00<00:01, 7.90it/s]/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/SQ_LEVEL_WAVES.csv
36%|███████████████████████████████████████████████████████████████████████▋ | 4/11 [00:00<00:00, 8.56it/s]/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/SQ_INST_LEVEL_LDS.csv
45%|█████████████████████████████████████████████████████████████████████████████████████████▌ | 5/11 [00:00<00:00, 9.00it/s]/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/SQ_INST_LEVEL_VMEM.csv
55%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 6/11 [00:00<00:00, 9.24it/s]/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/sysinfo.csv
64%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 7/11 [00:00<00:00, 9.37it/s]/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/roofline.csv
82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 9/11 [00:00<00:00, 12.60it/s]/home/auser/repos/omniperf/sample/workloads/vcopy/MI200/timestamps.csv
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 11.05it/s]
9 collections added.
Workload name uploaded
-- Complete! --
Omniperf panels#
There are currently 18 main panel categories available for analyzing the compute workload performance. Each category contains several panels for close inspection of the system performance.
-
Kernel time histogram
Top ten bottleneck kernels
-
Speed-of-Light
System Info table
-
FP32/FP64
FP16/INT8
-
Command Processor - Fetch (CPF)
Command Processor - Controller (CPC)
Workgroup Manager or Shader Processor Input (SPI)
SPI Stats
SPI Resource Allocations
-
Wavefront Launch Stats
Wavefront runtime stats
per-SE Wavefront Scheduling performance
-
Wavefront lifetime breakdown
per-SE wavefront life (average)
per-SE wavefront life (histogram)
-
per-SE wavefront occupancy
per-CU wavefront occupancy
Compute Unit - Instruction Mix
per-wave Instruction mix
per-wave VALU Arithmetic instruction mix
per-wave MFMA Arithmetic instruction mix
Compute Unit - Compute Pipeline
Speed-of-Light: Compute Pipeline
Arithmetic OPs count
Compute pipeline stats
Memory latencies
-
Speed-of-Light: LDS
LDS stats
-
Speed-of-Light: Instruction Cache
Instruction Cache Accesses
Constant Cache
Speed-of-Light: Constant Cache
Constant Cache Accesses
Constant Cache - L2 Interface stats
Texture Addresser and Texture Data
Texture Addresser (TA)
Texture Data (TD)
L1 Cache
Speed-of-Light: L1 Cache
L1 Cache Accesses
L1 Cache Stalls
L1 - L2 Transactions
L1 - UTCL1 Interface stats
-
Speed-of-Light: L2 Cache
L2 Cache Accesses
L2 - EA Transactions
L2 - EA Stalls
L2 Cache Per Channel Performance
Per-channel L2 Hit rate
Per-channel L1-L2 Read requests
Per-channel L1-L2 Write Requests
Per-channel L1-L2 Atomic Requests
Per-channel L2-EA Read requests
Per-channel L2-EA Write requests
Per-channel L2-EA Atomic requests
Per-channel L2-EA Read latency
Per-channel L2-EA Write latency
Per-channel L2-EA Atomic latency
Per-channel L2-EA Read stall (I/O, GMI, HBM)
Per-channel L2-EA Write stall (I/O, GMI, HBM, Starve)
Most panels are designed around a specific hardware component block to thoroughly understand its behavior. Additional panels, including custom panels, could also be added to aid the performance analysis.
System Info#
Kernel Statistics#
Kernel Time Histogram#
Top Bottleneck Kernels#
Top Bottleneck Dispatches#
Current and Baseline Dispatch IDs (Filtered)#
System Speed-of-Light#
Tip
See System Speed-of-Light to learn about reported metrics.
Memory Chart Analysis#
Note
The Memory Chart Analysis support multiple normalizations. Due to limited
space, all transactions, when normalized to per_sec
, default to unit of
billion transactions per second.
Empirical Roofline Analysis#
Command Processor#
Tip
See Command processor (CP) to learn about reported metrics.
Command Processor Fetcher#
Command Processor Compute#
Shader Processor Input (SPI)#
Tip
See Workgroup manager (SPI) to learn about reported metrics.
SPI Stats#
SPI Resource Allocation#
Wavefront#
Wavefront Launch Stats#
Tip
See Wavefront launch stats to learn about reported metrics.
Wavefront Runtime Stats#
Tip
See Wavefront runtime stats to learn about reported metrics.
Compute Unit - Instruction Mix#
Instruction Mix#
Tip
See Instruction mix to learn about reported metrics.
VALU Arithmetic Instruction Mix#
Tip
See VALU arithmetic instruction mix to learn about reported metrics.
MFMA Arithmetic Instruction Mix#
Tip
See MFMA instruction mix to learn about reported metrics.
VMEM Arithmetic Instruction Mix#
Tip
See VMEM instruction mix to learn about reported metrics.
Compute Unit - Compute Pipeline#
Speed-of-Light#
Tip
See Compute Speed-of-Light to learn about reported metrics.
Pipeline Stats#
Tip
See Pipeline statistics to learn about reported metrics.
Arithmetic Operations#
Tip
See Arithmetic operations to learn about reported metrics.
Instruction Cache#
Speed-of-Light#
Tip
See L1I Speed-of-Light to learn about reported metrics.
Instruction Cache Stats#
Tip
See L1I cache accesses to learn about reported metrics.
Scalar L1D Cache#
Tip
See Scalar L1 data cache (sL1D) to learn about reported metrics.
Speed-of-Light#
Tip
See Scalar L1D Speed-of-Light to learn about reported metrics.
Scalar L1D Cache Accesses#
Tip
See Scalar L1D cache accesses to learn about reported metrics.
Scalar L1D Cache - L2 Interface#
Tip
See sL1D ↔ L2 Interface to learn about reported metrics.
Texture Address and Texture Data#
Texture Addresser#
Tip
See Address processing unit or Texture Addresser (TA) to learn about reported metrics.
Texture Data#
Tip
See Vector L1 data-return path or Texture Data (TD) to learn about reported metrics.
Vector L1 Data Cache#
Speed-of-Light#
Tip
See vL1D Speed-of-Light to learn about reported metrics.
L1D Cache Stalls#
Tip
See vL1D cache stall metrics to learn about reported metrics.
L1D Cache Accesses#
Tip
See vL1D cache access metrics to learn about reported metrics.
L1D - L2 Transactions#
Tip
See vL1D - L2 Transaction Detail to learn more.
L1D Addr Translation#
Tip
See L1 Unified Translation Cache (UTCL1) to learn about reported metrics.
L2 Cache#
Tip
See L2 cache (TCC) to learn about reported metrics.
Speed-of-Light#
Tip
See L2 Speed-of-Light to learn about reported metrics.
L2 Cache Accesses#
Tip
See L2 cache accesses to learn about reported metrics.
L2 - Fabric Transactions#
Tip
See L2-Fabric transactions to learn about reported metrics.
L2 - Fabric Interface Stalls#
Tip
See L2-Fabric interface stalls to learn about reported metrics.
L2 Cache Per Channel#
Tip
See L2 Speed-of-Light for more information.