Grafana GUI analysis

Grafana GUI analysis#

Find setup instructions in Setting up Grafana server for ROCm Compute Profiler.

The ROCm Compute Profiler Grafana analysis dashboard GUI supports the following features to facilitate MI accelerator performance profiling and analysis:

System and hardware component (hardware block)
Speed-of-Light (SOL)
Multiple normalization options
Baseline comparisons
Regex-based dispatch ID filtering
Roofline analysis
Detailed performance counters and metrics per hardware component, such as:
- Command Processor - Fetch (CPF) / Command Processor - Controller (CPC)
- Workgroup Manager (SPI)
- Shader Sequencer (SQ)
- Shader Sequencer Controller (SQC)
- L1 Address Processing Unit, a.k.a. Texture Addresser (TA) / L1 Backend Data Processing Unit, a.k.a. Texture Data (TD)
- L1 Cache (TCP)
- L2 Cache (TCC) (both aggregated and per-channel perf info)

See the full list of ROCm Compute Profiler’s analysis panels.

Speed-of-Light#

Speed-of-Light panels are provided at both the system and per hardware component level to help diagnosis performance bottlenecks. The performance numbers of the workload under testing are compared to the theoretical maximum, such as floating point operations, bandwidth, cache hit rate, etc., to indicate the available room to further utilize the hardware capability.

Normalizations#

Multiple performance number normalizations are provided to allow performance inspection within both hardware and software context. The following normalizations are available.

per_wave
per_cycle
per_kernel
per_second

See Normalization units to learn more about ROCm Compute Profiler normalizations.

Baseline comparison#

ROCm Compute Profiler enables baseline comparison to allow checking A/B effect. Currently baseline comparison is limited to the same SoC. Cross comparison between SoCs is in development.

For both the Current Workload and the Baseline Workload, you can independently setup the following filters to allow fine grained comparisons:

Workload Name
GPU ID filtering (multi-selection)
Kernel Name filtering (multi-selection)
Dispatch ID filtering (regex filtering)
ROCm Compute Profiler Panels (multi-selection)

Regex-based dispatch ID filtering#

ROCm Compute Profiler allows filtering via Regular Expressions (regex), a standard Linux string matching syntax, based dispatch ID filtering to flexibly choose the kernel invocations.

For example, to inspect Dispatch Range from 17 to 48, inclusive, the corresponding regex is : (1[7-9]|[23]\d|4[0-8]).

Tip

Try Regex Numeric Range Generator for help generating typical number ranges.

Incremental profiling#

ROCm Compute Profiler supports incremental profiling to speed up performance analysis.

Refer to the Hardware component filtering section for this command.

By default, the entire application is profiled to collect performance counters for all hardware blocks, giving a complete view of where the workload stands in terms of performance optimization opportunities and bottlenecks.

You can choose to focus on only a few hardware components – for example L1 cache or LDS – to closely check the effect of software optimizations, without performing application replay for all other hardware components. This saves a lot of compute time. In addition, prior profiling results for other hardware components are not overwritten; instead, they can be merged during the import to piece together an overall profile of the system.

Color coding#

Uniform color coding applies to most visualizations – including bar graphs, tables, and diagrams – for easy inspection. As a rule of thumb, yellow means over 50%, while red means over 90% percent.

Global variables and configurations#

Grafana GUI import#

The ROCm Compute Profiler database --import option imports the raw profiling data to Grafana’s backend MongoDB database. This step is only required for Grafana GUI-based performance analysis.

Default username and password for MongoDB (to be used in database mode) are as follows:

Username: temp
Password: temp123

Each workload is imported to a separate database with the following naming convention:

rocprofiler-compute_<team>_<database>_<soc>

For example:

rocprofiler-compute_asw_vcopy_mi200

When using database mode, be sure to tailor the connection options to the machine hosting your server-side instance. Below is the sample command to import the vcopy profiling data, assuming our host machine is called dummybox.

$ rocprof-compute database --help
usage:

rocprof-compute database <interaction type> [connection options]



-------------------------------------------------------------------------------

Examples:

        rocprof-compute database --import -H pavii1 -u temp -t asw -w workloads/vcopy/mi200/

        rocprof-compute database --remove -H pavii1 -u temp -w rocprofiler-compute_asw_sample_mi200

-------------------------------------------------------------------------------



Help:
  -h, --help         show this help message and exit

General Options:
  -v, --version      show program's version number and exit
  -V, --verbose      Increase output verbosity (use multiple times for higher levels)
  -s, --specs        Print system specs.

Interaction Type:
  -i, --import                                  Import workload to ROCm Compute Profiler DB
  -r, --remove                                  Remove a workload from ROCm Compute Profiler DB

Connection Options:
  -H , --host                                   Name or IP address of the server host.
  -P , --port                                   TCP/IP Port. (DEFAULT: 27018)
  -u , --username                               Username for authentication.
  -p , --password                               The user's password. (will be requested later if it's not set)
  -t , --team                                   Specify Team prefix.
  -w , --workload                               Specify name of workload (to remove) or path to workload (to import)
  --kernel-verbose              Specify Kernel Name verbose level 1-5. Lower the level, shorter the kernel name. (DEFAULT: 5) (DISABLE: 5)

ROCm Compute Profiler import for vcopy:#

$ rocprof-compute database --import -H dummybox -u temp -t asw -w workloads/vcopy/mi200/

                                 __                                       _
 _ __ ___   ___ _ __  _ __ ___  / _|       ___ ___  _ __ ___  _ __  _   _| |_ ___
| '__/ _ \ / __| '_ \| '__/ _ \| |_ _____ / __/ _ \| '_ ` _ \| '_ \| | | | __/ _ \
| | | (_) | (__| |_) | | | (_) |  _|_____| (_| (_) | | | | | | |_) | |_| | ||  __/
|_|  \___/ \___| .__/|_|  \___/|_|        \___\___/|_| |_| |_| .__/ \__,_|\__\___|
               |_|                                           |_|


Pulling data from  /home/auser/repos/rocprofiler-compute/sample/workloads/vcopy/MI200
The directory exists
Found sysinfo file
KernelName shortening enabled
Kernel name verbose level: 2
Password:
Password received
-- Conversion & Upload in Progress --
  0%|                                                                                                                                                                                                             | 0/11 [00:00<?, ?it/s]/home/auser/repos/rocprofiler-compute/sample/workloads/vcopy/MI200/SQ_IFETCH_LEVEL.csv
  9%|█████████████████▉                                                                                                                                                                                   | 1/11 [00:00<00:01,  8.53it/s]/home/auser/repos/rocprofiler-compute/sample/workloads/vcopy/MI200/pmc_perf.csv
 18%|███████████████████████████████████▊                                                                                                                                                                 | 2/11 [00:00<00:01,  6.99it/s]/home/auser/repos/rocprofiler-compute/sample/workloads/vcopy/MI200/SQ_INST_LEVEL_SMEM.csv
 27%|█████████████████████████████████████████████████████▋                                                                                                                                               | 3/11 [00:00<00:01,  7.90it/s]/home/auser/repos/rocprofiler-compute/sample/workloads/vcopy/MI200/SQ_LEVEL_WAVES.csv
 36%|███████████████████████████████████████████████████████████████████████▋                                                                                                                             | 4/11 [00:00<00:00,  8.56it/s]/home/auser/repos/rocprofiler-compute/sample/workloads/vcopy/MI200/SQ_INST_LEVEL_LDS.csv
 45%|█████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                           | 5/11 [00:00<00:00,  9.00it/s]/home/auser/repos/rocprofiler-compute/sample/workloads/vcopy/MI200/SQ_INST_LEVEL_VMEM.csv
 55%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                                                         | 6/11 [00:00<00:00,  9.24it/s]/home/auser/repos/rocprofiler-compute/sample/workloads/vcopy/MI200/sysinfo.csv
 64%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                                       | 7/11 [00:00<00:00,  9.37it/s]/home/auser/repos/rocprofiler-compute/sample/workloads/vcopy/MI200/roofline.csv
 82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                   | 9/11 [00:00<00:00, 12.60it/s]/home/auser/repos/rocprofiler-compute/sample/workloads/vcopy/MI200/timestamps.csv
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 11.05it/s]
9 collections added.
Workload name uploaded
-- Complete! --

ROCm Compute Profiler panels#

There are currently 18 main panel categories available for analyzing the compute workload performance. Each category contains several panels for close inspection of the system performance.

Kernel Statistics
- Kernel time histogram
- Top ten bottleneck kernels
System Speed-of-Light
- Speed-of-Light
- System Info table
Memory Chart Analysis
Roofline Analysis
- FP32/FP64
- FP16/INT8
Command Processor
- Command Processor - Fetch (CPF)
- Command Processor - Controller (CPC)
Workgroup Manager or Shader Processor Input (SPI)
- SPI Stats
- SPI Resource Allocations
Wavefront Launch
- Wavefront Launch Stats
- Wavefront runtime stats
- per-SE Wavefront Scheduling performance
Wavefront Lifetime
- Wavefront lifetime breakdown
- per-SE wavefront life (average)
- per-SE wavefront life (histogram)
Wavefront Occupancy
- per-SE wavefront occupancy
- per-CU wavefront occupancy
Compute Unit - Instruction Mix
- per-wave Instruction mix
- per-wave VALU Arithmetic instruction mix
- per-wave MFMA Arithmetic instruction mix
Compute Unit - Compute Pipeline
- Speed-of-Light: Compute Pipeline
- Arithmetic OPs count
- Compute pipeline stats
- Memory latencies
Local Data Share (LDS)
- Speed-of-Light: LDS
- LDS stats
Instruction Cache
- Speed-of-Light: Instruction Cache
- Instruction Cache Accesses
Constant Cache
- Speed-of-Light: Constant Cache
- Constant Cache Accesses
- Constant Cache - L2 Interface stats
Texture Addresser and Texture Data
- Texture Addresser (TA)
- Texture Data (TD)
L1 Cache
- Speed-of-Light: L1 Cache
- L1 Cache Accesses
- L1 Cache Stalls
- L1 - L2 Transactions
- L1 - UTCL1 Interface stats
L2 Cache
- Speed-of-Light: L2 Cache
- L2 Cache Accesses
- L2 - EA Transactions
- L2 - EA Stalls
L2 Cache Per Channel Performance
- Per-channel L2 Hit rate
- Per-channel L1-L2 Read requests
- Per-channel L1-L2 Write Requests
- Per-channel L1-L2 Atomic Requests
- Per-channel L2-EA Read requests
- Per-channel L2-EA Write requests
- Per-channel L2-EA Atomic requests
- Per-channel L2-EA Read latency
- Per-channel L2-EA Write latency
- Per-channel L2-EA Atomic latency
- Per-channel L2-EA Read stall (I/O, GMI, HBM)
- Per-channel L2-EA Write stall (I/O, GMI, HBM, Starve)

Most panels are designed around a specific hardware component block to thoroughly understand its behavior. Additional panels, including custom panels, could also be added to aid the performance analysis.

System Info#

Fig. 7 System details logged from the host machine.#

Kernel Statistics#

Kernel Time Histogram#

Top Bottleneck Kernels#

Top Bottleneck Dispatches#

Current and Baseline Dispatch IDs (Filtered)#

Current and baseline dispatch IDs panel in ROCm Compute Profiler Grafana — Fig. 11 List of all kernel dispatches.#

System Speed-of-Light#

Tip

See System Speed-of-Light to learn about reported metrics.

Memory Chart Analysis#

Note

The Memory Chart Analysis support multiple normalizations. Due to limited space, all transactions, when normalized to per_sec, default to unit of billion transactions per second.