Using rocprofv2#

Note

rocprofv2 is considered beta software.

rocprofv2 is a command-line interface tool (CLI) that lets you profile AMD GPU applications without any requirement of source code modification. The usage of rocprofv2 along with various command-line arguments is described in the following sections.

To see all the rocprofv2 options, refer to rocprofv2 command reference, or run the following from the command line:

rocprofv2 --help

Application tracing#

Tracing of application and hardware events, is a primary feature of the rocprofv2 command. The various options for tracing HIP/HSA API, asynchronous activity, and kernel dispatches are described in the following table:

Tracing mode

Option

Usage

HIP API tracing

--hip-api

rocprofv2 --hip-api <app_relative_path>

Combined HIP API and asynchronous activity tracing

--hip-activity
or --hip-trace

rocprofv2 --hip-activity <app_relative_path>

HSA API tracing

--hip-api

rocprofv2 --hsa-api <app_relative_path>

Combined HSA API and asynchronous activity tracing

--hip-activity
or --hsa-trace

rocprofv2 --hsa-api <app_relative_path>

ROCTx API tracing

--roctx-trace

rocprofv2 --roctx-trace <app_relative_path>

Kernel dispatches tracing

--kernel-trace

rocprofv2 --kernel-trace <app_relative_path>

All tracing modes combined

--sys-trace

rocprofv2 --sys-trace <app_relative_path>

Note

By default, the output of these options is directed to stdout unless the -o option is also specified.

To generate output from these trace options, use one of the supported plugins that generate output in a specific format, as explained in Formatting output using plugins. The default plugin is the file plugin that generates a CSV file returned to stdout, or returned to a file when used with -o option.

rocprofv2 supports API tracing at both HIP and HSA level. In general, HIP APIs directly interact with the user program. It is easier to analyze HIP traces as you can directly map the traces to the program. HSA API tracing is more suited for advanced users who want to understand the application behavior at the lower level.

Both HIP and HSA APIs support asynchronous behavior (e.g., asynchronous memory copy). If trace collection is triggered using either --hip-api or --hsa-api, the trace records only the start, stop, and duration of API events, but not the execution time of associated actions like memory copy. To record the duration of asynchronous activities, use --hip-activity and --hsa-activity options, which record both the API events and asynchronous events.

Visualize tracing results#

You can view the traces generated by rocprofv2 using the Perfetto UI that enables you to view and analyze traces in a web browser. To begin go to Perfetto UI, select Open trace file from the left-side menu, and select the ROCProfiler trace file to view.

The following is a screenshot from the Perfetto interface. The tasks are organized in a Gantt chart style with the x-axis representing time and each rectangle representing the start and the end time of a task. The tasks are organized in rows. In the figure is the HIP API, HSA API, a queue, and a stream.

Viewing HIP Trace

Fig. 9 Visualizing Traces Generated Using sys-trace#

Tip

To enlarge the image, right click on the image and use the Open image in new tab option.

Kernel profiling#

As explained in rocprof-counters application tracing lets you evaluate the timeline of application events, but is little help in providing insight into kernel execution details. The kernel profiling functionality lets you select kernels for profiling and choose the basic counters or derived metrics to be collected for each kernel execution, thus providing a greater insight into hardware performance.

To check the supported performance counters and metrics, use:

rocprofv2 --list-counters

The following is a sample output from the --list-counters option. The output has been truncated for explanation:

gfx1030:0 : SQ_WAVES
: Count number of waves sent to SQs. {emulated, global, C1}
block SQ can only handle 8 counters at a time

The fields in the output are:

  • gfx1030:0 - The GPU architecture and GPU ID (separated by colon). The GPU ID needs to be specified as there might be multiple GPUs in the system.

  • SQ_WAVES - The counter name. Typically, the first token before the first underscore is the GPU block name. Here, SQ is the block that is responsible for managing wavefronts and issuing instructions.

Note

For more information on the performance counters available on AMD GPUs, refer to the GPU architecture documentation.

Input file#

To collect basic counters and derived metrics, define the profiling scope in an input file, and specify the file on the command line:

rocprofv2 -i input.txt <app_relative_path>

An input file is a text file that can be supplied to rocprofv2 for basic counter and derived metric collection. It contains the list of basic counters or derived metrics to be collected.

Sample Input File:

pmc: SQ_WAVES TA_UTIL

The fields in the input file are detailed in Input File.

PMC: The rows in the text file beginning with pmc: are the group of basic counters or derived metrics the user is interested in collecting. The basic counters or derived metrics can be selected from the output generated by --list-counters option.

The number of basic counters or derived metrics that can be collected in one run of profiling is limited by the GPU hardware resources. If too many counters/metrics are selected, the kernels need to be executed multiple times to collect the counters/metrics. For multi-pass execution, include multiple rows of pmc: in the input file. Counters or metrics in each pmc: row can be collected in each run of the kernel.

GPU: The row beginning with the keyword gpu: specifies the GPU(s) on which the hardware counters are to be collected. This enables the support for profiling multiple GPUs. You can specify multiple GPUs separated by comma such as gpu: 1,3.

Kernel: The row beginning with the kernel: keyword specifies the names of kernels to be profiled.

Range: The row beginning with the keyword range: specifies the range of kernel dispatches. Specifying range is helpful in cases where the application causes multiple kernel dispatches and users want to filter some kernel dispatches. In the above example, the range: 0:1 depicts that one kernel is profiled.

Kernel profiling output#

This section discusses the kernel profiling output generated using the Input File. rocprofv2 reports one value per metric per kernel in the output. You can generate the output in desired format as described in Formatting output using plugins. If no plugin is specified while generating the output, the result is dumped on the command-line.

The following sample output is generated using the file plugin. Each row of the file is an instance of kernel execution.

For each kernel, basic information (e.g., GPU_ID, SGPR, PID, etc.) and performance counters (specified in the input file) values are listed. The information is generated in the format of field name and value.

$ rocprofv2 -i input.txt --plugin file -o result MatrixTranspose

$ cat results_result.csv

Dispatch_ID,GPU_ID,Queue_ID,Queue_Index,PID,TID,GRD,WGR,LDS,SCR,Arch_VGPR,ACCUM_VGPR,
SGPR,Wave_Size,SIG,OBJ,Kernel_Name,Start_Timestamp,End_Timestamp,Correlation_ID,
SQ_WAVES,GRBM_COUNT,GRBM_GUI_ACTIVE,SQ_INSTS_VALU,FETCH_SIZE

1,64700,1,0,353,353,1048576,16,0,0,8,0,16,64,140356026185088,1,"matrixTranspose(float*, float*, int)
(.kd)",7,30064771072,0,65536.000000,398333.000000,398333.000000,917504.000000,4136.000000

2,64700,1,2,353,353,1048576,16,0,0,8,0,16,64,140356026184832,2,"matrixTranspose(float*,
float*, int)
(.kd)",7,30064771072,0,65536.000000,586424.000000,586424.000000,917504.000000,4130.437500

3,64700,1,4,353,353,1048576,16,0,0,8,0,16,64,140356026184576,3,"matrixTranspose(float*,
float*, int)
(.kd)",7,30064771072,0,65536.000000,392460.000000,392460.000000,917504.000000,4129.937500

The fields in the output file are:

Output fields

Description

Dispatch_ID

Kernel’s dispatch Id

GPU_ID

GPU identifier to which the kernel was submitted

Queue_ID

ROCm queue unique identifier to which the kernel was submitted

Queue_Index

ROCm queue write index for the submitted AQL packet

PID

System application process id that submitted the kernel

TID

System application thread id that submitted the kernel

GRD

Kernel’s grid size

WGR

Kernel’s work group size

LDS

Kernel’s Local Data Share (LDS) memory size

SCR

Kernel’s scratch memory size

Arch_VGPR

Number of Vector General Purpose Registers (VGPR) used in kernel dispatch

ACCUM_VGPR

Total Count of VGPRs

SGPR

Kernel’s Scalar General-Purpose Register (SGPR) size

Wave_Size

Number of wavefronts

SIG

Kernel’s completion signal

OBJ

Code object

Kernel_Name

Name of the dispatched kernel

Start_Timestamp

Begin time in nanoseconds (ns) when the kernel begins execution

End_Timestamp

End time in ns when the kernel finishes execution

Correlation_ID

Unique identifier for correlation between HIP and HSA async calls during activity tracing

You can view the generated output using the Perfetto UI as previously described in Visualize Tracing Results. The following is a screenshot of the Perfetto UI when viewing the kernel profiling output.

Viewing Kernel Profile

Fig. 10 Viewing kernel profiling output#

The first four rows represent the performance counters as specified in the input file. The last row is the kernel execution timeline, which is the same as the --kernel-trace option used in the Application tracing mode.

Viewing the profile results provides a good overview of kernel execution times and how performance metrics values change across the kernels. Additionally, you can also see the exact value of a counter/metric by hovering over or clicking the bar.

Formatting output using plugins#

rocprofv2 uses a modular plugin system which allows you to generate profiling output in the desired format. Because these plugins are modular in nature, they can easily be decoupled from the code based on need. By default, rocprofv2 generates the profiling output using the file and CLI plugins.

You can install other plugins (as listed in the table below) using the plugins package as shown:

rocprofiler-plugins_2.0.0-local_amd64.deb
-or-
rocprofiler-plugins-2.0.0-local.x86_64.rpm

You can also create your own plugins if you are using rocprofv2 with source code and not just as a CLI tool. To write new plugins import the include/rocprofiler/v2/rocprofiler_plugins.h header file.

To generate the profiling output using a plugin, use:

rocprofv2 --plugin plugin_name -i input.txt <app_relative_path>

# where plugin_name is file, perfetto, att, or ctf

To specify the plugin version to be used in case of multiple versions, use:

rocprofv2 --plugin <plugin_name> --plugin-version <plugin_version_required> <rocprofv2_options> <app_relative_path>

The following table lists the available plugins:

Plugin

Output format

File

Text files (.csv or .txt)

Perfetto

Protobuf in the format of the Chromium Project’s trace-event format

Advanced Thread Tracer (ATT)

Binary and .csv formats for Analysis View tool

Common Trace Format (CTF)

Binary, formatted in the ctf format that can be consumed by public tools such as Babeltrace and TraceCompass

Note

To generate output, the plugins require you to set the OUTPUT_PATH variable to the desired directory. File plugin is the only plugin that still generates output in the absence of OUTPUT_PATH by dumping the output to standard output.

File plugin#

To output the data in .txt files using file plugin, use:

rocprofv2 --plugin file -i samples/input.txt -d output_dir <app_relative_path>

Note that specifying the directory for output files using -d is optional.

File plugin has two versions with version 2 being the default. The headers in the output files generated using file plugin version 1 and 2 differ as shown below.

Version 1 header:

Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,DispatchNs,BeginNs,EndNs,CompleteNs,Counters

Note that the version 1 header is same as the legacy rocprof output.

Version 2 header:

Dispatch_ID,GPU_ID,Queue_ID,PID,TID,Grid_Size,Workgroup_Size,LDS_Per_Workgroup,Scratch_Per_Workitem,Arch_VGPR,Accum_VGPR,SGPR,Wave_Size,Kernel_Name,Start_Timestamp,End_Timestamp,Correlation_ID,Counters

Perfetto plugin#

To output the data in Protobuf format using the Perfetto plugin, use:

rocprofv2 --plugin perfetto --hsa-trace <app_relative_path>

You can view the Protobuf files using Perfetto or Trace processor.

Common Trace Format plugin#

To output the data in Common Trace Format (CTF), which is a binary trace format, use:

rocprofv2 --plugin ctf --hip-trace <app_relative_path>

You can view the CTF binary output using TraceCompass or Babeltrace.

For information on the ATT plugin, refer to the Analysis view.

Analysis View#

Analysis View is a web-based tool that allows you to visualize detailed information within a running kernel. It allows you to dive deeper into the kernels and examine how metrics change over time. This is more helpful for analyzing kernel execution than the standard kernel profiling which reports only one value for a counter for a kernel.

Analysis View supports collection of the following data:

  • Fast counter (supported only in GFX9 product family)

  • Wave states

  • Instruction delays

  • Memory barrier dependencies

For each run, data can be collected from one Compute Unit (CU). A user can choose any CU from 0 to 15 for data collection in Analysis View. The selected CU is 1 by default.

Running the Analysis View#

This section provides the information required to run the Analysis View tool.

Prerequisites#

Install the Python packages numpy, matplotlib, and websockets.

python3 -m pip install numpy matplotlib websockets

Compiling the application#

Analysis View requires assembly files for the kernel being profiled. You can generate these files using either of these two options:

  • Individually for the application

hipcc -g --savetemps
  • Set environment variable HIPCC_COMPILE_FLAGS_APPEND to configure the assembly files to be generated for all applications compiled by hipcc.

export HIPCC_COMPILE_FLAGS_APPEND="--save-temps -g"

Supplying input file#

To specify options for information collection using Analysis View, configure the input file using the parameters given below. You can specify the values in the input file as decimal (for example, 15) or hex (for example, 0xF).

Table 1 input file#

Parameter

Need for specification

Description

att: TARGET_CU

Required

Analysis View data collection retrieves information related to instruction timing on a single CU. This parameter must be the first string in the configuration file to mark the CU from which the data needs to be retrieved. Allowed values are in the range (0, 15). You can select any CU from 1 to 15, however we recommend selecting CU 1 for better results.

SIMD_MASK

Optional

This is a binary mask representing the SIMDs that generate instruction data on the given TARGET_CU. Allowed values are in the range (0x0, 0xF). Recommended value is 0x3. Note that Analysis View generates a lot of data and if you encounter packet loss, it is better to reduce the value of SIMD_MASK.

KERNEL

Optional

This parameter adds a kernel filter. Only kernels containing the given string are profiled in Analysis View. Used for large applications, where profiling every kernel generates a lot of Analysis View data. Adding multiple filters is supported.

PERFCOUNTERS_COL_PERIOD

Required only for performance counter collection

This parameter defines the frequency at which counters need to be collected. Allowed values are in the range (0, 31), with 0 being the fastest and 31 being the slowest. Recommended value is 3.

PERFCOUNTER

Optional

This parameter enables basic counter collection for the given name. Only SQ block counters can be used. For example, SQ_BUSY_CU_CYCLES and maximum 16 counters can be specified.

PERFCOUNTER_ID

Optional

This parameter performs the same function as PERFCOUNTER with the counter ID being specified instead of the counter name.

Example: A simple input.txt file sample

att: TARGET_CU=1
SIMD_MASK=0x1

Example: input.txt file sample with kernel filters

att: TARGET_CU=1
SIMD_MASK=0x3
KERNEL=vectoradd
KERNEL=histogram

Specifying the kernel in the above input file ensures that only kernels with vectoradd or histogram in their name are collected.

Example: input.txt file sample for performance counter collection

att: TARGET_CU=1
SIMD_MASK=0x3
PERFCOUNTERS_COL_PERIOD=0x2
PERFCOUNTER=SQ_BUSY_CU_CYCLES
PERFCOUNTER=SQ_WAVE_CYCLES
PERFCOUNTER=SQ_WAIT_ANY

Starting Analysis View#

To start Analysis View, run rocprofv2 command-line interface along with ATT plugin and arguments as shown:

rocprofv2 <rocprofv2_args> --plugin att <assembly_file> <att_args> <application>

To learn about the values that can be passed to the command-line arguments given in the command above, see the table below:

Table 2 analysis view arguments#

Command-line arguments

Values

Required

Description

<rocprof_args>

-i <input_file>

-d <output_location>

Yes

Optional

Specifies the input file

Specifies the location of the output folder for Analysis View data collection. Default value is “./”.

<assembly_file>

Path for the assembly file generated by hipcc, which contains the relevant kernel

Yes

The assembly files are generally in the form *amdgcn*.s. For example, vectoradd_hip-hip-amdgcn-amd-amdhsa-gfx908.s for the vectoradd_hip.cpp sample on a MI100 GPU.

<att_args>

--ports


--target_cu


--att_kernel



--trace_file


--genasm

All att arguments are optional

Specifies a pair of ports to open the viewer on the browser. Default value is 8000,18000

Overrides TARGET_CU in the input file. Default value is None.

Specifies the kernel to be loaded on the viewer. Default value is output_path/*kernel.txt. Loads all the data present in the output folder.

Specifies the trace file (.att) to be used. Uses the files from att_kernel by default.

Specifies the filename (*.s) to be generated with looped waitcnt barriers. This option is rather unique. Typically, waitcnt instruction waits for a group of instructions in one call and the number of instructions a waitcnt instruction waits for is supplied as its operand. Specifying --genasm generates waitcnt for one instruction per waitcnt. This allows you to view individual instruction delays. Default value is None.

<application>

Name of the application to be profiled

Optional

Specify this option for collecting new Analysis View data. For viewing previously collected data, omit this option.

Running sample application#

Follow the steps given below to run an application for Analysis View collection. Note that the use of vectorAdd HIP application is for demonstration purpose only.

  1. Go to the directory that contains the application.

cd HIP-Examples-master/vectorAdd/
  1. Create input.txt file with the following content to specify a target CU and SIMD mask.

att: TARGET_CU=1
SIMD_MASK=0x3
  1. Compile the HIP application while generating the assembly files (*.s).

hipcc -g --save-temps vectoradd_hip.cpp -o vectoradd_hip.exe
  1. Execute rocprofv2 command by specifying the input.txt and ISA file.

/opt/rocm/bin/rocprofv2 -i input.txt --plugin att *amdgcn*.s  ./vectoradd_hip.exe

Generated Output#

On successful Analysis View data collection and analysis, the following message is displayed:

“Serving at ports: 8000,18000”

These files are generated:

  • *kernel.txt: It contains the mangled and demangled name of the kernel that was profiled.

  • *.att: It contains the data per shader engine.

Each profiled kernel generates its own *.att and kernel.txt files. If data from multiple kernels with the same name is generated from different runs or due to the kernel running in a loop, the file name gets incremented (v0, v1, v2, and so on) and the viewer allows you to choose the run to be displayed.

You can view the collected data in the browser at localhost:8000. Make sure to use port forwarding when running on a remote machine.

Types of views#

Analysis View supports three views, including the Hotspot view, Instruction view, and SIMD view. It also generates a graph for performance counter collection. The views are described in detail below.

Hotspot view#

The hotspot view is used to quickly find the instructions with most cycles used. It allows you to see a histogram that shows cycles used by grouped sections of code. Clicking on a line puts the given instruction into view.

Instruction view#

The instruction view shows the number of cycles taken by each instruction to complete. It is used to link trace data to the underlying assembly code. Hovering the mouse over an instruction displays the number of cycles taken by the instruction to be issued and complete. If an instruction runs multiple times (in a loop), the cycles are shown for the first run. Clicking on an instruction puts the corresponding instruction into view in the wave bar.

SIMD view#

The Single Instruction Multiple Data (SIMD) view shows the timeline for waves belonging to a given SIMD and CU. Each wave slot contains two bars, the instruction and timeline bar. Each rectangle is an instruction color coded as per its type, which allows you to quickly see the general SIMD/CU usage. Hovering over the rectangle displays the instruction type and the current clock cycle. Along with the instruction type, each wave has a timeline bar, where red indicates stall (stalled), green indicates exec (ready to execute), yellow indicates wait (waiting usually for barriers), and white indicates empty (no wave allocated).

Counters graph#

A graph is generated for performance counters if they are collected. Performance counters are helpful in providing GPU-wide information. This information is averaged over all shader engines and over the number of quad cycles (polling rate). Note that some counters display values multiplied by 4, such as BUSY CUs, as they increment differently than other counters. Also, normalization causes the counter values to be divided by their maximum value and be shown as percentage of maximum (checked by default). You have the option to select the counters to be visualized and normalize the graph.

If no counters are present, the graph shows waves states instead. Wave states are low pass filtered for improved visualization.

Running on a remote machine#

Analysis View allows you to profile an application running on a remote server over ssh. You can achieve this in two simple steps:

  1. Generate the trace data and assembly files on the remote server.

  2. Download all the data (*kernel.txt, *.att, *.s) on the local machine. You can view all the traces locally using a locally installed rocprofv2.

Alternatively, since the Analysis View is a web-based tool, you can use ssh port forwarding:

ssh –L 8000:localhost:8000 <user@IP>