Using rocprof#
rocprof
is a powerful tool for profiling HIP applications on AMD ROCm platforms. It can be used to identify performance bottlenecks in applications and to optimize their performance. rocprof
provides a variety of profiling data, including performance counters, hardware traces, and runtime API/activity traces. This document provides a detailed description of the features and usage of the rocprof
command-line tool with a focus on these two primary features:
Application Tracing: This basic feature of
rocprof
is used to trace the execution of an application, with start/end timestamps for each API call and kernel execution. [Read more here](#application-tracing).Performance Counter Collection: This powerful feature collects performance counters for each API call and kernel execution. [Read more here](#performance-counter-collection).
To see all the rocprof
options, refer to rocprof command help, or run the following from the command line:
rocprof --help
Application tracing#
Application tracing provides the big picture of a program’s execution by collecting data on the execution times of API calls and GPU commands, such as kernel execution, async memory copy, and barrier packets. This information can be used as the first step in the profiling process to answer important questions, such as which kernel took the longest time to execute, and what percentage of time was spent on memory copy.
There are two ways to use application tracing: rocprof
and ROCTracer
API.
rocprof
is a command line interface (CLI) profiler that can be used on the applications running on ROCm-supported GPUs, without the requirement of any code modification in the application.ROCTracer
API is a library that requires minor code modification in the application to be traced but provides greater flexibility, such as adjusting execution based on profiling results.
Command-line options for application tracing#
The rocprof
CLI allows you to trace the entire execution of HIP applications. It allows tracing at different levels such as HIP-level, HSA-level, and system-level. These levels can be selected by supplying the respective command-line options to rocprof.
The command-line options used with rocprof
for tracing:
Options |
Description |
|
To specify the directory where the profiler stores traces and the profiling data. The profiler stores the profiling data in a temporary directory [/tmp] by default, which is removed automatically after a specific period. Specifying a directory explicitly allows you to prevent loss of data. |
|
To trace API execution stats, HIP API calls, and copy operation calls.|br| For more information refer to HIP Trace. |
|
To trace API execution stats, HIP API calls, and copy operation calls.|br| For more information refer to HSA Trace. |
|
To enable |
|
To trace API execution stats and kernel execution stats. |
|
To trace API execution stats, HIP and HSA API calls, and copy operation calls. |
HIP Trace#
Use the --hip-trace
option to collect execution trace data for the entire application with rocprof
, including HIP
API functions and their asynchronous activities at the runtime level.
Usage:
rocprof -d outputFolder --hip-trace ./Matrixtranspose
The above command generates three groups of files: viewable trace data, statistics files, and intermediate tracing data.
Note
This document does not discuss the intermediate tracing data as it is not designed to be read by users.
Viewable Trace Data
The viewable trace data is available in results.json
, which is a JSON
file that follows the Chromium Project’s trace-event format. You can view the trace using viewing tools such as Chrome Tracing <chrome://tracing/> or Perfetto UI. In the following figure a short segment of the trace data is viewed:
In the time axis at the top of the following figure there is a small, highlighted region between 0.22s to 0.24s, that indicates the currently selected time range.
Below the time axis in the following figure there are Gantt chart-style boxes that show the duration of each task. There are three rows of tasks divided by the black rows mentioning their categories. The first row is the CPU HIP API, which lists the execution time of each API trace.
In the following figure the second row titled COPY shows the tasks completed by the copy engine. See that the CopyHostToDevice
and CopyDeviceToHost
tasks are being completed at the beginning and the end of the time range, respectively.
In the following figure the GPU tasks are listed at the bottom. Note that the matrixTranspose
kernel is executed from about 0.185s to 0.20s.
Statistics Files#
The statistics files include:
results.stats.csv
for kernel statisticsresults.hip_stats.csv
forHIP
APIresults.copy_stats.csv
for activity statistics
The files are organized in comma-separated values (CSV) format, with the columns:
Name: Name of the action
Calls: Number of invocations
TotalDurationNS: Total duration in nanosecond
AverageNS: Average time in nanoseconds required to execute the action
Percentage: Percentage of the action with respect to the complete execution of the application
HSA Trace#
The HIP
runtime library is implemented with the low-level HSA
runtime. To trace the application at a lower level, you can use rocprof
to collect application traces at the HSA
runtime level. In general, tracing at the HIP
-level is recommended for most users. You are advised to use HSA
trace only if you are familiar with HSA
runtime.
HSA
trace contains the start/end time of HSA
runtime API calls and their asynchronous activities. Use the --hsa-trace
option with rocprof to collect HSA
-level trace:
rocprof --hsa-trace ./MatrixTranspose
As in HIP
trace, the generated HSA
trace also includes three groups of files, namely viewable traces, statistics files, and intermediate tracing data. You can visualize the generated results.json
using third-party tools such as Perfetto, as shown in the following figure.
In the preceding figure you can see that HSA
trace shows HSA
trace rows, just like HIP
trace shows the HIP
API row in {numref}hip-trace-visualize. However, note that there are more HSA
API calls as compared to HIP
API calls because HSA
works closer to the hardware.
Note
GPU ID in the HSA visualization is always 0.
The statistics files are:
results.stats.csv
for kernel statisticsresults.hsa_stats.csv
for HSA APIresults.copy_stats.csv
for memory copy statistics
Each file is a CSV table with columns as described in the stats-file.
System Trace#
The rocprof
tool can also generate both the HIP
and HSA
traces together with the --sys-trace
option.
rocprof --sys-trace ./MatrixTranspose
The following figure shows the generated results.json
visualized using Perfetto. It contains sections from both HIP
and HSA
trace.
ROCTx Trace#
In certain situations, such as debugging performance issues in large-scale GPU programs, API-level tracing may be too fine-grained to provide a big picture of the program execution. In such cases, you might find it helpful to define specific tasks to be traced.
To specify the tasks for tracing, enclose the respective source code with the API calls provided by roctx
. This process is also known as instrumentation. As the scope of code for instrumentation is defined using the enclosing API calls, it is called a range. A range is a programmer-defined task that has a well-defined start and end code scope. You can also fine grain the scope specified within a range using further nested ranges. The rocprof
tool also reports the timelines for these nested ranges.
Here is a list of useful APIs for code instrumentation.
roctxMark
: Inserts a marker in the code with a message. Creating marks can help you see when a line of code is executed.roctxRangeStart
: Starts a range. Ranges can be started by different threads.roctxRangePush
: Starts a new nested range.roctxRangePop
: Stops the current nested range.roctxRangeStop
: Stops the given range.
See roctx
code annotations in the MatrixTranspose
application below:
roctxMark("before hipLaunchKernel");
int rangeId = roctxRangeStart("hipLaunchKernel range");
roctxRangePush("hipLaunchKernel");
// Lauching kernel from host
hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH/THREADS_PER_BLOCK_X, WIDTH/THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0,0,gpuTransposeMatrix,gpuMatrix, WIDTH);
roctxMark("after hipLaunchKernel");
// Memory transfer from device to host
roctxRangePush("hipMemcpy");
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
roctxRangePop(); // for "hipMemcpy"
roctxRangePop(); // for "hipLaunchKernel"
roctxRangeStop(rangeId);
Tip
A version of the MatrixTranspose application instrumented using the ROCTx API is available in the rocprofiler/tests-v2 folder on GitHub.
Using rocprof
with roctx-trace
option:
rocprof -d outputFolder --roctx-trace ./
You can visualize the generated output file results.json
using Perfetto as shown in the following figure. The sections Markers and Ranges show the marked events and ranges.
Tracing Control#
The rocprof
tool provides these customization options:
Filter Tasks#
To filter tasks, specify the trace category and the tasks to be traced in an input file.
cat input.txt
hsa : hsa_queue_create hsa_amd_memory_pool_allocate
Then supply this input file to rocprof
.
rocprof -i input.txt --hsa-trace ./MatrixTranspose
The above sample input file generates HSA
tracing information for only the two events specified in the file.
Adjust Trace Flush Rate#
Use the --flush-rate
option to specify the flush rate in seconds, milliseconds, or microseconds. This determines how often the trace data is dumped to files. In the following example the flush rate is set to 10 microseconds:
rocprof --flush-rate 10us --hsa-trace ./MatrixTranspose
Function Name Truncation#
To truncate the kernel full function name in the trace files to the base name of the function, use --basename
as shown in the following example:
$ rocprof --basename on --hip-trace ./MatrixTranspose
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"vectoradd_float",1,46373,46373,100.0
When --basename
is not explicitly enabled, the full kernel function name is displayed in the trace:
$ rocprof --basename off --hip-trace ./MatrixTranspose
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"vectoradd_float(float*, float const*, float const*, int, int)",1,45633,45633,100.0
Tracing Control for API or Code Block#
To enable selective tracing for a HIP
API or code block instead of the entire application, follow these steps:
Enclose the API or code block within roctracer_start() and roctracer_stop(). This ensures that tracing starts only when it encounters roctracer_start() and stops once it encounters roctracer_stop(). See the usage of these API calls here, where the user wants to trace only hipMemcpy() API:
#include <roctracer/include/roctracer_ext.h>
// allocate the memory on the device side
hipMalloc((void**)&gpuMatrix, NUM * sizeof(float));
hipMalloc((void**)&gpuTransposeMatrix, NUM * sizeof(float));
roctracer_start();
// Memory transfer from host to device
hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);
roctracer_stop();
Use –trace-start off to disable application tracing from the beginning and start tracing only when roctracer_start() is encountered.
$ rocprof --trace-start off --hip-trace MatrixTranspose
cat results.hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hipMemcpy",10,255048886,25504888,100.0
Set Initial Delay, Periodic Sample Length and Rate#
Use option -trace-period <delay:length:rate> to fine tune tracing by setting the following attributes:
Initial Delay: The time interval between the start of the profiler and start of tracing. So an initial delay of 10 ms causes the tracing to commence after 10 ms since the start of the profiler.
Periodic Sample Length: The duration for which tracing runs
Rate: Rate at which the tracing results are flushed to the user
Example: Tracing with a delay of 10ms, length of 1ms and rate of 10ms
$ rocprof --hip-trace --trace-period 10ms:1ms:10ms MatrixTranspose
cat results.hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hipMemcpy",6,6358906,1059817,100.0
Example: Tracing with a delay of 10ms, length of 1ms and rate of 1ms
$ rocprof --hip-trace --trace-period 10ms:1ms:1ms MatrixTranspose
cat results.hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hipMemcpy",11,272473856,24770350,99.9468484848238
"hipMalloc",2,102871,51435,0.037734380837192355
"hipFree",2,39940,19970,0.014650495967157534
"hipGetDeviceProperties",1,2090,2090,0.0007666383718417439
ROCTracer API#
The ROCTracer APIs are runtime-independent APIs for tracing runtime calls and asynchronous activity, like GPU kernel dispatches and memory moves. The tracing includes callback API for runtime API tracing and activity API for asynchronous activity records logging. You can utilize these APIs to develop a tracing tool or to implement tracing within an application. Refer to the ROCTracer API Specification for more information.
To use the ROCTracer API, link the application with ROCTracer using the API header and dynamic library as shown below:
API header: /opt/rocm-{version}/include/roctracer/roctracer.h
Dynamic library (.so): /opt/rocm-{version}/lib/libroctracer64.so.<version major>
Performance Counter Collection#
As discussed in the sections above, the application trace mode is limited to providing an overview of program execution and does not provide an insight into kernel execution. To address performance issues, the counter and metric collection functionality of rocprof
can be used to report hardware component performance metrics during kernel execution.
Counter and metric collection is supported on the following GPUs:
AMD Radeon Instinct MI25, MI50, MI100, MI2XX
AMD Radeon VII, Radeon Pro VII
Command-line options for counter collection#
The command-line options used with rocprof
for profiling:
Options |
Description |
|
To print the list of basic hardware counters. |
|
To print the list of derived metrics with formulas. |
|
To retrieve the values of the desired list of basic counters and/or derived metrics. |
|
To define new derived metrics or modify the existing metrics defined in the metrics.xml by default. |
|
Specify a name for the output file generated when used with |
|
To enable or disable the kernel dispatch timestamps in nanoseconds for events such as dispatch, begin, end, and complete, as described in the following. Default value: |
Listing Performance Counters#
AMD GPUs are equipped with hardware performance counters that can be used to measure specific values during kernel execution, which are then exported from the GPU and written into the output files at the end of the kernel execution. These performance counters vary according to the GPU. Therefore, it is recommended to examine the hardware counters that can be collected before running the profile.
There are two types of data available for profiling: hardware basic counters and derived metrics.
To obtain the list of supported basic counters, use:
rocprof --list-basic
The derived metrics are calculated from the basic counters using mathematical expressions. To list the supported derived metrics along with their mathematical expressions, use:
rocprof --list-derived
You can also customize the derived metrics as explained in Metric File.
Note
The output generated from the --list_basic
and --list_derived
commands can be significant, and is sometimes worth capturing by redirecting the output to a file.
rocprof --list-derived ./MatrixTranspose > output.txt
Using rocprof
for Application Profiling#
To profile kernels in GPU applications, define the profiling scope in an input file and use:
rocprof -i input.txt ./MatrixTranspose
Note
Refer to the MatrixTranspose application tutorial for the example application.
Input File#
As mentioned above, an input file is a text file that can be supplied to rocprof
for basic counter and derived metric collection. It typically consists of four parts: the counters and derived metrics to collect, GPUs to profile, names and range of kernels to profile.
The collected data is written to an output CSV file that has the same name as the input file specified. For example`` -i input.txt`` results in input.csv
being generated.
Sample Input File
# Perf counters group 1
pmc: MemUnitStalled,TCC_MISS[0]
# Filter by dispatches range, GPU index and kernel names
# supported range formats: "3:9", "3:", "3"
range: 0:1
gpu: 0
kernel: matrixTranspose
The fields in the input file are as follows:
PMC: The rows in the text file beginning with pmc:
are the group of basic counters or derived metrics you are interested in collecting. The performance counters can be selected from the output generated by --list-basic
or --list-derived
command.
The number of basic counters or derived metrics that can be collected in one run of profiling is limited by the GPU hardware resources. If too many counters/metrics are selected, the kernels need to be executed multiple times to collect the counters/metrics. For multi-pass execution, include multiple rows of pmc:
in the input file. Counters or metrics in each pmc:
row can be collected in each run of the kernel.
Note
rocprof
will provide suggestions to group the counters/metrics if you exceed the hardware limits. You can use this suggestion to split the counters/metrics into group sets and successfully perform counter/metric collection.
Example: to see the suggestion to group the counters/metrics:
$ cat input.txt
pmc : Wavefronts, VALUInsts, SALUInsts, SFetchInsts,
FlatVMemInsts,
LDSInsts, FlatLDSInsts, GDSInsts, VALUUtilization, FetchSize,
WriteSize, L2CacheHit, VWriteInsts, GPUBusy, VALUBusy, SALUBusy,
MemUnitStalled, WriteUnitStalled, LDSBankConflict, MemUnitBusy
range: 0 : 1
gpu: 0
kernel:matrixTranspose
$ ropcprof -i input.txt ./MatrixTranspose
Input metrics out of hardware limit. Proposed metrics group set:
group1: FetchSize WriteSize VWriteInsts MemUnitStalled MemUnitBusy
FlatVMemInsts LDSInsts VALUInsts SALUInsts SFetchInsts
FlatLDSInsts GPUBusy Wavefronts
group2: WriteUnitStalled L2CacheHit GDSInsts VALUUtilization
VALUBusy SALUBusy LDSBankConflict
Note
The results reported vary depending on the GPU device being profiled.
GPU: The row beginning with the keyword gpu:
specifies the GPU(s) on which the hardware counters are to be collected. This enables the support for profiling multiple GPUs. You can specify multiple GPUs separated by comma such as gpu: 1,3
.
Kernel: specifies the names of kernels to profile.
Range: specifies the range of kernel dispatches. Specifying range is helpful in cases where the application causes multiple kernel dispatches and users want to filter some kernel dispatches. In the above example, the range 0:1 depicts that one kernel is profiled.
Metric File#
The derived metrics are defined in the /opt/rocm/lib/rocprofiler/metrics.xml
by default.
Here is an entry from metrics.xml
:
<global>
# Wavefronts
<metric
name="Wavefronts"
descr="Total wavefronts."
expr=SQ_WAVES
></metric>
</global>
To override the properties (description/expression) of the derived metrics that are predefined in the metrics.xml
, redefine the derived metrics in the custom derived metrics file as shown here:
#include "gfx_metrics.xml"
<global>
<metric
name="Wavefronts"
descr="Total wavefronts. Description redefined by user."
expr=SQ_WAVES
></metric>
</global>
Note that while specifying your custom metrics file, you must include rocprofiler/test/tool/gfx_metrics.xml
file as all the basic counters are defined here. The basic counters are used in calculating the derived metrics.
Here is an entry from gfx_metrics.xml
:
<gfx9>
<metric name="SQ_WAVES" block=SQ event=4 descr="Count number of waves sent to SQs. (per-simd, emulated, global)"></metric>
<gfx9>
You can also define new derived metrics in the custom metrics file as shown here:
#include "gfx_metrics.xml"
<gfx9_expr>
<metric name="TotalWorkItems" expr=SQ_WAVES*4 descr="Total number of waves sent to SQs(For all simd's). Defined by user." ></metric>
</gfx9_expr>
To see the list of customized derived metrics, use:
$ rocprof -m custom_metrics.xml --list-derived
gpu-agent1 : TotalWorkItems : Total number of waves sent to SQs(For all simd's). Defined by user.
TotalWorkItems = SQ_WAVES*4
To collect the values of custom derived metrics, use:
cat input.txt
pmc: TotalWorkItems
$ rocprof -i input.txt -m custom_metrics.xml ./MatrixTranspose
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,TotalWorkItems
0,"matrixTranspose(float*, float*, int) [clone .kd]",1,0,0,2746,2746,1048576,16,0,0,8,0,16,64,0x0,0x7fead980e800,262144.0000000000
Note
You must use the input file to supply the list of derived metrics to be collected while the custom metrics file provides the expressions for calculating those derived metrics. When collecting the custom derived metrics, make sure to mention only those derived metrics in the input file that you have defined in the custom derived metrics file. When not using the option -m
, rocprof
refers to the default metrics.xml
for expressions of all the derived metrics specified in the input file.
The basic counters and derived metrics specific to the AMD GPUs are listed in the xml files under the respective GPU family. The LLVM target gfx9
corresponds to MI50
and MI100
while gfx90a
corresponds to the MI200
family. The counters and metrics applicable to all GPUs are listed under global. See the supported GPUs and their respective LLVM targets in the Linux Supported GPUs.
Generated Output#
Executing rocprof
with an input file input.txt
produces an output CSV (the file name can be specified with the -o
option) with the counter information as shown below:
Example: Executing rocprof
on MatrixTranspose
application with sample input file
$ rocprof -i input.txt ./MatrixTranspose
$ cat input.csv
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,
scr,vgpr,sgpr,fbar,sig,obj,MemUnitStalled,TCC_MISS[0]
0,"matrixTranspose(float*, float*, int) [clone .kd]",0,0,0,2614,2614,
1048576,16,0,0,8,24,0,0x0,0x7fbfcb37c580,6.6756117852,4096.
Each row of the CSV file is an instance of kernel execution. The columns in the output file are:
Index - kernels dispatch order index
KernelName - kernel name
gpu-id - GPU ID the kernel was submitted to
queue-id - ROCm queue unique ID the kernel was submitted to
queue-index - ROCm queue write index for the submitted AQL packet
pid - system application process ID
tid - system application thread id that submitted the kernel
grd - kernel’s grid size
wgr - kernel’s work group size
lds - kernel’s LDS memory size
scr - kernel’s scratch memory size
vgpr - kernel’s VGPR size
sgpr - kernel’s SGPR size
fbar - kernel’s barriers limitation
sig - kernel’s completion signal
Example: To enable the timestamp
To enable timestamp in the output, use option --timestamp on
as shown in the following example.
$ rocprof -i input.txt --timestamp on ./MatrixTranspose
$ cat input.csv
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,
wgr,lds,scr,vgpr,sgpr,fbar,sig,obj,
MemUnitStalled,TCC_MISS[0],DispatchNs,BeginNs,EndNs,CompleteNs
0,"matrixTranspose(float*, float*, int) [clone .kd]",0,0,0,2837,2837,
1048576,16,0,0,8,24,0,0x0,
0x7fcd75984580,5.9792124305,4096,87851328156768,
87851334047658,87851334141098,87851334732528
Note
The --timestamp
is on by default.
Example: To specify the output file name.
The default output CSV file has the same name as the input file specified. To specify a name for the output file, use option -o <file name>
as shown in the following example.
$ rocprof -i input.txt --timestamp on -o output.csv ./MatrixTranspose
$ cat output.csv
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,
grd,wgr,lds,scr,vgpr,sgpr,fbar,
sig,obj,MemUnitStalled,
TCC_MISS[0],DispatchNs,BeginNs,EndNs,CompleteNs
0,"matrixTranspose(float*, float*, int) [clone .kd]",0,0,0,
215,215,1048576,16,0,0,8,24,0,0x0,0x7f961080a580,7.0675214726,4096,
91063585321414,91063591158627,91063591252551,91063592018031
Profiling Multiple MPI Ranks#
To profile multiple MPI ranks use the following command:
mpirun ... <mpi args> ... rocprof ... <rocprof args> ... application ... <application args>
Important
When using OpenMPI, you must run rocprof
inside the mpirun
command, as shown above, to enable ROCProfiler to handle process forking and launching via mpirun
and related executables. If you run the rocprof
command before mpirun
, then the tool fails with the error roctracer: Loading ‘libamdhip64.so’ failed, (null).
This execution mode requires the following:
Generation of trace data per MPI (or process) rank
ROCm provides a simple bash wrapper that demonstrates how to generate a unique output directory per process as given below:
$ cat wrapper.sh #! /usr/bin/env bash if [[ -n ${OMPI_COMM_WORLD_RANK+z} ]]; then # mpich export MPI_RANK=${OMPI_COMM_WORLD_RANK} elif [[ -n ${MV2_COMM_WORLD_RANK+z} ]]; then # ompi export MPI_RANK=${MV2_COMM_WORLD_RANK} fi args="$*" pid="$$" outdir="rank_${pid}_${MPI_RANK}" outfile="results_${pid}_${MPI_RANK}.csv" eval "rocprof -d ${outdir} -o ${outdir}/${outfile} $*"
This script:
Determines the global MPI rank (implemented here for OpenMPI and MPICH only)
Determines the process id of the MPI rank
Generates a unique output directory using the two
To invoke this wrapper, use the following command:
mpirun <mpi args> ./wrapper.sh --hip-trace <application> <args>
This generates an output directory for each MPI rank used.
Example:
$ ls -ld rank_* | awk {'print $5" "$9'} 4096 rank_513555_0 4096 rank_513556_1
Combining traces from multiple processes
The multiple traces as generated above can be combined into a unified trace for profiling using
merge_traces.sh
utility script. The full path for the script is/opt/rocm/bin/merge_traces.sh
Usage:
merge_traces.sh -o <outputdir> [<inputdir>...]
For example:
$ ./merge_traces.sh -h
Use the following input arguments to the merge_traces.sh script to control which traces are merged and where the resulting merged trace is saved.
<inputdir>...
- space-separated list of ROCProfiler directories to merge. If not specified, the current working directory is used.-o <outputdir>
- output directory where the results are aggregated. The fileunified/results.json
is generated, and it contains trace data from the specified input directories.