ROCProfilerV1 User Manual#
Introduction#
rocprof
is a powerful tool for profiling HIP
applications on AMD ROCm platforms. It can be used to identify performance bottlenecks in applications and to optimize their performance. rocprof provides a variety of profiling data, including performance counters, hardware traces, and runtime API/activity traces.
This document provides a detailed description of rocprof's
features and usage with a focus on these important features of rocprof
:
Application Tracing: This basic feature of
rocprof
is used to trace the execution of an application, with start/end timestamps for each API call and kernel execution. Read more here.Performance Counter Collection: This powerful feature collects performance counters for each API call and kernel execution. Read more here.
To demonstrate the usage of rocprof
with various options, this document refers to the MatrixTranspose application as an example.
Installation#
To install rocprof
, the simplest method is to perform a full installation of ROCm, as rocprof
is included as a standard component of the ROCm distribution. This approach is straightforward and dependable, making it suitable for most users. For detailed instructions on ROCm installation, refer to the Installation Guide..
Alternatively, installing rocprof
from the sources offers greater flexibility. This option is however recommended for experienced users due to the complexity involved.
Supported Devices#
rocprof
is supported on the following AMDGPU architectures:
gfx900
(AMD Vega 10)gfx906
(AMD Vega 7nm also known as AMD Vega 20)gfx908
(AMD Instinct™ MI100 accelerator)gfx90a
(Aldebaran)gfx940
(AMD Instinct MI300)gfx1010
(Navi10)gfx1011
(Navi12)gfx1012
(Navi14)gfx1030
(Sienna Cichlid)gfx1031
(Navy Flounder)gfx1032
(Dimgrey Cavefish)gfx1100
(Navi31)
The following section discusses how to install rocprof
from the source.
Install from Source#
Prerequisites#
Installing rocprof
from the source requires:
Linux system that is supported by ROCm.
AMD GPU driver and ROCm installed in the system.
A few other packages that are necessary for
rocprof
compilation and execution. Here are the dependency installation commands for supported Linux distros:Ubuntu 20.04 and 22.04:
sudo apt install libudev-dev libnuma1 libnuma-dev Python3 python3-pip gcc g++ make cmake doxygen
RHEL 8.6 and 9.2:
yum install -y systemd-devel numactl numactl-devel Python3 python3-pip gcc gcc-g++ make cmake libatomic doxygen
CentOS 7:
yum install -y systemd-devel numactl numactl-devel Python3 python3-pip gcc gcc-g++ make cmake libatomic doxygen
SLES 15.3:
zypper in libudev-devel libnuma1 libnuma-devel Python3 python3-pip gcc gcc-g++ make cmake libatomic doxygen
SLES Tumbleweed:
zypper install systemd-devel libnuma1 libnuma-devel Python3 python3-pip gcc gcc-g++ make cmake libatomic doxygen
Python libraries such as
CppHeaderParser
,argparse
, andsqlite3
. To install the libraries based on the default Python installation on the system, use:
pip3 install cppheaderparser argparse sqlite3
Build#
With the dependencies installed, you can now build rocprof
from the source code using the steps given below:
After cloning the repository, you can use
CMake
to compile and installrocprof
. To useCMake
, set the environment variable:export CMAKE_PREFIX_PATH=/opt/rocm/include/hsa:/opt/rocm
Then, build
rocprof
with a premade bash script../build.sh
You now have two options for utilizing the newly built
rocprof
and its associated libraries.The first option involves setting an environment variable to indicate the location of the executable and libraries. This enables the system to use the new
rocprof
and related libraries instead of the default version.export LD_LIBRARY_PATH=<BUILD_LOCATION>:$LD_LIBRARY_PATH export PATH=<BUILD_LOCATION>/bin:$PATH
Another option is to specify the desired location for installation of the executable and libraries if you do not want the installation process to move the executable and libraries to the default location. To specify the installation location, set environment variable:
export CMAKE_INSTALL_PREFIX=<LOCAL INSTALL FOLDER>
Then, to install
rocprof
in the build directory, run:cd build make install
The
Makefile
is generated byCMake
.
Application Tracing#
Application tracing provides the big picture of a program’s execution by collecting data on the execution times of API calls and GPU commands, such as kernel execution, async memory copy, and barrier packets. This information can be used as the first step in the profiling process to answer important questions, such as how much percentage of time was spent on memory copy and which kernel took the longest time to execute.
There are two ways to use application tracing: rocprof
and ROCTracer
API. rocprof
is a command line interface (CLI) profiler that can be used on the applications running on ROCm-supported GPUs, without the requirement of any code modification in the application. On the other hand, ROCTracer
API is a library that requires minor code modification in the application to be traced but provides greater flexibility, such as adjusting execution based on profiling results.
Using rocprof
for Application Tracing#
The rocprof
CLI allows you to trace the entire execution of HIP
applications. It allows tracing at different levels such as HIP
-level, HSA
-level, and system-level. These levels can be selected by supplying the respective command-line options to rocprof
.
The command-line options used with rocprof
for tracing:
Options |
Description |
---|---|
|
To specify the directory where the profiler stores traces and the profiling data. The profiler stores the profiling data in a temporary directory [/tmp] by default, which is removed automatically after a specific period. Specifying a directory explicitly allows you to prevent loss of data. |
|
To trace API execution stats, |
|
To trace API execution stats, |
|
To trace API execution stats, |
|
To enable |
|
To trace API execution stats and kernel execution stats |
|
To enable/disable truncation of the kernel full function names in the trace, till the base ones. Default value: [off]. Read more… |
|
To enable trace flush rate |
|
To rename long kernel names. Kernel renaming is recommended only in special cases where kernel names are auto generated in an incomprehensible format. |
|
To enable/disable tracing for a |
|
To enable tracing with an initial delay, periodic sample length and rate. Supported time formats: <number(m|s|ms|us)> Read more… |
HIP Trace#
Collecting execution traces for the entire application with rocprof
at the HIP
level generates HIP
traces. HIP
trace includes HIP
API functions and their asynchronous activities at the runtime level.
Usage:
rocprof -d outputFolder --hip-trace ./Matrixtranspose
The above command generates three groups of files, namely visualizable traces, statistics files, and intermediatory tracing data. This document does not discuss the intermediatory tracing data as it is not designed to be read by users directly.
Visualizable Traces
The visualizable traces are available in results.json
, which is a JSON
file that follows the Chromium Project’s trace-event format. You can visualize the trace using visualization tools such as Perfetto. See a short segment of the trace visualization in the figure below:
In the figure below, see the time axis on the top. There is a small, highlighted region, between 0.22s to 0.24s, that indicates the currently selected time range.
In the figure below, below the time axis, see the Gantt chart-style boxes that show the duration of each task. There are three rows of tasks divided by the black rows mentioning their categories. The first row is the “CPU HIP
API”, which lists the execution time of each API trace.
In the figure below, the second row titled COPY shows the tasks completed by the copy engine. See that the CopyHostToDevice
and CopyDeviceToHost
tasks are being completed at the beginning and the end of the time range, respectively.
In the figure below, see that the GPU tasks are listed at the bottom. Note that the matrixTranspose
kernel is executed from about 0.185s to 0.20s.
Statistics Files#
The statistics files include:
results.stats.csv
for kernel statisticsresults.hip_stats.csv
forHIP
APIresults.copy_stats.csv
for activity statistics
They are organized in comma-separated values (CSV) format, with the columns:
Name: Name of the action
Calls: Number of invocations
TotalDurationNS: Total duration in nanosecond
AverageNS: Average time in nanoseconds required to execute the action
Percentage: Percentage of the action with respect to the complete execution of the application
HSA Trace#
The HIP
runtime library is implemented with the low-level HSA
runtime. To trace the application at a lower level, you can use rocprof
to collect application traces at the HSA
runtime level. In general, tracing at the HIP
-level is recommended for most users. You are advised to use HSA
trace only if you are familiar with HSA
runtime.
HSA
trace contains the start/end time of HSA
runtime API calls and their asynchronous activities. Use the hsa-trace
option with rocprof to collect HSA
-level trace:
rocprof --hsa-trace ./MatrixTranspose
As in HIP
trace, the generated HSA
trace also includes three groups of files, namely visualizable traces, statistics files, and intermediatory tracing data.
You can visualize the generated results.json
using third-party tools such as Perfetto, as shown below.
In figure above, you can see that HSA
trace shows HSA
trace rows, just like HIP
trace shows the HIP
API row in Fig. 1. However, note that there are more HSA
API calls as compared to HIP
API calls because HSA
works closer to the hardware.
The statistics files are:
results.stats.csv
for kernel statisticsresults.hsa_stats.csv
forHSA
APIresults.copy_stats.csv
for memory copy statistics
Each file is a CSV table with columns as described in the Statistic Files.
System Trace#
The rocprof
tool can also generate both the HIP
and HSA
traces together with the sys-trace
option.
rocprof --sys-trace ./MatrixTranspose
The following figure shows the generated results.json
visualized using Perfetto. It contains sections from both HIP
and HSA
trace.
Roctx Trace#
In certain situations, such as debugging performance issues in large-scale GPU programs, API-level tracing may be too fine-grained to provide a big picture of the program execution. In such cases, you might find it helpful to define specific tasks to be traced.
To specify the tasks for tracing, enclose the respective source code with the API calls provided by Roctx
. This process is also known as instrumentation. As the scope of code for instrumentation is defined using the enclosing API calls, it is called a range. A range is a programmer-defined task that has a well-defined start and end code scope. You can also fine grain the scope specified within a range using further nested ranges. The rocprof
tool also reports the timelines for these nested ranges.
Here is a list of useful APIs for code instrumentation.
roctxMark
: Inserts a marker in the code with a message. Creating marks can help you see when a line of code is executed.roctxRangeStart
: Starts a range. Ranges can be started by different threads.roctxRangePush
: Starts a new nested range.roctxRangePop
: Stops the current nested range.roctxRangeStop
: Stops the given range.
See roctx
code annotations in the MatrixTranspose
application below:
roctxMark("before hipLaunchKernel");
int rangeId = roctxRangeStart("hipLaunchKernel range");
roctxRangePush("hipLaunchKernel");
// Lauching kernel from host
hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH/THREADS_PER_BLOCK_X, WIDTH/THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0,0,gpuTransposeMatrix,gpuMatrix, WIDTH);
roctxMark("after hipLaunchKernel");
// Memory transfer from device to host
roctxRangePush("hipMemcpy");
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
roctxRangePop(); // for "hipMemcpy"
roctxRangePop(); // for "hipLaunchKernel"
roctxRangeStop(rangeId);
Using rocprof
with roctx-trace
option:
rocprof -d outputFolder --roctx-trace ./
You can visualize the generated output file results.json
using Perfetto as shown in the figure below. The sections Markers and Ranges show the marked events and ranges.
Tracing Control#
The rocprof
tool provides these customization options:
Filter Tasks#
To filter tasks, specify the trace category and the tasks to be traced in an input file.
cat input.txt
hsa : hsa_queue_create hsa_amd_memory_pool_allocate
Then supply this input file to rocprof
.
rocprof -i input.txt --hsa-trace ./MatrixTranspose
The above sample input file generates HSA
tracing information for only the two events specified in the file.
Adjust Trace Flush Rate#
To specify the flush rate (in seconds, microseconds, or milliseconds) which determines how often the traces are dumped to the files, use:
rocprof --flush-rate 10us --hsa-trace ./MatrixTranspose
Function Name Truncation#
To truncate the kernel full function name in the trace till the base one, use:
$ rocprof --basename on -–hip-trace ./MatrixTranspose
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"vectoradd_float",1,46373,46373,100.0
When basename
is not explicitly enabled, the full kernel function name is displayed in the trace:
$ rocprof --basename off -–hip-trace ./MatrixTranspose
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"vectoradd_float(float*, float const*, float const*, int, int)",1,45633,45633,100.0
Tracing Control for API or Code Block#
To enable selective tracing for a HIP
API(s) or code block(s) instead of the entire application, follow these steps:
Enclose the API or code block within
roctracer_start()
androctracer_stop()
. This ensures that tracing starts only when it encountersroctracer_start()
and stops once it encountersroctracer_stop()
. See the usage of these API calls here, where the user wants to trace onlyhipMemcpy()
API:
#include <roctracer/include/roctracer_ext.h>
// allocate the memory on the device side
hipMalloc((void**)&gpuMatrix, NUM * sizeof(float));
hipMalloc((void**)&gpuTransposeMatrix, NUM * sizeof(float));
roctracer_start();
// Memory transfer from host to device
hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);
roctracer_stop();
Use
--trace-start
off to disable application tracing from the beginning and start tracing only whenroctracer_start()
is encountered.
$ rocprof --trace-start off --hip-trace MatrixTranspose
cat results.hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hipMemcpy",10,255048886,25504888,100.0
Set Initial Delay, Periodic Sample Length and Rate#
Use option –trace-period <delay:length:rate>
to finetune tracing by setting the following attributes:
Initial Delay: The time interval between the start of the profiler and start of tracing. So an initial delay of 10 ms causes the tracing to commence after 10 ms since the start of the profiler.
Periodic Sample Length: The duration for which tracing runs
Rate: Rate at which the tracing results are flushed to the user
Example: Tracing with a delay of 10ms, length of 1ms and rate of 10ms
$ rocprof --hip-trace --trace-period 10ms:1ms:10ms MatrixTranspose
cat results.hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hipMemcpy",6,6358906,1059817,100.0
Example: Tracing with a delay of 10ms, length of 1ms and rate of 1ms
$ rocprof --hip-trace --trace-period 10ms:1ms:1ms MatrixTranspose
cat results.hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hipMemcpy",11,272473856,24770350,99.9468484848238
"hipMalloc",2,102871,51435,0.037734380837192355
"hipFree",2,39940,19970,0.014650495967157534
"hipGetDeviceProperties",1,2090,2090,0.0007666383718417439
ROCTracer API#
The ROCTracer
APIs are runtime-independent APIs for tracing runtime calls and asynchronous activity, like GPU kernel dispatches and memory moves. The tracing includes callback API for
runtime API tracing and activity API for asynchronous activity records logging.
You can utilize these APIs to develop a tracing tool or to implement tracing in application.
To use the ROCTracer
API, link the application with ROCTracer
using the API header and dynamic library as shown below:
API header:
/opt/rocm-{version}/include/roctracer/roctracer.h
Dynamic library (.so):
/opt/rocm-{version}/lib/libroctracer64.so.<version major>
For all the API-related information, refer to the ROCTracer API Specification.
Performance Counter Collection#
As discussed in the sections above, the application trace mode is limited to providing an overview of program execution and does not provide an insight into kernel execution. To address performance issues, the counter and metric collection functionality of rocprof can be used to report hardware component performance metrics during kernel execution.
Counter and metric collection is supported on the following GPUs:
AMD Radeon Instinct MI25
,MI50
,MI100
,MI2XX
AMD Radeon VII
,Radeon Pro VII
The command-line options used with rocprof
for profiling:
Options |
Description |
---|---|
|
To print the list of basic hardware counters. Read more… |
|
To print the list of derived metrics with formulas. Read more… |
|
To retrieve the values of the desired list of basic counters and/or derived metrics. Read more… |
|
To define new derived metrics or modify the existing derived metrics that are defined in the |
|
To provide a preferred name to the generated output file. Refer to the example in Generated Output section below. |
|
To enable/disable the kernel dispatch timestamps in nanoseconds such as DispatchNs/BeginNs/EndNs/CompleteNs. Default value: [on].
Refer to the example in the Generated Output section below. |
Listing Performance Counters#
AMDGPUs are equipped with hardware performance counters that can be used to measure specific values during kernel execution, which are then exported from the GPU and written into the output files at the end of the kernel execution. These performance counters vary according to the GPU. Therefore, it is recommended to examine the hardware counters that can be collected before running the profile.
There are two types of data available for profiling: hardware basic counters and derived metrics.
To obtain the list of supported basic counters, use:
rocprof --list-basic
The derived metrics are calculated from the basic counters using mathematical expressions. To list the supported derived metrics along with their mathematical expressions, use:
rocprof --list-derived
You can also customize the derived metrics as explained in the Metric File below.
Using rocprof for Application Profiling#
To profile kernels in GPU applications, define the profiling scope in an input file and use:
rocprof -i input.txt ./MatrixTranspose
Input File#
As mentioned above, an input file is a text file that can be supplied to rocprof
for basic counter and derived metric collection. It typically consists of four parts, namely the basic counter(s)/derived metrics(s) to use, the GPUs to profile, name of the kernels to be profiled, and the range of kernels to profile.
Sample Input File
# Perf counters group 1
pmc: MemUnitStalled,TCC_MISS[0]
# Filter by dispatches range, GPU index and kernel names
# supported range formats: "3:9", "3:", "3"
range: 0:1
gpu: 0
kernel: matrixTranspose
The fields in the input file are described here:
PMC: The rows in the text file beginning with pmc
are the group of basic counters or derived metrics the user is interested in collecting. The performance counters can be selected from the output generated by --list-basic
or --list-derived
command.
The number of basic counters or derived metrics that can be collected in one run of profiling is limited by the GPU hardware resources. If too many counters/metrics are selected, the kernels need to be executed multiple times to collect the counters/metrics. For multi-pass execution, include multiple rows of pmc
in the input file. Counters or metrics in each pmc
row can be collected in each run of the kernel.
rocprof
provides suggestions to group the counters/metrics if the user exceeds the hardware limits. You can use this suggestion to split the counters/metrics into group sets and successfully perform counter/metric collection.
Example: To see the suggestion to group the counters/metrics on exceeding hardware limits
$ cat input.txt
pmc : Wavefronts, VALUInsts, SALUInsts, SFetchInsts,
FlatVMemInsts,
LDSInsts, FlatLDSInsts, GDSInsts, VALUUtilization, FetchSize,
WriteSize, L2CacheHit, VWriteInsts, GPUBusy, VALUBusy, SALUBusy,
MemUnitStalled, WriteUnitStalled, LDSBankConflict, MemUnitBusy
range: 0 : 1
gpu: 0
kernel:matrixTranspose
$ ropcprof -i input.txt ./MatrixTranspose
Input metrics out of hardware limit. Proposed metrics group set:
group1: FetchSize WriteSize VWriteInsts MemUnitStalled MemUnitBusy
FlatVMemInsts LDSInsts VALUInsts SALUInsts SFetchInsts
FlatLDSInsts GPUBusy Wavefronts
group2: WriteUnitStalled L2CacheHit GDSInsts VALUUtilization
VALUBusy SALUBusy LDSBankConflict
GPU: The row beginning with the keyword gpu specifies the GPU(s) on which the hardware counters are to be collected. This enables the support for profiling multiple GPUs. You can specify multiple GPUs separated by comma such as gpu: 1,3.
Kernel: The row beginning with the keyword kernel specifies the name(s) of the kernel(s) that need to be profiled.
Range: The row beginning with the keyword range specifies the range of kernel dispatches. Specifying range is helpful in cases where the application causes multiple kernel dispatches and users want to filter some kernel dispatches. In the above example, the range 0:1 depicts that one kernel is profiled.
Metric File#
The derived metrics are defined in the ./lib/rocprofiler/metrics.xml
by default.
Here is an entry from metrics.xml
:
<global>
# Wavefronts
<metric
name="Wavefronts"
descr="Total wavefronts."
expr=SQ_WAVES
></metric>
</global>
To override the properties (description/expression) of the derived metrics that are predefined in the metrics.xml
, redefine the derived metrics in the custom derived metrics file as shown here:
#include "gfx_metrics.xml"
<global>
<metric
name="Wavefronts"
descr="Total wavefronts. Description redefined by user."
expr=SQ_WAVES
></metric>
</global>
Note that while specifying your custom metrics file, you must include rocprofiler/test/tool/gfx_metrics.xml
file as all the basic counters are defined here. The basic counters are used in calculating the derived metrics.
Here is an entry from gfx_metrics.xml
:
<gfx9>
<metric name="SQ_WAVES" block=SQ event=4 descr="Count number of waves sent to SQs. (per-simd, emulated, global)"></metric>
<gfx9>
You can also define new derived metrics in the custom metrics file as shown here:
#include "gfx_metrics.xml"
<gfx9_expr>
<metric name="TotalWorkItems" expr=SQ_WAVES*4 descr="Total number of waves sent to SQs(For all simd's). Defined by user." ></metric>
</gfx9_expr>
To see the list of customized derived metrics, use:
$ rocprof -m custom_metrics.xml --list-derived
gpu-agent1 : TotalWorkItems : Total number of waves sent to SQs(For all simd's). Defined by user.
TotalWorkItems = SQ_WAVES*4
To collect the values of custom derived metrics, use:
cat input.txt
pmc: TotalWorkItems
$ rocprof -i input.txt -m custom_metrics.xml ./MatrixTranspose
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,TotalWorkItems
0,"matrixTranspose(float*, float*, int) [clone .kd]",1,0,0,2746,2746,1048576,16,0,0,8,0,16,64,0x0,0x7fead980e800,262144.0000000000
Note that you must use the input file to supply the list of derived metrics to be collected while the custom metrics file provides the expressions for calculating those derived metrics. When collecting the custom derived metrics, make sure to mention only those derived metrics in the input file that you have defined in the custom derived metrics file. When not using the option -m
, rocprof
refers to the default metrics.xml
for expressions of all the derived metrics specified in the input file.
The basic counters and derived metrics specific to the AMDGPUs are listed in the xml files under the respective GPU family. The LLVM target gfx9
corresponds to MI50
and MI100
while gfx90a
corresponds to the MI200
family. The counters and metrics applicable to all GPUs are listed under global. See the supported GPUs and their respective LLVM targets in the Linux Supported GPUs.
Generated Output#
Executing rocprof
with the above-mentioned input file input.txt
on the MatrixTranspose
application produces an output CSV (the file name can be specified with the -o option) with the counter information as shown below:
Example: Executing rocprof
on MatrixTranspose
application with sample input file
$ rocprof -i input.txt ./MatrixTranspose
$ cat input.csv
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,
scr,vgpr,sgpr,fbar,sig,obj,MemUnitStalled,TCC_MISS[0]
0,"matrixTranspose(float*, float*, int) [clone .kd]",0,0,0,2614,2614,
1048576,16,0,0,8,24,0,0x0,0x7fbfcb37c580,6.6756117852,4096.
Each row of the CSV file is an instance of kernel execution. The columns in the output file are:
Index - kernels dispatch order index
KernelName - kernel name
gpu-id - GPU ID the kernel was submitted to
queue-id - ROCm queue unique id the kernel was submitted to
queue-index - ROCm queue write index for the submitted AQL packet
tid - system application thread id that submitted the kernel
grd - kernel’s grid size
wgr - kernel’s work group size
lds - kernel’s LDS memory size
scr - kernel’s scratch memory size
vgpr - kernel’s VGPR size
sgpr - kernel’s SGPR size
fbar - kernel’s barriers limitation
sig - kernel’s completion signal
To enable timestamp in the output, use option --timestamp
on, as shown below:
Example: To enable the timestamp
$ rocprof -i input.txt --timestamp on ./MatrixTranspose
$ cat input.csv
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,
wgr,lds,scr,vgpr,sgpr,fbar,sig,obj,
MemUnitStalled,TCC_MISS[0],DispatchNs,BeginNs,EndNs,CompleteNs
0,"matrixTranspose(float*, float*, int) [clone .kd]",0,0,0,2837,2837,
1048576,16,0,0,8,24,0,0x0,
0x7fcd75984580,5.9792124305,4096,87851328156768,
87851334047658,87851334141098,87851334732528
To provide a preferred name to the output file, use option -o <file name>
as shown below:
Example: To specify the output file name
$ rocprof -i input.txt --timestamp on -o output.csv ./MatrixTranspose
$ cat output.csv
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,
grd,wgr,lds,scr,vgpr,sgpr,fbar,
sig,obj,MemUnitStalled,
TCC_MISS[0],DispatchNs,BeginNs,EndNs,CompleteNs
0,"matrixTranspose(float*, float*, int) [clone .kd]",0,0,0,
215,215,1048576,16,0,0,8,24,0,0x0,0x7f961080a580,7.0675214726,4096,
91063585321414,91063591158627,91063591252551,91063592018031
Profiling Multiple MPI Ranks#
To profile multiple MPI ranks in ROCm v4.3
and higher, use the following command:
mpirun ... <mpi args> ... rocprof ... <rocprof args> ... application ... <application args>
NOTE for ROCm v4.3 and later versions: Do not use rocprof command followed by mpirun
as previously used in ROCm v4.2
and lower. This feature differs from ROCm v4.2
and lower, which used rocprof ... mpirun ... application
.
This change was made to enable ROCProfiler
to handle process forking better and launching via mpirun
(and related) executables.
From a user perspective, this new execution mode requires the following:
Generation of trace data per
MPI
(or process) rank
ROCm provides a simple bash wrapper that demonstrates how to generate a unique output directory per process as given below:
$ cat wrapper.sh
#! /usr/bin/env bash
if [[ -n ${OMPI_COMM_WORLD_RANK+z} ]]; then
# mpich
export MPI_RANK=${OMPI_COMM_WORLD_RANK}
elif [[ -n ${MV2_COMM_WORLD_RANK+z} ]]; then
# ompi
export MPI_RANK=${MV2_COMM_WORLD_RANK}
fi
args="$*"
pid="$$"
outdir="rank_${pid}_${MPI_RANK}"
outfile="results_${pid}_${MPI_RANK}.csv"
eval "rocprof -d ${outdir} -o ${outdir}/${outfile} $*"
This script:
Determines the global
MPI
rank (implemented here forOpenMPI
andMPICH
only)Determines the process id of the
MPI
rankGenerates a unique output directory using the two
To invoke this wrapper, use the following command:
mpirun <mpi args> ./wrapper.sh --hip-trace <application> <args>
This generates an output directory for each used MPI rank.
Example:
$ ls -ld rank_* | awk {'print $5" "$9'}
4096 rank_513555_0
4096 rank_513556_1
Combining traces from multiple processes
The multiple traces as generated above can be combined using merge_traces.sh
utility script into a unified trace for profiling as shown below:
$ ./merge_traces.sh -h
Script for aggregating results from multiple ROCProfiler directories.
Full path: /opt/rocm/bin/merge_traces.sh
Usage:
merge_traces.sh -o <outputdir> [<inputdir>...]
Use the following input arguments to the merge_traces.sh
script to control which traces are merged and where the resulting merged trace is saved.
-o <outputdir>
- output directory where the results are aggregated.<inputdir>...
- space-separated list ofROCProfiler
directories. If not specified, the current working directory is used.
The file unified/results.json
is generated, and it contains trace data from both MPI ranks.
ROCProfiler API#
The ROCProfiler
library provides a hardware-specific low-level performance-analysis
interface for profiling of GPU compute applications. The profiling includes hardware performance
counters with complex performance metrics and hardware traces. The library can be loaded by HSA
runtime as a tool plugin or by a higher-level hardware-independent performance-analysis API like PAPI
.
For all the API-related information, refer to the ROCProfiler API specification.
Profiling Tools: Appendix#
This chapter provides meta-information and material for your reference.
Sample Application Code: MatrixTranspose#
#include <iostream>
// hip header file
#include <hip/hip_runtime.h>
#include "roctracer_ext.h"
// roctx header file
#include <roctx.h>
#define WIDTH 1024
#define NUM (WIDTH * WIDTH)
#define THREADS_PER_BLOCK_X 4
#define THREADS_PER_BLOCK_Y 4
#define THREADS_PER_BLOCK_Z 1
// Mark API
extern "C"
// Device (Kernel) function, it must be void
__global__ void
matrixTranspose(float* out, float* in, const int width) {
int x = hipBlockDim_x * hipBlockIdx_x + hipThreadIdx_x;
int y = hipBlockDim_y * hipBlockIdx_y + hipThreadIdx_y;
out[y * width + x] = in[x * width + y];
}
int main() {
float* Matrix;
float* TransposeMatrix;
float* cpuTransposeMatrix;
float* gpuMatrix;
float* gpuTransposeMatrix;
hipDeviceProp_t devProp;
hipGetDeviceProperties(&devProp, 0);
std::cout << "Device name " << devProp.name << std::endl;
int i;
int errors;
Matrix = (float*)malloc(NUM * sizeof(float));
TransposeMatrix = (float*)malloc(NUM * sizeof(float));
// initialize the input data
for (i = 0; i < NUM; i++) {
Matrix[i] = (float)i * 10.0f;
}
// allocate the memory on the device side
hipMalloc((void**)&gpuMatrix, NUM * sizeof(float));
hipMalloc((void**)&gpuTransposeMatrix, NUM * sizeof(float));
uint32_t iterations = 100;
while (iterations-- > 0) {
std::cout << "## Iteration (" << iterations
<< ")
################ #" << std::endl;
// Memory transfer from host to device
hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);
roctxMark("before hipLaunchKernel");
int rangeId = roctxRangeStart("hipLaunchKernel range");
roctxRangePush("hipLaunchKernel");
// Lauching kernel from host
2 hipLaunchKernelGGL(
matrixTranspose, dim3(WIDTH / THREADS_PER_BLOCK_X, WIDTH / THREADS_PER_BLOCK_Y),
dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0, 0, gpuTransposeMatrix, gpuMatrix, WIDTH);
roctxMark("after hipLaunchKernel");
// Memory transfer from device to host
roctxRangePush("hipMemcpy");
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
roctxRangePop(); // for "hipMemcpy"
roctxRangePop(); // for "hipLaunchKernel"
roctxRangeStop(rangeId);
}
// free the resources on device side
hipFree(gpuMatrix);
hipFree(gpuTransposeMatrix);
// free the resources on host side
free(Matrix);
free(TransposeMatrix);
return errors;
}
Limitations#
When using
OpenMPI
, you must runrocprof
inside thempirun
command as shown in Profiling Multiple MPI Ranks. If you runrocprof
command beforempirun
, then the tool fails with the errorroctracer: Loading 'libamdhip64.so' failed, (null)
.GPU id in the
HSA
visualization is always 0.
Logging#
Set the following environment variables to enable logging:
Environment Variable |
Purpose |
Log Files |
---|---|---|
|
Enables error message logging |
|
|
Enables verbose tracing |
|