ROCProfilerV1 User Manual#

Introduction#

rocprof is a powerful tool for profiling HIP applications on AMD ROCm platforms. It can be used to identify performance bottlenecks in applications and to optimize their performance. rocprof provides a variety of profiling data, including performance counters, hardware traces, and runtime API/activity traces. This document provides a detailed description of rocprof's features and usage with a focus on these important features of rocprof:

  • Application Tracing: This basic feature of rocprof is used to trace the execution of an application, with start/end timestamps for each API call and kernel execution. Read more here.

  • Performance Counter Collection: This powerful feature collects performance counters for each API call and kernel execution. Read more here.

To demonstrate the usage of rocprof with various options, this document refers to the MatrixTranspose application as an example.

Installation#

To install rocprof, the simplest method is to perform a full installation of ROCm, as rocprof is included as a standard component of the ROCm distribution. This approach is straightforward and dependable, making it suitable for most users. For detailed instructions on ROCm installation, refer to the Installation Guide..

Alternatively, installing rocprof from the sources offers greater flexibility. This option is however recommended for experienced users due to the complexity involved.

Supported Devices#

rocprof is supported on the following AMDGPU architectures:

  • gfx900 (AMD Vega 10)

  • gfx906 (AMD Vega 7nm also known as AMD Vega 20)

  • gfx908 (AMD Instinct™ MI100 accelerator)

  • gfx90a (Aldebaran)

  • gfx940 (AMD Instinct MI300)

  • gfx1010 (Navi10)

  • gfx1011 (Navi12)

  • gfx1012 (Navi14)

  • gfx1030 (Sienna Cichlid)

  • gfx1031 (Navy Flounder)

  • gfx1032 (Dimgrey Cavefish)

  • gfx1100 (Navi31)

The following section discusses how to install rocprof from the source.

Install from Source#

Prerequisites#

Installing rocprof from the source requires:

  • Linux system that is supported by ROCm.

  • AMD GPU driver and ROCm installed in the system.

  • A few other packages that are necessary for rocprof compilation and execution. Here are the dependency installation commands for supported Linux distros:

    • Ubuntu 20.04 and 22.04:

      sudo apt install libudev-dev libnuma1 libnuma-dev Python3 python3-pip gcc g++ make cmake doxygen
      
    • RHEL 8.6 and 9.2:

      yum install -y systemd-devel numactl numactl-devel Python3 python3-pip gcc gcc-g++ make cmake libatomic doxygen
      
    • CentOS 7:

      yum install -y systemd-devel numactl numactl-devel Python3 python3-pip gcc gcc-g++ make cmake libatomic doxygen
      
    • SLES 15.3:

      zypper in libudev-devel libnuma1 libnuma-devel Python3 python3-pip gcc gcc-g++ make cmake libatomic doxygen
      
    • SLES Tumbleweed:

      zypper install systemd-devel libnuma1 libnuma-devel Python3 python3-pip gcc gcc-g++ make cmake libatomic doxygen
      
  • Python libraries such as CppHeaderParser, argparse, and sqlite3. To install the libraries based on the default Python installation on the system, use:

pip3 install cppheaderparser argparse sqlite3

Build#

With the dependencies installed, you can now build rocprof from the source code using the steps given below:

  1. After cloning the repository, you can use CMake to compile and install rocprof. To use CMake, set the environment variable:

    export CMAKE_PREFIX_PATH=/opt/rocm/include/hsa:/opt/rocm
    
  2. Then, build rocprof with a premade bash script.

    ./build.sh
    
  3. You now have two options for utilizing the newly built rocprof and its associated libraries.

  4. The first option involves setting an environment variable to indicate the location of the executable and libraries. This enables the system to use the new rocprof and related libraries instead of the default version.

    export LD_LIBRARY_PATH=<BUILD_LOCATION>:$LD_LIBRARY_PATH
    export PATH=<BUILD_LOCATION>/bin:$PATH 
    
  5. Another option is to specify the desired location for installation of the executable and libraries if you do not want the installation process to move the executable and libraries to the default location. To specify the installation location, set environment variable:

    export CMAKE_INSTALL_PREFIX=<LOCAL INSTALL FOLDER>
    
  6. Then, to install rocprof in the build directory, run:

    cd build
    make install 
    

    The Makefile is generated by CMake.

Application Tracing#

Application tracing provides the big picture of a program’s execution by collecting data on the execution times of API calls and GPU commands, such as kernel execution, async memory copy, and barrier packets. This information can be used as the first step in the profiling process to answer important questions, such as how much percentage of time was spent on memory copy and which kernel took the longest time to execute.

There are two ways to use application tracing: rocprof and ROCTracer API. rocprof is a command line interface (CLI) profiler that can be used on the applications running on ROCm-supported GPUs, without the requirement of any code modification in the application. On the other hand, ROCTracer API is a library that requires minor code modification in the application to be traced but provides greater flexibility, such as adjusting execution based on profiling results.

Using rocprof for Application Tracing#

The rocprof CLI allows you to trace the entire execution of HIP applications. It allows tracing at different levels such as HIP-level, HSA-level, and system-level. These levels can be selected by supplying the respective command-line options to rocprof.

The command-line options used with rocprof for tracing:

Options

Description

-d <data directory>

To specify the directory where the profiler stores traces and the profiling data. The profiler stores the profiling data in a temporary directory [/tmp] by default, which is removed automatically after a specific period. Specifying a directory explicitly allows you to prevent loss of data.

--hip-trace

To trace API execution stats, HIP API calls, and copy operation calls. Read more…

--hsa-trace

To trace API execution stats, HSA API calls, kernel execution stats, and copy operation calls. Read more…

--sys-trace

To trace API execution stats, HIP and HSA API calls, and copy operation calls. Read more…

--roctx-trace

To enable roctx application code annotation trace. Allows you to trace a particular block of code when Markers and Ranges are specified in the application code. Read more…

--stats

To trace API execution stats and kernel execution stats

--basenames <on|off>

To enable/disable truncation of the kernel full function names in the trace, till the base ones. Default value: [off]. Read more…

--flush-rate

To enable trace flush rate

--roctx-rename

To rename long kernel names. Kernel renaming is recommended only in special cases where kernel names are auto generated in an incomprehensible format.

--trace-start <on|off>

To enable/disable tracing for a HIP API or code block. Default value: [on]. Read more…

--trace-period <delay:length:rate>

To enable tracing with an initial delay, periodic sample length and rate. Supported time formats: <number(m|s|ms|us)> Read more…

HIP Trace#

Collecting execution traces for the entire application with rocprof at the HIP level generates HIP traces. HIP trace includes HIP API functions and their asynchronous activities at the runtime level.

Usage:

rocprof -d outputFolder --hip-trace ./Matrixtranspose

The above command generates three groups of files, namely visualizable traces, statistics files, and intermediatory tracing data. This document does not discuss the intermediatory tracing data as it is not designed to be read by users directly.

Visualizable Traces

The visualizable traces are available in results.json, which is a JSON file that follows the Chromium Project’s trace-event format. You can visualize the trace using visualization tools such as Perfetto. See a short segment of the trace visualization in the figure below:

Fig. 1 HIP Trace Visualization#

In the figure below, see the time axis on the top. There is a small, highlighted region, between 0.22s to 0.24s, that indicates the currently selected time range.

Fig. 2 Time Range#

In the figure below, below the time axis, see the Gantt chart-style boxes that show the duration of each task. There are three rows of tasks divided by the black rows mentioning their categories. The first row is the “CPU HIP API”, which lists the execution time of each API trace.

Fig. 3 Duration of API Trace Execution#

In the figure below, the second row titled COPY shows the tasks completed by the copy engine. See that the CopyHostToDevice and CopyDeviceToHost tasks are being completed at the beginning and the end of the time range, respectively.

Fig. 4 Copy Tasks#

In the figure below, see that the GPU tasks are listed at the bottom. Note that the matrixTranspose kernel is executed from about 0.185s to 0.20s.

Fig. 5 GPU Tasks#

Statistics Files#

The statistics files include:

  • results.stats.csv for kernel statistics

  • results.hip_stats.csv for HIP API

  • results.copy_stats.csv for activity statistics

They are organized in comma-separated values (CSV) format, with the columns:

  • Name: Name of the action

  • Calls: Number of invocations

  • TotalDurationNS: Total duration in nanosecond

  • AverageNS: Average time in nanoseconds required to execute the action

  • Percentage: Percentage of the action with respect to the complete execution of the application

HSA Trace#

The HIP runtime library is implemented with the low-level HSA runtime. To trace the application at a lower level, you can use rocprof to collect application traces at the HSA runtime level. In general, tracing at the HIP-level is recommended for most users. You are advised to use HSA trace only if you are familiar with HSA runtime.

HSA trace contains the start/end time of HSA runtime API calls and their asynchronous activities. Use the hsa-trace option with rocprof to collect HSA-level trace:

rocprof --hsa-trace ./MatrixTranspose

As in HIP trace, the generated HSA trace also includes three groups of files, namely visualizable traces, statistics files, and intermediatory tracing data.

You can visualize the generated results.json using third-party tools such as Perfetto, as shown below.

Fig. 6 HSA Trace Visualization#

In figure above, you can see that HSA trace shows HSA trace rows, just like HIP trace shows the HIP API row in Fig. 1. However, note that there are more HSA API calls as compared to HIP API calls because HSA works closer to the hardware.

The statistics files are:

  • results.stats.csv for kernel statistics

  • results.hsa_stats.csv for HSA API

  • results.copy_stats.csv for memory copy statistics

Each file is a CSV table with columns as described in the Statistic Files.

System Trace#

The rocprof tool can also generate both the HIP and HSA traces together with the sys-trace option.

rocprof --sys-trace ./MatrixTranspose

The following figure shows the generated results.json visualized using Perfetto. It contains sections from both HIP and HSA trace.

Fig. 7 Sys Trace Visualization#

Roctx Trace#

In certain situations, such as debugging performance issues in large-scale GPU programs, API-level tracing may be too fine-grained to provide a big picture of the program execution. In such cases, you might find it helpful to define specific tasks to be traced.

To specify the tasks for tracing, enclose the respective source code with the API calls provided by Roctx. This process is also known as instrumentation. As the scope of code for instrumentation is defined using the enclosing API calls, it is called a range. A range is a programmer-defined task that has a well-defined start and end code scope. You can also fine grain the scope specified within a range using further nested ranges. The rocprof tool also reports the timelines for these nested ranges.

Here is a list of useful APIs for code instrumentation.

  • roctxMark: Inserts a marker in the code with a message. Creating marks can help you see when a line of code is executed.

  • roctxRangeStart: Starts a range. Ranges can be started by different threads.

  • roctxRangePush: Starts a new nested range.

  • roctxRangePop: Stops the current nested range.

  • roctxRangeStop: Stops the given range.

See roctx code annotations in the MatrixTranspose application below:

roctxMark("before hipLaunchKernel");
int rangeId = roctxRangeStart("hipLaunchKernel range");
roctxRangePush("hipLaunchKernel");

// Lauching kernel from host
hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH/THREADS_PER_BLOCK_X, WIDTH/THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0,0,gpuTransposeMatrix,gpuMatrix, WIDTH);

roctxMark("after hipLaunchKernel");

// Memory transfer from device to host
roctxRangePush("hipMemcpy");

hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);

roctxRangePop();  // for "hipMemcpy"
roctxRangePop();  // for "hipLaunchKernel"
roctxRangeStop(rangeId);

Using rocprof with roctx-trace option:

rocprof -d outputFolder --roctx-trace ./

You can visualize the generated output file results.json using Perfetto as shown in the figure below. The sections Markers and Ranges show the marked events and ranges.

Fig. 8 Roctx Trace Visualization#

Tracing Control#

The rocprof tool provides these customization options:

Filter Tasks#

To filter tasks, specify the trace category and the tasks to be traced in an input file.

cat input.txt
hsa : hsa_queue_create hsa_amd_memory_pool_allocate

Then supply this input file to rocprof.

rocprof -i input.txt --hsa-trace ./MatrixTranspose

The above sample input file generates HSA tracing information for only the two events specified in the file.

Adjust Trace Flush Rate#

To specify the flush rate (in seconds, microseconds, or milliseconds) which determines how often the traces are dumped to the files, use:

rocprof --flush-rate 10us --hsa-trace ./MatrixTranspose
Function Name Truncation#

To truncate the kernel full function name in the trace till the base one, use:

$ rocprof --basename on -–hip-trace ./MatrixTranspose

"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"vectoradd_float",1,46373,46373,100.0

When basename is not explicitly enabled, the full kernel function name is displayed in the trace:

$ rocprof --basename off -–hip-trace ./MatrixTranspose

"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"vectoradd_float(float*, float const*, float const*, int, int)",1,45633,45633,100.0
Tracing Control for API or Code Block#

To enable selective tracing for a HIP API(s) or code block(s) instead of the entire application, follow these steps:

  1. Enclose the API or code block within roctracer_start() and roctracer_stop(). This ensures that tracing starts only when it encounters roctracer_start() and stops once it encounters roctracer_stop(). See the usage of these API calls here, where the user wants to trace only hipMemcpy() API:

#include <roctracer/include/roctracer_ext.h>
    
// allocate the memory on the device side
hipMalloc((void**)&gpuMatrix, NUM * sizeof(float));
hipMalloc((void**)&gpuTransposeMatrix, NUM * sizeof(float));

roctracer_start();
    
// Memory transfer from host to device
hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);
    
roctracer_stop();
  1. Use --trace-start off to disable application tracing from the beginning and start tracing only when roctracer_start() is encountered.

$ rocprof --trace-start off --hip-trace MatrixTranspose
    
cat results.hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hipMemcpy",10,255048886,25504888,100.0
Set Initial Delay, Periodic Sample Length and Rate#

Use option –trace-period <delay:length:rate> to finetune tracing by setting the following attributes:

  • Initial Delay: The time interval between the start of the profiler and start of tracing. So an initial delay of 10 ms causes the tracing to commence after 10 ms since the start of the profiler.

  • Periodic Sample Length: The duration for which tracing runs

  • Rate: Rate at which the tracing results are flushed to the user

Example: Tracing with a delay of 10ms, length of 1ms and rate of 10ms

$ rocprof --hip-trace --trace-period 10ms:1ms:10ms MatrixTranspose

cat results.hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hipMemcpy",6,6358906,1059817,100.0

Example: Tracing with a delay of 10ms, length of 1ms and rate of 1ms

$ rocprof --hip-trace --trace-period 10ms:1ms:1ms MatrixTranspose 

cat results.hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hipMemcpy",11,272473856,24770350,99.9468484848238
"hipMalloc",2,102871,51435,0.037734380837192355
"hipFree",2,39940,19970,0.014650495967157534
"hipGetDeviceProperties",1,2090,2090,0.0007666383718417439

ROCTracer API#

The ROCTracer APIs are runtime-independent APIs for tracing runtime calls and asynchronous activity, like GPU kernel dispatches and memory moves. The tracing includes callback API for runtime API tracing and activity API for asynchronous activity records logging. You can utilize these APIs to develop a tracing tool or to implement tracing in application. To use the ROCTracer API, link the application with ROCTracer using the API header and dynamic library as shown below:

  • API header: /opt/rocm-{version}/include/roctracer/roctracer.h

  • Dynamic library (.so): /opt/rocm-{version}/lib/libroctracer64.so.<version major>

For all the API-related information, refer to the ROCTracer API Specification.

Performance Counter Collection#

As discussed in the sections above, the application trace mode is limited to providing an overview of program execution and does not provide an insight into kernel execution. To address performance issues, the counter and metric collection functionality of rocprof can be used to report hardware component performance metrics during kernel execution.

Counter and metric collection is supported on the following GPUs:

  • AMD Radeon Instinct MI25, MI50, MI100, MI2XX

  • AMD Radeon VII, Radeon Pro VII

The command-line options used with rocprof for profiling:

Options

Description

--list-basic

To print the list of basic hardware counters. Read more…

--list-derived

To print the list of derived metrics with formulas. Read more…

-i <.txt|.xml file>

To retrieve the values of the desired list of basic counters and/or derived metrics. Read more…

-m <xml file>

To define new derived metrics or modify the existing derived metrics that are defined in the metrics.xml by default. Read more…

-o <output file.csv>

To provide a preferred name to the generated output file. Refer to the example in Generated Output section below.

--timestamp <on|off>

To enable/disable the kernel dispatch timestamps in nanoseconds such as DispatchNs/BeginNs/EndNs/CompleteNs. Default value: [on].

  • DispatchNs: Dispatch time when the GPU receives notification to work on a specific kernel as the kernel Architected Queuing Language (AQL) dispatch packet is submitted to the queue.
  • BeginNs: Begin time when the kernel begins execution
  • EndNs: End time when the kernel finishes execution
  • CompleteNs: Complete time when the system receives notification from the GPU about completion of work by the kernel through the completion signal of the AQL dispatch packet.

Refer to the example in the Generated Output section below.

Listing Performance Counters#

AMDGPUs are equipped with hardware performance counters that can be used to measure specific values during kernel execution, which are then exported from the GPU and written into the output files at the end of the kernel execution. These performance counters vary according to the GPU. Therefore, it is recommended to examine the hardware counters that can be collected before running the profile.

There are two types of data available for profiling: hardware basic counters and derived metrics.

To obtain the list of supported basic counters, use:

rocprof --list-basic

The derived metrics are calculated from the basic counters using mathematical expressions. To list the supported derived metrics along with their mathematical expressions, use:

rocprof --list-derived

You can also customize the derived metrics as explained in the Metric File below.

Using rocprof for Application Profiling#

To profile kernels in GPU applications, define the profiling scope in an input file and use:

rocprof -i input.txt ./MatrixTranspose

Input File#

As mentioned above, an input file is a text file that can be supplied to rocprof for basic counter and derived metric collection. It typically consists of four parts, namely the basic counter(s)/derived metrics(s) to use, the GPUs to profile, name of the kernels to be profiled, and the range of kernels to profile.

Sample Input File

# Perf counters group 1
pmc: MemUnitStalled,TCC_MISS[0]
# Filter by dispatches range, GPU index and kernel names
# supported range formats: "3:9", "3:", "3"
range: 0:1
gpu: 0
kernel: matrixTranspose

The fields in the input file are described here:

PMC: The rows in the text file beginning with pmc are the group of basic counters or derived metrics the user is interested in collecting. The performance counters can be selected from the output generated by --list-basic or --list-derived command.

The number of basic counters or derived metrics that can be collected in one run of profiling is limited by the GPU hardware resources. If too many counters/metrics are selected, the kernels need to be executed multiple times to collect the counters/metrics. For multi-pass execution, include multiple rows of pmc in the input file. Counters or metrics in each pmc row can be collected in each run of the kernel.

rocprof provides suggestions to group the counters/metrics if the user exceeds the hardware limits. You can use this suggestion to split the counters/metrics into group sets and successfully perform counter/metric collection.

Example: To see the suggestion to group the counters/metrics on exceeding hardware limits

$ cat input.txt
pmc : Wavefronts, VALUInsts, SALUInsts, SFetchInsts,
FlatVMemInsts,
LDSInsts, FlatLDSInsts, GDSInsts, VALUUtilization, FetchSize,
WriteSize, L2CacheHit, VWriteInsts, GPUBusy, VALUBusy, SALUBusy,
MemUnitStalled, WriteUnitStalled, LDSBankConflict, MemUnitBusy
range: 0 : 1
gpu: 0
kernel:matrixTranspose

$ ropcprof -i input.txt ./MatrixTranspose
Input metrics out of hardware limit. Proposed metrics group set:
group1: FetchSize WriteSize VWriteInsts MemUnitStalled MemUnitBusy
FlatVMemInsts LDSInsts VALUInsts SALUInsts SFetchInsts
FlatLDSInsts GPUBusy Wavefronts
group2: WriteUnitStalled L2CacheHit GDSInsts VALUUtilization
VALUBusy SALUBusy LDSBankConflict

GPU: The row beginning with the keyword gpu specifies the GPU(s) on which the hardware counters are to be collected. This enables the support for profiling multiple GPUs. You can specify multiple GPUs separated by comma such as gpu: 1,3.

Kernel: The row beginning with the keyword kernel specifies the name(s) of the kernel(s) that need to be profiled.

Range: The row beginning with the keyword range specifies the range of kernel dispatches. Specifying range is helpful in cases where the application causes multiple kernel dispatches and users want to filter some kernel dispatches. In the above example, the range 0:1 depicts that one kernel is profiled.

Metric File#

The derived metrics are defined in the ./lib/rocprofiler/metrics.xml by default.

Here is an entry from metrics.xml:

<global>
# Wavefronts 
  <metric
    name="Wavefronts"
    descr="Total wavefronts."
    expr=SQ_WAVES
  ></metric>
</global>

To override the properties (description/expression) of the derived metrics that are predefined in the metrics.xml, redefine the derived metrics in the custom derived metrics file as shown here:

#include "gfx_metrics.xml"

<global>
  <metric
    name="Wavefronts"
    descr="Total wavefronts. Description redefined by user."
    expr=SQ_WAVES
  ></metric>
</global>

Note that while specifying your custom metrics file, you must include rocprofiler/test/tool/gfx_metrics.xml file as all the basic counters are defined here. The basic counters are used in calculating the derived metrics.

Here is an entry from gfx_metrics.xml:

<gfx9>
<metric name="SQ_WAVES" block=SQ event=4 descr="Count number of waves sent to SQs. (per-simd, emulated, global)"></metric>
<gfx9>

You can also define new derived metrics in the custom metrics file as shown here:

#include "gfx_metrics.xml"

<gfx9_expr>
<metric name="TotalWorkItems" expr=SQ_WAVES*4  descr="Total number of waves sent to SQs(For all simd's). Defined by user." ></metric>
</gfx9_expr>

To see the list of customized derived metrics, use:

$ rocprof -m custom_metrics.xml --list-derived
gpu-agent1 : TotalWorkItems : Total number of waves sent to SQs(For all simd's). Defined by user.
TotalWorkItems = SQ_WAVES*4

To collect the values of custom derived metrics, use:

cat input.txt
pmc: TotalWorkItems

$ rocprof -i input.txt -m custom_metrics.xml ./MatrixTranspose
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,TotalWorkItems
0,"matrixTranspose(float*, float*, int) [clone .kd]",1,0,0,2746,2746,1048576,16,0,0,8,0,16,64,0x0,0x7fead980e800,262144.0000000000

Note that you must use the input file to supply the list of derived metrics to be collected while the custom metrics file provides the expressions for calculating those derived metrics. When collecting the custom derived metrics, make sure to mention only those derived metrics in the input file that you have defined in the custom derived metrics file. When not using the option -m, rocprof refers to the default metrics.xml for expressions of all the derived metrics specified in the input file.

The basic counters and derived metrics specific to the AMDGPUs are listed in the xml files under the respective GPU family. The LLVM target gfx9 corresponds to MI50 and MI100 while gfx90a corresponds to the MI200 family. The counters and metrics applicable to all GPUs are listed under global. See the supported GPUs and their respective LLVM targets in the Linux Supported GPUs.

Generated Output#

Executing rocprof with the above-mentioned input file input.txt on the MatrixTranspose application produces an output CSV (the file name can be specified with the -o option) with the counter information as shown below:

Example: Executing rocprof on MatrixTranspose application with sample input file

$ rocprof -i input.txt ./MatrixTranspose
$ cat input.csv
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,
scr,vgpr,sgpr,fbar,sig,obj,MemUnitStalled,TCC_MISS[0]
0,"matrixTranspose(float*, float*, int) [clone .kd]",0,0,0,2614,2614,
1048576,16,0,0,8,24,0,0x0,0x7fbfcb37c580,6.6756117852,4096.

Each row of the CSV file is an instance of kernel execution. The columns in the output file are:

  • Index - kernels dispatch order index

  • KernelName - kernel name

  • gpu-id - GPU ID the kernel was submitted to

  • queue-id - ROCm queue unique id the kernel was submitted to

  • queue-index - ROCm queue write index for the submitted AQL packet

  • tid - system application thread id that submitted the kernel

  • grd - kernel’s grid size

  • wgr - kernel’s work group size

  • lds - kernel’s LDS memory size

  • scr - kernel’s scratch memory size

  • vgpr - kernel’s VGPR size

  • sgpr - kernel’s SGPR size

  • fbar - kernel’s barriers limitation

  • sig - kernel’s completion signal

To enable timestamp in the output, use option --timestamp on, as shown below:

Example: To enable the timestamp

$ rocprof -i input.txt --timestamp on ./MatrixTranspose
$ cat input.csv
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,
wgr,lds,scr,vgpr,sgpr,fbar,sig,obj,
MemUnitStalled,TCC_MISS[0],DispatchNs,BeginNs,EndNs,CompleteNs
0,"matrixTranspose(float*, float*, int) [clone .kd]",0,0,0,2837,2837,
1048576,16,0,0,8,24,0,0x0,
0x7fcd75984580,5.9792124305,4096,87851328156768,
87851334047658,87851334141098,87851334732528

To provide a preferred name to the output file, use option -o <file name> as shown below:

Example: To specify the output file name

$ rocprof -i input.txt --timestamp on -o output.csv ./MatrixTranspose
$ cat output.csv
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,
grd,wgr,lds,scr,vgpr,sgpr,fbar,
sig,obj,MemUnitStalled,
TCC_MISS[0],DispatchNs,BeginNs,EndNs,CompleteNs
0,"matrixTranspose(float*, float*, int) [clone .kd]",0,0,0,
215,215,1048576,16,0,0,8,24,0,0x0,0x7f961080a580,7.0675214726,4096,
91063585321414,91063591158627,91063591252551,91063592018031

Profiling Multiple MPI Ranks#

To profile multiple MPI ranks in ROCm v4.3 and higher, use the following command:

mpirun ... <mpi args> ... rocprof ... <rocprof args> ... application ... <application args>

NOTE for ROCm v4.3 and later versions: Do not use rocprof command followed by mpirun as previously used in ROCm v4.2 and lower. This feature differs from ROCm v4.2 and lower, which used rocprof ... mpirun ... application.

This change was made to enable ROCProfiler to handle process forking better and launching via mpirun (and related) executables.

From a user perspective, this new execution mode requires the following:

  1. Generation of trace data per MPI (or process) rank

ROCm provides a simple bash wrapper that demonstrates how to generate a unique output directory per process as given below:

$ cat wrapper.sh
#! /usr/bin/env bash
if [[ -n ${OMPI_COMM_WORLD_RANK+z} ]]; then
	# mpich
	export MPI_RANK=${OMPI_COMM_WORLD_RANK}

elif [[ -n ${MV2_COMM_WORLD_RANK+z} ]]; then
	# ompi
	export MPI_RANK=${MV2_COMM_WORLD_RANK}
fi

args="$*"
pid="$$"
outdir="rank_${pid}_${MPI_RANK}"
outfile="results_${pid}_${MPI_RANK}.csv"
eval "rocprof -d ${outdir} -o ${outdir}/${outfile} $*"

This script:

  • Determines the global MPI rank (implemented here for OpenMPI and MPICH only)

  • Determines the process id of the MPI rank

  • Generates a unique output directory using the two

To invoke this wrapper, use the following command:

mpirun <mpi args> ./wrapper.sh --hip-trace <application> <args>

This generates an output directory for each used MPI rank.

Example:

$ ls -ld rank_* | awk {'print $5" "$9'}
4096 rank_513555_0
4096 rank_513556_1
  1. Combining traces from multiple processes

The multiple traces as generated above can be combined using merge_traces.sh utility script into a unified trace for profiling as shown below:

$  ./merge_traces.sh -h
Script for aggregating results from multiple ROCProfiler directories.

Full path: /opt/rocm/bin/merge_traces.sh

Usage:
merge_traces.sh -o <outputdir> [<inputdir>...]

Use the following input arguments to the merge_traces.sh script to control which traces are merged and where the resulting merged trace is saved.

  • -o <outputdir> - output directory where the results are aggregated.

  • <inputdir>... - space-separated list of ROCProfiler directories. If not specified, the current working directory is used.

The file unified/results.json is generated, and it contains trace data from both MPI ranks.

ROCProfiler API#

The ROCProfiler library provides a hardware-specific low-level performance-analysis interface for profiling of GPU compute applications. The profiling includes hardware performance counters with complex performance metrics and hardware traces. The library can be loaded by HSA runtime as a tool plugin or by a higher-level hardware-independent performance-analysis API like PAPI.

For all the API-related information, refer to the ROCProfiler API specification.

Profiling Tools: Appendix#

This chapter provides meta-information and material for your reference.

Sample Application Code: MatrixTranspose#

#include <iostream>
// hip header file
#include <hip/hip_runtime.h>
#include "roctracer_ext.h"
// roctx header file
#include <roctx.h>
#define WIDTH 1024
#define NUM (WIDTH * WIDTH)
#define THREADS_PER_BLOCK_X 4
#define THREADS_PER_BLOCK_Y 4
#define THREADS_PER_BLOCK_Z 1
// Mark API
extern "C"
    // Device (Kernel) function, it must be void
    __global__ void
    matrixTranspose(float* out, float* in, const int width) {
  int x = hipBlockDim_x * hipBlockIdx_x + hipThreadIdx_x;
  int y = hipBlockDim_y * hipBlockIdx_y + hipThreadIdx_y;
  out[y * width + x] = in[x * width + y];
}
int main() {
  float* Matrix;
  float* TransposeMatrix;
  float* cpuTransposeMatrix;
  float* gpuMatrix;
  float* gpuTransposeMatrix;
  hipDeviceProp_t devProp;
  hipGetDeviceProperties(&devProp, 0);
  std::cout << "Device name " << devProp.name << std::endl;
  int i;
  int errors;
  Matrix = (float*)malloc(NUM * sizeof(float));
  TransposeMatrix = (float*)malloc(NUM * sizeof(float));
  // initialize the input data
  for (i = 0; i < NUM; i++) {
    Matrix[i] = (float)i * 10.0f;
  }
  // allocate the memory on the device side
  hipMalloc((void**)&gpuMatrix, NUM * sizeof(float));
  hipMalloc((void**)&gpuTransposeMatrix, NUM * sizeof(float));
  uint32_t iterations = 100;
  while (iterations-- > 0) {
    std::cout << "## Iteration (" << iterations
              << ")
        ################ #" << std::endl;
        // Memory transfer from host to device
        hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);
    roctxMark("before hipLaunchKernel");
    int rangeId = roctxRangeStart("hipLaunchKernel range");
    roctxRangePush("hipLaunchKernel");
    // Lauching kernel from host
    2 hipLaunchKernelGGL(
        matrixTranspose, dim3(WIDTH / THREADS_PER_BLOCK_X, WIDTH / THREADS_PER_BLOCK_Y),
        dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0, 0, gpuTransposeMatrix, gpuMatrix, WIDTH);
    roctxMark("after hipLaunchKernel");
    // Memory transfer from device to host
    roctxRangePush("hipMemcpy");
    hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
    roctxRangePop();  // for "hipMemcpy"
    roctxRangePop();  // for "hipLaunchKernel"
    roctxRangeStop(rangeId);
  }
  // free the resources on device side
  hipFree(gpuMatrix);
  hipFree(gpuTransposeMatrix);
  // free the resources on host side
  free(Matrix);
  free(TransposeMatrix);
  return errors;
}

Limitations#

  • When using OpenMPI, you must run rocprof inside the mpirun command as shown in Profiling Multiple MPI Ranks. If you run rocprof command before mpirun, then the tool fails with the error roctracer: Loading 'libamdhip64.so' failed, (null).

  • GPU id in the HSA visualization is always 0.

Logging#

Set the following environment variables to enable logging:

Environment Variable

Purpose

Log Files

$ export ROCPROFILER_LOG=1

Enables error message logging

/tmp/rocprofiler_log.txt

$ export ROCPROFILER_TRACE=1

Enables verbose tracing

/tmp/roctracer_log.txt

References#