Using rocprof#

rocprof is a powerful tool for profiling HIP applications on AMD ROCm platforms. It can be used to identify performance bottlenecks in applications and to optimize their performance. rocprof provides a variety of profiling data, including performance counters, hardware traces, and runtime API/activity traces. This document provides a detailed description of the features and usage of the rocprof command-line tool with a focus on these two primary features:

  • Application Tracing: This basic feature of rocprof is used to trace the execution of an application, with start/end timestamps for each API call and kernel execution. [Read more here](#application-tracing).

  • Performance Counter Collection: This powerful feature collects performance counters for each API call and kernel execution. [Read more here](#performance-counter-collection).

To see all the rocprof options, refer to rocprof command reference, or run the following from the command line:

rocprof --help

Application tracing#

Application tracing provides the big picture of a program’s execution by collecting data on the execution times of API calls and GPU commands, such as kernel execution, async memory copy, and barrier packets. This information can be used as the first step in the profiling process to answer important questions, such as which kernel took the longest time to execute, and what percentage of time was spent on memory copy.

There are two ways to use application tracing: rocprof and ROCTracer API.

  • rocprof is a command line interface (CLI) profiler that can be used on the applications running on ROCm-supported GPUs, without the requirement of any code modification in the application.

  • ROCTracer API is a library that requires minor code modification in the application to be traced but provides greater flexibility, such as adjusting execution based on profiling results.

Command-line options for application tracing#

The rocprof CLI allows you to trace the entire execution of HIP applications. It allows tracing at different levels such as HIP-level, HSA-level, and system-level. These levels can be selected by supplying the respective command-line options to rocprof.

The command-line options used with rocprof for tracing:

Options

Description

-d <output directory>

To specify the directory where the profiler stores traces and the profiling data. The profiler stores the profiling data in a temporary directory [/tmp] by default, which is removed automatically after a specific period. Specifying a directory explicitly allows you to prevent loss of data.

--hip-trace

To trace API execution stats, HIP API calls, and copy operation calls.|br| For more information refer to HIP Trace.

--hsa-trace

To trace API execution stats, HIP API calls, and copy operation calls.|br| For more information refer to HSA Trace.

--roctx-trace

To enable roctx application code annotation trace. Allows you to trace a particular block of code when Markers and Ranges are specified in the application code.
For more information refer to ROCTx Trace.

--stats

To trace API execution stats and kernel execution stats.
For more information refer to stats-file.

--sys-trace

To trace API execution stats, HIP and HSA API calls, and copy operation calls.
For more information refer to System Trace.

HIP Trace#

Use the --hip-trace option to collect execution trace data for the entire application with rocprof, including HIP API functions and their asynchronous activities at the runtime level.

Usage:

rocprof -d outputFolder --hip-trace ./Matrixtranspose

The above command generates three groups of files: viewable trace data, statistics files, and intermediate tracing data.

Note

This document does not discuss the intermediate tracing data as it is not designed to be read by users.

Viewable Trace Data

The viewable trace data is available in results.json, which is a JSON file that follows the Chromium Project’s trace-event format. You can view the trace using viewing tools such as Chrome Tracing <chrome://tracing/> or Perfetto UI. In the following figure a short segment of the trace data is viewed:

Viewing HIP Trace

Fig. 1 Viewing HIP Trace#

In the time axis at the top of the following figure there is a small, highlighted region between 0.22s to 0.24s, that indicates the currently selected time range.

Time Range

Fig. 2 Time Range#

Below the time axis in the following figure there are Gantt chart-style boxes that show the duration of each task. There are three rows of tasks divided by the black rows mentioning their categories. The first row is the CPU HIP API, which lists the execution time of each API trace.

Duration of API Trace Execution

Fig. 3 Duration of API Trace Execution#

In the following figure the second row titled COPY shows the tasks completed by the copy engine. See that the CopyHostToDevice and CopyDeviceToHost tasks are being completed at the beginning and the end of the time range, respectively.

Trace of Copy Tasks

Fig. 4 Copy Tasks#

In the following figure the GPU tasks are listed at the bottom. Note that the matrixTranspose kernel is executed from about 0.185s to 0.20s.

Trace of GPU Tasks

Fig. 5 GPU Tasks#

Statistics Files#

The statistics files include:

  • results.stats.csv for kernel statistics

  • results.hip_stats.csv for HIP API

  • results.copy_stats.csv for activity statistics

The files are organized in comma-separated values (CSV) format, with the columns:

  • Name: Name of the action

  • Calls: Number of invocations

  • TotalDurationNS: Total duration in nanosecond

  • AverageNS: Average time in nanoseconds required to execute the action

  • Percentage: Percentage of the action with respect to the complete execution of the application

HSA Trace#

The HIP runtime library is implemented with the low-level HSA runtime. To trace the application at a lower level, you can use rocprof to collect application traces at the HSA runtime level. In general, tracing at the HIP-level is recommended for most users. You are advised to use HSA trace only if you are familiar with HSA runtime.

HSA trace contains the start/end time of HSA runtime API calls and their asynchronous activities. Use the --hsa-trace option with rocprof to collect HSA-level trace:

rocprof --hsa-trace ./MatrixTranspose

As in HIP trace, the generated HSA trace also includes three groups of files, namely viewable traces, statistics files, and intermediate tracing data. You can visualize the generated results.json using third-party tools such as Perfetto, as shown in the following figure.

Viewing HSA Trace

Fig. 6 Viewing HSA Trace#

In the preceding figure you can see that HSA trace shows HSA trace rows, just like HIP trace shows the HIP API row in {numref}hip-trace-visualize. However, note that there are more HSA API calls as compared to HIP API calls because HSA works closer to the hardware.

Note

GPU ID in the HSA visualization is always 0.

The statistics files are:

  • results.stats.csv for kernel statistics

  • results.hsa_stats.csv for HSA API

  • results.copy_stats.csv for memory copy statistics

Each file is a CSV table with columns as described in the stats-file.

System Trace#

The rocprof tool can also generate both the HIP and HSA traces together with the --sys-trace option.

rocprof --sys-trace ./MatrixTranspose

The following figure shows the generated results.json visualized using Perfetto. It contains sections from both HIP and HSA trace.

Viewing Sys Trace

Fig. 7 Viewing Sys Trace#

ROCTx Trace#

In certain situations, such as debugging performance issues in large-scale GPU programs, API-level tracing may be too fine-grained to provide a big picture of the program execution. In such cases, you might find it helpful to define specific tasks to be traced.

To specify the tasks for tracing, enclose the respective source code with the API calls provided by roctx. This process is also known as instrumentation. As the scope of code for instrumentation is defined using the enclosing API calls, it is called a range. A range is a programmer-defined task that has a well-defined start and end code scope. You can also fine grain the scope specified within a range using further nested ranges. The rocprof tool also reports the timelines for these nested ranges.

Here is a list of useful APIs for code instrumentation.

  • roctxMark: Inserts a marker in the code with a message. Creating marks can help you see when a line of code is executed.

  • roctxRangeStart: Starts a range. Ranges can be started by different threads.

  • roctxRangePush: Starts a new nested range.

  • roctxRangePop: Stops the current nested range.

  • roctxRangeStop: Stops the given range.

See roctx code annotations in the MatrixTranspose application below:

roctxMark("before hipLaunchKernel");
int rangeId = roctxRangeStart("hipLaunchKernel range");
roctxRangePush("hipLaunchKernel");

// Lauching kernel from host
hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH/THREADS_PER_BLOCK_X, WIDTH/THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0,0,gpuTransposeMatrix,gpuMatrix, WIDTH);

roctxMark("after hipLaunchKernel");

// Memory transfer from device to host
roctxRangePush("hipMemcpy");

hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);

roctxRangePop();  // for "hipMemcpy"
roctxRangePop();  // for "hipLaunchKernel"
roctxRangeStop(rangeId);

Tip

A version of the MatrixTranspose application instrumented using the ROCTx API is available in the rocprofiler/tests-v2 folder on GitHub.

Using rocprof with roctx-trace option:

rocprof -d outputFolder --roctx-trace ./

You can visualize the generated output file results.json using Perfetto as shown in the following figure. The sections Markers and Ranges show the marked events and ranges.

Viewing Roctx Trace

Fig. 8 Viewing Roctx Trace#

Tracing Control#

The rocprof tool provides these customization options:

Filter Tasks#

To filter tasks, specify the trace category and the tasks to be traced in an input file.

cat input.txt
hsa : hsa_queue_create hsa_amd_memory_pool_allocate

Then supply this input file to rocprof.

rocprof -i input.txt --hsa-trace ./MatrixTranspose

The above sample input file generates HSA tracing information for only the two events specified in the file.

Adjust Trace Flush Rate#

Use the --flush-rate option to specify the flush rate in seconds, milliseconds, or microseconds. This determines how often the trace data is dumped to files. In the following example the flush rate is set to 10 microseconds:

rocprof --flush-rate 10us --hsa-trace ./MatrixTranspose

Function Name Truncation#

To truncate the kernel full function name in the trace files to the base name of the function, use --basename as shown in the following example:

$ rocprof --basename on --hip-trace ./MatrixTranspose

"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"vectoradd_float",1,46373,46373,100.0

When --basename is not explicitly enabled, the full kernel function name is displayed in the trace:

$ rocprof --basename off --hip-trace ./MatrixTranspose

"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"vectoradd_float(float*, float const*, float const*, int, int)",1,45633,45633,100.0

Tracing Control for API or Code Block#

To enable selective tracing for a HIP API or code block instead of the entire application, follow these steps:

  1. Enclose the API or code block within roctracer_start() and roctracer_stop(). This ensures that tracing starts only when it encounters roctracer_start() and stops once it encounters roctracer_stop(). See the usage of these API calls here, where the user wants to trace only hipMemcpy() API:

#include <roctracer/include/roctracer_ext.h>

// allocate the memory on the device side
hipMalloc((void**)&gpuMatrix, NUM * sizeof(float));
hipMalloc((void**)&gpuTransposeMatrix, NUM * sizeof(float));

roctracer_start();

// Memory transfer from host to device
hipMemcpy(gpuMatrix, Matrix, NUM * sizeof(float), hipMemcpyHostToDevice);

roctracer_stop();
  1. Use –trace-start off to disable application tracing from the beginning and start tracing only when roctracer_start() is encountered.

$ rocprof --trace-start off --hip-trace MatrixTranspose

cat results.hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hipMemcpy",10,255048886,25504888,100.0

Set Initial Delay, Periodic Sample Length and Rate#

Use option -trace-period <delay:length:rate> to fine tune tracing by setting the following attributes:

  • Initial Delay: The time interval between the start of the profiler and start of tracing. So an initial delay of 10 ms causes the tracing to commence after 10 ms since the start of the profiler.

  • Periodic Sample Length: The duration for which tracing runs

  • Rate: Rate at which the tracing results are flushed to the user

Example: Tracing with a delay of 10ms, length of 1ms and rate of 10ms

$ rocprof --hip-trace --trace-period 10ms:1ms:10ms MatrixTranspose

cat results.hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hipMemcpy",6,6358906,1059817,100.0

Example: Tracing with a delay of 10ms, length of 1ms and rate of 1ms

$ rocprof --hip-trace --trace-period 10ms:1ms:1ms MatrixTranspose

cat results.hip_stats.csv
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"hipMemcpy",11,272473856,24770350,99.9468484848238
"hipMalloc",2,102871,51435,0.037734380837192355
"hipFree",2,39940,19970,0.014650495967157534
"hipGetDeviceProperties",1,2090,2090,0.0007666383718417439

ROCTracer API#

The ROCTracer APIs are runtime-independent APIs for tracing runtime calls and asynchronous activity, like GPU kernel dispatches and memory moves. The tracing includes callback API for runtime API tracing and activity API for asynchronous activity records logging. You can utilize these APIs to develop a tracing tool or to implement tracing within an application. Refer to the ROCTracer API Specification for more information.

To use the ROCTracer API, link the application with ROCTracer using the API header and dynamic library as shown below:

  • API header: /opt/rocm-{version}/include/roctracer/roctracer.h

  • Dynamic library (.so): /opt/rocm-{version}/lib/libroctracer64.so.<version major>

Performance Counter Collection#

As discussed in the sections above, the application trace mode is limited to providing an overview of program execution and does not provide an insight into kernel execution. To address performance issues, the counter and metric collection functionality of rocprof can be used to report hardware component performance metrics during kernel execution.

Counter and metric collection is supported on the following GPUs:

  • AMD Radeon Instinct MI25, MI50, MI100, MI2XX

  • AMD Radeon VII, Radeon Pro VII

Command-line options for counter collection#

The command-line options used with rocprof for profiling:

Options

Description

--list-basic

To print the list of basic hardware counters.
For more information refer to Listing Performance Counters.

--list-derived

To print the list of derived metrics with formulas.

-i <.txt/.xml file>

To retrieve the values of the desired list of basic counters and/or derived metrics.
For more information refer to Input File.

-m <.xml file>

To define new derived metrics or modify the existing metrics defined in the metrics.xml by default.
For more information refer to Metric File.

-o <output file.csv>

Specify a name for the output file generated when used with -i.
For more information refer to Generated Output.

--timestamp <on/off>

To enable or disable the kernel dispatch timestamps in nanoseconds for events such as dispatch, begin, end, and complete, as described in the following. Default value: on.
For more information refer to Generated Output.
* DispatchNs: indicates the time when the GPU receives notification to work on a specific kernel as the kernel Architected Queuing Language (AQL) dispatch packet is submitted to the queue.
* BeginNs: indicates when the kernel begins execution
* EndNs: time when the kernel finishes execution
* CompleteNs: inidcates the time when the system receives notification from the GPU about completion of work by the kernel through the completion signal of the AQL dispatch packet.

Listing Performance Counters#

AMD GPUs are equipped with hardware performance counters that can be used to measure specific values during kernel execution, which are then exported from the GPU and written into the output files at the end of the kernel execution. These performance counters vary according to the GPU. Therefore, it is recommended to examine the hardware counters that can be collected before running the profile.

There are two types of data available for profiling: hardware basic counters and derived metrics.

To obtain the list of supported basic counters, use:

rocprof --list-basic

The derived metrics are calculated from the basic counters using mathematical expressions. To list the supported derived metrics along with their mathematical expressions, use:

rocprof --list-derived

You can also customize the derived metrics as explained in Metric File.

Note

The output generated from the --list_basic and --list_derived commands can be significant, and is sometimes worth capturing by redirecting the output to a file.

rocprof --list-derived ./MatrixTranspose > output.txt

Using rocprof for Application Profiling#

To profile kernels in GPU applications, define the profiling scope in an input file and use:

rocprof -i input.txt ./MatrixTranspose

Note

Refer to the MatrixTranspose application tutorial for the example application.

Input File#

As mentioned above, an input file is a text file that can be supplied to rocprof for basic counter and derived metric collection. It typically consists of four parts: the counters and derived metrics to collect, GPUs to profile, names and range of kernels to profile.

The collected data is written to an output CSV file that has the same name as the input file specified. For example`` -i input.txt`` results in input.csv being generated.

Sample Input File

# Perf counters group 1
pmc: MemUnitStalled,TCC_MISS[0]
# Filter by dispatches range, GPU index and kernel names
# supported range formats: "3:9", "3:", "3"
range: 0:1
gpu: 0
kernel: matrixTranspose

The fields in the input file are as follows:

PMC: The rows in the text file beginning with pmc: are the group of basic counters or derived metrics you are interested in collecting. The performance counters can be selected from the output generated by --list-basic or --list-derived command.

The number of basic counters or derived metrics that can be collected in one run of profiling is limited by the GPU hardware resources. If too many counters/metrics are selected, the kernels need to be executed multiple times to collect the counters/metrics. For multi-pass execution, include multiple rows of pmc: in the input file. Counters or metrics in each pmc: row can be collected in each run of the kernel.

Note

rocprof will provide suggestions to group the counters/metrics if you exceed the hardware limits. You can use this suggestion to split the counters/metrics into group sets and successfully perform counter/metric collection.

Example: to see the suggestion to group the counters/metrics:

$ cat input.txt
pmc : Wavefronts, VALUInsts, SALUInsts, SFetchInsts,
FlatVMemInsts,
LDSInsts, FlatLDSInsts, GDSInsts, VALUUtilization, FetchSize,
WriteSize, L2CacheHit, VWriteInsts, GPUBusy, VALUBusy, SALUBusy,
MemUnitStalled, WriteUnitStalled, LDSBankConflict, MemUnitBusy
range: 0 : 1
gpu: 0
kernel:matrixTranspose

$ ropcprof -i input.txt ./MatrixTranspose
Input metrics out of hardware limit. Proposed metrics group set:
group1: FetchSize WriteSize VWriteInsts MemUnitStalled MemUnitBusy
FlatVMemInsts LDSInsts VALUInsts SALUInsts SFetchInsts
FlatLDSInsts GPUBusy Wavefronts
group2: WriteUnitStalled L2CacheHit GDSInsts VALUUtilization
VALUBusy SALUBusy LDSBankConflict

Note

The results reported vary depending on the GPU device being profiled.

GPU: The row beginning with the keyword gpu: specifies the GPU(s) on which the hardware counters are to be collected. This enables the support for profiling multiple GPUs. You can specify multiple GPUs separated by comma such as gpu: 1,3.

Kernel: specifies the names of kernels to profile.

Range: specifies the range of kernel dispatches. Specifying range is helpful in cases where the application causes multiple kernel dispatches and users want to filter some kernel dispatches. In the above example, the range 0:1 depicts that one kernel is profiled.

Metric File#

The derived metrics are defined in the /opt/rocm/lib/rocprofiler/metrics.xml by default.

Here is an entry from metrics.xml:

<global>
# Wavefronts
<metric
    name="Wavefronts"
    descr="Total wavefronts."
    expr=SQ_WAVES
></metric>
</global>

To override the properties (description/expression) of the derived metrics that are predefined in the metrics.xml, redefine the derived metrics in the custom derived metrics file as shown here:

#include "gfx_metrics.xml"

<global>
<metric
    name="Wavefronts"
    descr="Total wavefronts. Description redefined by user."
    expr=SQ_WAVES
></metric>
</global>

Note that while specifying your custom metrics file, you must include rocprofiler/test/tool/gfx_metrics.xml file as all the basic counters are defined here. The basic counters are used in calculating the derived metrics.

Here is an entry from gfx_metrics.xml:

<gfx9>
<metric name="SQ_WAVES" block=SQ event=4 descr="Count number of waves sent to SQs. (per-simd, emulated, global)"></metric>
<gfx9>

You can also define new derived metrics in the custom metrics file as shown here:

#include "gfx_metrics.xml"

<gfx9_expr>
<metric name="TotalWorkItems" expr=SQ_WAVES*4  descr="Total number of waves sent to SQs(For all simd's). Defined by user." ></metric>
</gfx9_expr>

To see the list of customized derived metrics, use:

$ rocprof -m custom_metrics.xml --list-derived
gpu-agent1 : TotalWorkItems : Total number of waves sent to SQs(For all simd's). Defined by user.
TotalWorkItems = SQ_WAVES*4

To collect the values of custom derived metrics, use:

cat input.txt
pmc: TotalWorkItems

$ rocprof -i input.txt -m custom_metrics.xml ./MatrixTranspose
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,TotalWorkItems
0,"matrixTranspose(float*, float*, int) [clone .kd]",1,0,0,2746,2746,1048576,16,0,0,8,0,16,64,0x0,0x7fead980e800,262144.0000000000

Note

You must use the input file to supply the list of derived metrics to be collected while the custom metrics file provides the expressions for calculating those derived metrics. When collecting the custom derived metrics, make sure to mention only those derived metrics in the input file that you have defined in the custom derived metrics file. When not using the option -m, rocprof refers to the default metrics.xml for expressions of all the derived metrics specified in the input file.

The basic counters and derived metrics specific to the AMD GPUs are listed in the xml files under the respective GPU family. The LLVM target gfx9 corresponds to MI50 and MI100 while gfx90a corresponds to the MI200 family. The counters and metrics applicable to all GPUs are listed under global. See the supported GPUs and their respective LLVM targets in the Linux Supported GPUs.

Generated Output#

Executing rocprof with an input file input.txt produces an output CSV (the file name can be specified with the -o option) with the counter information as shown below:

Example: Executing rocprof on MatrixTranspose application with sample input file

$ rocprof -i input.txt ./MatrixTranspose
$ cat input.csv
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,
scr,vgpr,sgpr,fbar,sig,obj,MemUnitStalled,TCC_MISS[0]
0,"matrixTranspose(float*, float*, int) [clone .kd]",0,0,0,2614,2614,
1048576,16,0,0,8,24,0,0x0,0x7fbfcb37c580,6.6756117852,4096.

Each row of the CSV file is an instance of kernel execution. The columns in the output file are:

  • Index - kernels dispatch order index

  • KernelName - kernel name

  • gpu-id - GPU ID the kernel was submitted to

  • queue-id - ROCm queue unique ID the kernel was submitted to

  • queue-index - ROCm queue write index for the submitted AQL packet

  • pid - system application process ID

  • tid - system application thread id that submitted the kernel

  • grd - kernel’s grid size

  • wgr - kernel’s work group size

  • lds - kernel’s LDS memory size

  • scr - kernel’s scratch memory size

  • vgpr - kernel’s VGPR size

  • sgpr - kernel’s SGPR size

  • fbar - kernel’s barriers limitation

  • sig - kernel’s completion signal

Example: To enable the timestamp

To enable timestamp in the output, use option --timestamp on as shown in the following example.

$ rocprof -i input.txt --timestamp on ./MatrixTranspose
$ cat input.csv
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,
wgr,lds,scr,vgpr,sgpr,fbar,sig,obj,
MemUnitStalled,TCC_MISS[0],DispatchNs,BeginNs,EndNs,CompleteNs
0,"matrixTranspose(float*, float*, int) [clone .kd]",0,0,0,2837,2837,
1048576,16,0,0,8,24,0,0x0,
0x7fcd75984580,5.9792124305,4096,87851328156768,
87851334047658,87851334141098,87851334732528

Note

The --timestamp is on by default.

Example: To specify the output file name.

The default output CSV file has the same name as the input file specified. To specify a name for the output file, use option -o <file name> as shown in the following example.

$ rocprof -i input.txt --timestamp on -o output.csv ./MatrixTranspose
$ cat output.csv
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,
grd,wgr,lds,scr,vgpr,sgpr,fbar,
sig,obj,MemUnitStalled,
TCC_MISS[0],DispatchNs,BeginNs,EndNs,CompleteNs
0,"matrixTranspose(float*, float*, int) [clone .kd]",0,0,0,
215,215,1048576,16,0,0,8,24,0,0x0,0x7f961080a580,7.0675214726,4096,
91063585321414,91063591158627,91063591252551,91063592018031

Profiling Multiple MPI Ranks#

To profile multiple MPI ranks use the following command:

mpirun ... <mpi args> ... rocprof ... <rocprof args> ... application ... <application args>

Important

When using OpenMPI, you must run rocprof inside the mpirun command, as shown above, to enable ROCProfiler to handle process forking and launching via mpirun and related executables. If you run the rocprof command before mpirun, then the tool fails with the error roctracer: Loading ‘libamdhip64.so’ failed, (null).

This execution mode requires the following:

  1. Generation of trace data per MPI (or process) rank

    ROCm provides a simple bash wrapper that demonstrates how to generate a unique output directory per process as given below:

    $ cat wrapper.sh
    #! /usr/bin/env bash
    if [[ -n ${OMPI_COMM_WORLD_RANK+z} ]]; then
        # mpich
        export MPI_RANK=${OMPI_COMM_WORLD_RANK}
    
    elif [[ -n ${MV2_COMM_WORLD_RANK+z} ]]; then
        # ompi
        export MPI_RANK=${MV2_COMM_WORLD_RANK}
    fi
    
    args="$*"
    pid="$$"
    outdir="rank_${pid}_${MPI_RANK}"
    outfile="results_${pid}_${MPI_RANK}.csv"
    eval "rocprof -d ${outdir} -o ${outdir}/${outfile} $*"
    

    This script:

    • Determines the global MPI rank (implemented here for OpenMPI and MPICH only)

    • Determines the process id of the MPI rank

    • Generates a unique output directory using the two

    To invoke this wrapper, use the following command:

    mpirun <mpi args> ./wrapper.sh --hip-trace <application> <args>
    

    This generates an output directory for each MPI rank used.

    Example:

    $ ls -ld rank_* | awk {'print $5" "$9'}
    4096 rank_513555_0
    4096 rank_513556_1
    
  2. Combining traces from multiple processes

    The multiple traces as generated above can be combined into a unified trace for profiling using merge_traces.sh utility script. The full path for the script is /opt/rocm/bin/merge_traces.sh

    Usage:

    merge_traces.sh -o <outputdir> [<inputdir>...]
    

    For example:

    $ ./merge_traces.sh -h
    

    Use the following input arguments to the merge_traces.sh script to control which traces are merged and where the resulting merged trace is saved.

    • <inputdir>... - space-separated list of ROCProfiler directories to merge. If not specified, the current working directory is used.

    • -o <outputdir> - output directory where the results are aggregated. The file unified/results.json is generated, and it contains trace data from the specified input directories.

ROCProfiler API#

The ROCProfiler library provides a hardware-specific low-level performance-analysis interface for profiling of GPU compute applications. The profiling includes hardware performance counters with complex performance metrics and hardware traces. The library can be loaded by HSA runtime as a tool plugin or by a higher-level hardware-independent performance-analysis API like PAPI.

For all the API-related information, refer to the ROCProfiler API specification.

Logging#

Set the following environment variables to enable logging:

Environment Variable

Purpose

Log Files

ROCPROFILER_LOG=1

Enables error message logging

/tmp/rocprofiler_log.txt

ROCPROFILER_TRACE=1

Enables verbose tracing

/tmp/roctracer_log.txt