Using rocprofv2#

Note

rocprofv2 is considered beta software.

rocprofv2 is a command-line interface tool (CLI) that lets you profile AMD GPU applications without any requirement of source code modification. The usage of rocprofv2 along with various command-line arguments is described in the following sections.

To see all the rocprofv2 options, refer to rocprofv2 Command reference, or run the following from the command line:

rocprofv2 --help

Application tracing#

Tracing of application and hardware events, is a primary feature of the rocprofv2 command. The various options for tracing HIP/HSA API, asynchronous activity, and kernel dispatches are described in the following table:

Tracing Mode

Option

Usage

HIP API tracing

--hip-api

rocprofv2 --hip-api <app_relative_path>

Combined HIP API and asynchronous activity tracing

--hip-activity
or --hip-trace

rocprofv2 --hip-activity <app_relative_path>

HSA API tracing

--hip-api

rocprofv2 --hsa-api <app_relative_path>

Combined HSA API and asynchronous activity tracing

--hip-activity
or --hsa-trace

rocprofv2 --hsa-api <app_relative_path>

ROCTx API tracing

--roctx-trace

rocprofv2 --roctx-trace <app_relative_path>

Kernel dispatches tracing

--kernel-trace

rocprofv2 --kernel-trace <app_relative_path>

All tracing modes combined

--sys-trace

rocprofv2 --sys-trace <app_relative_path>

Note

By default, the output of these options is directed to stdout unless the -o option is also specified.

To generate output from these trace options, use one of the supported plugins that generate output in a specific format, as explained in Formatting output using plugins. The default plugin is the file plugin that generates a CSV file returned to stdout, or returned to a file when used with -o option.

rocprofv2 supports API tracing at both HIP and HSA level. In general, HIP APIs directly interact with the user program. It is easier to analyze HIP traces as you can directly map the traces to the program. HSA API tracing is more suited for advanced users who want to understand the application behavior at the lower level.

Both HIP and HSA APIs support asynchronous behavior (e.g., asynchronous memory copy). If trace collection is triggered using either --hip-api or --hsa-api, the trace records only the start, stop, and duration of API events, but not the execution time of associated actions like memory copy. To record the duration of asynchronous activities, use --hip-activity and --hsa-activity options, which record both the API events and asynchronous events.

2.1.1 Visualize tracing results#

You can view the traces generated by rocprofv2 using the Perfetto UI that enables you to view and analyze traces in a web browser. To begin go to Perfetto UI, select Open trace file from the left-side menu, and select the ROCProfiler trace file to view.

The following is a screenshot from the Perfetto interface. The tasks are organized in a Gantt chart style with the x-axis representing time and each rectangle representing the start and the end time of a task. The tasks are organized in rows. In the figure is the HIP API, HSA API, a queue, and a stream.

Viewing HIP Trace

Fig. 9 Visualizing Traces Generated Using sys-trace#

Tip

You can place your mouse over the image, and use the Open image in new tab command from the pop-up menu to open an enlarged view of the image.

Kernel profiling#

As explained in rocprof-counters application tracing lets you evaluate the timeline of application events, but is little help in providing insight into kernel execution details. The kernel profiling functionality lets you select kernels for profiling and choose the basic counters or derived metrics to be collected for each kernel execution, thus providing a greater insight into hardware performance.

To check the supported performance counters and metrics, use:

rocprofv2 --list-counters

The following is a sample output from the --list-counters option. The output has been truncated for explanation:

gfx1030:0 : SQ_WAVES
: Count number of waves sent to SQs. {emulated, global, C1}
block SQ can only handle 8 counters at a time

The fields in the output are:

  • gfx1030:0 - The GPU architecture and GPU ID (separated by colon). The GPU ID needs to be specified as there might be multiple GPUs in the system.

  • SQ_WAVES - The counter name. Typically, the first token before the first underscore is the GPU block name. Here, SQ is the block that is responsible for managing wavefronts and issuing instructions.

Note

For more information on the performance counters available on AMD GPUs, refer to the GPU architecture documentation.

Input file#

To collect basic counters and derived metrics, define the profiling scope in an input file, and specify the file on the command line:

rocprofv2 -i input.txt <app_relative_path>

An input file is a text file that can be supplied to rocprofv2 for basic counter and derived metric collection. It typically consists of four parts, namely the basic counter(s)/derived metric(s) to use, the GPUs to profile, name of the kernels to be profiled, and the range of kernels to profile. All fields other than pmc: are optional.

Sample Input File:

pmc: SQ_WAVES TA_UTIL
range: 0:1
gpu: 0
kernel: matrixTranspose

The fields in the input file are detailed in Input File.

PMC: The rows in the text file beginning with pmc: are the group of basic counters or derived metrics the user is interested in collecting. The basic counters or derived metrics can be selected from the output generated by --list-counters option.

The number of basic counters or derived metrics that can be collected in one run of profiling is limited by the GPU hardware resources. If too many counters/metrics are selected, the kernels need to be executed multiple times to collect the counters/metrics. For multi-pass execution, include multiple rows of pmc: in the input file. Counters or metrics in each pmc: row can be collected in each run of the kernel.

GPU: The row beginning with the keyword gpu: specifies the GPU(s) on which the hardware counters are to be collected. This enables the support for profiling multiple GPUs. You can specify multiple GPUs separated by comma such as gpu: 1,3.

Kernel: The row beginning with the kernel: keyword specifies the names of kernels to be profiled.

Range: The row beginning with the keyword range: specifies the range of kernel dispatches. Specifying range is helpful in cases where the application causes multiple kernel dispatches and users want to filter some kernel dispatches. In the above example, the range: 0:1 depicts that one kernel is profiled.

Kernel profiling output#

This section discusses the kernel profiling output generated using the Input File. rocprofv2 reports one value per metric per kernel in the output. You can generate the output in desired format as described in Formatting output using plugins. If no plugin is specified while generating the output, the result is dumped on the command-line.

The following sample output is generated using the file plugin. Each row of the file is an instance of kernel execution.

For each kernel, basic information (e.g., GPU_ID, SGPR, PID, etc.) and performance counters (specified in the input file) values are listed. The information is generated in the format of field name and value.

$ rocprofv2 -i input.txt --plugin file -o result MatrixTranspose

$ cat results_result.csv

Dispatch_ID,GPU_ID,Queue_ID,Queue_Index,PID,TID,GRD,WGR,LDS,SCR,Arch_VGPR,ACCUM_VGPR,
SGPR,Wave_Size,SIG,OBJ,Kernel_Name,Start_Timestamp,End_Timestamp,Correlation_ID,
SQ_WAVES,GRBM_COUNT,GRBM_GUI_ACTIVE,SQ_INSTS_VALU,FETCH_SIZE

1,64700,1,0,353,353,1048576,16,0,0,8,0,16,64,140356026185088,1,"matrixTranspose(float*, float*, int)
(.kd)",7,30064771072,0,65536.000000,398333.000000,398333.000000,917504.000000,4136.000000

2,64700,1,2,353,353,1048576,16,0,0,8,0,16,64,140356026184832,2,"matrixTranspose(float*,
float*, int)
(.kd)",7,30064771072,0,65536.000000,586424.000000,586424.000000,917504.000000,4130.437500

3,64700,1,4,353,353,1048576,16,0,0,8,0,16,64,140356026184576,3,"matrixTranspose(float*,
float*, int)
(.kd)",7,30064771072,0,65536.000000,392460.000000,392460.000000,917504.000000,4129.937500

The fields in the output file are:

Output Fields

Description

Dispatch_ID

Kernel’s dispatch Id

GPU_ID

GPU identifier to which the kernel was submitted

Queue_ID

ROCm queue unique identifier to which the kernel was submitted

Queue_Index

ROCm queue write index for the submitted AQL packet

PID

System application process id that submitted the kernel

TID

System application thread id that submitted the kernel

GRD

Kernel’s grid size

WGR

Kernel’s work group size

LDS

Kernel’s Local Data Share (LDS) memory size

SCR

Kernel’s scratch memory size

Arch_VGPR

Number of Vector General Purpose Registers (VGPR) used in kernel dispatch

ACCUM_VGPR

Total Count of VGPRs

SGPR

Kernel’s Scalar General-Purpose Register (SGPR) size

Wave_Size

Number of wavefronts

SIG

Kernel’s completion signal

OBJ

Code object

Kernel_Name

Name of the dispatched kernel

Start_Timestamp

Begin time in nanoseconds (ns) when the kernel begins execution

End_Timestamp

End time in ns when the kernel finishes execution

Correlation_ID

Unique identifier for correlation between HIP and HSA async calls during activity tracing

You can view the generated output using the Perfetto UI as previously described in Visualize Tracing Results. The following is a screenshot of the Perfetto UI when viewing the kernel profiling output.

Viewing Kernel Profile

Fig. 10 Viewing kernel profiling output#

The first four rows represent the performance counters as specified in the input file. The last row is the kernel execution timeline, which is the same as the --kernel-trace option used in the Application tracing mode.

Viewing the profile results provides a good overview of kernel execution times and how performance metrics values change across the kernels. Additionally, you can also see the exact value of a counter/metric by hovering over or clicking the bar.

Formatting output using plugins#

rocprofv2 uses a modular plugin system which allows you to generate profiling output in the desired format. Because these plugins are modular in nature, they can easily be decoupled from the code based on need. By default, rocprofv2 generates the profiling output using the file and CLI plugins.

You can install other plugins (as listed in the table below) using the plugins package as shown:

rocprofiler-plugins_2.0.0-local_amd64.deb
-or-
rocprofiler-plugins-2.0.0-local.x86_64.rpm

Note

You can also create your own plugins if you are using rocprofv2 with source code and not just as a CLI tool.

To generate the profiling output using a plugin, use:

rocprofv2 --plugin plugin_name -i input.txt <app_relative_path>

# where plugin_name is file, perfetto, att, or ctf

The following table lists the available plugins:

Plugin

Output Format

File

Text files (.csv or .txt)

Perfetto

Protobuf in the format of the Chromium Project’s trace-event format

Advanced Thread Tracer (ATT)

Binary and .csv formats

Common Trace Format (CTF)

Binary, formatted in the ctf format that can be consumed by public tools such as Babeltrace and TraceCompass

Note

To generate output, the plugins require you to set the OUTPUT_PATH variable to the desired directory. File plugin is the only plugin that still generates output in the absence of OUTPUT_PATH by dumping the output to standard output.

Using rocsys#

rocsys is a command-line utility tool used to invoke and control a profiling session (launch/start/stop/exit) on an application being traced or profiled. rocsys is especially useful for selective profiling of applications with long-running workloads (such as DNN training) as it allows you to profile and control the application while it is running. You can also launch the session from one terminal and control the application using rocsys from another terminal.

To see all the rocsys options, run:

rocsys -help

rocsys: launch must be preceded by --session <name>
e.g. rocsys --session <SESSION_NAME> launch <MPI_COMMAND> <MPI_ARGUMENTS> rocprofv2
<ROCPROFV2_OPTIONS> <APP_EXEC>

where all mpiexec options must come before rocsys
rocsys: start must be preceded by --session <name>
   rocsys --session <name> start
rocsys: stop must be preceded by --session <name>
   rocsys --session <name> stop
rocsys: exit must be preceded by --session <name>
   rocsys --session <name> exit

The following are the session management options used with rocsys in the given order to achieve selective profiling on the rocprofv2 run:

  1. Launch - Creates a session. After launching the application stops until the session is started as shown in step 2.

/opt/rocm/bin/rocsys --session session1 launch rocprofv2 -i ../samples/input.txt <long_running_app>
ROCSYS:: Session ID: 2109
ROCSYS Session Created!
ROCProfilerV2: Collecting the following counters:
- SQ_WAVES
- GRBM_COUNT
- GRBM_GUI_ACTIVE
- SQ_INSTS_VALU
- FETCH_SIZE
Enabling Counter Collection
  1. Start - Starts the halted after launching session on the same or another terminal, and begins dumping kernel profiling information. The start command triggers the halted application to run.

/opt/rocm/bin/rocsys --session session1 start
ROCSYS:: Starting Tools Session...
Dispatch_ID(1), GPU_ID(1), ... // All the metrics of a kernel
Dispatch_ID(2), GPU_ID(1), ... // All the metrics of a kernel
Dispatch_ID(3), GPU_ID(1), ... // All the metrics of a kernel
  1. Stop - Stops the session. The information displayed on the terminal is a result of kernel profiling between the current and the previous rocsys command. Note that this command stops only the profiling session without affecting the application on the run.

/opt/rocm/bin/rocsys --session session1 stop
ROCSYS:: Stopping Tools Session...
Dispatch_ID(22397), GPU_ID(1), ... // All the metrics of a kernel
Dispatch_ID(22398), GPU_ID(1), ... // All the metrics of a kernel
Dispatch_ID(22399), GPU_ID(1), ... // All the metrics of a kernel
  1. Start (to restart) - rocsys allows you to start and stop the session innumerable times once the session is created. This helps in analyzing batches of kernel profiling information.

/opt/rocm/bin/rocsys --session session1 start
ROCSYS:: Starting Tools Session...
Dispatch_ID(22400), GPU_ID(1), ... // All the metrics of a kernel
Dispatch_ID(22401), GPU_ID(1), ... // All the metrics of a kernel
  1. Exit - Exits the profiling session. Once the session is exited, it cannot be restarted.

/opt/rocm/bin/rocsys --session session1 exit
Dispatch_ID(16828), GPU_ID(1), ... // All the metrics of a kernel
Dispatch_ID(16829), GPU_ID(1), ... // All the metrics of a kernel
ROCSYS:: Exiting Tools Session...Application might still be finishing up..

Note

Exiting the session only stops profiling. The application could continue running to completion in the background. If you don’t want to wait for the application to finish, use CTRL+C to stop the application after exit.

ROCProfilerV2 API#

ROCProfilerV2 also provides an API that allows fine-grained control over the profiling process. Like the CLI tool rocprofv2, the API supports both application tracing and kernel profiling. ROCProfilerV2 API allows you to create a session and invoke application tracing or kernel profiling within the session. The following sections describe session management, application tracing and kernel profiling using ROCProfilerV2 API.

Profiling Sessions#

A ROCProfilerV2 session maintains the global profiling state for an application. It is a unique identifier for a profiling or tracing task that is specified within the session. A session contains sufficient information about what needs to be collected or traced and it allows you to start/stop profiling/tracing as and when required.

The following demonstrates the use of session management APIs:

// Initialize the tools
   rocprofiler_initialize();

// Create the session with no replay mode
   rocprofiler_session_id_t session_id;
   rocprofiler_create_session(ROCPROFILER_NONE_REPLAY_MODE, &session_id);

// Start Session
   rocprofiler_start_session(session_id);

// profile a kernel -kernelA
   hipLaunchKernelGGL(kernelA, dim3(1), dim3(1), 0, 0);

// Deactivating session
   rocprofiler_terminate_session(session_id);

// Destroy sessions
   rocprofiler_destroy_session(session_id);

// Destroy all profiling related objects
   rocprofiler_finalize();

The following is a typical session management workflow:

  1. Initialize ROCProfilerV2 using rocprofiler_initialize.

  2. Create a session using rocprofiler_create_session. The created session keeps track of the global status of the application profiling.

    Note

    You can only create a session in no-replay mode (ROCPROFILER_NONE_REPLAY_MODE) which allows kernels to be run only once.

  3. Create a buffer to hold the results using rocprofiler_create_buffer.

  4. Create filters using rocprofiler_create_filter to specify the profiling task such as application tracing or metrics/counters collection.

    Note

    If the same filter is applied twice with different values, the latter application of the filter is considered the recent one, which overwrites the former application of the filter. To learn about the types of filters and their utility, see Filters.

  5. Start the session with rocprofiler_start_session.

  6. Run the specified kernels to collect traces or counters/metrics (as specified in the filter)

  7. Terminate the session with rocprofiler_terminate_session and flush the profiling results using rocprofiler_flush_data.

    Note

    The session must be terminated after the kernel completes (synchronization required). If a session is stopped before the completion of kernel execution within that session, the instrumentation data is undefined. Additionally, a session can be restarted after terminating.

  8. Destroy the session with rocprofiler_destroy_session and finalize profiling with rocprofiler_finalize.

See working examples demonstrating the use of the ROCProfilerV2 API in Application Tracing and Kernel Profiling.

Filters#

As explained in Profiling Sessions, filters allow you to specify a profiling task within a session. For different profiling tasks, different filters are specified as a parameter to rocprofiler_create_filter.

See the list of filters in the table below:

Filter

Purpose

ROCPROFILER_API_TRACE

To trace API calls. You must specify the API calls to be traced, in a vector.

ROCPROFILER_DISPATCH _TIMESTAMPS_COLLECTION

To track all the kernel execution’s start and end times on the GPUs

ROCPROFILER_COUNTERS_COLLECTION

To collect counters. You must specify the counters to be collected, in a vector.

Application Tracing#

The following code demonstrates the usage of ROCProfilerV2 APIs to trace an application. This example traces HIP APIs, HIP asynchronous activities, HSA APIs, HSA asynchronous activities, and ROCTX ranges. Note the use of ROCPROFILER_API_TRACE filter to trace API calls, and ROCPROFILER_DISPATCH_TIMESTAMPS_COLLECTION filter to trace the kernel.

int main(int argc, char*\* argv) {
   int\* gpuMem;
   prepare();
   // Initialize the tools
   CHECK_ROCPROFILER(rocprofiler_initialize());

   // Creating the session with given replay mode
   rocprofiler_session_id_t session_id;
   CHECK_ROCPROFILER(rocprofiler_create_session(ROCPROFILER_NONE_REPLAY_MODE, &session_id));

   // Creating Output Buffer for the data
   rocprofiler_buffer_id_t buffer_id;
   CHECK_ROCPROFILER(rocprofiler_create_buffer(session_id,
      [](const rocprofiler_record_header_t\* record, const
      rocprofiler_record_header_t\* end_record,
      rocprofiler_session_id_t session_id, rocprofiler_buffer_id_t
      buffer_id) {
         WriteBufferRecords(record, end_record, session_id, buffer_id);
      },
      0x9999, &buffer_id));

   // Specifying the APIs to be traced in a vector
   std::vector<rocprofiler_tracer_activity_domain_t> apis_requested;
   apis_requested.emplace_back(ACTIVITY_DOMAIN_HIP_API);
   apis_requested.emplace_back(ACTIVITY_DOMAIN_HIP_OPS);
   apis_requested.emplace_back(ACTIVITY_DOMAIN_HSA_API);
   apis_requested.emplace_back(ACTIVITY_DOMAIN_HSA_OPS);
   apis_requested.emplace_back(ACTIVITY_DOMAIN_ROCTX);
   rocprofiler_filter_id_t api_tracing_filter_id;

   // Creating filter for tracing APIs
   CHECK_ROCPROFILER(rocprofiler_create_filter(
      session_id, ROCPROFILER_API_TRACE,
      rocprofiler_filter_data_t{&apis_requested[0]}, apis_requested.size(),
      &api_tracing_filter_id, rocprofiler_filter_property_t{}));
   CHECK_ROCPROFILER(rocprofiler_set_filter_buffer(session_id,
      api_tracing_filter_id, buffer_id));

   // Kernel Tracing
   rocprofiler_filter_id_t kernel_tracing_filter_id;
   CHECK_ROCPROFILER(rocprofiler_create_filter(session_id,
      ROCPROFILER_DISPATCH_TIMESTAMPS_COLLECTION, rocprofiler_filter_data_t{},
      0, &kernel_tracing_filter_id, rocprofiler_filter_property_t{}));
   CHECK_ROCPROFILER(rocprofiler_set_filter_buffer(session_id,
      kernel_tracing_filter_id, buffer_id));

   // Normal HIP Calls won't be traced
   hipDeviceProp_t devProp;
   HIP_CALL(hipGetDeviceProperties(&devProp, 0));
   HIP_CALL(hipMalloc((void**)&gpuMem, 1 \* sizeof(int)));
   // KernelA and KernelB won't be traced
   kernelCalls('A');
   kernelCalls('B');

   // Activating Profiling Session to profile whatever kernel launches occur
   // up to the next terminate session
   CHECK_ROCPROFILER(rocprofiler_start_session(session_id));

   // KernelC, KernelD, KernelE and KernelF to be traced as part of the session
   kernelCalls('C');
   kernelCalls('D');
   kernelCalls('E');
   kernelCalls('F');
   // Normal HIP Calls that will be traced
   HIP_CALL(hipFree(gpuMem));

   // Deactivating session
   CHECK_ROCPROFILER(rocprofiler_terminate_session(session_id));

   // Manual Flush user buffer request
   CHECK_ROCPROFILER(rocprofiler_flush_data(session_id, buffer_id));

   // Destroy sessions
   CHECK_ROCPROFILER(rocprofiler_destroy_session(session_id));

   // Destroy all profiling related objects (User buffer, sessions, filters, etc..)
   CHECK_ROCPROFILER(rocprofiler_finalize());

   return 0;
}

Kernel Profiling#

The following is a full-application example that utilizes the ROCProfilerV2 API to profile the kernels. The ROCPROFILER_COUNTERS_COLLECTION filter for counter collection distinguishes this example from the one in Application Tracing. The GRBM_COUNT counter to be collected is specified in a vector of strings as shown.

#include <hip/hip_runtime.h>
#include <rocprofiler/v2/rocprofiler.h>

int main(int argc, char*\* argv) {
   int\* gpuMem;

   // Initialize the tools
   CHECK_ROCPROFILER(rocprofiler_initialize());

   // Creating the session with given replay mode
   rocprofiler_session_id_t session_id;
   CHECK_ROCPROFILER(rocprofiler_create_session(ROCPROFILER_NONE_REPLAY_MODE,
   &session_id));

   // Creating Output Buffer for the data
   rocprofiler_buffer_id_t buffer_id;
   CHECK_ROCPROFILER(rocprofiler_create_buffer(session_id,
      [](const rocprofiler_record_header_t\* record, const
      rocprofiler_record_header_t\* end_record,
      rocprofiler_session_id_t session_id, rocprofiler_buffer_id_t
      buffer_id) {
         WriteBufferRecords(record, end_record, session_id, buffer_id);
      },
      0x9999, &buffer_id));

   // Counter Collection Filter
   std::vector<const char*> counters;
   counters.emplace_back("GRBM_COUNT");
   rocprofiler_filter_id_t filter_id;
   [[maybe_unused]] rocprofiler_filter_property_t property = {};
   CHECK_ROCPROFILER(rocprofiler_create_filter(session_id,
      ROCPROFILER_COUNTERS_COLLECTION,
      rocprofiler_filter_data_t{.counters_names = &counters[0]},
      counters.size(), &filter_id, property));
   CHECK_ROCPROFILER(rocprofiler_set_filter_buffer(session_id, filter_id,
      buffer_id));

   // Normal HIP Calls
   hipDeviceProp_t devProp;
   HIP_CALL(hipGetDeviceProperties(&devProp, 0));
   HIP_CALL(hipMalloc((void**)&gpuMem, 1 \* sizeof(int)));

   // KernelA and KernelB won't be profiled
   kernelCalls('A');
   kernelCalls('B');

   // Activating Profiling Session to profile whatever kernel launches occur
   // up to the next terminate session
   CHECK_ROCPROFILER(rocprofiler_start_session(session_id));

   // KernelC, KernelD, KernelE and KernelF to be profiled as part of the session
   kernelCalls('C');
   kernelCalls('D');
   kernelCalls('E');
   kernelCalls('F');
   // Normal HIP Calls
   HIP_CALL(hipFree(gpuMem));

   // Deactivating session
   CHECK_ROCPROFILER(rocprofiler_terminate_session(session_id));

   // Manual Flush user buffer request
   CHECK_ROCPROFILER(rocprofiler_flush_data(session_id, buffer_id));

   // Destroy sessions
   CHECK_ROCPROFILER(rocprofiler_destroy_session(session_id));

   // Destroy all profiling related objects (User buffer, sessions, filters, etc..)
   CHECK_ROCPROFILER(rocprofiler_finalize());

   return 0;
}