Using rocprofv3#
rocprofv3
is a CLI tool that helps you quickly optimize applications and understand the low-level kernel details without requiring any modification in the source code.
It’s backward compatible with its predecessor, rocprof
, and provides more features for application profiling with better accuracy.
The following sections demonstrate the use of rocprofv3
for application tracing and kernel profiling using various command-line options.
rocprofv3
is installed with ROCm under /opt/rocm/bin
. To use the tool from anywhere in the system, export PATH
variable:
export PATH=$PATH:/opt/rocm/bin
Before you start tracing or profiling your HIP application using rocprofv3
, build the application using:
cmake -B <build-directory> <source-directory> -DCMAKE_PREFIX_PATH=/opt/rocm
cmake --build <build-directory> --target all --parallel <N>
Command-line options#
Here is the sample of commonly used rocprofv3
command-line options. Some options are used for application tracing and some for kernel profiling while the output control options control the presentation and redirection of the generated output.
Option |
Description |
Use |
---|---|---|
|
Specifies the input file. JSON and YAML formats support configuration of all command-line options whereas the text format only supports specifying HW counters. |
Run Configuration |
|
Specifies the path for the output files. Supports special keys: |
Output control |
|
Specifies the name of the output file. Note that this name is appended to the default names (_api_trace or counter_collection.csv) of the generated files’. Supports special keys: |
Output control |
|
For adding output format (supported formats: csv, json, pftrace) |
Output control |
|
Collects HIP (runtime), memory copy, memory allocation, marker, scratch memory, rocDecode, and kernel dispatch traces. |
Application Tracing |
|
Collects HIP, HSA, memory copy, memory allocation, marker, scratch memory, rocDecode, and kernel dispatch traces. |
Application Tracing |
|
Collects HIP runtime and compiler traces. |
Application tracing |
|
Collects kernel dispatch traces. |
Application tracing |
|
Collects marker (ROC-TX) traces. |
Application tracing |
|
Collects memory copy traces. |
Application tracing |
|
Collects memory allocation traces. |
Application tracing |
|
Collects scratch memory operations traces. |
Application tracing |
|
Collects rocDecode API traces. |
Application tracing |
|
Collects HSA API traces. |
Application tracing |
|
Collects HIP runtime API traces. |
Application tracing |
|
Collects HSA API traces (core API). |
Application tracing |
|
Collects HSA API traces (AMD-extension API). |
Application tracing |
|
For Collecting statistics of enabled tracing types |
Application tracing |
|
Display summary of collected data |
Application tracing |
|
Include the kernels matching this filter. |
Kernel Dispatch Counter Collection |
|
Exclude the kernels matching this filter. |
Kernel Dispatch Counter Collection |
|
Iteration range for each kernel that match the filter [start-stop]. |
Kernel Dispatch Counter Collection |
|
List metrics for counter collection |
List supported PC sampling configurations. |
|
Specifies the path to a YAML file containing extra counter definitions. |
Kernel Dispatch Counter Collection |
|
Overrides the default demangling of kernel names. |
Output control |
|
Truncates the demangled kernel names for improved readability. |
Output control |
|
For adding output format (supported formats: csv, json, pftrace, otf2) |
Output control |
|
Libraries to prepend to LD_PRELOAD (usually for sanitizers) |
Extension |
|
Perfetto data collection backend. ‘system’ mode requires starting traced and perfetto daemons |
Extension |
|
Size of buffer for perfetto output in KB. default: 1 GB |
Extension |
|
Policy for handling new records when perfetto has reached the buffer limit |
Extension |
|
Perfetto shared memory size hint in KB. default: 64 KB |
Extension |
|
pc sampling support is in beta version |
This flag set the ROCPROFILER_PC_SAMPLING_BETA_ENABLED environment variable |
|
Type of PC Sampling, currently only host trap method is supported |
PC Sampling Configurations |
|
The unit appropriate to the PC sampling type/method, currently only time unit is supported |
PC Sampling Configurations |
|
Frequency at which PC samples are generated |
PC Sampling Configurations |
|
The times are specified in seconds by default, but the unit can be changed using the –collection-period-unit or -pu option. Start Delay Time is the time in seconds before the collection begins, Collection Time is the duration in seconds for which data is collected, and Rate is the number of times the cycle is repeated. A repeat of 0 indicates that the cycle will repeat indefinitely. Users can specify multiple configurations, each defined by a triplet in the format start_delay:collection_time:repeat. For example, the command -p 10:10:1 5:3:0 specifies two configurations: the first with a start delay of 10 seconds, a collection time of 10 seconds, and a repeat of 1 (the cycle will repeat once); the second with a start delay of 5 seconds, a collection time of 3 seconds, and a repeat of 0 (the cycle will repeat indefinitely). |
Filtering Options |
|
To change the unit used in –collection-period or -p, you can specify the desired unit using the –collection-period-unit option. The available units are hour for hours, min for minutes, sec for seconds, msec for milliseconds, usec for microseconds, and nsec for nanoseconds. |
Filtering Options |
To see exhaustive list of rocprofv3
options, run:
rocprofv3 --help
Application tracing#
Application tracing provides the big picture of a program’s execution by collecting data on the execution times of API calls and GPU commands, such as kernel execution, async memory copy, and barrier packets. This information can be used as the first step in the profiling process to answer important questions, such as how much percentage of time was spent on memory copy and which kernel took the longest time to execute.
To use rocprofv3
for application tracing, run:
rocprofv3 <tracing_option> -- <application_path>
HIP trace#
HIP trace comprises execution traces for the entire application at the HIP level. This includes HIP API functions and their asynchronous activities at the runtime level. In general, HIP APIs directly interact with the user program. It is easier to analyze HIP traces as you can directly map them to the program.
To trace HIP runtime APIs, use:
rocprofv3 --hip-trace -- <application_path>
The preceding command generates a hip_api_trace.csv
file prefixed with the process ID.
$ cat 238_hip_api_trace.csv
Here are the contents of hip_api_trace.csv
file:
Domain |
Function |
Process_Id |
Thread_Id |
Correlation_Id |
Start_Timestamp |
End_Timestamp |
---|---|---|---|---|---|---|
HIP_COMPILER_API |
__hipRegisterFatBinary |
208 |
208 |
1 |
1508780270085955 |
1508780270096795 |
HIP_COMPILER_API |
__hipRegisterFunction |
208 |
208 |
2 |
1508780270104242 |
1508780270115355 |
HIP_COMPILER_API |
__hipPushCallConfiguration |
208 |
208 |
3 |
1508780613897816 |
1508780613898701 |
HIP_COMPILER_API |
__hipPopCallConfiguration |
208 |
208 |
4 |
1508780613901714 |
1508780613902200 |
To trace HIP compile time APIs, use:
rocprofv3 --hip-compiler-trace -- <application_path>
The preceding command generates a hip_api_trace.csv
file prefixed with the process ID.
$ cat 208_hip_api_trace.csv
Here are the contents of hip_api_trace.csv
file:
Domain |
Function |
Process_Id |
Thread_Id |
Correlation_Id |
Start_Timestamp |
End_Timestamp |
---|---|---|---|---|---|---|
HIP_COMPILER_API |
__hipRegisterFatBinary |
208 |
208 |
1 |
1508780270085955 |
1508780270096795 |
HIP_COMPILER_API |
__hipRegisterFunction |
208 |
208 |
2 |
1508780270104242 |
1508780270115355 |
HIP_COMPILER_API |
__hipPushCallConfiguration |
208 |
208 |
3 |
1508780613897816 |
1508780613898701 |
HIP_COMPILER_API |
__hipPopCallConfiguration |
208 |
208 |
4 |
1508780613901714 |
1508780613902200 |
For the description of the fields in the output file, see Output file fields.
HSA trace#
The HIP runtime library is implemented with the low-level HSA runtime. HSA API tracing is more suited for advanced users who want to understand the application behavior at the lower level. In general, tracing at the HIP level is recommended for most users. You should use HSA trace only if you are familiar with HSA runtime.
HSA trace contains the start and end time of HSA runtime API calls and their asynchronous activities.
rocprofv3 --hsa-trace -- <application_path>
The preceding command generates a hsa_api_trace.csv
file prefixed with process ID. Note that the contents of this file have been truncated for demonstration purposes.
$ cat 197_hsa_api_trace.csv
Here are the contents of hsa_api_trace.csv
file:
Domain |
Function |
Process_Id |
Thread_Id |
Correlation_Id |
Start_Timestamp |
End_Timestamp |
---|---|---|---|---|---|---|
HSA_CORE_API |
hsa_system_get_major_extension_table |
197 |
197 |
1 |
1507843974724237 |
1507843974724947 |
HSA_CORE_API |
hsa_agent_get_info |
197 |
197 |
3 |
1507843974754471 |
1507843974755014 |
HSA_AMD_EXT_API |
hsa_amd_memory_pool_get_info |
197 |
197 |
5 |
1507843974761705 |
1507843974762398 |
HSA_AMD_EXT_API |
hsa_amd_memory_pool_get_info |
197 |
197 |
6 |
1507843974763901 |
1507843974764030 |
HSA_AMD_EXT_API |
hsa_amd_memory_pool_get_info |
197 |
197 |
7 |
1507843974765121 |
1507843974765224 |
HSA_AMD_EXT_API |
hsa_amd_memory_pool_get_info |
197 |
197 |
8 |
1507843974766196 |
1507843974766328 |
HSA_AMD_EXT_API |
hsa_amd_memory_pool_get_info |
197 |
197 |
9 |
1507843974767534 |
1507843974767641 |
HSA_AMD_EXT_API |
hsa_amd_memory_pool_get_info |
197 |
197 |
10 |
1507843974768639 |
1507843974768779 |
HSA_AMD_EXT_API |
hsa_amd_agent_iterate_memory_pools |
197 |
197 |
4 |
1507843974758768 |
1507843974769238 |
HSA_CORE_API |
hsa_agent_get_info |
197 |
197 |
11 |
1507843974771091 |
1507843974771537 |
For the description of the fields in the output file, see Output file fields.
Marker trace#
Note
To use rocprofv3
for marker tracing, including and linking to old ROCTx works but it is recommended to switch to new ROCTx because
it has been extended with new APIs.
To use new ROCTx, please include header "rocprofiler-sdk-roctx/roctx.h"
and link your application with librocprofiler-sdk-roctx.so
.
Above list of APIs is not exhaustive. See public header file "rocprofiler-sdk-roctx/roctx.h"
for full list.
To see usage of ROCTx/marker
library, see Using ROCTx (AMD Tools Extension Library).
Kernel Rename#
To rename kernels with their enclosing roctxRangePush/roctxRangePop message. Known as –roctx-rename in earlier rocprof versions.
See how to use --kernel-rename
option with help of below code snippet:
#include <rocprofiler-sdk-roctx/roctx.h>
roctxRangePush("HIP_Kernel-1");
// Launching kernel from host
hipLaunchKernelGGL(matrixTranspose, dim3(WIDTH/THREADS_PER_BLOCK_X, WIDTH/THREADS_PER_BLOCK_Y), dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y), 0,0,gpuTransposeMatrix,gpuMatrix, WIDTH);
// Memory transfer from device to host
roctxRangePush("hipMemCpy-DeviceToHost");
hipMemcpy(TransposeMatrix, gpuTransposeMatrix, NUM * sizeof(float), hipMemcpyDeviceToHost);
roctxRangePop(); // for "hipMemcpy"
roctxRangePop(); // for "hipLaunchKernel"
roctxRangeStop(rangeId);
To rename the kernel, use:
rocprofv3 --marker-trace --kernel-rename -- <application_path>
The preceding command generates a marker-trace
file prefixed with the process ID.
$ cat 210_marker_api_trace.csv
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"MARKER_CORE_API","roctxGetThreadId",315155,315155,2,58378843928406,58378843930247
"MARKER_CONTROL_API","roctxProfilerPause",315155,315155,3,58378844627184,58378844627502
"MARKER_CONTROL_API","roctxProfilerResume",315155,315155,4,58378844638601,58378844639267
"MARKER_CORE_API","pre-kernel-launch",315155,315155,5,58378844641787,58378844641787
"MARKER_CORE_API","post-kernel-launch",315155,315155,6,58378844936586,58378844936586
"MARKER_CORE_API","memCopyDth",315155,315155,7,58378844938371,58378851383270
"MARKER_CORE_API","HIP_Kernel-1",315155,315155,1,58378526575735,58378851384485
Kokkos trace#
Kokkos is a C++ library for writing performance portable applications. Kokkos is used in many scientific applications for writing performance portable code that can run on CPUs, GPUs, and other accelerators.
rocprofv3
loads an inbuilt Kokkos Tools library, which emits roctx ranges with the labels passed using Kokkos APIs. For example, Kokkos::parallel_for(“MyParallelForLabel”, …)
calls roctxRangePush
internally and enables the kernel renaming option to replace the highly templated kernel names with the Kokkos labels.
To enable the inbuilt marker support, use the kokkos-trace
option. Internally, this option enables marker-trace
and kernel-rename
:
rocprofv3 --kokkos-trace -- <application_path>
The preceding command generates a marker-trace
file prefixed with the process ID.
$ cat 210_marker_api_trace.csv
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"MARKER_CORE_API","Kokkos::Initialization Complete",4069256,4069256,1,56728499773965,56728499773965
"MARKER_CORE_API","Kokkos::Impl::CombinedFunctorReducer<CountFunctor, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Serial>, CountFunctor, long int>::Reducer, void>",4069256,4069256,2,56728501756088,56728501764241
"MARKER_CORE_API","Kokkos::parallel_reduce: fence due to result being value, not view",4069256,4069256,4,56728501767957,56728501769600
"MARKER_CORE_API","Kokkos::Finalization Complete",4069256,4069256,6,56728502054554,56728502054554
Kernel trace#
To trace kernel dispatch traces, use:
rocprofv3 --kernel-trace -- <application_path>
The preceding command generates a kernel_trace.csv
file prefixed with the process ID.
$ cat 199_kernel_trace.csv
Here are the contents of kernel_trace.csv
file:
Kind |
Agent_Id |
Queue_Id |
Thread_Id |
Dispatch_Id |
Kernel_Id |
Kernel_Name |
Correlation_Id |
Start_Timestamp |
End_Timestamp |
Private_Segment_Size |
Group_Segment_Size |
Workgroup_Size_X |
Workgroup_Size_Y |
Workgroup_Size_Z |
Grid_Size_X |
Grid_Size_Y |
Grid_Size_Z |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
KERNEL_DISPATCH |
1 |
1 |
69 |
1 |
16 |
void addition_kernel<float>(float*, float const*, float const*, int, int) |
1451 |
8819330200067564 |
8819330200116308 |
0 |
0 |
64 |
1 |
1 |
1024 |
1024 |
1 |
KERNEL_DISPATCH |
1 |
2 |
69 |
5 |
16 |
void addition_kernel<float>(float*, float const*, float const*, int, int) |
1484 |
8819330200118678 |
8819330200219573 |
0 |
0 |
64 |
1 |
1 |
1024 |
1024 |
1 |
KERNEL_DISPATCH |
1 |
1 |
69 |
2 |
19 |
subtract_kernel(float*, float const*, float const*, int, int) |
1459 |
8819330200120456 |
8819330200223721 |
0 |
0 |
64 |
1 |
1 |
1024 |
1024 |
1 |
KERNEL_DISPATCH |
1 |
3 |
69 |
9 |
16 |
void addition_kernel<float>(float*, float const*, float const*, int, int) |
1517 |
8819330200152902 |
8819330200283428 |
0 |
0 |
64 |
1 |
1 |
1024 |
1024 |
1 |
KERNEL_DISPATCH |
1 |
4 |
69 |
13 |
16 |
void addition_kernel<float>(float*, float const*, float const*, int, int) |
1550 |
8819330200187127 |
8819330200320468 |
0 |
0 |
64 |
1 |
1 |
1024 |
1024 |
1 |
KERNEL_DISPATCH |
1 |
2 |
69 |
6 |
19 |
subtract_kernel(float*, float const*, float const*, int, int) |
1492 |
8819330200225499 |
8819330200364618 |
0 |
0 |
64 |
1 |
1 |
1024 |
1024 |
1 |
KERNEL_DISPATCH |
1 |
1 |
69 |
3 |
18 |
multiply_kernel(float*, float const*, float const*, int, int) |
1467 |
8819330200229796 |
8819330200369359 |
0 |
0 |
64 |
1 |
1 |
1024 |
1024 |
1 |
For the description of the fields in the output file, see Output file fields.
Memory copy trace#
To trace memory moves across the application, use:
rocprofv3 –-memory-copy-trace -- <application_path>
The preceding command generates a memory_copy_trace.csv
file prefixed with the process ID.
$ cat 197_memory_copy_trace.csv
Here are the contents of memory_copy_trace.csv
file:
Kind |
Direction |
Source_Agent_Id |
Destination_Agent_Id |
Correlation_Id |
Start_Timestamp |
End_Timestamp |
---|---|---|---|---|---|---|
MEMORY_COPY |
MEMORY_COPY_HOST_TO_DEVICE |
0 |
1 |
0 |
14955949675563 |
14955950239443 |
MEMORY_COPY |
MEMORY_COPY_DEVICE_TO_HOST |
1 |
0 |
0 |
14955952733485 |
14955953315285 |
For the description of the fields in the output file, see Output file fields.
Memory allocation trace#
Memory allocation traces track the HSA functions hsa_memory_allocate
,
hsa_amd_memory_pool_allocate
, and hsa_amd_vmem_handle_create`
. The function
hipMalloc
calls these underlying HSA functions allowing memory allocations to be
tracked.
In addition to the HSA memory allocation functions listed above, the corresponding HSA
free functions hsa_memory_free
, hsa_amd_memory_pool_free
, and hsa_amd_vmem_handle_release
are also tracked. Unlike the allocation functions, however, only the address of the freed memory
is recorded. As such, the agent id and size of the freed memory are recorded as 0 in the CSV and
JSON outputs. It should be noted that it is possible for some free functions to records a null
pointer address of 0x0. This situation can occur when some HIP functions such as hipStreamDestroy
call underlying HSA free functions with null pointers, even if the user never explicitly calls
free memory functions with null pointer addresses.
To trace memory allocations during the application run, use:
rocprofv3 –-memory-allocation-trace -- < app_path >
The preceding command generates a memory_allocation_trace.csv
file prefixed with the process ID.
$ cat 6489_memory_allocation_trace.csv
Here are the contents of memory_allocation_trace.csv
file:
Kind |
Operation |
Agent_Id |
Allocation_Size |
Address |
Correlation_Id |
Start_Timestamp |
End_Timestamp |
---|---|---|---|---|---|---|---|
MEMORY_ALLOCATION |
MEMORY_ALLOCATION_ALLOCATE |
0 |
1024 |
0x7fb2d0005000 |
11 |
3721742710532634 |
3721742710584854 |
MEMORY_ALLOCATION |
MEMORY_ALLOCATION_FREE |
0 |
0 |
0x7fb2d0005000 |
12 |
3721742710596404 |
3721742710933366 |
MEMORY_ALLOCATION |
MEMORY_ALLOCATION_ALLOCATE |
0 |
1024 |
0x7fb2d0005000 |
13 |
3721742710941416 |
3721742710960916 |
MEMORY_ALLOCATION |
MEMORY_ALLOCATION_FREE |
0 |
0 |
0x7fb2d0005000 |
14 |
3721742710967236 |
3721742711197647 |
MEMORY_ALLOCATION |
MEMORY_ALLOCATION_ALLOCATE |
0 |
1024 |
0x7fb2d0005000 |
15 |
3721742711204077 |
3721742711219717 |
MEMORY_ALLOCATION |
MEMORY_ALLOCATION_FREE |
0 |
0 |
0x7fb2d0005000 |
16 |
3721742711225857 |
3721742711466018 |
For the description of the fields in the output file, see Output file fields.
Runtime trace#
This is a short-hand option that targets the most relevant tracing options for a standard user by excluding traces for HSA runtime API and HIP compiler API.
The HSA runtime API is excluded because it is a lower-level API upon which HIP and OpenMP target are built and thus, tends to be an implementation detail irrelevant to most users. Similarly, the HIP compiler API is also excluded for being an implementation detail as these functions are automatically inserted during HIP compilation.
--runtime-trace
traces the HIP runtime API, marker API, kernel dispatches, and
memory operations (copies and scratch).
rocprofv3 –-runtime-trace -- <application_path>
Running the preceding command generates hip_api_trace.csv
, kernel_trace.csv
, memory_copy_trace.csv
, scratch_memory_trace.csv
, memory_allocation_trace.csv
, and marker_api_trace.csv
(if ROCTx
APIs are specified in the application) files prefixed with the process ID.
System trace#
This is an all-inclusive option to collect HIP, HSA, kernel, memory copy, memory allocation, and marker trace (if ROCTx
APIs are specified in the application).
rocprofv3 –-sys-trace -- <application_path>
Running the above command generates hip_api_trace.csv
, hsa_api_trace.csv
, kernel_trace.csv
, memory_copy_trace.csv
, memory_allocation_trace.csv
, and marker_api_trace.csv
(if files prefixed with the process ID.
Scratch memory trace#
This option collects scratch memory operation traces. Scratch is an address space on AMD GPUs roughly equivalent to the local memory in NVIDIA CUDA. The local memory in CUDA is a thread-local global memory with interleaved addressing, which is used for register spills or stack space. This option helps to trace when the rocr
runtime allocates, frees, and tries to reclaim scratch memory.
rocprofv3 --scratch-memory-trace -- <application_path>
RCCL trace#
RCCL (pronounced “Rickle”) is a stand-alone library of standard collective communication routines for GPUs. This option traces those communication routines.
rocprofv3 --rccl-trace -- <application_path>
The preceding command generates a rccl_api_trace
file prefixed with the process ID.
$ cat 197_rccl_api_trace.csv
Here are the contents of rccl_api_trace.csv
file:
Domain |
Function |
Process_Id |
Thread_Id |
Correlation_Id |
Start_Timestamp |
End_Timestamp |
---|---|---|---|---|---|---|
RCCL_API |
ncclGetVersion |
1834151 |
1834151 |
416 |
18413845573432 |
18413845577374 |
RCCL_API |
ncclGetUniqueId |
1834151 |
1834151 |
1116 |
18413961300878 |
18413963267869 |
RCCL_API |
ncclGetUniqueId |
1834151 |
1834151 |
1481 |
18414166449182 |
18414166720831 |
RCCL_API |
ncclGroupStart |
1834151 |
1834151 |
1482 |
18414166723772 |
18414166726834 |
RCCL_API |
ncclGroupEnd |
1834151 |
1834151 |
1490 |
18414166823575 |
18414380520973 |
RCCL_API |
ncclCommInitAll |
1834151 |
1834151 |
1477 |
18414166402665 |
18414380522536 |
RCCL_API |
ncclCommGetAsyncError |
1834151 |
1834151 |
89098 |
18414380660695 |
18414380661652 |
RCCL_API |
ncclAllReduce |
1834151 |
1834151 |
89097 |
18414380653860 |
18414380693574 |
RCCL_API |
ncclCommGetAsyncError |
1834151 |
1834151 |
89108 |
18414380694631 |
18414380694659 |
RCCL_API |
ncclAllReduce |
1834151 |
1834151 |
89107 |
18414380694212 |
18414380704722 |
RCCL_API |
ncclCommGetAsyncError |
1834151 |
1834151 |
89117 |
18414380706650 |
18414380706677 |
RCCL_API |
ncclAllReduce |
1834151 |
1834151 |
89116 |
18414380705574 |
18414380715055 |
RCCL_API |
ncclCommGetAsyncError |
1834151 |
1834151 |
89126 |
18414380715749 |
18414380715774 |
RCCL_API |
ncclAllReduce |
1834151 |
1834151 |
89125 |
18414380715463 |
18414380723944 |
RCCL_API |
ncclCommGetAsyncError |
1834151 |
1834151 |
89135 |
18414380724688 |
18414380724715 |
RCCL_API |
ncclAllReduce |
1834151 |
1834151 |
89134 |
18414380724395 |
18414380732209 |
RCCL_API |
ncclCommGetAsyncError |
1834151 |
1834151 |
89154 |
18414380746383 |
18414380746411 |
RCCL_API |
ncclCommGetAsyncError |
1834151 |
1834151 |
89157 |
18414380749863 |
18414380749889 |
RCCL_API |
ncclCommGetAsyncError |
1834151 |
1834151 |
89160 |
18414380751671 |
18414380751696 |
RCCL_API |
ncclCommGetAsyncError |
1834151 |
1834151 |
89163 |
18414380753326 |
18414380753353 |
RCCL_API |
ncclCommGetAsyncError |
1834151 |
1834151 |
89166 |
18414380755128 |
18414380755154 |
rocDecode trace#
rocDecode is a high-performance video decode SDK for AMD GPUs. This option traces the rocDecode API.
rocprofv3 --rocdecode-trace -- <application_path>
The above command generates a rocdecode_api_trace
file prefixed with the process ID.
$ cat 41688_rocdecode_api_trace.csv
Here are the contents of rocdecode_api_trace.csv
file:
Domain |
Function |
Process_Id |
Thread_Id |
Correlation_Id |
Start_Timestamp |
End_Timestamp |
---|---|---|---|---|---|---|
ROCDECODE_API |
rocDecCreateVideoParser |
41688 |
41688 |
583 |
615449881677279 |
615449882001583 |
ROCDECODE_API |
rocDecGetDecoderCaps |
41688 |
41688 |
584 |
615449882016054 |
615449882163756 |
ROCDECODE_API |
rocDecGetDecoderCaps |
41688 |
41688 |
588 |
615449886038750 |
615449886050880 |
ROCDECODE_API |
rocDecCreateDecoder |
41688 |
41688 |
591 |
615449886084210 |
615450756910310 |
ROCDECODE_API |
rocDecDecodeFrame |
41688 |
41688 |
595 |
615450757036042 |
615450767147413 |
ROCDECODE_API |
rocDecGetDecodeStatus |
41688 |
41688 |
812 |
615450836779385 |
615450836779575 |
Post-processing tracing options#
rocprofv3
provides options to collect tracing summary or statistics after conclusion of a tracing session. These options are described here.
Stats#
This option collects statistics for the enabled tracing types. For example, it collects statistics of HIP APIs, when HIP trace is enabled. The statistics help to determine the API or function that took the most amount of time.
rocprofv3 --stats --hip-trace -- <application_path>
The preceding command generates a hip_api_stats.csv
, domain_stats.csv
and hip_api_trace.csv
file prefixed with the process ID.
$ cat hip_api_stats.csv
Here are the contents of hip_api_stats.csv
file:
Name |
Calls |
TotalDurationNs |
AverageNs |
Percentage |
MinNs |
MaxNs |
StdDev |
---|---|---|---|---|---|---|---|
hipStreamCreateWithFlags |
4 |
262497406 |
65624351.500000 |
85.15 |
3991286 |
249121840 |
122332531.343496 |
hipGetDeviceCount |
1 |
32505687 |
32505687.000000 |
10.54 |
32505687 |
32505687 |
0.00000000e+00 |
hipHostMalloc |
12 |
6096409 |
508034.083333 |
1.98 |
443793 |
548024 |
39236.753678 |
hipFree |
12 |
1994421 |
166201.750000 |
0.6470 |
7790 |
1036046 |
299086.860470 |
hipMemcpyAsync |
12 |
1368378 |
114031.500000 |
0.4439 |
2490 |
764044 |
249308.051619 |
hipMallocAsync |
12 |
927255 |
77271.250000 |
0.3008 |
51540 |
107671 |
20487.475966 |
hipStreamSynchronize |
12 |
870486 |
72540.500000 |
0.2824 |
140 |
866606 |
250065.900069 |
hipLaunchKernel |
16 |
692734 |
43295.875000 |
0.2247 |
1000 |
670044 |
167133.656647 |
hipStreamDestroy |
4 |
619905 |
154976.250000 |
0.2011 |
92901 |
339252 |
122852.320356 |
hipDeviceSynchronize |
4 |
404252 |
101063.000000 |
0.1311 |
570 |
385212 |
189518.505401 |
hipHostFree |
12 |
271202 |
22600.166667 |
0.0880 |
11950 |
34950 |
7480.268600 |
__hipRegisterFatBinary |
1 |
9000 |
9000.000000 |
2.920e-03 |
9000 |
9000 |
0.00000000e+00 |
__hipRegisterFunction |
4 |
6150 |
1537.500000 |
1.995e-03 |
230 |
5370 |
2555.091323 |
__hipPushCallConfiguration |
16 |
2460 |
153.750000 |
7.980e-04 |
70 |
1140 |
267.503894 |
__hipPopCallConfiguration |
16 |
2000 |
125.000000 |
6.488e-04 |
70 |
680 |
151.613544 |
hipGetLastError |
16 |
1270 |
79.375000 |
4.120e-04 |
50 |
440 |
96.295985 |
hipSetDevice |
1 |
660 |
660.000000 |
2.141e-04 |
660 |
660 |
0.00000000e+00 |
Here are the contents of domain_stats.csv
file:
Name |
Calls |
TotalDurationNs |
AverageNs |
Percentage |
MinNs |
MaxNs |
StdDev |
---|---|---|---|---|---|---|---|
HIP_API |
13 |
458514859 |
35270373.769231 |
100.00 |
2300 |
352276613 |
99315857.546240 |
For the description of the fields in the output file, see Output file fields.
Summary#
This option displays a summary of tracing data for the enabled tracing type, after conclusion of the profiling session.
rocprofv3 -S --hip-trace -- <application_path>

Summary per domain#
This option displays a summary of each tracing domain for the enabled tracing type, after conclusion of the profiling session.
rocprofv3 -D --hsa-trace --hip-trace -- <application_path>
The preceding command generates a hip_trace.csv
and hsa_trace.csv
file prefixed with the process ID along with displaying the summary of each domain.
Summary groups#
This option displays a summary of multiple domains for the domain names specified on the command line. The summary groups can be separated using a pipe ( | ) symbol.
To see a summary for MEMORY_COPY
domains, use:
rocprofv3 --summary-groups MEMORY_COPY --sys-trace -- <application_path>

To see a summary for MEMORY_COPY
and HIP_API
domains, use:
rocprofv3 --summary-groups 'MEMORY_COPY|HIP_API' --sys-trace -- <application_path>

Collecting traces using input file#
The preceding sections describe how to collect traces by specifying the desired tracing type on the command line. You can also specify the desired tracing types in an input file in YAML (.yaml/.yml), or JSON (.json) format. You can supply any command-line option for tracing in the input file.
Here is a sample input.yaml file for collecting tracing summary:
- jobs:
output_directory: “@CMAKE_CURRENT_BINARY_DIR@/%env{ARBITRARY_ENV_VARIABLE}%” output_file: out output_format: [pftrace, json, otf2] log_level: env runtime_trace: true kernel_rename: true summary: true summary_per_domain: true summary_groups: [“KERNEL_DISPATCH|MEMORY_COPY”] summary_output_file: “summary”
Here is a sample input.json file for collecting tracing summary:
{
"jobs": [
{
"output_directory": "out-directory",
"output_file": "out",
"output_format": ["pftrace", "json", "otf2"],
"log_level": "env",
"runtime_trace": true,
"kernel_rename": true,
"summary": true,
"summary_per_domain": true,
"summary_groups": ["KERNEL_DISPATCH|MEMORY_COPY"],
"summary_output_file": "summary"
}
]
}
Here is the input schema (properties) of JSON or YAML input files:
``jobs`` (array): rocprofv3 input data per application run.
Items (object): data for rocprofv3.
``pmc`` (array): list of counters to collect.
``kernel_include_regex`` (string): Include the kernels matching this filter.
``kernel_exclude_regex`` (string): Exclude the kernels matching this filter.
``kernel_iteration_range`` (string): Iteration range for each kernel that match the filter [start-stop].
``hip_trace`` (boolean): For Collecting HIP Traces (runtime + compiler).
``hip_runtime_trace`` (boolean): For Collecting HIP Runtime API Traces.
``hip_compiler_trace`` (boolean): For Collecting HIP Compiler generated code Traces.
``marker_trace`` (boolean): For Collecting Marker (ROCTx) Traces.
``kernel_trace`` (boolean): For Collecting Kernel Dispatch Traces.
``memory_copy_trace`` (boolean): For Collecting Memory Copy Traces.
``memory_allocation_trace`` (boolean): For Collecting Memory Allocation Traces.
``scratch_memory_trace`` (boolean): For Collecting Scratch Memory operations Traces.
``stats`` (boolean): For Collecting statistics of enabled tracing types.
``hsa_trace`` (boolean): For Collecting HSA Traces (core + amd + image + finalizer).
``hsa_core_trace`` (boolean): For Collecting HSA API Traces (core API).
``hsa_amd_trace`` (boolean): For Collecting HSA API Traces (AMD-extension API).
``hsa_finalize_trace`` (boolean): For Collecting HSA API Traces (Finalizer-extension API).
``hsa_image_trace`` (boolean): For Collecting HSA API Traces (Image-extension API).
``sys_trace`` (boolean): For Collecting HIP, HSA, Marker (ROCTx), Memory copy, Memory allocation, Scratch memory, and Kernel dispatch traces.
``mangled_kernels`` (boolean): Do not demangle the kernel names.
``truncate_kernels`` (boolean): Truncate the demangled kernel names.
``output_file`` (string): For the output file name.
``output_directory`` (string): For adding output path where the output files will be saved.
``output_format`` (array): For adding output format (supported formats: csv, json, pftrace, otf2).
``list_metrics`` (boolean): List the metrics.
``log_level`` (string): fatal, error, warning, info, trace.
``preload`` (array): Libraries to prepend to LD_PRELOAD (usually for sanitizers).
``pc_sampling_unit`` (string): pc sampling unit.
``pc_sampling_method`` (string): pc sampling method.
``pc_sampling_interval`` (integer): pc sampling interval.
``pc-sampling-beta-enabled`` (boolean): enable pc sampling support; beta version.
- ``att_filenames`` (object)
``key`` (integer): Dispatch id.
``value`` (array): An array of ATT filenames.
- ``code_object_snapshot_filenames`` (array): Code
object snapshot filename.
$ cat input.txt
pmc: GPUBusy SQ_WAVES
pmc: GRBM_GUI_ACTIVE
While the input file in text format can only be used for counter collection, JSON and YAML formats support all the command-line options for profiling. The input file in YAML or JSON format has an array of profiling configurations called jobs. Each job is used to configure profiling for an application execution.
Here is the input schema (properties) of JSON or YAML input files:
``jobs`` (array):
rocprofv3
input data per application runItems (object): Data for
rocprofv3
``pmc`` (array): list of counters for collection
``kernel_include_regex`` (string)
``kernel_exclude_regex`` (string)
``kernel_iteration_range`` (string)
``mangled_kernels`` (boolean)
``truncate_kernels`` (boolean)
``output_file`` (string)
``output_directory`` (string)
``output_format`` (array)
``list_avail`` (boolean)
``log_level`` (string)
``preload`` (array)
``pc_sampling_unit`` (string)
``pc_sampling_method`` (string)
``pc_sampling_interval`` (integer)
``pc_sampling_beta_enabled`` (boolean)
For description of the options specified under job items, see Command-line options.
Here is a sample input.json file for specifying counters for collection along with the options to filter and control the output:
$ cat input.json
{
"jobs": [
{
"pmc": ["SQ_WAVES", "GRBM_COUNT", "GRBM_GUI_ACTIVE"]
},
{
"pmc": ["FETCH_SIZE", "WRITE_SIZE"],
"kernel_include_regex": ".*_kernel",
"kernel_exclude_regex": "multiply",
"kernel_iteration_range": "[1-2],[3-4]",
"output_file": "out",
"output_format": [
"csv",
"json"
],
"truncate_kernels": true
}
]
}
Here is a sample input.yaml file for counter collection:
jobs:
- pmc: ["SQ_WAVES", "GRBM_COUNT", "GRBM_GUI_ACTIVE"]
- pmc: ["FETCH_SIZE", "WRITE_SIZE"]
kernel_include_regex: ".*_kernel"
kernel_exclude_regex: "multiply"
kernel_iteration_range: "[1-2],[3-4]"
output_file: "out"
output_format:
- "csv"
- "json"
truncate_kernels: true
To supply the input file for kernel profiling, use:
rocprofv3 -i input.yaml -- <application_path>
Counter collection using command line#
You can also collect the desired counters by directly specifying them in the command line instead of using an input file.
To supply the counters in the command line, use:
rocprofv3 --pmc SQ_WAVES GRBM_COUNT GRBM_GUI_ACTIVE -- <application_path>
Note
When specifying more than one counter, separate them using space or a comma.
Job fails if the entire set of counters can’t be collected in a single pass.
Extra counters#
While the basic counters and derived metrics are available for collection by default, you can also define counters as per requirement. These user-defined counters with custom definitions are named extra counters.
You can define the extra counters in a YAML file as shown:
$ cat extra_counters.yaml
GRBM_GUI_ACTIVE_SUM:
architectures:
gfx942/gfx10/gfx1010/gfx1030/gfx1031/gfx11/gfx1032/gfx1102/gfx906/gfx1100/gfx1101/gfx908/gfx90a/gfx9:
expression: reduce(GRBM_GUI_ACTIVE,max)*CU_NUM
description: 'Unit: cycles'
To collect the extra counters defined in the extra_counters.yaml file , use option --pmc
to specify the extra counters to be collected:
rocprofv3 -E <path-to-extra_counters.yaml> --pmc GRBM_GUI_ACTIVE_SUM -- <app_relative_path>
Kernel profiling output#
Using rocprofv3
for counter collection using input file or command line generates a ./pmc_n/counter_collection.csv
file prefixed with the process ID. For each pmc
row, a directory pmc_n
containing a counter_collection.csv
file is generated, where n = 1 for the first row and so on.
When using input file in JSON or YAML format, for each job, a directory pass_n
containing a counter_collection.csv
file is generated, where n = 1 for the first job and so on.
Each row of the CSV file is an instance of kernel execution. Here is a truncated version of the output file from pmc_1
:
$ cat pmc_1/218_counter_collection.csv
Here are the contents of counter_collection.csv
file:
Correlation_Id |
Dispatch_Id |
Agent_Id |
Queue_Id |
Process_Id |
Thread_Id |
Grid_Size |
Kernel_Id |
Kernel_Name |
Workgroup_Size |
LDS_Block_Size |
Scratch_Size |
VGPR_Count |
SGPR_Count |
Counter_Name |
Counter_Value |
Start_Timestamp |
End_Timestamp |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 |
1 |
1 |
1 |
19396 |
19396 |
1048576 |
16 |
void addition_kernel<float>(float*, float const*, float const*, int, int) |
64 |
0 |
0 |
8 |
16 |
SQ_WAVES |
16384 |
2228955885095594 |
2228955885119754 |
2 |
2 |
1 |
1 |
19396 |
19396 |
1048576 |
19 |
subtract_kernel(float*, float const*, float const*, int, int) |
64 |
0 |
0 |
8 |
16 |
SQ_WAVES |
16384 |
2228955885095594 |
2228955885119754 |
5 |
5 |
1 |
2 |
19396 |
19396 |
1048576 |
16 |
void addition_kernel<float>(float*, float const*, float const*, int, int) |
64 |
0 |
0 |
8 |
16 |
SQ_WAVES |
16384 |
2228955885095594 |
2228955885119754 |
9 |
9 |
1 |
3 |
19396 |
19396 |
1048576 |
16 |
void addition_kernel<float>(float*, float const*, float const*, int, int) |
64 |
0 |
0 |
8 |
16 |
SQ_WAVES |
16384 |
2228955885095594 |
2228955885119754 |
13 |
13 |
1 |
4 |
19396 |
19396 |
1048576 |
16 |
void addition_kernel<float>(float*, float const*, float const*, int, int) |
64 |
0 |
0 |
8 |
16 |
SQ_WAVES |
16384 |
2228955885095594 |
2228955885119754 |
3 |
3 |
1 |
1 |
19396 |
19396 |
1048576 |
17 |
multiply_kernel(float*, float const*, float const*, int, int) |
64 |
0 |
0 |
8 |
16 |
SQ_WAVES |
16384 |
2228955885095594 |
2228955885119754 |
6 |
6 |
1 |
2 |
19396 |
19396 |
1048576 |
19 |
subtract_kernel(float*, float const*, float const*, int, int) |
64 |
0 |
0 |
8 |
16 |
SQ_WAVES |
16384 |
2228955885095594 |
2228955885119754 |
For the description of the fields in the output file, see Output file fields.
Kernel filtering#
rocprofv3 supports kernel filtering in case of profiling. A kernel filter is a set of a regex string (to include the kernels matching this filter), a regex string (to exclude the kernels matching this filter), and an iteration range (set of iterations of the included kernels). If the iteration range is not provided then all iterations of the included kernels are profiled.
$ cat input.yml
jobs:
- pmc: [SQ_WAVES]
kernel_include_regex: "divide"
kernel_exclude_regex: ""
kernel_iteration_range: "[1, 2, [5-8]]"
Agent info#
Note
All tracing and counter collection options generate an additional agent_info.csv
file prefixed with the process ID.
The agent_info.csv
file contains information about the CPU or GPU the kernel runs on.
$ cat 238_agent_info.csv
"Node_Id","Logical_Node_Id","Agent_Type","Cpu_Cores_Count","Simd_Count","Cpu_Core_Id_Base","Simd_Id_Base","Max_Waves_Per_Simd","Lds_Size_In_Kb","Gds_Size_In_Kb","Num_Gws","Wave_Front_Size","Num_Xcc","Cu_Count","Array_Count","Num_Shader_Banks","Simd_Arrays_Per_Engine","Cu_Per_Simd_Array","Simd_Per_Cu","Max_Slots_Scratch_Cu","Gfx_Target_Version","Vendor_Id","Device_Id","Location_Id","Domain","Drm_Render_Minor","Num_Sdma_Engines","Num_Sdma_Xgmi_Engines","Num_Sdma_Queues_Per_Engine","Num_Cp_Queues","Max_Engine_Clk_Ccompute","Max_Engine_Clk_Fcompute","Sdma_Fw_Version","Fw_Version","Capability","Cu_Per_Engine","Max_Waves_Per_Cu","Family_Id","Workgroup_Max_Size","Grid_Max_Size","Local_Mem_Size","Hive_Id","Gpu_Id","Workgroup_Max_Dim_X","Workgroup_Max_Dim_Y","Workgroup_Max_Dim_Z","Grid_Max_Dim_X","Grid_Max_Dim_Y","Grid_Max_Dim_Z","Name","Vendor_Name","Product_Name","Model_Name"
0,0,"CPU",24,0,0,0,0,0,0,0,0,1,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3800,0,0,0,0,0,0,23,0,0,0,0,0,0,0,0,0,0,0,"AMD Ryzen 9 3900X 12-Core Processor","CPU","AMD Ryzen 9 3900X 12-Core Processor",""
1,1,"GPU",0,256,0,2147487744,10,64,0,64,64,1,64,4,4,1,16,4,32,90000,4098,26751,12032,0,128,2,0,2,24,3800,1630,432,440,138420864,16,40,141,1024,4294967295,0,0,64700,1024,1024,1024,4294967295,4294967295,4294967295,"gfx900","AMD","Radeon RX Vega","vega10"
Kernel filtering#
Kernel filtering allows you to include or exclude the kernels for profiling by specifying a filter using a regex string. You can also specify an iteration range for profiling the included kernels. If the iteration range is not provided, then all iterations of the included kernels are profiled.
Here is an input file with kernel filters:
$ cat input.yml
jobs:
- pmc: [SQ_WAVES]
kernel_include_regex: "divide"
kernel_exclude_regex: ""
kernel_iteration_range: "[1, 2, [5-8]]"
To collect counters for the kernels matching the filters specified in the preceding input file, run:
rocprofv3 -i input.yml -- <application_path>
$ cat pass_1/312_counter_collection.csv
"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","Accum_VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
1,1,4,1,225049,225049,1048576,10,"void addition_kernel<float>(float*, float const*, float const*, int, int)",64,0,0,8,0,16,"SQ_WAVES",16384.000000,317095766765717,317095766775957
2,2,4,1,225049,225049,1048576,13,"subtract_kernel(float*, float const*, float const*, int, int)",64,0,0,8,0,16,"SQ_WAVES",16384.000000,317095767013157,317095767022957
3,3,4,1,225049,225049,1048576,11,"multiply_kernel(float*, float const*, float const*, int, int)",64,0,0,8,0,16,"SQ_WAVES",16384.000000,317095767176998,317095767186678
4,4,4,1,225049,225049,1048576,12,"divide_kernel(float*, float const*, float const*, int, int)",64,0,0,12,4,16,"SQ_WAVES",16384.000000,317095767380718,317095767390878
I/O control options#
Output file#
The output file name can be specified using the --output-file
or -o
option. If nothing specified, the output file is by-default prefixed with the process ID.
rocprofv3 --hip-trace --output-file output -- <application_path>
The above command generates an output_hip_api_trace.csv
file.
Output directory#
The output directory can be specified using the --output-directory
or -d
option. If nothing specified, default path is %hostname%/%pid%.
rocprofv3 --hip-trace --output-directory output_dir -- <application_path>
The above command generates an output_dir/%hostname%/%pid%_hip_api_trace.csv
file.
Output directory option supports many placeholders. To name a few:
%hostname%: Hostname of the machine
%pid%: Process ID
%env{NAME}% - Consistent with other output key formats (start+end with %)
$ENV{NAME} - Similar to CMake
%q{NAME}% - Compatibility with NVIDIA
To see a full list, refer to Output prefix keys.
The following example shows how to use the output directory option with placeholders:
mpirun -n 2 rocprofv3 --hip-trace -d %h.%p.%env{OMPI_COMM_WORLD_RANK}% -- <application_path>
The above command runs the application with rocprofv3 and generates the trace file for each rank. The trace files are prefixed with the hostname, process ID, and the MPI rank.
Assuming the hostname is ubuntu-latest, the process ID is 3000020 and 3000019, the output file names are:
ubuntu-latest.3000020.1/ubuntu-latest/3000020_agent_info.csv
ubuntu-latest.3000019.0/ubuntu-latest/3000019_agent_info.csv
ubuntu-latest.3000020.1/ubuntu-latest/3000020_hip_api_trace.csv
ubuntu-latest.3000019.0/ubuntu-latest/3000019_hip_api_trace.csv
Output prefix keys#
Output prefix keys have many uses but are most helpful when dealing with multiple profiling runs or large MPI jobs. Here is a list of the available keys:
String |
Encoding |
---|---|
|
Entire command-line condensed into a single string |
|
Similar to |
|
All command line arguments condensed into a single string |
|
Basename of first command line argument |
|
Hostname of the machine (i.e. gethostname()) |
|
Process identifier (i.e. getpid()) |
|
Parent process identifier (i.e. getppid()) |
|
Process group identifier (i.e. getpgid(getpid())) |
|
Process session identifier (i.e. getsid(getpid())) |
|
Number of sibling process (from reading /proc/<PPID>/tasks/<PPID>/children) |
|
Value of SLURM_JOB_ID environment variable if exists, else 0 |
|
Value of SLURM_PROCID environment variable if exists, else MPI_Comm_rank (or 0 non-mpi) |
|
MPI_Comm_size or 1 if non-mpi |
|
%rank% if possible, otherwise |
|
Launch date and time (Date and/or time according to ROCPROF_TIME_FORMAT) |
|
Value of environment variable NAME (i.e. getenv(NAME)) |
|
Alternative syntax to |
|
Shorthand for |
|
Shorthand for |
|
Shorthand for |
|
Shorthand for |
Output file fields#
The following table lists the various fields or the columns in the output CSV files generated for application tracing and kernel profiling:
Field |
Description |
---|---|
Agent_Id |
GPU identifier to which the kernel was submitted. |
Correlation_Id |
Unique identifier for correlation between HIP and HSA async calls during activity tracing. |
Start_Timestamp |
Begin time in nanoseconds (ns) when the kernel begins execution. |
End_Timestamp |
End time in ns when the kernel finishes execution. |
Queue_Id |
ROCm queue unique identifier to which the kernel was submitted. |
Private_Segment_Size |
The amount of memory required in bytes for the combined private, spill, and arg segments for a work item. |
Group_Segment_Size |
The group segment memory required by a workgroup in bytes. This does not include any dynamically allocated group segment memory that may be added when the kernel is dispatched. |
Workgroup_Size |
Size of the workgroup as declared by the compute shader. |
Workgroup_Size_n |
Size of the workgroup in the nth dimension as declared by the compute shader, where n = X, Y, or Z. |
Grid_Size |
Number of thread blocks required to launch the kernel. |
Grid_Size_n |
Number of thread blocks in the nth dimension required to launch the kernel, where n = X, Y, or Z. |
LDS_Block_Size |
Thread block size for the kernel’s Local Data Share (LDS) memory. |
Scratch_Size |
Kernel’s scratch memory size. |
SGPR_Count |
Kernel’s Scalar General Purpose Register (SGPR) count. |
VGPR_Count |
Kernel’s Architected Vector General Purpose Register (VGPR) count. |
Accum_VGPR_Count |
Kernel’s Accumulation Vector General Purpose Register (Accum_VGPR/AGPR) count. |
Output formats#
rocprofv3
supports the following output formats:
CSV (Default)
JSON (Custom format for programmatic analysis only)
PFTrace (Perfetto trace for visualization with Perfetto)
OTF2 (Open Trace Format for visualization with compatible third-party tools)
To specify the output format, use:
rocprofv3 -i input.txt --output-format json -- <application_path>
Format selection is case-insensitive and multiple output formats are supported. While --output-format json
exclusively enables JSON output, --output-format csv json pftrace otf2
enables all four output formats for the run.
For PFTrace trace visualization, use the PFTrace format and open the trace in ui.perfetto.dev.
For OTF2 trace visualization, open the trace in vampir.eu or any supported visualizer.
Note
For large trace files (> 10GB), it’s recommended to use OTF2 format.
JSON output schema#
rocprofv3
supports a custom JSON output format designed for programmatic analysis and NOT for visualization.
The schema is optimized for size while factoring in usability.
Note
Perfetto UI doesn’t accept this JSON output format.
To generate the JSON output, use --output-format json
command-line option.
Properties#
Here are the properties of the JSON output schema:
- `rocprofiler-sdk-tool` (array): rocprofv3 data per process (each element represents a process).
- Items (object): Data for rocprofv3.
- `metadata` (object, required): Metadata related to the profiler session.
`pid` (integer, required): Process ID.
`init_time` (integer, required): Initialization time in nanoseconds.
`fini_time` (integer, required): Finalization time in nanoseconds.
- `agents` (array, required): List of agents.
- Items (object): Data for an agent.
`size` (integer, required): Size of the agent data.
- `id` (object, required): Identifier for the agent.
`handle` (integer, required): Handle for the agent.
`type` (integer, required): Type of the agent.
`cpu_cores_count` (integer): Number of CPU cores.
`simd_count` (integer): Number of SIMD units.
`mem_banks_count` (integer): Number of memory banks.
`caches_count` (integer): Number of caches.
`io_links_count` (integer): Number of I/O links.
`cpu_core_id_base` (integer): Base ID for CPU cores.
`simd_id_base` (integer): Base ID for SIMD units.
`max_waves_per_simd` (integer): Maximum waves per SIMD.
`lds_size_in_kb` (integer): Size of LDS in KB.
`gds_size_in_kb` (integer): Size of GDS in KB.
`num_gws` (integer): Number of GWS (global work size).
`wave_front_size` (integer): Size of the wave front.
`num_xcc` (integer): Number of XCC (execution compute units).
`cu_count` (integer): Number of compute units (CUs).
`array_count` (integer): Number of arrays.
`num_shader_banks` (integer): Number of shader banks.
`simd_arrays_per_engine` (integer): SIMD arrays per engine.
`cu_per_simd_array` (integer): CUs per SIMD array.
`simd_per_cu` (integer): SIMDs per CU.
`max_slots_scratch_cu` (integer): Maximum slots for scratch CU.
`gfx_target_version` (integer): GFX target version.
`vendor_id` (integer): Vendor ID.
`device_id` (integer): Device ID.
`location_id` (integer): Location ID.
`domain` (integer): Domain identifier.
`drm_render_minor` (integer): DRM render minor version.
`num_sdma_engines` (integer): Number of SDMA engines.
`num_sdma_xgmi_engines` (integer): Number of SDMA XGMI engines.
`num_sdma_queues_per_engine` (integer): Number of SDMA queues per engine.
`num_cp_queues` (integer): Number of CP queues.
`max_engine_clk_ccompute` (integer): Maximum engine clock for compute.
`max_engine_clk_fcompute` (integer): Maximum engine clock for F compute.
- `sdma_fw_version` (object): SDMA firmware version.
`uCodeSDMA` (integer, required): SDMA microcode version.
`uCodeRes` (integer, required): Reserved microcode version.
- `fw_version` (object): Firmware version.
`uCode` (integer, required): Microcode version.
`Major` (integer, required): Major version.
`Minor` (integer, required): Minor version.
`Stepping` (integer, required): Stepping version.
- `capability` (object, required): Agent capability flags.
`HotPluggable` (integer, required): Hot pluggable capability.
`HSAMMUPresent` (integer, required): HSAMMU present capability.
`SharedWithGraphics` (integer, required): Shared with graphics capability.
`QueueSizePowerOfTwo` (integer, required): Queue size is power of two.
`QueueSize32bit` (integer, required): Queue size is 32-bit.
`QueueIdleEvent` (integer, required): Queue idle event.
`VALimit` (integer, required): VA limit.
`WatchPointsSupported` (integer, required): Watch points supported.
`WatchPointsTotalBits` (integer, required): Total bits for watch points.
`DoorbellType` (integer, required): Doorbell type.
`AQLQueueDoubleMap` (integer, required): AQL queue double map.
`DebugTrapSupported` (integer, required): Debug trap supported.
`WaveLaunchTrapOverrideSupported` (integer, required): Wave launch trap override supported.
`WaveLaunchModeSupported` (integer, required): Wave launch mode supported.
`PreciseMemoryOperationsSupported` (integer, required): Precise memory operations supported.
`DEPRECATED_SRAM_EDCSupport` (integer, required): Deprecated SRAM EDC support.
`Mem_EDCSupport` (integer, required): Memory EDC support.
`RASEventNotify` (integer, required): RAS event notify.
`ASICRevision` (integer, required): ASIC revision.
`SRAM_EDCSupport` (integer, required): SRAM EDC support.
`SVMAPISupported` (integer, required): SVM API supported.
`CoherentHostAccess` (integer, required): Coherent host access.
`DebugSupportedFirmware` (integer, required): Debug supported firmware.
`Reserved` (integer, required): Reserved field.
- `counters` (array, required): Array of counter objects.
- Items (object)
- `agent_id` (object, required): Agent ID information.
`handle` (integer, required): Handle of the agent.
- `id` (object, required): Counter ID information.
`handle` (integer, required): Handle of the counter.
`is_constant` (integer, required): Indicator if the counter value is constant.
`is_derived` (integer, required): Indicator if the counter value is derived.
`name` (string, required): Name of the counter.
`description` (string, required): Description of the counter.
`block` (string, required): Block information of the counter.
`expression` (string, required): Expression of the counter.
- `dimension_ids` (array, required): Array of dimension IDs.
Items (integer): Dimension ID.
- `strings` (object, required): String records.
- `callback_records` (array): Callback records.
- Items (object)
`kind` (string, required): Kind of the record.
- `operations` (array, required): Array of operations.
Items (string): Operation.
- `buffer_records` (array): Buffer records.
- Items (object)
`kind` (string, required): Kind of the record.
- `operations` (array, required): Array of operations.
Items (string): Operation.
- `marker_api` (array): Marker API records.
- Items (object)
`key` (integer, required): Key of the record.
`value` (string, required): Value of the record.
- `counters` (object): Counter records.
- `dimension_ids` (array, required): Array of dimension IDs.
- Items (object)
`id` (integer, required): Dimension ID.
`instance_size` (integer, required): Size of the instance.
`name` (string, required): Name of the dimension.
``pc_sample_instructions`` (array): Array of decoded instructions matching sampled PCs from pc_sample_host_trap section.
``pc_sample_comments`` (array): Comments matching assembly instructions from pc_sample_instructions array. If debug symbols are available, comments provide instructions to source-line mapping. Otherwise, a comment is an empty string.
- `code_objects` (array, required): Code object records.
- Items (object)
`size` (integer, required): Size of the code object.
`code_object_id` (integer, required): ID of the code object.
- `rocp_agent` (object, required): ROCP agent information.
`handle` (integer, required): Handle of the ROCP agent.
- `hsa_agent` (object, required): HSA agent information.
`handle` (integer, required): Handle of the HSA agent.
`uri` (string, required): URI of the code object.
`load_base` (integer, required): Base address for loading.
`load_size` (integer, required): Size for loading.
`load_delta` (integer, required): Delta for loading.
`storage_type` (integer, required): Type of storage.
`memory_base` (integer, required): Base address for memory.
`memory_size` (integer, required): Size of memory.
- `kernel_symbols` (array, required): Kernel symbol records.
- Items (object)
`size` (integer, required): Size of the kernel symbol.
`kernel_id` (integer, required): ID of the kernel.
`code_object_id` (integer, required): ID of the code object.
`kernel_name` (string, required): Name of the kernel.
`kernel_object` (integer, required): Object of the kernel.
`kernarg_segment_size` (integer, required): Size of the kernarg segment.
`kernarg_segment_alignment` (integer, required): Alignment of the kernarg segment.
`group_segment_size` (integer, required): Size of the group segment.
`private_segment_size` (integer, required): Size of the private segment.
`formatted_kernel_name` (string, required): Formatted name of the kernel.
`demangled_kernel_name` (string, required): Demangled name of the kernel.
`truncated_kernel_name` (string, required): Truncated name of the kernel.
- `callback_records` (object, required): Callback record details.
- `counter_collection` (array): Counter collection records.
- Items (object)
- `dispatch_data` (object, required): Dispatch data details.
`size` (integer, required): Size of the dispatch data.
- `correlation_id` (object, required): Correlation ID information.
`internal` (integer, required): Internal correlation ID.
`external` (integer, required): External correlation ID.
- `dispatch_info` (object, required): Dispatch information details.
`size` (integer, required): Size of the dispatch information.
- `agent_id` (object, required): Agent ID information.
`handle` (integer, required): Handle of the agent.
- `queue_id` (object, required): Queue ID information.
`handle` (integer, required): Handle of the queue.
`kernel_id` (integer, required): ID of the kernel.
`dispatch_id` (integer, required): ID of the dispatch.
`private_segment_size` (integer, required): Size of the private segment.
`group_segment_size` (integer, required): Size of the group segment.
- `workgroup_size` (object, required): Workgroup size information.
`x` (integer, required): X dimension.
`y` (integer, required): Y dimension.
`z` (integer, required): Z dimension.
- `grid_size` (object, required): Grid size information.
`x` (integer, required): X dimension.
`y` (integer, required): Y dimension.
`z` (integer, required): Z dimension.
- `records` (array, required): Records.
- Items (object)
- `counter_id` (object, required): Counter ID information.
`handle` (integer, required): Handle of the counter.
`value` (number, required): Value of the counter.
`thread_id` (integer, required): Thread ID.
`arch_vgpr_count` (integer, required): Count of Architected VGPRs.
`accum_vgpr_count` (integer, required): Count of Accumulation VGPRs.
`sgpr_count` (integer, required): Count of SGPRs.
`lds_block_size_v` (integer, required): Size of LDS block.
- ``pc_sample_host_trap`` (array): Host Trap PC Sampling records.
- Items (object)
- ``hw_id`` (object): Describes hardware part on which sampled wave was running.
``chiplet`` (integer): Chiplet index.
``wave_id`` (integer): Wave slot index.
``simd_id`` (integer): SIMD index.
``pipe_id`` (integer): Pipe index.
``cu_or_wgp_id`` (integer): Index of compute unit or workgroup processer.
``shader_array_id`` (integer): Shader array index.
``shader_engine_id`` (integer): Shader engine index.
``workgroup_id`` (integer): Workgroup position in the 3D.
``vm_id`` (integer): Virtual memory ID.
``queue_id`` (integer): Queue id.
``microengine_id`` (integer): ACE (microengine) index.
``pc`` (object): Encapsulates information about sampled PC. - ``code_object_id`` (integer): Code object id. - ``code_object_offset`` (integer): Offset within the object if the latter is known. Otherwise, virtual address of the PC.
``exec_mask`` (integer): Execution mask indicating active SIMD lanes of sampled wave.
``timestamp`` (integer): Timestamp.
``dispatch_id`` (integer): Dispatch id.
``correlation_id`` (object): Correlation ID information. - ``internal`` (integer): Internal correlation ID. - ``external`` (integer): External correlation ID.
- ``rocprofiler_dim3_t`` (object): Position of the workgroup in 3D grid.
``x`` (integer): Dimension x.
``y`` (integer): Dimension y.
``z`` (integer): Dimension z.
``wave_in_group`` (integer): Wave position within the workgroup (0-31).
- `buffer_records` (object, required): Buffer record details.
- `kernel_dispatch` (array): Kernel dispatch records.
- Items (object)
`size` (integer, required): Size of the dispatch.
`kind` (integer, required): Kind of the dispatch.
`operation` (integer, required): Operation of the dispatch.
`thread_id` (integer, required): Thread ID.
- `correlation_id` (object, required): Correlation ID information.
`internal` (integer, required): Internal correlation ID.
`external` (integer, required): External correlation ID.
`start_timestamp` (integer, required): Start timestamp.
`end_timestamp` (integer, required): End timestamp.
- `dispatch_info` (object, required): Dispatch information details.
`size` (integer, required): Size of the dispatch information.
- `agent_id` (object, required): Agent ID information.
`handle` (integer, required): Handle of the agent.
- `queue_id` (object, required): Queue ID information.
`handle` (integer, required): Handle of the queue.
`kernel_id` (integer, required): ID of the kernel.
`dispatch_id` (integer, required): ID of the dispatch.
`private_segment_size` (integer, required): Size of the private segment.
`group_segment_size` (integer, required): Size of the group segment.
- `workgroup_size` (object, required): Workgroup size information.
`x` (integer, required): X dimension.
`y` (integer, required): Y dimension.
`z` (integer, required): Z dimension.
- `grid_size` (object, required): Grid size information.
`x` (integer, required): X dimension.
`y` (integer, required): Y dimension.
`z` (integer, required): Z dimension.
- `hip_api` (array): HIP API records.
- Items (object)
`size` (integer, required): Size of the HIP API record.
`kind` (integer, required): Kind of the HIP API.
`operation` (integer, required): Operation of the HIP API.
- `correlation_id` (object, required): Correlation ID information.
`internal` (integer, required): Internal correlation ID.
`external` (integer, required): External correlation ID.
`start_timestamp` (integer, required): Start timestamp.
`end_timestamp` (integer, required): End timestamp.
`thread_id` (integer, required): Thread ID.
- `hsa_api` (array): HSA API records.
- Items (object)
`size` (integer, required): Size of the HSA API record.
`kind` (integer, required): Kind of the HSA API.
`operation` (integer, required): Operation of the HSA API.
- `correlation_id` (object, required): Correlation ID information.
`internal` (integer, required): Internal correlation ID.
`external` (integer, required): External correlation ID.
`start_timestamp` (integer, required): Start timestamp.
`end_timestamp` (integer, required): End timestamp.
`thread_id` (integer, required): Thread ID.
- `marker_api` (array): Marker (ROCTx) API records.
- Items (object)
`size` (integer, required): Size of the Marker API record.
`kind` (integer, required): Kind of the Marker API.
`operation` (integer, required): Operation of the Marker API.
- `correlation_id` (object, required): Correlation ID information.
`internal` (integer, required): Internal correlation ID.
`external` (integer, required): External correlation ID.
`start_timestamp` (integer, required): Start timestamp.
`end_timestamp` (integer, required): End timestamp.
`thread_id` (integer, required): Thread ID.
- `memory_copy` (array): Async memory copy records.
- Items (object)
`size` (integer, required): Size of the Marker API record.
`kind` (integer, required): Kind of the Marker API.
`operation` (integer, required): Operation of the Marker API.
- `correlation_id` (object, required): Correlation ID information.
`internal` (integer, required): Internal correlation ID.
`external` (integer, required): External correlation ID.
`start_timestamp` (integer, required): Start timestamp.
`end_timestamp` (integer, required): End timestamp.
`thread_id` (integer, required): Thread ID.
- `dst_agent_id` (object, required): Destination Agent ID.
`handle` (integer, required): Handle of the agent.
- `src_agent_id` (object, required): Source Agent ID.
`handle` (integer, required): Handle of the agent.
`bytes` (integer, required): Bytes copied.
- `memory_allocation` (array): Memory allocation records.
- Items (object)
`size` (integer, required): Size of the Marker API record.
`kind` (integer, required): Kind of the Marker API.
`operation` (integer, required): Operation of the Marker API.
- `correlation_id` (object, required): Correlation ID information.
`internal` (integer, required): Internal correlation ID.
`external` (integer, required): External correlation ID.
`start_timestamp` (integer, required): Start timestamp.
`end_timestamp` (integer, required): End timestamp.
`thread_id` (integer, required): Thread ID.
- `agent_id` (object, required): Agent ID.
`handle` (integer, required): Handle of the agent.
`address` (string, required): Starting address of allocation.
`allocation_size` (integer, required): Size of allocation.
- `rocDecode_api` (array): rocDecode API records.
- Items (object)
`size` (integer, required): Size of the rocDecode API record.
`kind` (integer, required): Kind of the rocDecode API.
`operation` (integer, required): Operation of the rocDecode API.
- `correlation_id` (object, required): Correlation ID information.
`internal` (integer, required): Internal correlation ID.
`external` (integer, required): External correlation ID.
`start_timestamp` (integer, required): Start timestamp.
`end_timestamp` (integer, required): End timestamp.
`thread_id` (integer, required): Thread ID.