Using thread trace#
Thread trace is a shader execution tracing technique capable of profiling wavefronts at the instruction timing level. This is a low-level tracing and profiling feature that targets a single or a few kernel executions.
Thread trace features include:
Near cycle-accurate instruction tracing
Exact thread or wave execution path
Wave scheduling and stall timing analysis
Instruction and source level hotspots
Extremely fast and granular counter collection (AMD Instinct)
Supported devices:
AMD Instinct: MI200 and MI300 series
AMD Radeon: gfx10, gfx11 and gfx12
Thread trace profiling is performed in the following steps:
Tracing (data collection) - Uses ROCprofiler-SDK thread trace service API
Decoding (analysis) - Uses ROCprof Trace Decoder API
Visualization - Requires ROCprof Compute Viewer
Tracing and decoding is handled by rocprofv3
while visualization is handled by the ROCprof Compute Viewer.
Prerequisites#
aqlprofile:
ROCm 7.x build, or
Early release can be built from source
Otherwise,
rocprofv3
throws error “INVALID_SHADER_DATA” or “Agent not supported”.
Installation of ROCprof Trace Decoder component:
For binary files, see ROCprof trace decoder release page.
Default install location is
/opt/rocm/lib
For custom location, use:
Parameter
--att-library-path
, orEnvironment variable
ROCPROF_ATT_LIBRARY_PATH
rocprofv3 parameters for thread tracing#
To collect thread trace with default parameters, use:
rocprofv3 --att -d <output_dir> -- <application_path>
The following table lists the parameters relevant to thread tracing:
Parameter |
Type |
Range |
Typical |
Description |
---|---|---|---|---|
att-target-cu |
Integer |
0 - 15 |
1 |
Defines the CU used to gather detail tokens (WGP on Navi) |
att-shader-engine-mask |
Bitmask |
1 - ~0u |
0x1 |
Defines the Shader Engines (SE) to be traced. Max 2^32 - 1 |
att-simd-select |
Integer |
0 - 0xF |
gfx9: 0xF Navi: 0x0 |
Defines one or more SIMDs to be traced, out of four. Bitmask on GFX9 and SIMD_ID[0,3] on Navi. |
kernel-iteration-range |
List |
Defines dispatch iteration of the kernel to be profiled |
||
kernel-include-regex |
String |
Any |
Profiles kernel names matching the regex |
|
kernel-exclude-regex |
String |
Any |
Doesn’t profile kernel names matching the regex |
|
att-buffer-size |
Bytes |
1MB-2GB |
96MB |
Specifies the trace buffer size. This is shared for all SEs. Increase this value if the buffer tends to get full. |
att-serialize-all |
Bool |
False |
If set to “True”, turns on serialization for untraced kernels |
|
att-perfcounter-ctrl |
Integer |
1 - 32 |
2~8 |
Available only in gfx9. Streams SQ performance counters to the thread trace buffer in the given relative period. As this uses high bandwidth, a value too low can cause or worsen “Data Lost” events and warnings. |
att-perfcounters |
String |
SQ-only |
Available only in gfx9. Specifies the list of SQ counters. To list all counters, use “rocprofv3 –list-avail``. |
|
att-activity |
Integer |
1 - 16 |
5~10 |
Available only in gfx9. Shorthand for att-perfcounter-ctrl and the att-perfcounters related to compute unit activity such as VALU, SALU, etc. |
For AMD Instinct accelerators, enable perfmon streaming using:
rocprofv3 --att --att-activity 8 -- <application_path>
For AMD Radeon, the simd-select
parameter is a SIMD ID defaulting to 3. For some applications it’s best to use:
rocprofv3 --att --att-simd-select 0x0 -- <application_path>
Using input file#
As explained in the preceding section, you can specify parameters on the command line or use a JSON input file:
{
"jobs": [
{
"advanced_thread_trace": true,
"att_target_cu": 1,
"att_shader_engine_mask": "0x1",
"att_simd_select": "0xF",
"att_buffer_size": "0x6000000"
}
]
}
Thread tracing for multiple kernel instances#
By default, rocprofv3
enables thread trace only once per kernel instance. This implies that if an application launches the same kernel multiple times, only the first instance will be traced.
To enable thread trace for multiple kernel instances, use the kernel-iteration-range
parameter.
It’s recommended to use kernel-include-regex
parameter to filter the desired kernel names instead of tracing everything.
rocprofv3 output files#
After the application finishes executing, ROCprof Trace Decoder runs automatically and the following output files are generated:
stats_*.csv files:
Contains a summary of instruction latency per kernel.
ui_output_agent_{agent_id}_dispatch_{dispatch_id} directory:
Contains detailed tracing information in the form of .json files.
This directory can be opened using the ROCprof Compute Viewer.
Raw files:
.att - Raw SQTT data. Can be used with the ROCprof Trace Decoder for further analysis.
.out - Code object binaries (executable). Can be used with ISA analysis tools.
Stats CSV#
Here is a sample stats_*.csv file that is generated by the rocprofv3 tool.
Codeobj |
Vaddr |
Instruction |
Hitcount |
Latency |
Stall |
Idle |
Source |
---|---|---|---|---|---|---|---|
11 |
5888 |
s_load_dwordx4 s[40:43], s[0:1], 0x18 |
48 |
276 |
96 |
48 |
kernel.py:391 |
11 |
5896 |
s_load_dwordx2 s[38:39], s[0:1], 0x28 |
48 |
192 |
0 |
0 |
kernel.py:391 |
11 |
5904 |
s_ashr_i32 s3, s2, 31 |
48 |
260 |
0 |
0 |
kernel.py:395 |
11 |
5908 |
s_add_i32 s7, s2, s3 |
48 |
196 |
0 |
0 |
kernel.py:395 |
The columns of the stats_*.csv file are described here:
Codeobj: The code object load ID assigned by ROCprofiler-SDK.
Vaddr: ELF vaddr.
Hitcount: The number of times a particular instruction is executed while adding all the traced waves.
Latency: Total latency in cycles, defined as “Stall time + Issue time” for gfx9 or “Stall time + Execute time” for gfx10+.
Stall: The total number of cycles the hardware pipe couldn’t issue an instruction.
Usually caused when the hardware unit is busy, such as TCP or LDS backpressure.
Idle: The total time gap between the completion of previous instruction and the beginning of the current instruction. The idle time can be caused by:
Arbiter loss
Source or destination register dependency
Instruction cache miss
Source: The original source line of code assigned by the compiler.
Requires compiling with debug symbols.
Troubleshooting#
For some applications, stats_*.csv file could be empty even for a valid kernel dispatch.
Thread trace is limited to a single CU per SE (att-target-cu
). If a kernel dispatch doesn’t launch enough waves to populate the whole GPU, there’s a possibility of no wave getting assigned to the target_cu
. In such cases, there’s nothing to be traced.
Here are some options to handle this:
Launch more waves.
Swap the
target_cu
.Set the
--att-shader-engine-mask
to 0x11111111, or possibly to 0xFFFFFFFFA number too high can cause packet losses and/or lead to a full buffer.
Set the
HSA_CU_MASK
to mask out all CUs but the target. For more details, see setting CUs.If only the
target_cu
(or a few CUs) are not masked out, then all or most waves will be assigned to thetarget_cu
.This can potentially cause low performance in high-demanding kernels.