Using pc-sampling
#
PC (Program Counter) Sampling service for GPU profiling is a profiling technique that periodically samples the program counter during GPU kernel execution to understand code execution patterns and hotspots. This helps in: - Identifying performance bottlenecks - Understanding kernel execution behavior - Analyzing code coverage - Finding heavily executed code paths
To try out the PC sampling feature, you can use the rocprofv3 command-line tool or the rocprofiler SDK library on ROCm 6.4 or later.
Note
PC sampling is supported on AMD GPUs with gfx90a and later architectures. Before using the PC sampling feature, ensure that the GPU supports it.
PC Sampling availability and Configuration#
To check if the GPU supports PC sampling, use the following command:
rocprofv3 -L
OR
rocprofv3 --list-avail
The output will list if rocprofv3 supports PC sampling on the GPU and what configuration is supported.
List available PC Sample Configurations for node_id 11
Method: ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP
Unit: ROCPROFILER_PC_SAMPLING_UNIT_TIME
Minimum_Interval: 1
Maximum_Interval: 18446744073709551615
The above output shows that the GPU supports PC sampling with the ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP
method and the ROCPROFILER_PC_SAMPLING_UNIT_TIME
unit. The minimum and maximum intervals are also displayed.
Based on the above configuration, you can use the following command to profile the application using PC sampling:
rocprofv3 --pc-sampling-beta-enabled --pc-sampling-method host_trap --pc-sampling-unit time --pc-sampling-interval 1 -- <application_path>
The above command enables PC sampling with the host_trap method, time unit, and an interval of 1 us(micro second). Replace <application_path> with the path to the application you want to profile.
This will generate 2 files. agent_info.csv
and pc_sampling_host_trap.csv
. Both files are prefixed with file prefixed with the process ID.
Here is the output of pc-sampling for the MatrixTranspose sample application:
Here are the contents of pc_sampling_host_trap.csv
file:
Sample_Timestamp |
Exec_Mask |
Dispatch_Id |
Instruction |
Instruction_Comment |
Correlation_Id |
---|---|---|---|---|---|
3464444413017201 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413017201 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413018481 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413018481 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413018481 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413018481 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413018481 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413018481 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413019601 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413019761 |
65535 |
1 |
s_load_dword s8, s[4:5], 0x24 |
1 |
|
3464444413019761 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413019761 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413019761 |
65535 |
1 |
s_load_dword s8, s[4:5], 0x24 |
1 |
|
3464444413019761 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413019761 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413020881 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413020881 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413020881 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413020881 |
65535 |
1 |
s_waitcnt lgkmcnt(0) |
1 |
|
3464444413020881 |
65535 |
1 |
v_addc_co_u32_e32 v5, vcc, v1, v5, vcc |
1 |
|
3464444413020881 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413020881 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413020881 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413020881 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413021041 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413020881 |
65535 |
1 |
v_bfe_u32 v0, v0, 10, 10 |
1 |
|
3464444413021041 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413021041 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413021041 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413021041 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413021041 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413021041 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413022001 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413022001 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413022001 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413022001 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413022001 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413022001 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413022001 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413022001 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413022001 |
65535 |
1 |
s_waitcnt lgkmcnt(0) |
1 |
|
3464444413022161 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413022161 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413022161 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413022161 |
65535 |
1 |
s_load_dword s8, s[4:5], 0x24 |
1 |
|
3464444413022161 |
65535 |
1 |
global_store_dword v[0:1], v3, off |
1 |
|
3464444413022161 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413022161 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413022161 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413022161 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413022161 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413022161 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413022161 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413022321 |
65535 |
1 |
s_load_dwordx4 s[0:3], s[4:5], 0x0 |
1 |
|
3464444413022161 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413022321 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413022161 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413023281 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413023281 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413023281 |
65535 |
1 |
v_ashrrev_i32_e32 v1, 31, v0 |
1 |
|
3464444413024561 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413023281 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413024561 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413023761 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413026321 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413024401 |
65535 |
1 |
global_store_dword v[0:1], v3, off |
1 |
|
3464444413027121 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413025041 |
65535 |
1 |
v_add_co_u32_e32 v0, vcc, s0, v0 |
1 |
|
3464444413027761 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413025361 |
65535 |
1 |
s_endpgm |
1 |
|
3464444413027601 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413026321 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413028401 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413026481 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413028881 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413026641 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413028401 |
65535 |
1 |
s_load_dword s8, s[4:5], 0x24 |
1 |
|
3464444413027281 |
65535 |
1 |
s_waitcnt vmcnt(0) |
1 |
|
3464444413029681 |
65535 |
1 |
s_endpgm |
1 |
For the description of the fields in the output file, see PC Sampling Fields:.
If you noticed Instruction_Comment
field in the output file was empty. It is recommended to compile your application with debug symbols to populate this field.
It maps back to the source line if debug symbols were enabled when the application was compiled. This helps in understanding the code execution pattern and hotspots.
Sample_Timestamp |
Exec_Mask |
Dispatch_Id |
Instruction |
Instruction_Comment |
Correlation_Id |
---|---|---|---|---|---|
54155306462675 |
65535 |
1 |
s_waitcnt lgkmcnt(0) |
/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275 |
1 |
54155306462715 |
65535 |
1 |
s_waitcnt vmcnt(0) |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 |
1 |
54155306462755 |
65535 |
1 |
s_endpgm |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45 |
1 |
54155306462755 |
65535 |
1 |
s_endpgm |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45 |
1 |
54155306462955 |
65535 |
1 |
s_endpgm |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45 |
1 |
54155306463035 |
65535 |
1 |
s_waitcnt vmcnt(0) |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 |
1 |
54155306463235 |
65535 |
1 |
s_waitcnt vmcnt(0) |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 |
1 |
54155306463315 |
65535 |
1 |
s_waitcnt vmcnt(0) |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 |
1 |
54155306463515 |
65535 |
1 |
s_endpgm |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45 |
1 |
54155306463755 |
65535 |
1 |
s_waitcnt vmcnt(0) |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 |
1 |
54155306463875 |
65535 |
1 |
s_waitcnt vmcnt(0) |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 |
1 |
54155306464075 |
65535 |
1 |
v_mov_b32_e32 v2, s4 |
/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275 |
1 |
54155306464155 |
65535 |
1 |
s_waitcnt vmcnt(0) |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 |
1 |
54155306464155 |
65535 |
1 |
s_waitcnt vmcnt(0) |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 |
1 |
54155306464275 |
65535 |
1 |
s_endpgm |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45 |
1 |
54155306464395 |
65535 |
1 |
s_waitcnt vmcnt(0) |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 |
1 |
54155306464515 |
65535 |
1 |
s_waitcnt lgkmcnt(0) |
/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275 |
1 |
54155306464555 |
65535 |
1 |
s_waitcnt vmcnt(0) |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 |
1 |
54155306464595 |
65535 |
1 |
s_waitcnt vmcnt(0) |
/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44 |
1 |
54155306464595 |
65535 |
1 |
v_mov_b32_e32 v2, s6 |
/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275 |
1 |
54155306464595 |
65535 |
1 |
s_waitcnt lgkmcnt(0) |
/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275 |
1 |
The above output shows the Instruction_Comment field populated with the source line information.
PC Sampling Fields:#
The output file generated by PC sampling contains the following fields:
Sample_Timestamp
: Timestamp when sample is generatedExec_Mask
: Active SIMD lanes when sampledDispatch_Id
: Originating kernel dispatch IDInstruction
: Assembly instruction e.g:s_load_dword s8, s[1:2], 0x10
Instruction_Comment
: Instruction comment (Maps back to source-line if debug symbols were enabled when application was compiled)Correlation_Id
: API launch call id that matches dispatch ID
By default the output file is in CSV format. To dump samples in a more comprehensive format, one can use JSON through –output-format json.
rocprofv3 --pc-sampling-beta-enabled --pc-sampling-method host_trap --pc-sampling-unit time --pc-sampling-interval 1 --output-format json -- <application_path>
This will generate a JSON file with the comprehensive output. Here is a trimmed down output with multiple records:
{
"pc_sample_host_trap": [
{
"record": {
"hw_id": {
"chiplet": 0,
"wave_id": 0,
"simd_id": 2,
"pipe_id": 0,
"cu_or_wgp_id": 1,
"shader_array_id": 0,
"shader_engine_id": 2,
"workgroup_id": 0,
"vm_id": 3,
"queue_id": 2,
"microengine_id": 1
},
"pc": {
"code_object_id": 1,
"code_object_offset": 20228
},
"exec_mask": 18446744073709551615,
"timestamp": 51040126667689,
"dispatch_id": 1,
"corr_id": {
"internal": 1,
"external": 0
},
"wrkgrp_id": {
"x": 182,
"y": 0,
"z": 0
},
"wave_in_grp": 1
},
"inst_index": 0
},
{
"record": {
"hw_id": {
"chiplet": 0,
"wave_id": 0,
"simd_id": 2,
"pipe_id": 0,
"cu_or_wgp_id": 0,
"shader_array_id": 0,
"shader_engine_id": 2,
"workgroup_id": 0,
"vm_id": 3,
"queue_id": 2,
"microengine_id": 1
},
"pc": {
"code_object_id": 1,
"code_object_offset": 20236
},
"exec_mask": 18446744073709551615,
"timestamp": 51040126667689,
"dispatch_id": 1,
"corr_id": {
"internal": 1,
"external": 0
},
"wrkgrp_id": {
"x": 158,
"y": 0,
"z": 0
},
"wave_in_grp": 2
},
"inst_index": 1
}
]
}
The description of the fields in the JSON output is available in the Output file fields.