Using pc-sampling#

PC (Program Counter) Sampling service for GPU profiling is a profiling technique that periodically samples the program counter during GPU kernel execution to understand code execution patterns and hotspots. This helps in: - Identifying performance bottlenecks - Understanding kernel execution behavior - Analyzing code coverage - Finding heavily executed code paths

To try out the PC sampling feature, you can use the rocprofv3 command-line tool or the rocprofiler SDK library on ROCm 6.4 or later.

Note

PC sampling is supported on AMD GPUs with gfx90a and later architectures. Before using the PC sampling feature, ensure that the GPU supports it.

PC Sampling availability and Configuration#

To check if the GPU supports PC sampling, use the following command:

rocprofv3 -L

OR

rocprofv3 --list-avail

The output will list if rocprofv3 supports PC sampling on the GPU and what configuration is supported.

List available PC Sample Configurations for node_id     11
Method: ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP
Unit:   ROCPROFILER_PC_SAMPLING_UNIT_TIME
Minimum_Interval:       1
Maximum_Interval:       18446744073709551615

The above output shows that the GPU supports PC sampling with the ROCPROFILER_PC_SAMPLING_METHOD_HOST_TRAP method and the ROCPROFILER_PC_SAMPLING_UNIT_TIME unit. The minimum and maximum intervals are also displayed.

Based on the above configuration, you can use the following command to profile the application using PC sampling:

rocprofv3 --pc-sampling-beta-enabled --pc-sampling-method host_trap --pc-sampling-unit time --pc-sampling-interval 1 -- <application_path>

The above command enables PC sampling with the host_trap method, time unit, and an interval of 1 us(micro second). Replace <application_path> with the path to the application you want to profile.

This will generate 2 files. agent_info.csv and pc_sampling_host_trap.csv. Both files are prefixed with file prefixed with the process ID. Here is the output of pc-sampling for the MatrixTranspose sample application:

Here are the contents of pc_sampling_host_trap.csv file:

Table 15 PC sampling host trap#

Sample_Timestamp

Exec_Mask

Dispatch_Id

Instruction

Instruction_Comment

Correlation_Id

3464444413017201

65535

1

s_endpgm

1

3464444413017201

65535

1

s_waitcnt vmcnt(0)

1

3464444413018481

65535

1

s_waitcnt vmcnt(0)

1

3464444413018481

65535

1

s_endpgm

1

3464444413018481

65535

1

s_waitcnt vmcnt(0)

1

3464444413018481

65535

1

s_waitcnt vmcnt(0)

1

3464444413018481

65535

1

s_endpgm

1

3464444413018481

65535

1

s_endpgm

1

3464444413019601

65535

1

s_waitcnt vmcnt(0)

1

3464444413019761

65535

1

s_load_dword s8, s[4:5], 0x24

1

3464444413019761

65535

1

s_waitcnt vmcnt(0)

1

3464444413019761

65535

1

s_endpgm

1

3464444413019761

65535

1

s_load_dword s8, s[4:5], 0x24

1

3464444413019761

65535

1

s_endpgm

1

3464444413019761

65535

1

s_endpgm

1

3464444413020881

65535

1

s_endpgm

1

3464444413020881

65535

1

s_endpgm

1

3464444413020881

65535

1

s_endpgm

1

3464444413020881

65535

1

s_waitcnt lgkmcnt(0)

1

3464444413020881

65535

1

v_addc_co_u32_e32 v5, vcc, v1, v5, vcc

1

3464444413020881

65535

1

s_endpgm

1

3464444413020881

65535

1

s_waitcnt vmcnt(0)

1

3464444413020881

65535

1

s_endpgm

1

3464444413020881

65535

1

s_waitcnt vmcnt(0)

1

3464444413021041

65535

1

s_endpgm

1

3464444413020881

65535

1

v_bfe_u32 v0, v0, 10, 10

1

3464444413021041

65535

1

s_endpgm

1

3464444413021041

65535

1

s_endpgm

1

3464444413021041

65535

1

s_waitcnt vmcnt(0)

1

3464444413021041

65535

1

s_endpgm

1

3464444413021041

65535

1

s_waitcnt vmcnt(0)

1

3464444413021041

65535

1

s_endpgm

1

3464444413022001

65535

1

s_waitcnt vmcnt(0)

1

3464444413022001

65535

1

s_endpgm

1

3464444413022001

65535

1

s_endpgm

1

3464444413022001

65535

1

s_endpgm

1

3464444413022001

65535

1

s_endpgm

1

3464444413022001

65535

1

s_waitcnt vmcnt(0)

1

3464444413022001

65535

1

s_endpgm

1

3464444413022001

65535

1

s_waitcnt vmcnt(0)

1

3464444413022001

65535

1

s_waitcnt lgkmcnt(0)

1

3464444413022161

65535

1

s_endpgm

1

3464444413022161

65535

1

s_waitcnt vmcnt(0)

1

3464444413022161

65535

1

s_endpgm

1

3464444413022161

65535

1

s_load_dword s8, s[4:5], 0x24

1

3464444413022161

65535

1

global_store_dword v[0:1], v3, off

1

3464444413022161

65535

1

s_endpgm

1

3464444413022161

65535

1

s_endpgm

1

3464444413022161

65535

1

s_waitcnt vmcnt(0)

1

3464444413022161

65535

1

s_endpgm

1

3464444413022161

65535

1

s_endpgm

1

3464444413022161

65535

1

s_waitcnt vmcnt(0)

1

3464444413022161

65535

1

s_endpgm

1

3464444413022321

65535

1

s_load_dwordx4 s[0:3], s[4:5], 0x0

1

3464444413022161

65535

1

s_waitcnt vmcnt(0)

1

3464444413022321

65535

1

s_endpgm

1

3464444413022161

65535

1

s_waitcnt vmcnt(0)

1

3464444413023281

65535

1

s_endpgm

1

3464444413023281

65535

1

s_endpgm

1

3464444413023281

65535

1

v_ashrrev_i32_e32 v1, 31, v0

1

3464444413024561

65535

1

s_waitcnt vmcnt(0)

1

3464444413023281

65535

1

s_endpgm

1

3464444413024561

65535

1

s_endpgm

1

3464444413023761

65535

1

s_waitcnt vmcnt(0)

1

3464444413026321

65535

1

s_waitcnt vmcnt(0)

1

3464444413024401

65535

1

global_store_dword v[0:1], v3, off

1

3464444413027121

65535

1

s_waitcnt vmcnt(0)

1

3464444413025041

65535

1

v_add_co_u32_e32 v0, vcc, s0, v0

1

3464444413027761

65535

1

s_waitcnt vmcnt(0)

1

3464444413025361

65535

1

s_endpgm

1

3464444413027601

65535

1

s_waitcnt vmcnt(0)

1

3464444413026321

65535

1

s_waitcnt vmcnt(0)

1

3464444413028401

65535

1

s_waitcnt vmcnt(0)

1

3464444413026481

65535

1

s_waitcnt vmcnt(0)

1

3464444413028881

65535

1

s_waitcnt vmcnt(0)

1

3464444413026641

65535

1

s_waitcnt vmcnt(0)

1

3464444413028401

65535

1

s_load_dword s8, s[4:5], 0x24

1

3464444413027281

65535

1

s_waitcnt vmcnt(0)

1

3464444413029681

65535

1

s_endpgm

1

For the description of the fields in the output file, see PC Sampling Fields:.

If you noticed Instruction_Comment field in the output file was empty. It is recommended to compile your application with debug symbols to populate this field. It maps back to the source line if debug symbols were enabled when the application was compiled. This helps in understanding the code execution pattern and hotspots.

Table 16 PC sampling host trap with debug symbols#

Sample_Timestamp

Exec_Mask

Dispatch_Id

Instruction

Instruction_Comment

Correlation_Id

54155306462675

65535

1

s_waitcnt lgkmcnt(0)

/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275

1

54155306462715

65535

1

s_waitcnt vmcnt(0)

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44

1

54155306462755

65535

1

s_endpgm

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45

1

54155306462755

65535

1

s_endpgm

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45

1

54155306462955

65535

1

s_endpgm

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45

1

54155306463035

65535

1

s_waitcnt vmcnt(0)

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44

1

54155306463235

65535

1

s_waitcnt vmcnt(0)

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44

1

54155306463315

65535

1

s_waitcnt vmcnt(0)

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44

1

54155306463515

65535

1

s_endpgm

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45

1

54155306463755

65535

1

s_waitcnt vmcnt(0)

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44

1

54155306463875

65535

1

s_waitcnt vmcnt(0)

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44

1

54155306464075

65535

1

v_mov_b32_e32 v2, s4

/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275

1

54155306464155

65535

1

s_waitcnt vmcnt(0)

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44

1

54155306464155

65535

1

s_waitcnt vmcnt(0)

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44

1

54155306464275

65535

1

s_endpgm

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:45

1

54155306464395

65535

1

s_waitcnt vmcnt(0)

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44

1

54155306464515

65535

1

s_waitcnt lgkmcnt(0)

/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275

1

54155306464555

65535

1

s_waitcnt vmcnt(0)

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44

1

54155306464595

65535

1

s_waitcnt vmcnt(0)

/opt/rocm-6.4.0/share/hip/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp:44

1

54155306464595

65535

1

v_mov_b32_e32 v2, s6

/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275

1

54155306464595

65535

1

s_waitcnt lgkmcnt(0)

/opt/rocm/include/hip/amd_detail/amd_hip_runtime.h:275

1

The above output shows the Instruction_Comment field populated with the source line information.

PC Sampling Fields:#

The output file generated by PC sampling contains the following fields:

  • Sample_Timestamp: Timestamp when sample is generated

  • Exec_Mask: Active SIMD lanes when sampled

  • Dispatch_Id: Originating kernel dispatch ID

  • Instruction: Assembly instruction e.g: s_load_dword s8, s[1:2], 0x10

  • Instruction_Comment: Instruction comment (Maps back to source-line if debug symbols were enabled when application was compiled)

  • Correlation_Id: API launch call id that matches dispatch ID

By default the output file is in CSV format. To dump samples in a more comprehensive format, one can use JSON through –output-format json.

rocprofv3 --pc-sampling-beta-enabled --pc-sampling-method host_trap --pc-sampling-unit time --pc-sampling-interval 1 --output-format json -- <application_path>

This will generate a JSON file with the comprehensive output. Here is a trimmed down output with multiple records:

{
  "pc_sample_host_trap": [
    {
      "record": {
        "hw_id": {
          "chiplet": 0,
          "wave_id": 0,
          "simd_id": 2,
          "pipe_id": 0,
          "cu_or_wgp_id": 1,
          "shader_array_id": 0,
          "shader_engine_id": 2,
          "workgroup_id": 0,
          "vm_id": 3,
          "queue_id": 2,
          "microengine_id": 1
        },
        "pc": {
          "code_object_id": 1,
          "code_object_offset": 20228
        },
        "exec_mask": 18446744073709551615,
        "timestamp": 51040126667689,
        "dispatch_id": 1,
        "corr_id": {
          "internal": 1,
          "external": 0
        },
        "wrkgrp_id": {
          "x": 182,
          "y": 0,
          "z": 0
        },
        "wave_in_grp": 1
      },
      "inst_index": 0
    },
    {
      "record": {
        "hw_id": {
          "chiplet": 0,
          "wave_id": 0,
          "simd_id": 2,
          "pipe_id": 0,
          "cu_or_wgp_id": 0,
          "shader_array_id": 0,
          "shader_engine_id": 2,
          "workgroup_id": 0,
          "vm_id": 3,
          "queue_id": 2,
          "microengine_id": 1
        },
        "pc": {
          "code_object_id": 1,
          "code_object_offset": 20236
        },
        "exec_mask": 18446744073709551615,
        "timestamp": 51040126667689,
        "dispatch_id": 1,
        "corr_id": {
          "internal": 1,
          "external": 0
        },
        "wrkgrp_id": {
          "x": 158,
          "y": 0,
          "z": 0
        },
        "wave_in_grp": 2
      },
      "inst_index": 1
    }
  ]
}

The description of the fields in the JSON output is available in the Output file fields.