Quickstart#
This guide provides instructions for getting started and using ROCm Compute Profiler. It covers the steps required to profile GPU workloads and analyze performance data to identify bottlenecks and optimize applications.
There are two main phases to use the tool:
Prerequisites#
Ensure ROCm is installed and follow the steps:
Check the GPU and driver.
amd-smi # Monitor GPU health, temperature, utilization rocminfo # Display ROCm platform and GPU properties
If these commands fail:
Verify that the GPU driver is loaded:
lsmod | grep amdgpuLoad the driver if needed:
sudo modprobe amdgpuVerify that the device nodes exist:
ls /dev/kfd /dev/driEnsure that the user name is added to the
renderandvideogroups:
sudo usermod -aG render,video $USER # Log out and back in for changes to take effect
If
rocminfooramd-smicommands are not found, set ROCm environment:
export PATH=/opt/rocm/bin:$PATH export LD_LIBRARY_PATH=/opt/rocm/lib:${LD_LIBRARY_PATH}
Check the Python environment.
python3 --version # Requires Python 3.8+
Check the installation dependencies.
pip install -r <ROCM_PATH>/libexec/rocprofiler-compute/requirements.txt
**Note:** Replace ``<ROCM_PATH>`` with the ROCm installation path (e.g., ``/opt/rocm`` or ``/opt/rocm-7.3.0``).
For detailed installation instructions, refer to Installing and deploying ROCm Compute Profiler.
Profiling#
Profiling is the process of collecting performance counters from a GPU application during execution. ROCm Compute Profiler captures detailed metrics regarding kernel execution, memory usage, roofline analysis, and hardware utilization to facilitate performance understanding and optimization.
The following examples reference sample applications located in the samples folder of the GitHub repository: ROCm/rocm-systems
Compile HIP sample:: Build the HIP sample into an executable named ‘vcopy’
hipcc vcopy.cpp -o vcopy
Profile Command:
rocprof-compute profile --name <workload_name> [profile options] [roofline options] -- <workload_cmd>
Example:
rocprof-compute profile --name vcopy -- ./vcopy -n 1048576 -b 256
Explanation:
rocprof-compute profile: Starts a profiling session for a compute workload.--name vcopy: Labels this run as ‘vcopy’ for easy identification and comparison.--: Separates rocprof-compute options from the application arguments../vcopy -n 1048576 -b 256: Executes the application with the following parameters:-n 1048576: Number of elements.-b 256: Block size (threads per block).
What happens during profiling?#
The application runs multiple times to collect all required performance counters; it executes multiple times during profiling. Roofline analysis runs automatically unless you disable it using --no-roof.
After profiling, the generated files can be found inside:
workloads/vcopy/MI200/
For detailed information on all profiling options, refer to Profile mode.
During the profiling phase, roofline analysis also executes multiple iterations to collect the necessary performance data. For detailed information on roofline analysis, refer to Standalone roofline.
For more details and options, run:
rocprof-compute profile --help
Profiling examples#
Common use cases when profiling a workload are:
Collect only roofline data for performance analysis#
$ rocprof-compute profile --name vcopy --roof-only -- ./vcopy -n 1048576 -b 256
Collect the counters to compute the metric for compute throughput utilization, skipping roofline#
$ rocprof-compute profile --name vcopy --set compute_thruput_util --no-roof -- ./vcopy -n 1048576 -b 256
List the available blocks/metrics for profiling#
The blocks/metrics are listed by page, because the list is long. Note the index for each section.
$ rocprof-compute profile --list-available-metrics | more
Using block 2 for system speed-of-light profiling#
$ rocprof-compute profile --name vcopy -b 2 -- ./vcopy -n 1048576 -b 256
Attach to a running process for live profiling#
Dynamic process attachment can be performed with specific block IDs, verbose output, and no roofline data.
$ rocprof-compute profile -n try_live_attach_detach -b 3.1.1 4.1.1 5.1.1 --no-roof -VVV --attach-pid <process id>
Use multiple blocks (5 and 7) for detailed metric collection#
$ rocprof-compute profile --name vcopy -b 5 7 -- ./vcopy -n 1048576 -b 256
Analysis#
Analysis phase refers to the process of examining profiling data to understand GPU kernel performance, identify bottlenecks, and determine optimization opportunities. ROCm Compute Profiler provides multiple analysis modes to accommodate different workflows.
Mode |
When to Use |
Links to docs |
|---|---|---|
Fast, scriptable insights; great for automation and quick checks. |
||
Interactive exploration, visual drill-down, and detailed charts. |
||
Lightweight, keyboard-driven experience for terminals. |
Analysis Command:
rocprof-compute analyze -p <workloads_directory>
Example:
rocprof-compute analyze -p workloads/vcopy/MI200/
Explanation:
rocprof-compute analyze: Starts analysis mode to process profiling results.-p workloads/vcopy/MI200: The path points to the workload directory:workloads/: Root folder for profiling runs.vcopy/: The name the user provided while launching the profiling run.MI200: Device-Name.
For more details on analysis options, refer to Analyze.
Analysis examples#
Common use cases when analyzing a workload are:
Show a list of metrics supported for analysis#
rocprof-compute analyze -p workloads/vcopy/MI200/ --list-available-metrics | more
Show or display System speed-of-light (2) and roofline (4) analysis#
rocprof-compute analyze -p workloads/vcopy/MI200/ -b 2 4
Analyze dispatches 12 and 34 from mixbench workload with 3 decimal precision:#
rocprof-compute analyze -p workloads/mixbench/MI200/ --dispatch 12 34 --decimal 3
Compare two workloads to evaluate the impact of code optimizations#
rocprof-compute profile -n vcopy_optimized -- ./vcopy_optimized -n 1048576 -b 256
rocprof-compute analyze -p workloads/vcopy/MI200/ -p workloads/vcopy_optimized/MI200/
For more details and options, run:
rocprof-compute analyze --help