Quickstart#
This guide provides instructions for using rocprof-compute, AMD’s ROCm Compute Profiler. It covers the steps required to profile GPU workloads and analyze performance data to identify bottlenecks and optimize applications.
The following sections provide brief steps to get started with rocprof-compute. There are 2 main phases to use the tool:
Profiling
Analysis
Prerequisites#
Ensure ROCm is installed. Check:
1. Check GPU and Driver#
amd-smi # Monitor GPU health, temperature, utilization
rocminfo # Display ROCm platform and GPU properties
If these commands fail:
Verify that the GPU driver is loaded:
lsmod | grep amdgpuLoad the driver if needed:
sudo modprobe amdgpuVerify that the device nodes exist:
ls /dev/kfd /dev/driEnsure that the user name is added to the
renderandvideogroups:sudo usermod -aG render,video $USER # Log out and back in for changes to take effect
If
rocminfooramd-smicommands are not found, set ROCm environment:export PATH=/opt/rocm/bin:$PATH export LD_LIBRARY_PATH=/opt/rocm/lib:${LD_LIBRARY_PATH}
2. Check Python Environment#
python3 --version # Requires Python 3.8+
3. Install Dependencies#
pip install -r <ROCM_PATH>/libexec/rocprofiler-compute/requirements.txt
**Note:** Replace ``<ROCM_PATH>`` with the ROCm installation path (e.g., ``/opt/rocm`` or ``/opt/rocm-7.3.0``).
For detailed installation instructions, refer to Installing and deploying ROCm Compute Profiler.
Profiling#
Profiling is the process of collecting performance counters from a GPU application during execution. ROCm Compute Profiler captures detailed metrics regarding kernel execution, memory usage, roofline analysis, and hardware utilization to facilitate performance understanding and optimization.
The following examples reference sample applications available in the samples folder of the GitHub repository: ROCm/rocm-systems
Compile HIP sample:: Build the HIP sample into an executable named ‘vcopy’
hipcc vcopy.cpp -o vcopy
Profile Command:
rocprof-compute profile --name <workload_name> [profile options] [roofline options] -- <workload_cmd>
Example:
rocprof-compute profile --name vcopy -- ./vcopy -n 1048576 -b 256
Explanation:
rocprof-compute profile: Starts a profiling session for a compute workload.--name vcopy: Labels this run as ‘vcopy’ for easy identification and comparison.--: Separates rocprof-compute options from the application arguments../vcopy -n 1048576 -b 256: Executes the application with the following parameters:-n 1048576: Number of elements.-b 256: Block size (threads per block).
What happens during profiling?#
The application runs multiple times to collect all required performance counters; it executes multiple times during profiling. Roofline analysis runs automatically unless disabled with --no-roof.
After profiling, the generated files can be found inside:
workloads/vcopy/MI200/
For detailed information on all profiling options, refer to the full documentation: Profiling
During the profiling phase, roofline analysis also executes multiple iterations to collect the necessary performance data. For detailed information on roofline analysis, refer to the full documentation: Roofline Mode
For more details and options, run:
rocprof-compute profile --help
Other Profiling Examples#
Profiles the workload and collects only roofline data for performance analysis:
$ rocprof-compute profile --name vcopy --roof-only -- ./vcopy -n 1048576 -b 256
Profiles the workload and collects the counters to compute the metric for compute throughput utilization, skipping roofline:
$ rocprof-compute profile --name vcopy --set compute_thruput_util --no-roof -- ./vcopy -n 1048576 -b 256
Lists the available blocks/metrics available for profiling, by page, because list is long. Note the index for each section:
$ rocprof-compute profile --list-available-metrics | more
Profiles the workload using block 2 for system speed of light profiling:
$ rocprof-compute profile --name vcopy -b 2 -- ./vcopy -n 1048576 -b 256
Attaches to a running process for live profiling with specific block IDs, verbose output, and no roofline data:
$ rocprof-compute profile -n try_live_attach_detach -b 3.1.1 4.1.1 5.1.1 --no-roof -VVV --attach-pid <process id>
Profiles the workload using multiple block (5 and 7) for detailed metric collection:
$ rocprof-compute profile --name vcopy -b 5 7 -- ./vcopy -n 1048576 -b 256
Analyzing#
Analysis refers to the process of examining profiling data to understand GPU kernel performance, identifying bottlenecks, and determine optimization opportunities. ROCm Compute Profiler provides multiple analysis modes to accommodate different workflows.
Mode |
When to Use |
Links to docs |
|---|---|---|
Fast, scriptable insights; great for automation and quick checks. |
||
Interactive exploration, visual drill-down, and detailed charts. |
||
Lightweight, keyboard-driven experience for terminals. |
Analysis Command:
rocprof-compute analyze -p <workloads_directory>
Example:
rocprof-compute analyze -p workloads/vcopy/MI200/
Explanation:
rocprof-compute analyze: Starts analysis mode to process profiling results.-p workloads/vcopy/MI200: Path points to the workload directory:workloads/: Root folder for profiling runs.vcopy/: The name the user provided while launching the profiling run.MI200: Device-Name.
For more details on analysis options, refer to the full documentation: Analyze
Other Analysis Examples#
Show a list of metrics supported for analysis:
rocprof-compute analyze -p workloads/vcopy/MI200/ --list-available-metrics | more
Show or display System speed-of-light (2) and roofline (4) analysis:
rocprof-compute analyze -p workloads/vcopy/MI200/ -b 2 4
Analyzes dispatches 12 and 34 from mixbench workload with 3 decimal precision:
rocprof-compute analyze -p workloads/mixbench/MI200/ --dispatch 12 34 --decimal 3
Compares two workloads to evaluate the impact of code optimizations:
rocprof-compute profile -n vcopy_optimized -- ./vcopy_optimized -n 1048576 -b 256
rocprof-compute analyze -p workloads/vcopy/MI200/ -p workloads/vcopy_optimized/MI200/
For more details and options, run:
rocprof-compute analyze --help