The ROCm Systems Profiler feature set and use cases#
ROCm Systems Profiler is designed to be highly extensible. Internally, it leverages the Timemory performance analysis toolkit to manage extensions, resources, data, and other items. It supports the following features, modes, metrics, and APIs.
Data collection modes#
- Dynamic instrumentation - Runtime instrumentation: Instrument executables and shared libraries at runtime 
- Binary rewriting: Generate a new executable and/or library with instrumentation built-in 
 
- Statistical sampling: Periodic software interrupts per-thread 
- Process-level sampling: A background thread records process-, system- and device-level metrics while the application runs 
- Causal profiling: Quantifies the potential impact of optimizations in parallel code 
Data analysis#
- High-level summary profiles with mean, min, max, and standard deviation statistics - Low overhead and memory efficient 
- Ideal for running at scale 
 
- Comprehensive traces for every individual event and measurement 
- Application speed-up predictions resulting from potential optimizations in functions and lines of code based on causal profiling 
Parallelism API support#
- HIP 
- HSA 
- Pthreads 
- MPI 
- Kokkos-Tools (KokkosP) 
- OpenMP-Tools (OMPT) 
GPU metrics#
- GPU hardware counters 
- HIP API tracing 
- HIP kernel tracing 
- HSA API tracing 
- HSA operation tracing 
- System-level sampling (via rocm-smi) - Memory usage 
- Power usage 
- Temperature 
- Utilization 
 
CPU metrics#
- CPU hardware counters sampling and profiles 
- CPU frequency sampling 
- Various timing metrics - Wall time 
- CPU time (process and thread) 
- CPU utilization (process and thread) 
- User CPU time 
- Kernel CPU time 
 
- Various memory metrics - High-water mark (sampling and profiles) 
- Memory page allocation 
- Virtual memory usage 
 
- Network statistics 
- I/O metrics 
- Many others 
Third-party API support#
- TAU 
- LIKWID 
- Caliper 
- CrayPAT 
- VTune 
- NVTX 
- ROCTX 
ROCm Systems Profiler use cases#
When analyzing the performance of an application, do NOT assume you know where the performance bottlenecks are and why they are happening. ROCm Systems Profiler is a tool for analyzing the entire application and its performance. It is ideal for characterizing where optimization would have the greatest impact on an end-to-end run of the application and for viewing what else is happening on the system during a performance bottleneck.
When GPUs are involved, there is a tendency to assume that the quickest path to performance improvement is minimizing the runtime of the GPU kernels. This is a highly flawed assumption. If you optimize the runtime of a kernel from one millisecond to 1 microsecond (1000x speed-up) but the original application never spent time waiting for kernels to complete, there would be no statistically significant reduction in the end-to-end runtime of your application. In other words, it does not matter how fast or slow the code on GPU is if the application has a bottleneck on waiting on the GPU.
Use ROCm Systems Profiler to obtain a high-level view of the entire application. Use it to determine where the performance bottlenecks are and obtain clues to why these bottlenecks are happening. Rather than worrying about kernel performance, start your investigation with ROCm Systems Profiler, which characterizes the broad picture.
Note
For insight into the execution of individual kernels on the GPU, use ROCm Compute Profiler.
In terms of CPU analysis, ROCm Systems Profiler does not target any specific vendor. It works just as well on AMD and non-AMD CPUs. With regard to the GPU, ROCm Systems Profiler is currently restricted to HIP and HSA APIs and kernels running on AMD GPUs.