PyTorch inference performance testing#
2025-04-24
5 min read time
The ROCm PyTorch Docker image offers a prebuilt, optimized environment for testing model inference performance on AMD Instinct™ MI300X series accelerators. This guide demonstrates how to use the AMD Model Automation and Dashboarding (MAD) tool with the ROCm PyTorch container to test inference performance on various models efficiently.
Supported models#
Note
See the CLIP model card on Hugging Face to learn more about your selected model. Some models require access authorization before use via an external license agreement through a third party.
Getting started#
Use the following procedures to reproduce the benchmark results on an MI300X series accelerator with the prebuilt PyTorch Docker image.
Disable NUMA auto-balancing.
To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU might hang until the periodic balancing is finalized. For more information, see AMD Instinct MI300X system optimization.
# disable automatic NUMA balancing sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' # check if NUMA balancing is disabled (returns 0 if disabled) cat /proc/sys/kernel/numa_balancing 0
Use the following command to pull the ROCm PyTorch Docker image from Docker Hub.
docker pull rocm/pytorch:latest
Benchmarking#
To simplify performance testing, the ROCm Model Automation and Dashboarding (ROCm/MAD) project provides ready-to-use scripts and configuration. To start, clone the MAD repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the CLIP model
using one GPU with the float16
data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_clip_inference --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_clip_inference
. The latency and throughput reports of the
model are collected in perf.csv
.
Note
For improved performance, consider enabling TunableOp. By default,
pyt_clip_inference
runs with TunableOp disabled (see
ROCm/MAD). To enable
it, edit the default run behavior in the tools/run_models.py
– update the model’s
run args
by changing --tunableop off
to --tunableop on
.
Enabling TunableOp triggers a two-pass run – a warm-up followed by the performance-collection run. Although this might increase the initial training time, it can result in a performance gain.
Further reading#
To learn more about system settings and management practices to configure your system for MI300X accelerators, see AMD Instinct MI300X system optimization.
To learn how to run LLM models from Hugging Face or your model, see Running models from Hugging Face.
To learn how to optimize inference on LLMs, see Inference optimization.
To learn how to fine-tune LLMs, see Fine-tuning LLMs.