PyTorch inference performance testing

PyTorch inference performance testing#

2025-04-24

5 min read time

Applies to Linux

The ROCm PyTorch Docker image offers a prebuilt, optimized environment for testing model inference performance on AMD Instinct™ MI300X series accelerators. This guide demonstrates how to use the AMD Model Automation and Dashboarding (MAD) tool with the ROCm PyTorch container to test inference performance on various models efficiently.

Supported models#

Model
CLIP
Chai-1

Note

See the CLIP model card on Hugging Face to learn more about your selected model. Some models require access authorization before use via an external license agreement through a third party.

Getting started#

Use the following procedures to reproduce the benchmark results on an MI300X series accelerator with the prebuilt PyTorch Docker image.

  1. Disable NUMA auto-balancing.

    To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU might hang until the periodic balancing is finalized. For more information, see AMD Instinct MI300X system optimization.

    # disable automatic NUMA balancing
    sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
    # check if NUMA balancing is disabled (returns 0 if disabled)
    cat /proc/sys/kernel/numa_balancing
    0
    
  1. Use the following command to pull the ROCm PyTorch Docker image from Docker Hub.

    docker pull rocm/pytorch:latest
    

Benchmarking#

To simplify performance testing, the ROCm Model Automation and Dashboarding (ROCm/MAD) project provides ready-to-use scripts and configuration. To start, clone the MAD repository to a local directory and install the required packages on the host machine.

git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt

Use this command to run the performance benchmark test on the CLIP model using one GPU with the float16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_clip_inference --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_clip_inference. The latency and throughput reports of the model are collected in perf.csv.

Note

For improved performance, consider enabling TunableOp. By default, pyt_clip_inference runs with TunableOp disabled (see ROCm/MAD). To enable it, edit the default run behavior in the tools/run_models.py– update the model’s run args by changing --tunableop off to --tunableop on.

Enabling TunableOp triggers a two-pass run – a warm-up followed by the performance-collection run. Although this might increase the initial training time, it can result in a performance gain.

Further reading#