LLM inference performance testing on AMD Instinct MI300X#

2025-03-13

73 min read time

Applies to Linux

The ROCm vLLM Docker image offers a prebuilt, optimized environment for validating large language model (LLM) inference performance on AMD Instinct™ MI300X series accelerator. This ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for MI300X series accelerators and includes the following components:

With this Docker image, you can quickly test the expected inference performance numbers for MI300X series accelerators.

Available models#

Model
Llama
Mistral
Qwen
JAIS
DBRX
Gemma
Cohere
DeepSeek
Model variant
Llama 3.1 8B
Llama 3.1 70B
Llama 3.1 405B
Llama 3.2 11B Vision
Llama 2 7B
Llama 2 70B
Llama 3.1 8B FP8
Llama 3.1 70B FP8
Llama 3.1 405B FP8

Note

See the Llama 3.1 8B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

vLLM is a toolkit and library for LLM inference and serving. AMD implements high-performance custom kernels and modules in vLLM to enhance performance. See vLLM inference and vLLM performance optimization for more information.

Performance measurements#

To evaluate performance, the Performance results with AMD ROCm software page provides reference throughput and latency measurements for inferencing popular AI models.

Note

The performance data presented in Performance results with AMD ROCm software should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.

Advanced features and known issues#

For information on experimental features and known issues related to ROCm optimization efforts on vLLM, see the developer’s guide at ROCm/vllm.

Getting started#

Use the following procedures to reproduce the benchmark results on an MI300X accelerator with the prebuilt vLLM Docker image.

  1. Disable NUMA auto-balancing.

    To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU might hang until the periodic balancing is finalized. For more information, see AMD Instinct MI300X system optimization.

    # disable automatic NUMA balancing
    sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
    # check if NUMA balancing is disabled (returns 0 if disabled)
    cat /proc/sys/kernel/numa_balancing
    0
    
  2. Download the ROCm vLLM Docker image.

    Use the following command to pull the Docker image from Docker Hub.

    docker pull rocm/vllm:instinct_main
    

Benchmarking#

Once the setup is complete, choose between two options to reproduce the benchmark results:

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt

Use this command to run the performance benchmark test on the Llama 3.1 8B model using one GPU with the float16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_llama-3.1-8b --keep-model-dir --live-output --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.1-8b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:instinct_main
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:instinct_main

In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm

To start the benchmark, use the following command with the appropriate options.

./vllm_benchmark_report.sh -s $test_option -m meta-llama/Llama-3.1-8B-Instruct -g $num_gpu -d float16

Name

Options

Description

$test_option

latency

Measure decoding token latency

throughput

Measure token generation throughput

all

Measure both throughput and latency

$num_gpu

1 or 8

Number of GPUs

$datatype

float16 or float8

Data type

Note

The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

Note

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Here are some examples of running the benchmark with various options.

  • Latency benchmark

    Use this command to benchmark the latency of the Llama 3.1 8B model on eight GPUs with the float16 data type.

    ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.1-8B-Instruct -g 8 -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.3.1/summary/Llama-3.1-8B-Instruct_latency_report.csv.

  • Throughput benchmark

    Use this command to throughput the latency of the Llama 3.1 8B model on eight GPUs with the float16 data type.

    ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.1-8B-Instruct -g 8 -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.3.1/summary/Llama-3.1-8B-Instruct_throughput_report.csv.

Note

Throughput is calculated as:

  • throughput_tot=requests×(input lengths+output lengths)/elapsed_time
  • throughput_gen=requests×output lengths/elapsed_time

Further reading#

Previous versions#

This table lists previous versions of the ROCm vLLM inference Docker image for inference performance testing. For detailed information about available models for benchmarking, see the version-specific documentation.

ROCm version

vLLM version

PyTorch version

Resources

6.3.1

0.6.6

2.7.0

6.2.1

0.6.4

2.5.0

6.2.0

0.4.3

2.4.0