LLM inference performance validation on AMD Instinct MI300X#

2025-10-16

10 min read time

Applies to Linux

Caution

This documentation does not reflect the latest version of ROCm vLLM inference performance documentation. See vLLM inference performance testing for the latest version.

The ROCm vLLM Docker image offers a prebuilt, optimized environment for validating large language model (LLM) inference performance on the AMD Instinct™ MI300X GPU. This ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for the MI300X GPU and includes the following components:

With this Docker image, you can quickly validate the expected inference performance numbers for the MI300X GPU. This topic also provides tips on optimizing performance with popular AI models. For more information, see the lists of available models for MAD-integrated benchmarking and standalone benchmarking.

Note

vLLM is a toolkit and library for LLM inference and serving. AMD implements high-performance custom kernels and modules in vLLM to enhance performance. See vLLM inference and vLLM V1 performance optimization for more information.

Getting started#

Use the following procedures to reproduce the benchmark results on an MI300X GPU with the prebuilt vLLM Docker image.

  1. Disable NUMA auto-balancing.

    To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU might hang until the periodic balancing is finalized. For more information, see the system validation steps.

    # disable automatic NUMA balancing
    sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
    # check if NUMA balancing is disabled (returns 0 if disabled)
    cat /proc/sys/kernel/numa_balancing
    0
    
  2. Download the ROCm vLLM Docker image.

    Use the following command to pull the Docker image from Docker Hub.

    docker pull rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6
    

Once the setup is complete, choose between two options to reproduce the benchmark results:

MAD-integrated benchmarking#

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt

Use this command to run a performance benchmark test of the Llama 3.1 8B model on one GPU with float16 data type in the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_llama-3.1-8b --keep-model-dir --live-output --timeout 28800

ROCm MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.1-8b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the following models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. Refer to the Standalone benchmarking section.

Available models#

Model name

Tag

Llama 3.1 8B

pyt_vllm_llama-3.1-8b

Llama 3.1 70B

pyt_vllm_llama-3.1-70b

Llama 3.1 405B

pyt_vllm_llama-3.1-405b

Llama 3.2 11B Vision

pyt_vllm_llama-3.2-11b-vision-instruct

Llama 2 7B

pyt_vllm_llama-2-7b

Llama 2 70B

pyt_vllm_llama-2-70b

Mixtral MoE 8x7B

pyt_vllm_mixtral-8x7b

Mixtral MoE 8x22B

pyt_vllm_mixtral-8x22b

Mistral 7B

pyt_vllm_mistral-7b

Qwen2 7B

pyt_vllm_qwen2-7b

Qwen2 72B

pyt_vllm_qwen2-72b

JAIS 13B

pyt_vllm_jais-13b

JAIS 30B

pyt_vllm_jais-30b

DBRX Instruct

pyt_vllm_dbrx-instruct

Gemma 2 27B

pyt_vllm_gemma-2-27b

C4AI Command R+ 08-2024

pyt_vllm_c4ai-command-r-plus-08-2024

DeepSeek MoE 16B

pyt_vllm_deepseek-moe-16b-chat

Llama 3.1 70B FP8

pyt_vllm_llama-3.1-70b_fp8

Llama 3.1 405B FP8

pyt_vllm_llama-3.1-405b_fp8

Mixtral MoE 8x7B FP8

pyt_vllm_mixtral-8x7b_fp8

Mixtral MoE 8x22B FP8

pyt_vllm_mixtral-8x22b_fp8

Mistral 7B FP8

pyt_vllm_mistral-7b_fp8

DBRX Instruct FP8

pyt_vllm_dbrx_fp8

C4AI Command R+ 08-2024 FP8

pyt_vllm_command-r-plus_fp8

Standalone benchmarking#

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name vllm_v0.6.6 rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6

In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm

Command#

To start the benchmark, use the following command with the appropriate options. See Options for the list of options and their descriptions.

./vllm_benchmark_report.sh -s $test_option -m $model_repo -g $num_gpu -d $datatype

See the examples for more information.

Note

The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

Note

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Options and available models#

Name

Options

Description

$test_option

latency

Measure decoding token latency

throughput

Measure token generation throughput

all

Measure both throughput and latency

$model_repo

meta-llama/Llama-3.1-8B-Instruct

Llama 3.1 8B

(float16)

meta-llama/Llama-3.1-70B-Instruct

Llama 3.1 70B

meta-llama/Llama-3.1-405B-Instruct

Llama 3.1 405B

meta-llama/Llama-3.2-11B-Vision-Instruct

Llama 3.2 11B Vision

meta-llama/Llama-2-7b-chat-hf

Llama 2 7B

meta-llama/Llama-2-70b-chat-hf

Llama 2 70B

mistralai/Mixtral-8x7B-Instruct-v0.1

Mixtral MoE 8x7B

mistralai/Mixtral-8x22B-Instruct-v0.1

Mixtral MoE 8x22B

mistralai/Mistral-7B-Instruct-v0.3

Mistral 7B

Qwen/Qwen2-7B-Instruct

Qwen2 7B

Qwen/Qwen2-72B-Instruct

Qwen2 72B

core42/jais-13b-chat

JAIS 13B

core42/jais-30b-chat-v3

JAIS 30B

databricks/dbrx-instruct

DBRX Instruct

google/gemma-2-27b

Gemma 2 27B

CohereForAI/c4ai-command-r-plus-08-2024

C4AI Command R+ 08-2024

deepseek-ai/deepseek-moe-16b-chat

DeepSeek MoE 16B

$model_repo

amd/Llama-3.1-70B-Instruct-FP8-KV

Llama 3.1 70B FP8

(float8)

amd/Llama-3.1-405B-Instruct-FP8-KV

Llama 3.1 405B FP8

amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV

Mixtral MoE 8x7B FP8

amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV

Mixtral MoE 8x22B FP8

amd/Mistral-7B-v0.1-FP8-KV

Mistral 7B FP8

amd/dbrx-instruct-FP8-KV

DBRX Instruct FP8

amd/c4ai-command-r-plus-FP8-KV

C4AI Command R+ 08-2024 FP8

$num_gpu

1 or 8

Number of GPUs

$datatype

float16 or float8

Data type

Running the benchmark on the MI300X GPU#

Here are some examples of running the benchmark with various options. See Options for the list of options and their descriptions.

Example 1: latency benchmark#

Use this command to benchmark the latency of the Llama 3.1 70B model on eight GPUs with the float16 and float8 data types.

./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.1-70B-Instruct -g 8 -d float16
./vllm_benchmark_report.sh -s latency -m amd/Llama-3.1-70B-Instruct-FP8-KV -g 8 -d float8

Find the latency reports at:

  • ./reports_float16/summary/Llama-3.1-70B-Instruct_latency_report.csv

  • ./reports_float8/summary/Llama-3.1-70B-Instruct-FP8-KV_latency_report.csv

Example 2: throughput benchmark#

Use this command to benchmark the throughput of the Llama 3.1 70B model on eight GPUs with the float16 and float8 data types.

./vllm_benchmark_report.sh -s throughput -m meta-llama/Llama-3.1-70B-Instruct -g 8 -d float16
./vllm_benchmark_report.sh -s throughput -m amd/Llama-3.1-70B-Instruct-FP8-KV -g 8 -d float8

Find the throughput reports at:

  • ./reports_float16/summary/Llama-3.1-70B-Instruct_throughput_report.csv

  • ./reports_float8/summary/Llama-3.1-70B-Instruct-FP8-KV_throughput_report.csv

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

Further reading#

Previous versions#

See vLLM inference performance testing version history to find documentation for previous releases of the ROCm/vllm Docker image.