LLM inference performance validation on AMD Instinct MI300X

LLM inference performance validation on AMD Instinct MI300X#

Applies to Linux

2025-04-15

10 min read time

The ROCm vLLM Docker image offers a prebuilt, optimized environment designed for validating large language model (LLM) inference performance on the AMD Instinct™ MI300X accelerator. This ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for the MI300X accelerator and includes the following components:

ROCm 6.2.0
vLLM 0.4.3
PyTorch 2.4.0
Tuning files (in CSV format)

With this Docker image, you can quickly validate the expected inference performance numbers on the MI300X accelerator. This topic also provides tips on optimizing performance with popular AI models.

Note

vLLM is a toolkit and library for LLM inference and serving. It deploys the PagedAttention algorithm, which reduces memory consumption and increases throughput by leveraging dynamic key and value allocation in GPU memory. vLLM also incorporates many LLM acceleration and quantization algorithms. In addition, AMD implements high-performance custom kernels and modules in vLLM to enhance performance further. See vLLM inference and vLLM performance optimization for more information.

Getting started#

Use the following procedures to reproduce the benchmark results on an MI300X accelerator with the prebuilt vLLM Docker image.

Disable NUMA auto-balancing.

To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU might hang until the periodic balancing is finalized. For more information, see AMD Instinct MI300X system optimization.
```
# disable automatic NUMA balancing
sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
# check if NUMA balancing is disabled (returns 0 if disabled)
cat /proc/sys/kernel/numa_balancing
0
```
Download the ROCm vLLM Docker image.

Use the following command to pull the Docker image from Docker Hub.
```
docker pull rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50
```

Once setup is complete, you can choose between two options to reproduce the benchmark results:

MAD-integrated benchmarking
Standalone benchmarking

MAD-integrated benchmarking#

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt

Use this command to run a performance benchmark test of the Llama 3.1 8B model on one GPU with float16 data type in the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_llama-3.1-8b --keep-model-dir --live-output --timeout 28800

ROCm MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.1-8b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/

Although the following eight models are pre-configured to collect latency and throughput performance data, users can also change the benchmarking parameters. Refer to the Standalone benchmarking section.

Available models#

pyt_vllm_llama-3.1-8b
pyt_vllm_llama-3.1-70b
pyt_vllm_llama-3.1-405b

pyt_vllm_llama-2-7b
pyt_vllm_mistral-7b
pyt_vllm_qwen2-7b

pyt_vllm_jais-13b
pyt_vllm_jais-30b

Standalone benchmarking#

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name unified_docker_vllm rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50

In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm

Multiprocessing distributed executor#

To optimize vLLM performance, add the multiprocessing API server argument --distributed-executor-backend mp.

Command#

To start the benchmark, use the following command with the appropriate options. See Options for the list of options and their descriptions.

./vllm_benchmark_report.sh -s $test_option -m $model_repo -g $num_gpu -d $datatype

See the examples for more information.

Note

The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

Note

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Options#

Name	Options	Description
`$test_option`	latency	Measure decoding token latency
	throughput	Measure token generation throughput
	all	Measure both throughput and latency
`$model_repo`	`meta-llama/Meta-Llama-3.1-8B-Instruct`	Llama 3.1 8B
(`float16`)	`meta-llama/Meta-Llama-3.1-70B-Instruct`	Llama 3.1 70B
	`meta-llama/Meta-Llama-3.1-405B-Instruct`	Llama 3.1 405B
	`meta-llama/Llama-2-7b-chat-hf`	Llama 2 7B
	`mistralai/Mixtral-8x7B-Instruct-v0.1`	Mixtral 8x7B
	`mistralai/Mixtral-8x22B-Instruct-v0.1`	Mixtral 8x22B
	`mistralai/Mistral-7B-Instruct-v0.3`	Mixtral 7B
	`Qwen/Qwen2-7B-Instruct`	Qwen2 7B
	`core42/jais-13b-chat`	JAIS 13B
	`core42/jais-30b-chat-v3`	JAIS 30B
`$num_gpu`	1 or 8	Number of GPUs
`$datatype`	`float16`	Data type

Running the benchmark on the MI300X accelerator#

Here are some examples of running the benchmark with various options. See Options for the list of options and their descriptions.

Latency benchmark example#

Use this command to benchmark the latency of the Llama 3.1 8B model on one GPU with the float16 data type.

./vllm_benchmark_report.sh -s latency -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16

Find the latency report at:

./reports_float16/summary/Meta-Llama-3.1-8B-Instruct_latency_report.csv

Throughput benchmark example#

Use this command to benchmark the throughput of the Llama 3.1 8B model on one GPU with the float16 and float8 data types.

./vllm_benchmark_report.sh -s throughput -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16

Find the throughput reports at:

./reports_float16/summary/Meta-Llama-3.1-8B-Instruct_throughput_report.csv

Note

Throughput is calculated as:

$t h r o u g h p u t_t o t = r e q u e s t s \times (input lengths + output lengths) / e l a p s e d_t i m e$
$t h r o u g h p u t_g e n = r e q u e s t s \times output lengths / e l a p s e d_t i m e$

LLM inference performance validation on AMD Instinct MI300X

Contents

LLM inference performance validation on AMD Instinct MI300X#

Getting started#

MAD-integrated benchmarking#

Available models#

Standalone benchmarking#

Multiprocessing distributed executor#

Command#

Options#

Running the benchmark on the MI300X accelerator#

Latency benchmark example#

Throughput benchmark example#

Further reading#