vLLM inference performance testing#
2025-10-16
74 min read time
Caution
This documentation does not reflect the latest version of ROCm vLLM inference performance documentation. See vLLM inference performance testing for the latest version.
The ROCm vLLM Docker image offers a prebuilt, optimized environment for validating large language model (LLM) inference performance on AMD Instinct™ MI300X Series GPU. This ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series GPUs and includes the following components:
With this Docker image, you can quickly test the expected inference performance numbers for MI300X Series GPUs.
Available models#
Note
See the Llama 3.1 8B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Llama 3.1 70B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Llama 3.1 405B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Llama 3.2 11B Vision model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Llama 2 7B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Llama 2 70B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Llama 3.1 8B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Llama 3.1 70B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Llama 3.1 405B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Mixtral MoE 8x7B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Mixtral MoE 8x22B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Mistral 7B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Mixtral MoE 8x7B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Mixtral MoE 8x22B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Mistral 7B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Qwen2 7B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Qwen2 72B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the JAIS 13B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the JAIS 30B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the DBRX Instruct model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the DBRX Instruct FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Gemma 2 27B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the C4AI Command R+ 08-2024 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the C4AI Command R+ 08-2024 FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the DeepSeek MoE 16B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
vLLM is a toolkit and library for LLM inference and serving. AMD implements high-performance custom kernels and modules in vLLM to enhance performance. See vLLM inference and vLLM V1 performance optimization for more information.
Performance measurements#
To evaluate performance, the Performance results with AMD ROCm software page provides reference throughput and latency measurements for inferencing popular AI models.
Important
The performance data presented in Performance results with AMD ROCm software only reflects the latest version of this inference benchmarking environment. The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
Advanced features and known issues#
For information on experimental features and known issues related to ROCm optimization efforts on vLLM, see the developer’s guide at ROCm/vllm.
Getting started#
Use the following procedures to reproduce the benchmark results on an MI300X GPU with the prebuilt vLLM Docker image.
- Disable NUMA auto-balancing. - To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU might hang until the periodic balancing is finalized. For more information, see the system validation steps. - # disable automatic NUMA balancing sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' # check if NUMA balancing is disabled (returns 0 if disabled) cat /proc/sys/kernel/numa_balancing 0 
- Download the ROCm vLLM Docker image. - Use the following command to pull the Docker image from Docker Hub. - docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325 
Benchmarking#
Once the setup is complete, choose between two options to reproduce the benchmark results:
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 3.1 8B model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_llama-3.1-8b --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-3.1-8b. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m meta-llama/Llama-3.1-8B-Instruct -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Llama 3.1 8B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.1-8B-Instruct -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/Llama-3.1-8B-Instruct_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Llama 3.1 8B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.1-8B-Instruct -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/Llama-3.1-8B-Instruct_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 3.1 70B model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_llama-3.1-70b --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-3.1-70b. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m meta-llama/Llama-3.1-70B-Instruct -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Llama 3.1 70B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.1-70B-Instruct -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/Llama-3.1-70B-Instruct_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Llama 3.1 70B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.1-70B-Instruct -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/Llama-3.1-70B-Instruct_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 3.1 405B model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_llama-3.1-405b --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-3.1-405b. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m meta-llama/Llama-3.1-405B-Instruct -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Llama 3.1 405B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.1-405B-Instruct -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/Llama-3.1-405B-Instruct_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Llama 3.1 405B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.1-405B-Instruct -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/Llama-3.1-405B-Instruct_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 3.2 11B Vision model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_llama-3.2-11b-vision-instruct --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-3.2-11b-vision-instruct. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m meta-llama/Llama-3.2-11B-Vision-Instruct -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Llama 3.2 11B Vision model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.2-11B-Vision-Instruct -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/Llama-3.2-11B-Vision-Instruct_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Llama 3.2 11B Vision model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.2-11B-Vision-Instruct -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/Llama-3.2-11B-Vision-Instruct_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 2 7B model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_llama-2-7b --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-2-7b. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m meta-llama/Llama-2-7b-chat-hf -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Llama 2 7B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-2-7b-chat-hf -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/Llama-2-7b-chat-hf_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Llama 2 7B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-2-7b-chat-hf -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/Llama-2-7b-chat-hf_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 2 70B model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_llama-2-70b --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-2-70b. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m meta-llama/Llama-2-70b-chat-hf -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Llama 2 70B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-2-70b-chat-hf -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/Llama-2-70b-chat-hf_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Llama 2 70B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-2-70b-chat-hf -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/Llama-2-70b-chat-hf_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 3.1 8B FP8 model
using one GPU with the float8 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_llama-3.1-8b_fp8 --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-3.1-8b_fp8. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float8/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m amd/Llama-3.1-8B-Instruct-FP8-KV -g $num_gpu -d float8
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Llama 3.1 8B FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/Llama-3.1-8B-Instruct-FP8-KV -g 8 -d float8 - Find the latency report at - ./reports_float8_vllm_rocm6.3.1/summary/Llama-3.1-8B-Instruct-FP8-KV_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Llama 3.1 8B FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/Llama-3.1-8B-Instruct-FP8-KV -g 8 -d float8 - Find the throughput report at - ./reports_float8_vllm_rocm6.3.1/summary/Llama-3.1-8B-Instruct-FP8-KV_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 3.1 70B FP8 model
using one GPU with the float8 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_llama-3.1-70b_fp8 --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-3.1-70b_fp8. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float8/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m amd/Llama-3.1-70B-Instruct-FP8-KV -g $num_gpu -d float8
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Llama 3.1 70B FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/Llama-3.1-70B-Instruct-FP8-KV -g 8 -d float8 - Find the latency report at - ./reports_float8_vllm_rocm6.3.1/summary/Llama-3.1-70B-Instruct-FP8-KV_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Llama 3.1 70B FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/Llama-3.1-70B-Instruct-FP8-KV -g 8 -d float8 - Find the throughput report at - ./reports_float8_vllm_rocm6.3.1/summary/Llama-3.1-70B-Instruct-FP8-KV_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 3.1 405B FP8 model
using one GPU with the float8 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_llama-3.1-405b_fp8 --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-3.1-405b_fp8. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float8/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m amd/Llama-3.1-405B-Instruct-FP8-KV -g $num_gpu -d float8
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Llama 3.1 405B FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/Llama-3.1-405B-Instruct-FP8-KV -g 8 -d float8 - Find the latency report at - ./reports_float8_vllm_rocm6.3.1/summary/Llama-3.1-405B-Instruct-FP8-KV_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Llama 3.1 405B FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/Llama-3.1-405B-Instruct-FP8-KV -g 8 -d float8 - Find the throughput report at - ./reports_float8_vllm_rocm6.3.1/summary/Llama-3.1-405B-Instruct-FP8-KV_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Mixtral MoE 8x7B model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_mixtral-8x7b --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_mixtral-8x7b. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m mistralai/Mixtral-8x7B-Instruct-v0.1 -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Mixtral MoE 8x7B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m mistralai/Mixtral-8x7B-Instruct-v0.1 -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/Mixtral-8x7B-Instruct-v0.1_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Mixtral MoE 8x7B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m mistralai/Mixtral-8x7B-Instruct-v0.1 -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/Mixtral-8x7B-Instruct-v0.1_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Mixtral MoE 8x22B model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_mixtral-8x22b --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_mixtral-8x22b. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m mistralai/Mixtral-8x22B-Instruct-v0.1 -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Mixtral MoE 8x22B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m mistralai/Mixtral-8x22B-Instruct-v0.1 -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/Mixtral-8x22B-Instruct-v0.1_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Mixtral MoE 8x22B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m mistralai/Mixtral-8x22B-Instruct-v0.1 -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/Mixtral-8x22B-Instruct-v0.1_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Mistral 7B model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_mistral-7b --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_mistral-7b. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m mistralai/Mistral-7B-Instruct-v0.3 -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Mistral 7B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m mistralai/Mistral-7B-Instruct-v0.3 -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/Mistral-7B-Instruct-v0.3_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Mistral 7B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m mistralai/Mistral-7B-Instruct-v0.3 -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/Mistral-7B-Instruct-v0.3_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Mixtral MoE 8x7B FP8 model
using one GPU with the float8 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_mixtral-8x7b_fp8 --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_mixtral-8x7b_fp8. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float8/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV -g $num_gpu -d float8
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Mixtral MoE 8x7B FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV -g 8 -d float8 - Find the latency report at - ./reports_float8_vllm_rocm6.3.1/summary/Mixtral-8x7B-Instruct-v0.1-FP8-KV_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Mixtral MoE 8x7B FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV -g 8 -d float8 - Find the throughput report at - ./reports_float8_vllm_rocm6.3.1/summary/Mixtral-8x7B-Instruct-v0.1-FP8-KV_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Mixtral MoE 8x22B FP8 model
using one GPU with the float8 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_mixtral-8x22b_fp8 --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_mixtral-8x22b_fp8. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float8/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV -g $num_gpu -d float8
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Mixtral MoE 8x22B FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV -g 8 -d float8 - Find the latency report at - ./reports_float8_vllm_rocm6.3.1/summary/Mixtral-8x22B-Instruct-v0.1-FP8-KV_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Mixtral MoE 8x22B FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV -g 8 -d float8 - Find the throughput report at - ./reports_float8_vllm_rocm6.3.1/summary/Mixtral-8x22B-Instruct-v0.1-FP8-KV_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Mistral 7B FP8 model
using one GPU with the float8 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_mistral-7b_fp8 --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_mistral-7b_fp8. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float8/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m amd/Mistral-7B-v0.1-FP8-KV -g $num_gpu -d float8
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Mistral 7B FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/Mistral-7B-v0.1-FP8-KV -g 8 -d float8 - Find the latency report at - ./reports_float8_vllm_rocm6.3.1/summary/Mistral-7B-v0.1-FP8-KV_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Mistral 7B FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/Mistral-7B-v0.1-FP8-KV -g 8 -d float8 - Find the throughput report at - ./reports_float8_vllm_rocm6.3.1/summary/Mistral-7B-v0.1-FP8-KV_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Qwen2 7B model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_qwen2-7b --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_qwen2-7b. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m Qwen/Qwen2-7B-Instruct -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Qwen2 7B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m Qwen/Qwen2-7B-Instruct -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/Qwen2-7B-Instruct_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Qwen2 7B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m Qwen/Qwen2-7B-Instruct -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/Qwen2-7B-Instruct_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Qwen2 72B model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_qwen2-72b --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_qwen2-72b. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m Qwen/Qwen2-72B-Instruct -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Qwen2 72B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m Qwen/Qwen2-72B-Instruct -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/Qwen2-72B-Instruct_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Qwen2 72B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m Qwen/Qwen2-72B-Instruct -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/Qwen2-72B-Instruct_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the JAIS 13B model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_jais-13b --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_jais-13b. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m core42/jais-13b-chat -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the JAIS 13B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m core42/jais-13b-chat -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/jais-13b-chat_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the JAIS 13B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m core42/jais-13b-chat -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/jais-13b-chat_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the JAIS 30B model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_jais-30b --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_jais-30b. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m core42/jais-30b-chat-v3 -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the JAIS 30B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m core42/jais-30b-chat-v3 -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/jais-30b-chat-v3_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the JAIS 30B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m core42/jais-30b-chat-v3 -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/jais-30b-chat-v3_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the DBRX Instruct model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_dbrx-instruct --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_dbrx-instruct. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m databricks/dbrx-instruct -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the DBRX Instruct model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m databricks/dbrx-instruct -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/dbrx-instruct_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the DBRX Instruct model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m databricks/dbrx-instruct -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/dbrx-instruct_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the DBRX Instruct FP8 model
using one GPU with the float8 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_dbrx_fp8 --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_dbrx_fp8. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float8/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m amd/dbrx-instruct-FP8-KV -g $num_gpu -d float8
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the DBRX Instruct FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/dbrx-instruct-FP8-KV -g 8 -d float8 - Find the latency report at - ./reports_float8_vllm_rocm6.3.1/summary/dbrx-instruct-FP8-KV_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the DBRX Instruct FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/dbrx-instruct-FP8-KV -g 8 -d float8 - Find the throughput report at - ./reports_float8_vllm_rocm6.3.1/summary/dbrx-instruct-FP8-KV_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the Gemma 2 27B model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_gemma-2-27b --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_gemma-2-27b. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m google/gemma-2-27b -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the Gemma 2 27B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m google/gemma-2-27b -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/gemma-2-27b_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the Gemma 2 27B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m google/gemma-2-27b -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/gemma-2-27b_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the C4AI Command R+ 08-2024 model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_c4ai-command-r-plus-08-2024 --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_c4ai-command-r-plus-08-2024. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m CohereForAI/c4ai-command-r-plus-08-2024 -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the C4AI Command R+ 08-2024 model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m CohereForAI/c4ai-command-r-plus-08-2024 -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/c4ai-command-r-plus-08-2024_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the C4AI Command R+ 08-2024 model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m CohereForAI/c4ai-command-r-plus-08-2024 -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/c4ai-command-r-plus-08-2024_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the C4AI Command R+ 08-2024 FP8 model
using one GPU with the float8 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_command-r-plus_fp8 --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_command-r-plus_fp8. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float8/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m amd/c4ai-command-r-plus-FP8-KV -g $num_gpu -d float8
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the C4AI Command R+ 08-2024 FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/c4ai-command-r-plus-FP8-KV -g 8 -d float8 - Find the latency report at - ./reports_float8_vllm_rocm6.3.1/summary/c4ai-command-r-plus-FP8-KV_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the C4AI Command R+ 08-2024 FP8 model on eight GPUs with the - float8data type.- ./vllm_benchmark_report.sh -s latency -m amd/c4ai-command-r-plus-FP8-KV -g 8 -d float8 - Find the throughput report at - ./reports_float8_vllm_rocm6.3.1/summary/c4ai-command-r-plus-FP8-KV_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run the performance benchmark test on the DeepSeek MoE 16B model
using one GPU with the float16 data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_deepseek-moe-16b-chat --keep-model-dir --live-output --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_deepseek-moe-16b-chat. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/.
Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./vllm_benchmark_report.sh -s $test_option -m deepseek-ai/deepseek-moe-16b-chat -g $num_gpu -d float16
| Name | Options | Description | 
|---|---|---|
| 
 | latency | Measure decoding token latency | 
| throughput | Measure token generation throughput | |
| all | Measure both throughput and latency | |
| 
 | 1 or 8 | Number of GPUs | 
| 
 | 
 | Data type | 
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Here are some examples of running the benchmark with various options.
- Latency benchmark - Use this command to benchmark the latency of the DeepSeek MoE 16B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m deepseek-ai/deepseek-moe-16b-chat -g 8 -d float16 - Find the latency report at - ./reports_float16_vllm_rocm6.3.1/summary/deepseek-moe-16b-chat_latency_report.csv.
- Throughput benchmark - Use this command to throughput the latency of the DeepSeek MoE 16B model on eight GPUs with the - float16data type.- ./vllm_benchmark_report.sh -s latency -m deepseek-ai/deepseek-moe-16b-chat -g 8 -d float16 - Find the throughput report at - ./reports_float16_vllm_rocm6.3.1/summary/deepseek-moe-16b-chat_throughput_report.csv.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Further reading#
- For application performance optimization strategies for HPC and AI workloads, including inference with vLLM, see AMD Instinct MI300X workload optimization. 
- To learn more about the options for latency and throughput benchmark scripts, see ROCm/vllm. 
- To learn more about system settings and management practices to configure your system for MI300X Series GPUs, see AMD Instinct MI300X system optimization 
- To learn how to run community models from Hugging Face on AMD GPUs, see Running models from Hugging Face. 
- To learn how to fine-tune LLMs and optimize inference, see Fine-tuning LLMs and inference optimization. 
- For a list of other ready-made Docker images for AI with ROCm, see AMD Infinity Hub. 
Previous versions#
See vLLM inference performance testing version history to find documentation for previous releases
of the ROCm/vllm Docker image.