vLLM inference performance testing#

2025-10-16

77 min read time

Applies to Linux

Caution

This documentation does not reflect the latest version of ROCm vLLM inference performance documentation. See vLLM inference performance testing for the latest version.

The ROCm vLLM Docker image offers a prebuilt, optimized environment for validating large language model (LLM) inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series GPUs and includes the following components:

Software component

Version

ROCm

6.4.1

vLLM

0.9.1 (0.9.2.dev364+gb432b7a28.rocm641)

PyTorch

2.7.0+gitf717b2a

hipBLASLt

0.15

With this Docker image, you can quickly test the expected inference performance numbers for MI300X Series GPUs.

What’s new#

The following is summary of notable changes since the previous ROCm/vLLM Docker release.

  • The --compilation-config-parameter is no longer required as its options are now enabled by default. This parameter has been removed from the benchmarking script.

  • Resolved Llama 3.1 405 B custom all-reduce issue, eliminating the need for --disable-custom-all-reduce. This parameter has been removed from the benchmarking script.

  • Fixed a +rms_norm custom kernel issue.

  • Added quick reduce functionality. Set VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=FP to enable; supported modes are FP, INT8, INT6, INT4.

  • Implemented a workaround to potentially mitigate GPU crashes experienced with the Command R+ model, pending a driver fix.

Supported models#

The following models are supported for inference performance benchmarking with vLLM and ROCm. Some instructions, commands, and recommendations in this documentation might vary by model – select one to get started.

Model group
Meta Llama
Mistral AI
Qwen
Databricks DBRX
Google Gemma
Cohere
DeepSeek
Microsoft Phi
TII Falcon
Model
Llama 3.1 8B
Llama 3.1 70B
Llama 3.1 405B
Llama 2 7B
Llama 2 70B
Llama 3.1 8B FP8
Llama 3.1 70B FP8
Llama 3.1 405B FP8
Mixtral MoE 8x7B
Mixtral MoE 8x22B
Mistral 7B
Mixtral MoE 8x7B FP8
Mixtral MoE 8x22B FP8
Mistral 7B FP8
Qwen2 7B
Qwen2 72B
QwQ-32B
DBRX Instruct
DBRX Instruct FP8
Gemma 2 27B
C4AI Command R+ 08-2024
C4AI Command R+ 08-2024 FP8
DeepSeek MoE 16B
Phi-4
Falcon 180B

Note

See the Llama 3.1 8B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Llama 3.1 70B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Llama 3.1 405B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Llama 2 7B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Llama 2 70B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Llama 3.1 8B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Llama 3.1 70B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Llama 3.1 405B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Mixtral MoE 8x7B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Mixtral MoE 8x22B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Mistral 7B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Mixtral MoE 8x7B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Mixtral MoE 8x22B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Mistral 7B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Qwen2 7B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Qwen2 72B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the QwQ-32B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the DBRX Instruct model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the DBRX Instruct FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Gemma 2 27B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the C4AI Command R+ 08-2024 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the C4AI Command R+ 08-2024 FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the DeepSeek MoE 16B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Phi-4 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Falcon 180B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

vLLM is a toolkit and library for LLM inference and serving. AMD implements high-performance custom kernels and modules in vLLM to enhance performance. See vLLM inference and vLLM V1 performance optimization for more information.

Performance measurements#

To evaluate performance, the Performance results with AMD ROCm software page provides reference throughput and latency measurements for inferencing popular AI models.

Important

The performance data presented in Performance results with AMD ROCm software only reflects the latest version of this inference benchmarking environment. The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.

System validation#

Before running AI workloads, it’s important to validate that your AMD hardware is configured correctly and performing optimally.

If you have already validated your system settings, including aspects like NUMA auto-balancing, you can skip this step. Otherwise, complete the procedures in the System validation and optimization guide to properly configure your system settings before starting training.

To test for optimal performance, consult the recommended System health benchmarks. This suite of tests will help you verify and fine-tune your system’s configuration.

Pull the Docker image#

Download the ROCm vLLM Docker image. Use the following command to pull the Docker image from Docker Hub.

docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715

Benchmarking#

Once the setup is complete, choose between two options to reproduce the benchmark results:

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Llama 3.1 8B model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-3.1-8b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.1-8b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m meta-llama/Llama-3.1-8B-Instruct \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Llama 3.1 8B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m meta-llama/Llama-3.1-8B-Instruct \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/Llama-3.1-8B-Instruct_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Llama 3.1 8B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m meta-llama/Llama-3.1-8B-Instruct \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/Llama-3.1-8B-Instruct_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Llama 3.1 70B model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-3.1-70b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.1-70b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m meta-llama/Llama-3.1-70B-Instruct \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Llama 3.1 70B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m meta-llama/Llama-3.1-70B-Instruct \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/Llama-3.1-70B-Instruct_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Llama 3.1 70B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m meta-llama/Llama-3.1-70B-Instruct \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/Llama-3.1-70B-Instruct_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Llama 3.1 405B model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-3.1-405b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.1-405b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m meta-llama/Llama-3.1-405B-Instruct \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Llama 3.1 405B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m meta-llama/Llama-3.1-405B-Instruct \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/Llama-3.1-405B-Instruct_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Llama 3.1 405B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m meta-llama/Llama-3.1-405B-Instruct \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/Llama-3.1-405B-Instruct_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Llama 2 7B model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-2-7b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-2-7b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m meta-llama/Llama-2-7b-chat-hf \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Llama 2 7B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m meta-llama/Llama-2-7b-chat-hf \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/Llama-2-7b-chat-hf_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Llama 2 7B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m meta-llama/Llama-2-7b-chat-hf \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/Llama-2-7b-chat-hf_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Llama 2 70B model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-2-70b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-2-70b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m meta-llama/Llama-2-70b-chat-hf \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Llama 2 70B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m meta-llama/Llama-2-70b-chat-hf \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/Llama-2-70b-chat-hf_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Llama 2 70B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m meta-llama/Llama-2-70b-chat-hf \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/Llama-2-70b-chat-hf_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Llama 3.1 8B FP8 model using one GPU with the float8 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-3.1-8b_fp8 \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.1-8b_fp8. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float8/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m amd/Llama-3.1-8B-Instruct-FP8-KV \
        -g $num_gpu \
        -d float8
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Llama 3.1 8B FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m amd/Llama-3.1-8B-Instruct-FP8-KV \
        -g 8 \
        -d float8
    

    Find the latency report at ./reports_float8_vllm_rocm6.4.1/summary/Llama-3.1-8B-Instruct-FP8-KV_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Llama 3.1 8B FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m amd/Llama-3.1-8B-Instruct-FP8-KV \
        -g 8 \
        -d float8
    

    Find the throughput report at ./reports_float8_vllm_rocm6.4.1/summary/Llama-3.1-8B-Instruct-FP8-KV_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Llama 3.1 70B FP8 model using one GPU with the float8 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-3.1-70b_fp8 \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.1-70b_fp8. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float8/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m amd/Llama-3.1-70B-Instruct-FP8-KV \
        -g $num_gpu \
        -d float8
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Llama 3.1 70B FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m amd/Llama-3.1-70B-Instruct-FP8-KV \
        -g 8 \
        -d float8
    

    Find the latency report at ./reports_float8_vllm_rocm6.4.1/summary/Llama-3.1-70B-Instruct-FP8-KV_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Llama 3.1 70B FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m amd/Llama-3.1-70B-Instruct-FP8-KV \
        -g 8 \
        -d float8
    

    Find the throughput report at ./reports_float8_vllm_rocm6.4.1/summary/Llama-3.1-70B-Instruct-FP8-KV_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Llama 3.1 405B FP8 model using one GPU with the float8 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-3.1-405b_fp8 \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.1-405b_fp8. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float8/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m amd/Llama-3.1-405B-Instruct-FP8-KV \
        -g $num_gpu \
        -d float8
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Llama 3.1 405B FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m amd/Llama-3.1-405B-Instruct-FP8-KV \
        -g 8 \
        -d float8
    

    Find the latency report at ./reports_float8_vllm_rocm6.4.1/summary/Llama-3.1-405B-Instruct-FP8-KV_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Llama 3.1 405B FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m amd/Llama-3.1-405B-Instruct-FP8-KV \
        -g 8 \
        -d float8
    

    Find the throughput report at ./reports_float8_vllm_rocm6.4.1/summary/Llama-3.1-405B-Instruct-FP8-KV_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Mixtral MoE 8x7B model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_mixtral-8x7b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_mixtral-8x7b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m mistralai/Mixtral-8x7B-Instruct-v0.1 \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Mixtral MoE 8x7B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m mistralai/Mixtral-8x7B-Instruct-v0.1 \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/Mixtral-8x7B-Instruct-v0.1_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Mixtral MoE 8x7B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m mistralai/Mixtral-8x7B-Instruct-v0.1 \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/Mixtral-8x7B-Instruct-v0.1_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Mixtral MoE 8x22B model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_mixtral-8x22b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_mixtral-8x22b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m mistralai/Mixtral-8x22B-Instruct-v0.1 \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Mixtral MoE 8x22B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m mistralai/Mixtral-8x22B-Instruct-v0.1 \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/Mixtral-8x22B-Instruct-v0.1_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Mixtral MoE 8x22B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m mistralai/Mixtral-8x22B-Instruct-v0.1 \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/Mixtral-8x22B-Instruct-v0.1_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Mistral 7B model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_mistral-7b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_mistral-7b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m mistralai/Mistral-7B-Instruct-v0.3 \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Mistral 7B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m mistralai/Mistral-7B-Instruct-v0.3 \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/Mistral-7B-Instruct-v0.3_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Mistral 7B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m mistralai/Mistral-7B-Instruct-v0.3 \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/Mistral-7B-Instruct-v0.3_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Mixtral MoE 8x7B FP8 model using one GPU with the float8 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_mixtral-8x7b_fp8 \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_mixtral-8x7b_fp8. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float8/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV \
        -g $num_gpu \
        -d float8
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Mixtral MoE 8x7B FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV \
        -g 8 \
        -d float8
    

    Find the latency report at ./reports_float8_vllm_rocm6.4.1/summary/Mixtral-8x7B-Instruct-v0.1-FP8-KV_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Mixtral MoE 8x7B FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV \
        -g 8 \
        -d float8
    

    Find the throughput report at ./reports_float8_vllm_rocm6.4.1/summary/Mixtral-8x7B-Instruct-v0.1-FP8-KV_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Mixtral MoE 8x22B FP8 model using one GPU with the float8 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_mixtral-8x22b_fp8 \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_mixtral-8x22b_fp8. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float8/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV \
        -g $num_gpu \
        -d float8
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Mixtral MoE 8x22B FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV \
        -g 8 \
        -d float8
    

    Find the latency report at ./reports_float8_vllm_rocm6.4.1/summary/Mixtral-8x22B-Instruct-v0.1-FP8-KV_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Mixtral MoE 8x22B FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV \
        -g 8 \
        -d float8
    

    Find the throughput report at ./reports_float8_vllm_rocm6.4.1/summary/Mixtral-8x22B-Instruct-v0.1-FP8-KV_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Mistral 7B FP8 model using one GPU with the float8 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_mistral-7b_fp8 \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_mistral-7b_fp8. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float8/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m amd/Mistral-7B-v0.1-FP8-KV \
        -g $num_gpu \
        -d float8
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Mistral 7B FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m amd/Mistral-7B-v0.1-FP8-KV \
        -g 8 \
        -d float8
    

    Find the latency report at ./reports_float8_vllm_rocm6.4.1/summary/Mistral-7B-v0.1-FP8-KV_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Mistral 7B FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m amd/Mistral-7B-v0.1-FP8-KV \
        -g 8 \
        -d float8
    

    Find the throughput report at ./reports_float8_vllm_rocm6.4.1/summary/Mistral-7B-v0.1-FP8-KV_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Qwen2 7B model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_qwen2-7b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_qwen2-7b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m Qwen/Qwen2-7B-Instruct \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Qwen2 7B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m Qwen/Qwen2-7B-Instruct \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/Qwen2-7B-Instruct_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Qwen2 7B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m Qwen/Qwen2-7B-Instruct \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/Qwen2-7B-Instruct_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Qwen2 72B model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_qwen2-72b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_qwen2-72b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m Qwen/Qwen2-72B-Instruct \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Qwen2 72B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m Qwen/Qwen2-72B-Instruct \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/Qwen2-72B-Instruct_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Qwen2 72B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m Qwen/Qwen2-72B-Instruct \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/Qwen2-72B-Instruct_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the QwQ-32B model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_qwq-32b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_qwq-32b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Note

For improved performance, consider enabling PyTorch TunableOp. TunableOp automatically explores different implementations and configurations of certain PyTorch operators to find the fastest one for your hardware.

By default, pyt_vllm_qwq-32b runs with TunableOp disabled (see ROCm/MAD). To enable it, include the --tunableop on argument in your run.

Enabling TunableOp triggers a two-pass run – a warm-up followed by the performance-collection run.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m Qwen/QwQ-32B \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the QwQ-32B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m Qwen/QwQ-32B \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/QwQ-32B_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the QwQ-32B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m Qwen/QwQ-32B \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/QwQ-32B_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the DBRX Instruct model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_dbrx-instruct \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_dbrx-instruct. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m databricks/dbrx-instruct \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the DBRX Instruct model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m databricks/dbrx-instruct \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/dbrx-instruct_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the DBRX Instruct model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m databricks/dbrx-instruct \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/dbrx-instruct_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the DBRX Instruct FP8 model using one GPU with the float8 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_dbrx_fp8 \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_dbrx_fp8. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float8/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m amd/dbrx-instruct-FP8-KV \
        -g $num_gpu \
        -d float8
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the DBRX Instruct FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m amd/dbrx-instruct-FP8-KV \
        -g 8 \
        -d float8
    

    Find the latency report at ./reports_float8_vllm_rocm6.4.1/summary/dbrx-instruct-FP8-KV_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the DBRX Instruct FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m amd/dbrx-instruct-FP8-KV \
        -g 8 \
        -d float8
    

    Find the throughput report at ./reports_float8_vllm_rocm6.4.1/summary/dbrx-instruct-FP8-KV_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Gemma 2 27B model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_gemma-2-27b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_gemma-2-27b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m google/gemma-2-27b \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Gemma 2 27B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m google/gemma-2-27b \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/gemma-2-27b_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Gemma 2 27B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m google/gemma-2-27b \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/gemma-2-27b_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the C4AI Command R+ 08-2024 model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_c4ai-command-r-plus-08-2024 \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_c4ai-command-r-plus-08-2024. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m CohereForAI/c4ai-command-r-plus-08-2024 \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the C4AI Command R+ 08-2024 model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m CohereForAI/c4ai-command-r-plus-08-2024 \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/c4ai-command-r-plus-08-2024_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the C4AI Command R+ 08-2024 model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m CohereForAI/c4ai-command-r-plus-08-2024 \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/c4ai-command-r-plus-08-2024_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the C4AI Command R+ 08-2024 FP8 model using one GPU with the float8 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_command-r-plus_fp8 \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_command-r-plus_fp8. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float8/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m amd/c4ai-command-r-plus-FP8-KV \
        -g $num_gpu \
        -d float8
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the C4AI Command R+ 08-2024 FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m amd/c4ai-command-r-plus-FP8-KV \
        -g 8 \
        -d float8
    

    Find the latency report at ./reports_float8_vllm_rocm6.4.1/summary/c4ai-command-r-plus-FP8-KV_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the C4AI Command R+ 08-2024 FP8 model on eight GPUs with float8 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m amd/c4ai-command-r-plus-FP8-KV \
        -g 8 \
        -d float8
    

    Find the throughput report at ./reports_float8_vllm_rocm6.4.1/summary/c4ai-command-r-plus-FP8-KV_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the DeepSeek MoE 16B model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_deepseek-moe-16b-chat \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_deepseek-moe-16b-chat. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m deepseek-ai/deepseek-moe-16b-chat \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the DeepSeek MoE 16B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m deepseek-ai/deepseek-moe-16b-chat \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/deepseek-moe-16b-chat_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the DeepSeek MoE 16B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m deepseek-ai/deepseek-moe-16b-chat \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/deepseek-moe-16b-chat_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Phi-4 model using one GPU with the :literal:`` data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_phi-4 \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_phi-4. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m microsoft/phi-4 \
        -g $num_gpu \
        -d 
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Phi-4 model on eight GPUs with :literal:`` precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m microsoft/phi-4 \
        -g 8 \
        -d 
    

    Find the latency report at ./reports__vllm_rocm6.4.1/summary/phi-4_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Phi-4 model on eight GPUs with :literal:`` precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m microsoft/phi-4 \
        -g 8 \
        -d 
    

    Find the throughput report at ./reports__vllm_rocm6.4.1/summary/phi-4_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. Use this command to run the performance benchmark test on the Falcon 180B model using one GPU with the float16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_falcon-180b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

MAD launches a Docker container with the name container_ci-pyt_vllm_falcon-180b. The latency and throughput reports of the model are collected in the following path: ~/MAD/reports_float16/.

Although the available models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Download the Docker image and required scripts

  1. Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

    docker pull rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    docker run -it \
        --device=/dev/kfd \
        --device=/dev/dri \
        --group-add video \
        --shm-size 16G \
        --security-opt seccomp=unconfined \
        --security-opt apparmor=unconfined \
        --cap-add=SYS_PTRACE \
        -v $(pwd):/workspace \
        --env HUGGINGFACE_HUB_CACHE=/workspace \
        --name test \
        rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715
    
  2. In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/vllm.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/vllm
    
  3. To start the benchmark, use the following command with the appropriate options.

    Benchmark options

    Name

    Options

    Description

    $test_option

    latency

    Measure decoding token latency

    throughput

    Measure token generation throughput

    all

    Measure both throughput and latency

    $num_gpu

    1 or 8

    Number of GPUs

    $datatype

    float16 or float8

    Data type

    The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

    Command:

    ./vllm_benchmark_report.sh \
        -s $test_option \
        -m tiiuae/falcon-180B \
        -g $num_gpu \
        -d float16
    

    Note

    For best performance, it’s recommend to run with VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1.

    If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

    OSError: You are trying to access a gated repo.
    
    # pass your HF_TOKEN
    export HF_TOKEN=$your_personal_hf_token
    

Benchmarking examples

Here are some examples of running the benchmark with various options:

  • Latency benchmark

    Use this command to benchmark the latency of the Falcon 180B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s latency \
        -m tiiuae/falcon-180B \
        -g 8 \
        -d float16
    

    Find the latency report at ./reports_float16_vllm_rocm6.4.1/summary/falcon-180B_latency_report.csv.

  • Throughput benchmark

    Use this command to benchmark the throughput of the Falcon 180B model on eight GPUs with float16 precision.

    ./vllm_benchmark_report.sh \
        -s throughput \
        -m tiiuae/falcon-180B \
        -g 8 \
        -d float16
    

    Find the throughput report at ./reports_float16_vllm_rocm6.4.1/summary/falcon-180B_throughput_report.csv.

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

Advanced usage#

For information on experimental features and known issues related to ROCm optimization efforts on vLLM, see the developer’s guide at ROCm/vllm.

Reproducing the Docker image#

To reproduce this ROCm/vLLM Docker image release, follow these steps:

  1. Clone the vLLM repository.

    git clone https://github.com/ROCm/vllm.git
    
  2. Checkout the specific release commit.

    cd vllm
    git checkout b432b7a285aa0dcb9677380936ffa74931bb6d6f
    
  3. Build the Docker image. Replace vllm-rocm with your desired image tag.

    docker build -f docker/Dockerfile.rocm -t vllm-rocm .
    

Known issues and workarounds#

AITER does not support FP8 KV cache yet.

Further reading#

Previous versions#

See vLLM inference performance testing version history to find documentation for previous releases of the ROCm/vllm Docker image.