vLLM inference performance testing#

2025-10-16

96 min read time

Applies to Linux

The ROCm vLLM Docker image offers a prebuilt, optimized environment for validating large language model (LLM) inference performance on AMD Instinct™ MI355X, MI350X, MI325X and MI300X GPUs. This ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for AMD data center GPUs and includes the following components:

Software component

Version

ROCm

7.0.0

vLLM

0.10.2 (0.11.0rc2.dev160+g790d22168.rocm700)

PyTorch

2.9.0a0+git1c57644

hipBLASLt

1.0.0

With this Docker image, you can quickly test the expected inference performance numbers for AMD Instinct GPUs.

What’s new#

The following is summary of notable changes since the previous ROCm/vLLM Docker release.

  • Added support for AMD Instinct MI355X and MI350X GPUs.

  • Added support and benchmarking instructions for the following models. See Supported models.

    • Llama 4 Scout and Maverick

    • DeepSeek R1 0528 FP8

    • MXFP4 models (MI355X and MI350X only): Llama 3.3 70B MXFP4 and Llama 3.1 405B MXFP4

    • GPT OSS 20B and 120B

    • Qwen 3 32B, 30B-A3B, and 235B-A22B

  • Removed the deprecated --max-seq-len-to-capture flag.

  • --gpu-memory-utilization is now configurable via the configuration files in the MAD repository.

Supported models#

The following models are supported for inference performance benchmarking with vLLM and ROCm. Some instructions, commands, and recommendations in this documentation might vary by model – select one to get started. MXFP4 models are only supported on MI355X and MI350X GPUs.

Model
Meta Llama
DeepSeek
OpenAI GPT OSS
Mistral AI
Qwen
Microsoft Phi
Variant
Llama 2 70B
Llama 3.1 8B
Llama 3.1 8B FP8
Llama 3.1 405B
Llama 3.1 405B FP8
Llama 3.1 405B MXFP4
Llama 3.3 70B
Llama 3.3 70B FP8
Llama 3.3 70B MXFP4
Llama 4 Scout 17Bx16E
Llama 4 Maverick 17Bx128E
Llama 4 Maverick 17Bx128E FP8
DeepSeek R1 0528 FP8
GPT OSS 20B
GPT OSS 120B
Mixtral MoE 8x7B
Mixtral MoE 8x7B FP8
Mixtral MoE 8x22B
Mixtral MoE 8x22B FP8
Qwen3 8B
Qwen3 32B
Qwen3 30B A3B
Qwen3 30B A3B FP8
Qwen3 235B A22B
Qwen3 235B A22B FP8
Phi-4

Note

See the Llama 2 70B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Llama 3.1 8B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Llama 3.1 8B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

This model uses FP8 quantization via AMD Quark for efficient inference on AMD GPUs.

Note

See the Llama 3.1 405B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Llama 3.1 405B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

This model uses FP8 quantization via AMD Quark for efficient inference on AMD GPUs.

Important

MXFP4 is supported only on MI355X and MI350X GPUs.

Note

See the Llama 3.1 405B MXFP4 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

This model uses FP4 quantization via AMD Quark for efficient inference on AMD GPUs.

Note

See the Llama 3.3 70B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Llama 3.3 70B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

This model uses FP8 quantization via AMD Quark for efficient inference on AMD GPUs.

Important

MXFP4 is supported only on MI355X and MI350X GPUs.

Note

See the Llama 3.3 70B MXFP4 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

This model uses FP4 quantization via AMD Quark for efficient inference on AMD GPUs.

Note

See the Llama 4 Scout 17Bx16E model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Llama 4 Maverick 17Bx128E model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Llama 4 Maverick 17Bx128E FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the DeepSeek R1 0528 FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the GPT OSS 20B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the GPT OSS 120B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Mixtral MoE 8x7B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Mixtral MoE 8x7B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

This model uses FP8 quantization via AMD Quark for efficient inference on AMD GPUs.

Note

See the Mixtral MoE 8x22B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Mixtral MoE 8x22B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

This model uses FP8 quantization via AMD Quark for efficient inference on AMD GPUs.

Note

See the Qwen3 8B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Qwen3 32B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Qwen3 30B A3B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Qwen3 30B A3B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Qwen3 235B A22B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Qwen3 235B A22B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Note

See the Phi-4 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.

Performance measurements#

To evaluate performance, the Performance results with AMD ROCm software page provides reference throughput and serving measurements for inferencing popular AI models.

Important

The performance data presented in Performance results with AMD ROCm software only reflects the latest version of this inference benchmarking environment. The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct GPUs or ROCm software.

System validation#

Before running AI workloads, it’s important to validate that your AMD hardware is configured correctly and performing optimally.

If you have already validated your system settings, including aspects like NUMA auto-balancing, you can skip this step. Otherwise, complete the procedures in the System validation and optimization guide to properly configure your system settings before starting training.

To test for optimal performance, consult the recommended System health benchmarks. This suite of tests will help you verify and fine-tune your system’s configuration.

Pull the Docker image#

Download the ROCm vLLM Docker image. Use the following command to pull the Docker image from Docker Hub.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Benchmarking#

Once the setup is complete, choose between two options to reproduce the benchmark results:

The following run command is tailored to Llama 2 70B. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Llama 2 70B model using one node with the float16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-2-70b \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-2-70b. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_llama-2-70b_throughput.csv and pyt_vllm_llama-2-70b_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Llama 2 70B. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=meta-llama/Llama-2-70b-chat-hf
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=4096
max_model_len=4096

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=meta-llama/Llama-2-70b-chat-hf
    tp=8
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=4096
    max_model_len=4096
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=meta-llama/Llama-2-70b-chat-hf
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Llama 3.1 8B. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Llama 3.1 8B model using one node with the float16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-3.1-8b \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.1-8b. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_llama-3.1-8b_throughput.csv and pyt_vllm_llama-3.1-8b_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Llama 3.1 8B. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=meta-llama/Llama-3.1-8B-Instruct
tp=1
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=131072
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=meta-llama/Llama-3.1-8B-Instruct
    tp=1
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=131072
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=meta-llama/Llama-3.1-8B-Instruct
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Llama 3.1 8B FP8. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Llama 3.1 8B FP8 model using one node with the float8 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-3.1-8b_fp8 \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.1-8b_fp8. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_llama-3.1-8b_fp8_throughput.csv and pyt_vllm_llama-3.1-8b_fp8_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Llama 3.1 8B FP8. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=amd/Llama-3.1-8B-Instruct-FP8-KV
tp=1
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=fp8
max_num_seqs=1024
max_num_batched_tokens=131072
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=amd/Llama-3.1-8B-Instruct-FP8-KV
    tp=1
    dtype=auto
    kv_cache_dtype=fp8
    max_num_seqs=256
    max_num_batched_tokens=131072
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=amd/Llama-3.1-8B-Instruct-FP8-KV
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Llama 3.1 405B. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Llama 3.1 405B model using one node with the float16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-3.1-405b \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.1-405b. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_llama-3.1-405b_throughput.csv and pyt_vllm_llama-3.1-405b_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Llama 3.1 405B. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=meta-llama/Llama-3.1-405B-Instruct
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=131072
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=meta-llama/Llama-3.1-405B-Instruct
    tp=8
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=131072
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=meta-llama/Llama-3.1-405B-Instruct
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Llama 3.1 405B FP8. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Llama 3.1 405B FP8 model using one node with the float8 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-3.1-405b_fp8 \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.1-405b_fp8. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_llama-3.1-405b_fp8_throughput.csv and pyt_vllm_llama-3.1-405b_fp8_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Llama 3.1 405B FP8. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=amd/Llama-3.1-405B-Instruct-FP8-KV
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=fp8
max_num_seqs=1024
max_num_batched_tokens=131072
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=amd/Llama-3.1-405B-Instruct-FP8-KV
    tp=8
    dtype=auto
    kv_cache_dtype=fp8
    max_num_seqs=256
    max_num_batched_tokens=131072
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=amd/Llama-3.1-405B-Instruct-FP8-KV
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Llama 3.1 405B MXFP4. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Llama 3.1 405B MXFP4 model using one node with the float4 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-3.1-405b_fp4 \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.1-405b_fp4. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_llama-3.1-405b_fp4_throughput.csv and pyt_vllm_llama-3.1-405b_fp4_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Llama 3.1 405B MXFP4. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=amd/Llama-3.1-405B-Instruct-MXFP4-Preview
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=fp8
max_num_seqs=1024
max_num_batched_tokens=131072
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=amd/Llama-3.1-405B-Instruct-MXFP4-Preview
    tp=8
    dtype=auto
    kv_cache_dtype=fp8
    max_num_seqs=256
    max_num_batched_tokens=131072
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=amd/Llama-3.1-405B-Instruct-MXFP4-Preview
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Llama 3.3 70B. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Llama 3.3 70B model using one node with the float16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-3.3-70b \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.3-70b. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_llama-3.3-70b_throughput.csv and pyt_vllm_llama-3.3-70b_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Llama 3.3 70B. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=meta-llama/Llama-3.3-70B-Instruct
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=131072
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=meta-llama/Llama-3.3-70B-Instruct
    tp=8
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=131072
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=meta-llama/Llama-3.3-70B-Instruct
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Llama 3.3 70B FP8. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Llama 3.3 70B FP8 model using one node with the float8 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-3.3-70b_fp8 \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.3-70b_fp8. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_llama-3.3-70b_fp8_throughput.csv and pyt_vllm_llama-3.3-70b_fp8_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Llama 3.3 70B FP8. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=amd/Llama-3.3-70B-Instruct-FP8-KV
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=fp8
max_num_seqs=1024
max_num_batched_tokens=131072
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=amd/Llama-3.3-70B-Instruct-FP8-KV
    tp=8
    dtype=auto
    kv_cache_dtype=fp8
    max_num_seqs=256
    max_num_batched_tokens=131072
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=amd/Llama-3.3-70B-Instruct-FP8-KV
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Llama 3.3 70B MXFP4. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Llama 3.3 70B MXFP4 model using one node with the float4 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-3.3-70b_fp4 \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-3.3-70b_fp4. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_llama-3.3-70b_fp4_throughput.csv and pyt_vllm_llama-3.3-70b_fp4_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Llama 3.3 70B MXFP4. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=amd/Llama-3.3-70B-Instruct-MXFP4-Preview
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=fp8
max_num_seqs=1024
max_num_batched_tokens=131072
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=amd/Llama-3.3-70B-Instruct-MXFP4-Preview
    tp=8
    dtype=auto
    kv_cache_dtype=fp8
    max_num_seqs=256
    max_num_batched_tokens=131072
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=amd/Llama-3.3-70B-Instruct-MXFP4-Preview
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Llama 4 Scout 17Bx16E. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Llama 4 Scout 17Bx16E model using one node with the float16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-4-scout-17b-16e \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-4-scout-17b-16e. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_llama-4-scout-17b-16e_throughput.csv and pyt_vllm_llama-4-scout-17b-16e_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Llama 4 Scout 17Bx16E. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=meta-llama/Llama-4-Scout-17B-16E-Instruct
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=32768
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=meta-llama/Llama-4-Scout-17B-16E-Instruct
    tp=8
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=32768
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=meta-llama/Llama-4-Scout-17B-16E-Instruct
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Llama 4 Maverick 17Bx128E. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Llama 4 Maverick 17Bx128E model using one node with the float16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-4-maverick-17b-128e \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-4-maverick-17b-128e. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_llama-4-maverick-17b-128e_throughput.csv and pyt_vllm_llama-4-maverick-17b-128e_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Llama 4 Maverick 17Bx128E. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=meta-llama/Llama-4-Maverick-17B-128E-Instruct
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=32768
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=meta-llama/Llama-4-Maverick-17B-128E-Instruct
    tp=8
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=32768
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=meta-llama/Llama-4-Maverick-17B-128E-Instruct
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Llama 4 Maverick 17Bx128E FP8. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Llama 4 Maverick 17Bx128E FP8 model using one node with the float8 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_llama-4-maverick-17b-128e_fp8 \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_llama-4-maverick-17b-128e_fp8. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_llama-4-maverick-17b-128e_fp8_throughput.csv and pyt_vllm_llama-4-maverick-17b-128e_fp8_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Llama 4 Maverick 17Bx128E FP8. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=fp8
max_num_seqs=1024
max_num_batched_tokens=131072
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
    tp=8
    dtype=auto
    kv_cache_dtype=fp8
    max_num_seqs=256
    max_num_batched_tokens=131072
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to DeepSeek R1 0528 FP8. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the DeepSeek R1 0528 FP8 model using one node with the float8 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_deepseek-r1 \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_deepseek-r1. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_deepseek-r1_throughput.csv and pyt_vllm_deepseek-r1_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for DeepSeek R1 0528 FP8. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=deepseek-ai/DeepSeek-R1-0528
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=fp8
max_num_seqs=1024
max_num_batched_tokens=131072
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=deepseek-ai/DeepSeek-R1-0528
    tp=8
    dtype=auto
    kv_cache_dtype=fp8
    max_num_seqs=256
    max_num_batched_tokens=131072
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=deepseek-ai/DeepSeek-R1-0528
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to GPT OSS 20B. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the GPT OSS 20B model using one node with the bfloat16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_gpt-oss-20b \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_gpt-oss-20b. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_gpt-oss-20b_throughput.csv and pyt_vllm_gpt-oss-20b_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for GPT OSS 20B. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=openai/gpt-oss-20b
tp=1
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=8192
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=openai/gpt-oss-20b
    tp=1
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=8192
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=openai/gpt-oss-20b
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to GPT OSS 120B. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the GPT OSS 120B model using one node with the bfloat16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_gpt-oss-120b \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_gpt-oss-120b. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_gpt-oss-120b_throughput.csv and pyt_vllm_gpt-oss-120b_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for GPT OSS 120B. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=openai/gpt-oss-120b
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=8192
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=openai/gpt-oss-120b
    tp=8
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=8192
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=openai/gpt-oss-120b
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Mixtral MoE 8x7B. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Mixtral MoE 8x7B model using one node with the float16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_mixtral-8x7b \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_mixtral-8x7b. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_mixtral-8x7b_throughput.csv and pyt_vllm_mixtral-8x7b_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Mixtral MoE 8x7B. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=mistralai/Mixtral-8x7B-Instruct-v0.1
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=32768
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=mistralai/Mixtral-8x7B-Instruct-v0.1
    tp=8
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=32768
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=mistralai/Mixtral-8x7B-Instruct-v0.1
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Mixtral MoE 8x7B FP8. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Mixtral MoE 8x7B FP8 model using one node with the float8 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_mixtral-8x7b_fp8 \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_mixtral-8x7b_fp8. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_mixtral-8x7b_fp8_throughput.csv and pyt_vllm_mixtral-8x7b_fp8_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Mixtral MoE 8x7B FP8. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=fp8
max_num_seqs=1024
max_num_batched_tokens=32768
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
    tp=8
    dtype=auto
    kv_cache_dtype=fp8
    max_num_seqs=256
    max_num_batched_tokens=32768
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Mixtral MoE 8x22B. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Mixtral MoE 8x22B model using one node with the float16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_mixtral-8x22b \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_mixtral-8x22b. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_mixtral-8x22b_throughput.csv and pyt_vllm_mixtral-8x22b_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Mixtral MoE 8x22B. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=mistralai/Mixtral-8x22B-Instruct-v0.1
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=65536
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=mistralai/Mixtral-8x22B-Instruct-v0.1
    tp=8
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=65536
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=mistralai/Mixtral-8x22B-Instruct-v0.1
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Mixtral MoE 8x22B FP8. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Mixtral MoE 8x22B FP8 model using one node with the float8 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_mixtral-8x22b_fp8 \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_mixtral-8x22b_fp8. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_mixtral-8x22b_fp8_throughput.csv and pyt_vllm_mixtral-8x22b_fp8_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Mixtral MoE 8x22B FP8. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=fp8
max_num_seqs=1024
max_num_batched_tokens=65536
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
    tp=8
    dtype=auto
    kv_cache_dtype=fp8
    max_num_seqs=256
    max_num_batched_tokens=65536
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Qwen3 8B. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Qwen3 8B model using one node with the float16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_qwen3-8b \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_qwen3-8b. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_qwen3-8b_throughput.csv and pyt_vllm_qwen3-8b_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Qwen3 8B. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=Qwen/Qwen3-8B
tp=1
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=40960
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=Qwen/Qwen3-8B
    tp=1
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=40960
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=Qwen/Qwen3-8B
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Qwen3 32B. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Qwen3 32B model using one node with the float16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_qwen3-32b \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_qwen3-32b. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_qwen3-32b_throughput.csv and pyt_vllm_qwen3-32b_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Qwen3 32B. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=Qwen/Qwen3-32b
tp=1
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=40960
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=Qwen/Qwen3-32b
    tp=1
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=40960
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=Qwen/Qwen3-32b
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Qwen3 30B A3B. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Qwen3 30B A3B model using one node with the float16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_qwen3-30b-a3b \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_qwen3-30b-a3b. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_qwen3-30b-a3b_throughput.csv and pyt_vllm_qwen3-30b-a3b_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Qwen3 30B A3B. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=Qwen/Qwen3-30B-A3B
tp=1
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=40960
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=Qwen/Qwen3-30B-A3B
    tp=1
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=40960
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=Qwen/Qwen3-30B-A3B
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Qwen3 30B A3B FP8. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Qwen3 30B A3B FP8 model using one node with the float16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_qwen3-30b-a3b_fp8 \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_qwen3-30b-a3b_fp8. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_qwen3-30b-a3b_fp8_throughput.csv and pyt_vllm_qwen3-30b-a3b_fp8_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Qwen3 30B A3B FP8. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=Qwen/Qwen3-30B-A3B-FP8
tp=1
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=fp8
max_num_seqs=1024
max_num_batched_tokens=40960
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=Qwen/Qwen3-30B-A3B-FP8
    tp=1
    dtype=auto
    kv_cache_dtype=fp8
    max_num_seqs=256
    max_num_batched_tokens=40960
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=Qwen/Qwen3-30B-A3B-FP8
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Qwen3 235B A22B. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Qwen3 235B A22B model using one node with the float16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_qwen3-235b-a22b \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_qwen3-235b-a22b. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_qwen3-235b-a22b_throughput.csv and pyt_vllm_qwen3-235b-a22b_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Qwen3 235B A22B. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=Qwen/Qwen3-235B-A22B
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=40960
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=Qwen/Qwen3-235B-A22B
    tp=8
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=40960
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=Qwen/Qwen3-235B-A22B
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Qwen3 235B A22B FP8. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Qwen3 235B A22B FP8 model using one node with the float8 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_qwen3-235b-a22b_fp8 \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_qwen3-235b-a22b_fp8. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_qwen3-235b-a22b_fp8_throughput.csv and pyt_vllm_qwen3-235b-a22b_fp8_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Qwen3 235B A22B FP8. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=Qwen/Qwen3-235B-A22B-FP8
tp=8
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=fp8
max_num_seqs=1024
max_num_batched_tokens=40960
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=Qwen/Qwen3-235B-A22B-FP8
    tp=8
    dtype=auto
    kv_cache_dtype=fp8
    max_num_seqs=256
    max_num_batched_tokens=40960
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=Qwen/Qwen3-235B-A22B-FP8
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

The following run command is tailored to Phi-4. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. On the host machine, use this command to run the performance benchmark test on the Phi-4 model using one node with the float16 data type.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags pyt_vllm_phi-4 \
        --keep-model-dir \
        --live-output
    

MAD launches a Docker container with the name container_ci-pyt_vllm_phi-4. The throughput and serving reports of the model are collected in the following paths: pyt_vllm_phi-4_throughput.csv and pyt_vllm_phi-4_serving.csv.

Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

The following commands are optimized for Phi-4. See Supported models to switch to another available model.

See also

For more information on configuration, see the config files in the MAD repository. Refer to the vLLM engine for descriptions of available configuration options and Benchmarking vLLM for additional benchmarking information.

Launch the container

You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006

Throughput command

Use the following command to start the throughput benchmark.

model=microsoft/phi-4
tp=1
num_prompts=1024
in=128
out=128
dtype=auto
kv_cache_dtype=auto
max_num_seqs=1024
max_num_batched_tokens=16384
max_model_len=8192

vllm bench throughput --model $model \
    -tp $tp \
    --num-prompts $num_prompts \
    --input-len $in \
    --output-len $out \
    --dtype $dtype \
    --kv-cache-dtype $kv_cache_dtype \
    --max-num-seqs $max_num_seqs \
    --max-num-batched-tokens $max_num_batched_tokens \
    --max-model-len $max_model_len \
    --trust-remote-code \
    --output-json ${model}_throughput.json \
    --gpu-memory-utilization 0.9

Serving command

  1. Start the server using the following command:

    model=microsoft/phi-4
    tp=1
    dtype=auto
    kv_cache_dtype=auto
    max_num_seqs=256
    max_num_batched_tokens=16384
    max_model_len=8192
    
    vllm serve $model \
        -tp $tp \
        --dtype $dtype \
        --kv-cache-dtype $kv_cache_dtype \
        --max-num-seqs $max_num_seqs \
        --max-num-batched-tokens $max_num_batched_tokens \
        --max-model-len $max_model_len \
        --no-enable-prefix-caching \
        --swap-space 16 \
        --disable-log-requests \
        --trust-remote-code \
        --gpu-memory-utilization 0.9
    

    Wait until the model has loaded and the server is ready to accept requests.

  2. On another terminal on the same machine, run the benchmark:

    # Connect to the container
    docker exec -it test bash
    
    # Wait for the server to start
    until curl -s http://localhost:8000/v1/models; do sleep 30; done
    
    # Run the benchmark
    model=microsoft/phi-4
    max_concurrency=1
    num_prompts=10
    in=128
    out=128
    vllm bench serve --model $model \
        --percentile-metrics "ttft,tpot,itl,e2el" \
        --dataset-name random \
        --ignore-eos \
        --max-concurrency $max_concurrency \
        --num-prompts $num_prompts \
        --random-input-len $in \
        --random-output-len $out \
        --trust-remote-code \
        --save-result \
        --result-filename ${model}_serving.json
    

Note

For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, try adding export VLLM_ROCM_USE_AITER=1 to your commands.

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.

# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Note

Throughput is calculated as:

  • \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
  • \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

Advanced usage#

For information on experimental features and known issues related to ROCm optimization efforts on vLLM, see the developer’s guide at ROCm/vllm.

Reproducing the Docker image#

To reproduce this ROCm-enabled vLLM Docker image release, follow these steps:

  1. Clone the vLLM repository.

    git clone https://github.com/vllm-project/vllm.git
    cd vllm
    
  2. Use the following command to build the image directly from the specified commit.

    docker build -f docker/Dockerfile.rocm \
        --build-arg REMOTE_VLLM=1 \
        --build-arg VLLM_REPO=https://github.com/ROCm/vllm \
        --build-arg VLLM_BRANCH="790d22168820507f3105fef29596549378cfe399" \
        -t vllm-rocm .
    

    Tip

    Replace vllm-rocm with your desired image tag.

Further reading#

Previous versions#

See vLLM inference performance testing version history to find documentation for previous releases of the ROCm/vllm Docker image.