Benchmark Llama 3.1 405B FP4 inference with vLLM

Benchmark Llama 3.1 405B FP4 inference with vLLM#

2025-09-19

3 min read time

Applies to Linux

Note

For the latest iteration of AI training and inference performance for ROCm 7.0, see Infinity Hub and the ROCm 7.0 AI training and inference performance documentation.

This section provides instructions to test the inference performance of Llama 3.1 405B on the vLLM inference engine. The accompanying Docker image integrates the ROCm 7.0 preview with vLLM, and is tailored for AMD Instinct MI355X and MI350X accelerators. This benchmark does not support other GPUs.

Follow these steps to pull the required image, spin up the container with the appropriate options, download the model, and run the throughput test.

Pull the Docker image#

Use the following command to pull the Docker image.

docker pull rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_vllm_0.10.1_instinct_rc1

Download the model#

See the model card on Hugging Face at amd/Llama-3.1-405B-Instruct-MXFP4-Preview. This model uses microscaling 4-bit floating point (MXFP4) quantization via AMD Quark for efficient inference on AMD accelerators.

pip install huggingface_hub[cli] hf_transfer hf_xet
HF_HUB_ENABLE_HF_TRANSFER=1 \
HF_HOME=/data/huggingface-cache \
HF_TOKEN="<HF_TOKEN>" \
huggingface-cli download amd/Llama-3.1-405B-Instruct-MXFP4-Preview --exclude "original/*"

Run the inference benchmark#

Start the container using the following command.

docker run -it \
  --ipc=host \
  --network=host \
  --privileged \
  --cap-add=CAP_SYS_ADMIN \
  --device=/dev/kfd \
  --device=/dev/dri \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v /data:/data \
  -e HF_HOME=/data/huggingface-cache \
  -e HF_HUB_OFFLINE=1 \
  -e VLLM_USE_V1=1 \
  -e VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
  -e AMDGCN_USE_BUFFER_OPS=1 \
  -e VLLM_USE_AITER_TRITON_ROPE=1 \
  -e TRITON_HIP_ASYNC_COPY_BYPASS_PERMUTE=1 \
  -e TRITON_HIP_USE_ASYNC_COPY=1 \
  -e TRITON_HIP_USE_BLOCK_PINGPONG=1 \
  -e TRITON_HIP_ASYNC_FAST_SWIZZLE=1 \
  -e VLLM_ROCM_USE_AITER=1 \
  -e VLLM_ROCM_USE_AITER_RMSNORM=1 \
  -e VLLM_TRITON_FP4_GEMM_USE_ASM=1 \
  -e VLLM_TRITON_FP4_GEMM_SPLITK_USE_BF16=1 \
  --name vllm-server \
  rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_vllm_0.10.1_instinct_rc1

Start the server.

max_model_len=10240
max_num_seqs=1024
max_num_batched_tokens=131072
max_seq_len_to_capture=16384
tensor_parallel_size=8

vllm serve amd/Llama-3.1-405B-Instruct-MXFP4-Preview \
    --host localhost \
    --port 8000 \
    --swap-space 64 \
    --disable-log-requests \
    --dtype auto \
    --max-model-len ${max_model_len} \
    --tensor-parallel-size ${tensor_parallel_size} \
    --max-num-seqs ${max_num_seqs} \
    --distributed-executor-backend mp \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.92 \
    --max-seq-len-to-capture ${max_seq_len_to_capture} \
    --max-num-batched-tokens ${max_num_batched_tokens} \
    --no-enable-prefix-caching \
    --async-scheduling

    # Wait for model to load and server is ready to accept requests

Open another terminal on the same machine and run the benchmark with the following options.

# Connect to server
docker exec -it vllm-server bash

# Run the client benchmark
input_tokens=1024
output_tokens=1024
max_concurrency=4
num_prompts=32

python3 /app/vllm/benchmarks/benchmark_serving.py --host localhost --port 8000 \
    --model amd/Llama-3.1-405B-Instruct-MXFP4-Preview \
    --dataset-name random \
    --random-input-len ${input_tokens} \
    --random-output-len ${output_tokens} \
    --max-concurrency ${max_concurrency} \
    --num-prompts ${num_prompts} \
    --percentile-metrics ttft,tpot,itl,e2el \
    --ignore-eos