vLLM Docker image

vLLM Docker image#

Virtual Large Language Model (vLLM) is a fast and easy-to-use library for LLM inference and serving.

Model support is enabled via a vLLM Docker image that must be built separately (in addition to ROCm) for the current release.

For additional information, visit the AMD vLLM GitHub page.

Note that this is a benchmarking demo/example. Installation for other vLLM models/configurations may differ.

Prerequisites#

GitHub is authenticated.

Additional information#

AMD recommends 40GB GPU for 70B usecases.
Ensure that your GPU has enough VRAM for the chosen model.
This example highlights use of the AMD vLLM Docker using Llama-3 70B with GPTQ quantization (as shown at Computex).
However, performance is not limited to this specific Hugging Face model, and other vLLM supported models can also be used.

Download and install Docker image#

Download Docker image#

Select the applicable Ubuntu version to download the compatible Docker image before starting.

Ubuntu 24.04/Python 3.12/PyTorch 2.9

docker pull rocm/vllm-dev:rocm7.2_navi_ubuntu24.04_py3.12_pytorch_2.9_vllm_0.14.0rc0

Ubuntu 22.04/Python 3.10/PyTorch 2.9

docker pull rocm/vllm-dev:rocm7.2_navi_ubuntu22.04_py3.10_pytorch_2.9_vllm_0.14.0rc0

Note

For more information, see rocm/vllm-dev.

Installation#

Follow these steps to build a vLLM Docker image.

Start the Docker container.

Important!
Refer to WSL specific configurations for instructions when working in a WSL environment.

docker run -it \
  --privileged \
  --device=/dev/kfd \
  --device=/dev/dri \
  --network=host \
  --group-add sudo \
  -w /app/vllm/ \
  --name <container_name> \
<image_name> \
  /bin/bash

Note

The container_name is user defined. Ensure to name your Docker using this value.

WSL-specific configurations

Optional: Only applicable when using a WSL configuration
Select the applicable vLLM instructions, based on your specific WSL configuration.

Ubuntu 24.04/Python 3.12/PyTorch 2.9

docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/dxg \
--entrypoint /bin/bash \
-v /usr/lib/wsl/lib/libdxcore.so:/usr/lib/libdxcore.so \
-v /opt/rocm/lib/libhsa-runtime64.so.1:/opt/rocm/lib/libhsa-runtime64.so.1 \
-w /app/vllm/ \
--name vllm_rocm_container \
rocm/vllm-dev:rocm7.2_navi_ubuntu24.04_py3.12_pytorch_2.9_vllm_0.14.0rc0
-c "sed -i 's/is_rocm = False/is_rocm = True/g' /opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/platforms/__init__.py && /bin/bash"

Ubuntu 22.04/Python 3.10/PyTorch 2.9

docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/dxg \
--entrypoint /bin/bash \
-v /usr/lib/wsl/lib/libdxcore.so:/usr/lib/libdxcore.so \
-v /opt/rocm/lib/libhsa-runtime64.so.1:/opt/rocm/lib/libhsa-runtime64.so.1 \
-w /app/vllm/ \
--name vllm_rocm_container \
rocm/vllm-dev:rocm7.2_navi_ubuntu22.04_py3.10_pytorch_2.9_vllm_0.14.0rc0
-c "sed -i 's/is_rocm = False/is_rocm = True/g' /opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm/platforms/__init__.py && /bin/bash"

Clone the Hugging Face GitHub repository within the Docker container.

apt update
apt install git-lfs
git lfs clone https://huggingface.co/TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ

Run benchmarks with the Docker container.

vllm bench latency --model /app/vllm/Meta-Llama-3-70B-Instruct-GPTQ -q gptq --batch-size 1 --input-len 1024 --output-len 1024 --max-model-len 2048

This is a vllm CLI command for the latency mode. Similar parameters can be used for bench or serve modes, but they are separate modes and use different subcommands.

This can also be called using python -m vllm.entrypoints.cli.main bench latency...

For additional information, refer to vLLM CLI Guide - vLLM

Note
Ensure that the model is downloaded and vLLM checkout is set to your current directory within the container described in Step 3.

Note
Select the preferred environment variable prior to running models using vLLM in V1 mode:

Selection

Prefill

Decode

Flags

TRITON_ATTN (default)

kernel_unified_attention

same unified

None
or --attention-config.backend TRITON_ATTN

ROCM_ATTN (custom paged attention)

context_attention_fwd

paged_attention_rocm

VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
or --attention-config.backend ROCM_ATTN --attention-config.use_prefill_decode_attention=true

ROCM_AITER_UNIFIED_ATTN

AITER unified_attention

same unified

VLLM_ATTENTION_BACKEND=ROCM_AITER_UNIFIED_ATTN
or --attention-config.backend ROCM_AITER_UNIFIED_ATTN

Selection	Prefill	Decode	Flags
TRITON_ATTN (default)	`kernel_unified_attention`	`same unified`	None or `--attention-config.backend TRITON_ATTN`
ROCM_ATTN (custom paged attention)	`context_attention_fwd`	`paged_attention_rocm`	`VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1` or `--attention-config.backend ROCM_ATTN --attention-config.use_prefill_decode_attention=true`
ROCM_AITER_UNIFIED_ATTN	`AITER unified_attention`	`same unified`	`VLLM_ATTENTION_BACKEND=ROCM_AITER_UNIFIED_ATTN` or `--attention-config.backend ROCM_AITER_UNIFIED_ATTN`