vLLM inference
vLLM is an open-source library for fast,
memory-efficient LLM inference and serving. This page describes how to set up
and run vLLM on AMD GPUs and APUs using either a prebuilt Docker image or pip.
It applies to supported AMD GPUs and platforms.
AMD device family
Instinct
Radeon PRO
Radeon
Ryzen
MI355X
MI350X
MI325X
MI300X
MI300A
AI PRO R9700
AI PRO R9600D
RX 9070 XT
RX 9070 GRE
RX 9070
RX 9060 XT LP
RX 9060 XT
RX 9060
AI Max+ PRO 395
AI Max PRO 390
AI Max PRO 385
AI Max PRO 380
AI Max+ 395
AI Max 390
AI Max 385
Get started
Pull the ROCm vLLM Docker image.
docker pull rocm/vllm:rocm7.12.0_gfx950-dcgpu_ubuntu24.04_py3.12_pytorch_2.9.1_vllm_0.16.0
Start the Docker container.
docker run -it --rm \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/your/models>:/app/models \
-e HF_HOME="/app/models" \
rocm/vllm:rocm7.12.0_gfx950-dcgpu_ubuntu24.04_py3.12_pytorch_2.9.1_vllm_0.16.0 \
bash
Pull the ROCm vLLM Docker image.
docker pull rocm/vllm:rocm7.12.0_gfx94X-dcgpu_ubuntu24.04_py3.12_pytorch_2.9.1_vllm_0.16.0
Start the Docker container.
docker run -it --rm \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/your/models>:/app/models \
-e HF_HOME="/app/models" \
rocm/vllm:rocm7.12.0_gfx94X-dcgpu_ubuntu24.04_py3.12_pytorch_2.9.1_vllm_0.16.0 \
bash
Pull the ROCm vLLM Docker image.
docker pull rocm/vllm:rocm7.12.0_gfx120X-all_ubuntu24.04_py3.12_pytorch_2.9.1_vllm_0.16.0
Start the Docker container.
docker run -it --rm \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/your/models>:/app/models \
-e HF_HOME="/app/models" \
rocm/vllm:rocm7.12.0_gfx120X-all_ubuntu24.04_py3.12_pytorch_2.9.1_vllm_0.16.0 \
bash
Pull the ROCm vLLM Docker image.
docker pull rocm/vllm:rocm7.12.0_gfx1151_ubuntu24.04_py3.12_pytorch_2.9.1_vllm_0.16.0
Start the Docker container.
docker run -it --rm \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/your/models>:/app/models \
-e HF_HOME="/app/models" \
rocm/vllm:rocm7.12.0_gfx1151_ubuntu24.04_py3.12_pytorch_2.9.1_vllm_0.16.0 \
bash
Set up your Python virtual environment. If you already have a successful
ROCm 7.12.0 installation using pip, skip
this step.
For example, run the following command to create a virtual environment:
Activate your Python virtual environment. For example:
source .venv/bin/activate
Install ROCm 7.12.0 and PyTorch 2.9.1 in your virtual environment using pip.
python -m pip install \
--index-url https://repo.amd.com/rocm/whl/gfx950-dcgpu/ \
"torch==2.9.1+rocm7.12.0" \
"torchaudio==2.9.0+rocm7.12.0" \
"torchvision==0.24.0+rocm7.12.0"
python -m pip install \
--index-url https://repo.amd.com/rocm/whl/gfx94X-dcgpu/ \
"torch==2.9.1+rocm7.12.0" \
"torchaudio==2.9.0+rocm7.12.0" \
"torchvision==0.24.0+rocm7.12.0"
python -m pip install \
--index-url https://repo.amd.com/rocm/whl/gfx120X-all/ \
"torch==2.9.1+rocm7.12.0" \
"torchaudio==2.9.0+rocm7.12.0" \
"torchvision==0.24.0+rocm7.12.0"
python -m pip install \
--index-url https://repo.amd.com/rocm/whl/gfx1151/ \
"torch==2.9.1+rocm7.12.0" \
"torchaudio==2.9.0+rocm7.12.0" \
"torchvision==0.24.0+rocm7.12.0"
Install the appropriate vLLM 0.16.0 build for your GFX architecture from the ROCm package repository.
python -m pip install \
--extra-index-url https://rocm.frameworks.amd.com/whl/gfx950-dcgpu/ \
"vllm==0.16.1.dev10+g11515110f.d20260324.rocm712"
python -m pip install \
--extra-index-url https://rocm.frameworks.amd.com/whl/gfx94X-dcgpu/ \
"vllm==0.16.1.dev10+g11515110f.d20260324.rocm712"
python -m pip install \
--extra-index-url https://rocm.frameworks.amd.com/whl/gfx120X-all/ \
"vllm==0.16.1.dev10+g11515110f.d20260323.rocm712"
python -m pip install \
--extra-index-url https://rocm.frameworks.amd.com/whl/gfx1151/ \
"vllm==0.16.1.dev10+g11515110f.d20260323.rocm712"
Set the following environment variables to prevent errors related ROCm platform and Flash Attention availability when running vLLM.
export PYTHONPATH=.venv/lib/python3.12/site-packages/_rocm_sdk_core/share/amd_smi
export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
Check your installation.
echo "=== vLLM ===" && python -c "import vllm; print('vLLM version:', vllm.__version__)"
echo "=== PyTorch ===" && python -c "import torch; print('PyTorch:', torch.__version__); print('HIP available:', torch.cuda.is_available()); print('HIP built:', torch.backends.hip.is_built() if hasattr(torch.backends, 'hip') else 'N/A')"
echo "=== flash-attn ===" && python -c "import flash_attn; print('flash-attn:', flash_attn.__version__)"
After setting up your environment, follow the vLLM 0.16.0 usage documentation to get started: Using
vLLM.