vLLM Docker image#
Virtual Large Language Model (vLLM) is a fast and easy-to-use library for LLM inference and serving.
Model support is enabled via a vLLM Docker image that must be built separately (in addition to ROCm) for the current release.
For additional information, visit the AMD vLLM GitHub page.
Note that this is a benchmarking demo/example. Installation for other vLLM models/configurations may differ.
Prerequisites#
GitHub is authenticated.
Additional information#
AMD recommends 40GB GPU for 70B usecases.
Ensure that your GPU has enough VRAM for the chosen model.This example highlights use of the AMD vLLM Docker using Llama-3 70B with GPTQ quantization (as shown at Computex).
However, performance is not limited to this specific Hugging Face model, and other vLLM supported models can also be used.
Download and install Docker image#
Download Docker image#
Select the applicable Ubuntu version to download the compatible Docker image before starting.
docker pull rocm/vllm-dev:rocm7.2_navi_ubuntu24.04_py3.12_pytorch_2.9_vllm_0.14.0rc0
docker pull rocm/vllm-dev:rocm7.2_navi_ubuntu22.04_py3.10_pytorch_2.9_vllm_0.14.0rc0
Note
For more information, see rocm/vllm-dev.
Installation#
Follow these steps to build a vLLM Docker image.
Start the Docker container.
Important!
Refer to WSL specific configurations for instructions when working in a WSL environment.docker run -it \ --privileged \ --device=/dev/kfd \ --device=/dev/dri \ --network=host \ --group-add sudo \ -w /app/vllm/ \ --name <container_name> \ <image_name> \ /bin/bash
Note
The
container_nameis user defined. Ensure to name your Docker using this value.
WSL-specific configurations
Optional: Only applicable when using a WSL configuration
Select the applicable vLLM instructions, based on your specific WSL configuration.docker run -it \ --network=host \ --group-add=video \ --ipc=host \ --cap-add=SYS_PTRACE \ --security-opt seccomp=unconfined \ --device /dev/dxg \ --entrypoint /bin/bash \ -v /usr/lib/wsl/lib/libdxcore.so:/usr/lib/libdxcore.so \ -v /opt/rocm/lib/libhsa-runtime64.so.1:/opt/rocm/lib/libhsa-runtime64.so.1 \ -w /app/vllm/ \ --name vllm_rocm_container \ rocm/vllm-dev:rocm7.2_navi_ubuntu24.04_py3.12_pytorch_2.9_vllm_0.14.0rc0 -c "sed -i 's/is_rocm = False/is_rocm = True/g' /opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/platforms/__init__.py && /bin/bash"
docker run -it \ --network=host \ --group-add=video \ --ipc=host \ --cap-add=SYS_PTRACE \ --security-opt seccomp=unconfined \ --device /dev/dxg \ --entrypoint /bin/bash \ -v /usr/lib/wsl/lib/libdxcore.so:/usr/lib/libdxcore.so \ -v /opt/rocm/lib/libhsa-runtime64.so.1:/opt/rocm/lib/libhsa-runtime64.so.1 \ -w /app/vllm/ \ --name vllm_rocm_container \ rocm/vllm-dev:rocm7.2_navi_ubuntu22.04_py3.10_pytorch_2.9_vllm_0.14.0rc0 -c "sed -i 's/is_rocm = False/is_rocm = True/g' /opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm/platforms/__init__.py && /bin/bash"
Clone the Hugging Face GitHub repository within the Docker container.
apt update apt install git-lfs git lfs clone https://huggingface.co/TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ
Run benchmarks with the Docker container.
vllm bench latency --model /app/vllm/Meta-Llama-3-70B-Instruct-GPTQ -q gptq --batch-size 1 --input-len 1024 --output-len 1024 --max-model-len 2048
This is a vllm CLI command for the
latencymode. Similar parameters can be used forbenchorservemodes, but they are separate modes and use different subcommands.This can also be called using
python -m vllm.entrypoints.cli.main bench latency...For additional information, refer to vLLM CLI Guide - vLLM
Note
Ensure that the model is downloaded and vLLM checkout is set to your current directory within the container described in Step 3.Note
Select the preferred environment variable prior to running models using vLLM in V1 mode:Selection
Prefill
Decode
Flags
TRITON_ATTN (default)
kernel_unified_attentionsame unifiedNone
or--attention-config.backend TRITON_ATTNROCM_ATTN (custom paged attention)
context_attention_fwdpaged_attention_rocmVLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
or--attention-config.backend ROCM_ATTN --attention-config.use_prefill_decode_attention=trueROCM_AITER_UNIFIED_ATTN
AITER unified_attentionsame unifiedVLLM_ATTENTION_BACKEND=ROCM_AITER_UNIFIED_ATTN
or--attention-config.backend ROCM_AITER_UNIFIED_ATTN