vLLM Linux Docker Image#
Virtual Large Language Model (vLLM) is a fast and easy-to-use library for LLM inference and serving, providing greater optimizations and performance.
For additional information, visit the AMD vLLM GitHub page.
Note
This is a benchmarking demo/example. Installation for other vLLM models/configurations may differ.
Additional information
Ensure Docker is installed on your system. Refer to this link for more information.
This docker image supports gfx1151 and gfx1150.
This example highlights use of the AMD vLLM Docker using deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B. Other models vLLM supported models can be used too.
Download and install Docker image#
Download Docker image#
Select the applicable Ubuntu version to download the compatible Docker image before starting.
docker pull rocm/vllm-dev:rocm7.2_navi_ubuntu24.04_py3.12_pytorch_2.9_vllm_0.14.0rc0
Note
For more information, see rocm/vllm-dev.
Installation#
Follow these steps to build a vLLM Docker image and benchmark a model.
Start the Docker container.
docker run -it \ --privileged \ --device=/dev/kfd \ --device=/dev/dri \ --network=host \ --group-add sudo \ -w /app/vllm/ \ --name <container_name> \ <image_name> \ /bin/bash
Note
You can find the<image_name>by runningdocker images. Thecontainer_nameis user defined. Ensure to name your Docker using this value.Run benchmarks with the Docker container.
vllm bench latency --model /app/vllm/Meta-Llama-3-70B-Instruct-GPTQ -q gptq --batch-size 1 --input-len 1024 --output-len 1024 --max-model-len 2048
Note
This is a vllm CLI command for thelatencymode. Similar parameters can be used forbenchorservemodes, but they are separate modes and use different subcommands.This can also be called using
python -m vllm.entrypoints.cli.main bench latency...For additional information, refer to vLLM CLI Guide - vLLM
Additional Usage#
vLLM is optimized to serve LLMs faster and more efficiently, especially for applications requiring high throughput and scalability. See Quickstart - OpenAI Compatible Server for more information.
To run offline inference, see Quickstart - Offline Batched Inference for more information.
Note
If you experience errors withtorch.distributed, runningexport GLOO_SOCKET_IFNAME=lomay resolve the issue.