vLLM Linux Docker Image

vLLM Linux Docker Image#

Virtual Large Language Model (vLLM) is a fast and easy-to-use library for LLM inference and serving, providing greater optimizations and performance.

For additional information, visit the AMD vLLM GitHub page.

Note
This is a benchmarking demo/example. Installation for other vLLM models/configurations may differ.

Additional information

Ensure Docker is installed on your system. Refer to this link for more information.

This docker image supports gfx1151 and gfx1150.

This example highlights use of the AMD vLLM Docker using deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B. Other models vLLM supported models can be used too.

Download and install Docker image#

Download Docker image#

Select the applicable Ubuntu version to download the compatible Docker image before starting.

Ubuntu 24.04/Python 3.12/PyTorch 2.9

docker pull rocm/vllm-dev:rocm7.2_navi_ubuntu24.04_py3.12_pytorch_2.9_vllm_0.14.0rc0

Note
For more information, see rocm/vllm-dev.

Installation#

Follow these steps to build a vLLM Docker image and benchmark a model.

Start the Docker container.

docker run -it \
  --privileged \
  --device=/dev/kfd \
  --device=/dev/dri \
  --network=host \
  --group-add sudo \
  -w /app/vllm/ \
  --name <container_name> \
<image_name> \
  /bin/bash

Note
You can find the <image_name> by running docker images. The container_name is user defined. Ensure to name your Docker using this value.

Run benchmarks with the Docker container.
```
vllm bench latency --model /app/vllm/Meta-Llama-3-70B-Instruct-GPTQ -q gptq --batch-size 1 --input-len 1024 --output-len 1024 --max-model-len 2048
```
Note
This is a vllm CLI command for the latency mode. Similar parameters can be used for bench or serve modes, but they are separate modes and use different subcommands.

This can also be called using python -m vllm.entrypoints.cli.main bench latency...

For additional information, refer to vLLM CLI Guide - vLLM

Additional Usage#

vLLM is optimized to serve LLMs faster and more efficiently, especially for applications requiring high throughput and scalability. See Quickstart - OpenAI Compatible Server for more information.
To run offline inference, see Quickstart - Offline Batched Inference for more information.

Note
If you experience errors with torch.distributed, running export GLOO_SOCKET_IFNAME=lo may resolve the issue.