vLLM Linux Docker Image#

Virtual Large Language Model (vLLM) is a fast and easy-to-use library for LLM inference and serving, providing greater optimizations and performance.

For additional information, visit the AMD vLLM GitHub page.

Note
This is a benchmarking demo/example. Installation for other vLLM models/configurations may differ.

Additional information

  • Ensure Docker is installed on your system. Refer to this link for more information.

  • This docker image supports gfx1151 and gfx1150.

  • This example highlights use of the AMD vLLM Docker using deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B. Other models vLLM supported models can be used too.

Download and install Docker image#

Download Docker image#

Select the applicable Ubuntu version to download the compatible Docker image before starting.

docker pull rocm/vllm-dev:rocm7.2_navi_ubuntu24.04_py3.12_pytorch_2.9_vllm_0.14.0rc0

Note
For more information, see rocm/vllm-dev.

Installation#

Follow these steps to build a vLLM Docker image and benchmark a model.

  1. Start the Docker container.

    docker run -it \
      --privileged \
      --device=/dev/kfd \
      --device=/dev/dri \
      --network=host \
      --group-add sudo \
      -w /app/vllm/ \
      --name <container_name> \
    <image_name> \
      /bin/bash
    

    Note
    You can find the <image_name> by running docker images. The container_name is user defined. Ensure to name your Docker using this value.

  2. Run benchmarks with the Docker container.

    vllm bench latency --model /app/vllm/Meta-Llama-3-70B-Instruct-GPTQ -q gptq --batch-size 1 --input-len 1024 --output-len 1024 --max-model-len 2048
    

    Note
    This is a vllm CLI command for the latency mode. Similar parameters can be used for bench or serve modes, but they are separate modes and use different subcommands.

    This can also be called using python -m vllm.entrypoints.cli.main bench latency...

    For additional information, refer to vLLM CLI Guide - vLLM

Additional Usage#

Note
If you experience errors with torch.distributed, running export GLOO_SOCKET_IFNAME=lo may resolve the issue.