vLLM Docker image for Llama2 and Llama3#

Virtual Large Language Model (vLLM) is a fast and easy-to-use library for LLM inference and serving.

Llama2 and Llama3 support is enabled via a vLLM Docker image that must be built separately (in addition to ROCm) for the current release.

For additional information, visit the AMD vLLM GitHub page.

Note that this is a benchmarking demo/example. Installation for other vLLM models/configurations may differ.

Prerequisites#

Additional information#

  • AMD recommends 40GB GPU for 70B usecases.
    Ensure that your GPU has enough VRAM for the chosen model.

  • This example highlights use of the AMD vLLM Docker using Llama-3 70B with GPTQ quantization (as shown at Computex).
    However, performance is not limited to this specific Hugging Face model, and other vLLM supported models can also be used.

Download and install Docker image#

Download Docker image#

Select the applicable Ubuntu version to download the compatible Docker image before starting.

docker pull rocm/vllm-dev:rocm6.4.2_navi_ubuntu24.04_py3.12_pytorch_2.7_vllm_0.9.2
docker pull rocm/vllm-dev:rocm6.4.2_navi_ubuntu22.04_py3.10_pytorch_2.7_vllm_0.9.2

Note

For more information, see rocm/vllm-dev.

Installation#

Follow these steps to build a vLLM Docker image and start using Llama2 and Llama3.

  1. Start the Docker container.

    Important!
    Refer to WSL specific configurations for instructions when working in a WSL environment.

    docker run -it \
      --privileged \
      --device=/dev/kfd \
      --device=/dev/dri \
      --network=host \
      --group-add sudo \
      -w /app/vllm/ \
      --name <container_name> \
    <image_name> \
      /bin/bash
    

    Note

    • The container_name is user defined. Ensure to name your Docker using this value.

    WSL-specific configurations

    Optional: Only applicable when using a WSL configuration
    Select the applicable vLLM instructions, based on your specific WSL configuration.

    docker run -it \
    --network=host \
    --group-add=video \
    --ipc=host \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/dxg \
    --entrypoint /bin/bash \
    -v /usr/lib/wsl/lib/libdxcore.so:/usr/lib/libdxcore.so \
    -v /opt/rocm/lib/libhsa-runtime64.so.1:/opt/rocm/lib/libhsa-runtime64.so.1 \
    -w /app/vllm/ \
    --name vllm_rocm_container \
    rocm/vllm-dev:rocm6.4.2_navi_ubuntu24.04_py3.12_pytorch_2.7_vllm_0.9.2 \
    -c "sed -i 's/is_rocm = False/is_rocm = True/g' /opt/conda/envs/py_3.12/lib/python3.12/site-packages/vllm/platforms/__init__.py && /bin/bash"
    
    docker run -it \
    --network=host \
    --group-add=video \
    --ipc=host \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/dxg \
    --entrypoint /bin/bash \
    -v /usr/lib/wsl/lib/libdxcore.so:/usr/lib/libdxcore.so \
    -v /opt/rocm/lib/libhsa-runtime64.so.1:/opt/rocm/lib/libhsa-runtime64.so.1 \
    -w /app/vllm/ \
    --name vllm_rocm_container \
    rocm/vllm-dev:rocm6.4.2_navi_ubuntu22.04_py3.10_pytorch_2.7_vllm_0.9.2 \
    -c "sed -i 's/is_rocm = False/is_rocm = True/g' /opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm/platforms/__init__.py && /bin/bash"
    
  1. Clone the Hugging Face GitHub repository within the Docker container.

    apt update
    apt install git-lfs
    git lfs clone https://huggingface.co/TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ
    
  2. Run benchmarks with the Docker container.

    python3 /app/vllm/benchmarks/benchmark_latency.py --model /app/vllm/Meta-Llama-3-70B-Instruct-GPTQ -q gptq --batch-size 1 --input-len 1024 --output-len 1024 --max-model-len 2048
    

    Note
    Ensure that the model is downloaded and vLLM checkout is set to your current directory within the container described in Step 3.

    Note
    Set the following environment variable prior to running models using vLLM in V1 mode:
    VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1

Your environment is set up to use Llama2 and Llama3.