vLLM Docker image for Llama2 and Llama3#

Virtual Large Language Model (vLLM) is a fast and easy-to-use library for LLM inference and serving.

Llama2 and Llama3 support is enabled via a vLLM Docker image that must be built separately (in addition to ROCm) for the current release.

For additional information, visit the AMD vLLM GitHub page.

Note that this is a benchmarking demo/example. Installation for other vLLM models/configurations may differ.


Additional information#

  • AMD recommends 40GB GPU for 70B usecases.
    Ensure that your GPU has enough VRAM for the chosen model.

  • This example highlights use of the AMD vLLM Docker using Llama-3 70B with GPTQ quantization (as shown at Computex).
    However, performance is not limited to this specific Hugging Face model, and other vLLM supported models can also be used.

Installation steps#

Follow these steps to build a vLLM Docker image and start using Llama2 and Llama3.

  1. Clone the ROCm/vllm repository.

    git clone -b vllm0.4.1_llama_70b_gptq https://github.com/ROCm/vllm.git
  2. Change directory to vLLM, and build Docker image.

    sudo docker build -f Dockerfile.rocm -t <image_name> .


    • The Docker image_name is user defined. Ensure to name your Docker using this value.
      Example: vllm0.4.1_rocm6.1.1_ubuntu20.04_py3.9_image

    • Optional: Map the vllm directory from the host to the Docker container.

  3. Start the Docker container.

    sudo docker run -it --privileged --device=/dev/kfd --device=/dev/dri --network=host --group-add sudo -w /root/workspace --name <container_name> <image_name> /bin/bash

    The container_name is user defined. Ensure to name your Docker using this value.

    Example: vllm0.4.1_rocm6.1.1_ubuntu20.04_py3.9_container

  4. Clone the vLLM GitHub repository within the Docker container.

    This step is not necessary if mapped in Step 2.

    git clone -b vllm0.4.1_llama_70b_gptq https://github.com/ROCm/vllm.git
  5. Clone the Hugging Face GitHub repository within the Docker container.

    git lfs clone https://huggingface.co/TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ
  6. Run benchmarks within the Docker container.

    python3 vllm/benchmarks/benchmark_latency.py --model /root/workspace/Meta-Llama-3-70B-Instruct-GPTQ --batch-size 1 --input-len 1024 --output-len 1024

    Ensure that the model is downloaded and vLLM checkout is set to your current directory within the container described in Step 3.

Your environment is set up to use Llama2 and Llama3.