vLLM Docker image for Llama2 and Llama3

vLLM Docker image for Llama2 and Llama3#

Virtual Large Language Model (vLLM) is a fast and easy-to-use library for LLM inference and serving.

Llama2 and Llama3 support is enabled via a vLLM Docker image that must be built separately (in addition to ROCm) for the current release.

For additional information, visit the AMD vLLM GitHub page.

Note that this is a benchmarking demo/example. Installation for other vLLM models/configurations may differ.

Prerequisites#

Additional information#

  • AMD recommends 40GB GPU for 70B usecases.
    Ensure that your GPU has enough VRAM for the chosen model.

  • This example highlights use of the AMD vLLM Docker using Llama-3 70B with GPTQ quantization (as shown at Computex).
    However, performance is not limited to this specific Hugging Face model, and other vLLM supported models can also be used.

Installation steps#

Follow these steps to build a vLLM Docker image and start using Llama2 and Llama3.

  1. Clone the ROCm/vllm repository.

    git clone -b v0.6.2.post1+rocm https://github.com/ROCm/vllm.git
    
  2. Change directory to vLLM, and build Docker image.

    DOCKER_BUILDKIT=1 docker build \
    --build-arg "ARG_PYTORCH_ROCM_ARCH=<GPU_name>" \
    --build-arg "BUILD_FA=0" \
    -f Dockerfile.rocm \
    -t <image_name> .
    

    Note

    • The Docker image_name is user defined. Ensure to name your Docker using this value.
      Example: vllm0.4.1_rocm6.1.1_ubuntu20.04_py3.9_image

    • Use rocminfo to retrieve the GPU name.
      Example: GPU_name is gfx1100 for Navi31.

  3. Start the Docker container.

    docker run -it --privileged --device=/dev/kfd --device=/dev/dri --network=host --group-add sudo\
    -w /root/workspace\
    -v <vllm_directory>:/root/workspace/vllm\
    --name <container_name> <image_name> /bin/bash
    

    Note

    • The container_name is user defined. Ensure to name your Docker using this value.
      Example: ‘vllm0.6.2_rocm6.2_ubuntu20.04_py3.9_container’

    • The vllm_directory is user defined. Ensure to mount the directory cloned from the vllm git repository.
      Example: /home/user/vllm

  4. Clone the Hugging Face GitHub repository within the Docker container.

    apt update
    apt install git-lfs
    git lfs clone https://huggingface.co/TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ
    
  5. Run benchmarks within the Docker container.

    python3 vllm/benchmarks/benchmark_latency.py --model /root/workspace/Meta-Llama-3-70B-Instruct-GPTQ -q gptq --batch-size 1 --input-len 1024 --output-len 1024 --max-model-len 2048
    

    Note
    Ensure that the model is downloaded and vLLM checkout is set to your current directory within the container described in Step 3.

Your environment is set up to use Llama2 and Llama3.