vLLM Docker image for Llama2 and Llama3

vLLM Docker image for Llama2 and Llama3#

Virtual Large Language Model (vLLM) is a fast and easy-to-use library for LLM inference and serving.

Llama2 and Llama3 support is enabled via a vLLM Docker image that must be built separately (in addition to ROCm) for the current release.

For additional information, visit the AMD vLLM GitHub page.

Note that this is a benchmarking demo/example. Installation for other vLLM models/configurations may differ.

Prerequisites#

GitHub is authenticated.

Additional information#

AMD recommends 40GB GPU for 70B usecases.
Ensure that your GPU has enough VRAM for the chosen model.
This example highlights use of the AMD vLLM Docker using Llama-3 70B with GPTQ quantization (as shown at Computex).
However, performance is not limited to this specific Hugging Face model, and other vLLM supported models can also be used.

Download and install Docker image#

Download Docker image#

Select the applicable Ubuntu version to download the compatible Docker image before starting.

Ubuntu 24.04

docker pull rocm/vllm-dev:rocm6.3.4_navi3x_ubuntu24.04_py3.12_pytorch_2.4_vllm_0.7.2

Ubuntu 22.04

docker pull rocm/vllm-dev:rocm6.3.4_navi3x_ubuntu22.04_py3.10_pytorch_2.4_vllm_0.7.2

Note

For more information, see rocm/vllm-dev.

Installation#

Follow these steps to build a vLLM Docker image and start using Llama2 and Llama3.

Start the Docker container.

docker run -it \
  --privileged \
  --device=/dev/kfd \
  --device=/dev/dri \
  --network=host \
  --group-add sudo \
  -w /app/vllm/ \
  --name <container_name> \
<image_name> \
  /bin/bash

Note

The container_name is user defined. Ensure to name your Docker using this value.

Clone the Hugging Face GitHub repository within the Docker container.

apt update
apt install git-lfs
git lfs clone https://huggingface.co/TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ

Run benchmarks with the Docker container.

python3 /app/vllm/benchmarks/benchmark_latency.py --model /app/vllm/Meta-Llama-3-70B-Instruct-GPTQ -q gptq --batch-size 1 --input-len 1024 --output-len 1024 --max-model-len 2048

Note
Ensure that the model is downloaded and vLLM checkout is set to your current directory within the container described in Step 3.

Your environment is set up to use Llama2 and Llama3.