vLLM Docker image for Llama2 and Llama3#
vLLM is a fast and easy-to-use library for LLM inference and serving.
Llama2 and Llama3 support is enabled via a vLLM Docker image that must be built separately (in addition to ROCm) for the current release.
For additional information, visit the vLLM GitHub page.
Note that this is a benchmarking demo/example. Installation for other vLLM models/configurations may differ.
Prerequisites#
GitHub is authenticated.
Additional information#
AMD recommends 40GB GPU for 70B usecases.
Ensure that your GPU has enough VRAM for the chosen model.This example highlights use of the AMD vLLM Docker using Llama-3 70B with GPTQ quantization (as shown at Computex).
However, performance is not limited to this specific Hugging Face model, and other vLLM supported models can also be used.
Installation steps#
Follow these steps to build a vLLM Docker image and start using Llama2 and Llama3.
Clone the ROCm/vllm repository.
git clone -b vllm0.4.1_llama_70b_gptq https://github.com/ROCm/vllm.git
Change directory to vLLM, and build Docker image.
sudo docker build -f Dockerfile.rocm -t <image_name> .
Note
The Docker
image_name
is user defined. Ensure to name your Docker using this value.
Example: vllm0.4.1_rocm6.1.1_ubuntu20.04_py3.9_imageOptional: Map the vllm directory from the host to the Docker container.
Start the Docker container.
sudo docker run -it --privileged --device=/dev/kfd --device=/dev/dri --network=host --group-add sudo -w /root/workspace --name <container_name> <image_name> /bin/bash
Note
Thecontainer_name
is user defined. Ensure to name your Docker using this value.Example: vllm0.4.1_rocm6.1.1_ubuntu20.04_py3.9_container
Clone the vLLM GitHub repository within the Docker container.
This step is not necessary if mapped in Step 2.
git clone -b vllm0.4.1_llama_70b_gptq https://github.com/ROCm/vllm.git
Clone the Hugging Face GitHub repository within the Docker container.
git lfs clone https://huggingface.co/TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ
Run benchmarks within the Docker container.
python3 vllm/benchmarks/benchmark_latency.py --model /root/workspace/Meta-Llama-3-70B-Instruct-GPTQ --batch-size 1 --input-len 1024 --output-len 1024
Note
Ensure that the model is downloaded and vLLM checkout is set to your current directory within the container described in Step 3.
Your environment is set up to use Llama2 and Llama3.