vLLM inference

vLLM inference#

vLLM is an open-source library for fast, memory-efficient LLM inference and serving. This page describes how to set up and run vLLM on AMD GPUs and APUs using either a prebuilt Docker image or pip. It applies to supported AMD GPUs and platforms.

AMD device family
Instinct GPU
Radeon PRO GPU
Radeon GPU
Ryzen APU
Installation method

Prerequisites#

Ensure your system has Python 3.12 installed and accessible. Review the ROCm 7.12.0 compatibility matrix for more support details.

Get started#

After setting up your environment, follow the vLLM 0.16.0 usage documentation to get started: Using vLLM.

Known issues#