llama.cpp compatibility

llama.cpp compatibility#

2025-11-10

7 min read time

Applies to Linux

llama.cpp is an open-source framework for Large Language Model (LLM) inference that runs on both central processing units (CPUs) and graphics processing units (GPUs). It is written in plain C/C++, providing a simple, dependency-free setup.

The framework supports multiple quantization options, from 1.5-bit to 8-bit integers, to accelerate inference and reduce memory usage. Originally built as a CPU-first library, llama.cpp is easy to integrate with other programming environments and is widely adopted across diverse platforms, including consumer devices.

Support overview#

The ROCm-supported version of llama.cpp is maintained in the official ROCm/llama.cpp repository, which differs from the ggml-org/llama.cpp upstream repository.
To get started and install llama.cpp on ROCm, use the prebuilt Docker images, which include ROCm, llama.cpp, and all required dependencies.
- See the ROCm llama.cpp installation guide for installation and setup instructions.
- You can also consult the upstream Installation guide for additional context.

Version support#

llama.cpp is supported on ROCm 7.0.0 and ROCm 6.4.x.

Supported devices#

Officially Supported: AMD Instinct™ MI325X, MI300X, MI210

Use cases and recommendations#

llama.cpp can be applied in a variety of scenarios, particularly when you need to meet one or more of the following requirements:

Plain C/C++ implementation with no external dependencies
Support for 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory usage
Custom HIP (Heterogeneous-compute Interface for Portability) kernels for running large language models (LLMs) on AMD GPUs (graphics processing units)
CPU (central processing unit) + GPU (graphics processing unit) hybrid inference for partially accelerating models larger than the total available VRAM (video random-access memory)

llama.cpp is also used in a range of real-world applications, including:

Games such as Lucy’s Labyrinth: A simple maze game where AI-controlled agents attempt to trick the player.
Tools such as Styled Lines: A proprietary, asynchronous inference wrapper for Unity3D game development, including pre-built mobile and web platform wrappers and a model example.
Various other AI applications use llama.cpp as their inference engine; for a detailed list, see the user interfaces (UIs) section.

For more use cases and recommendations, refer to the AMD ROCm blog, where you can search for llama.cpp examples and best practices to optimize your workloads on AMD GPUs.

The Llama.cpp Meets Instinct: A New Era of Open-Source AI Acceleration blog post outlines how the open-source llama.cpp framework enables efficient LLM inference—including interactive inference with llama-cli, server deployment with llama-server, GGUF model preparation and quantization, performance benchmarking, and optimizations tailored for AMD Instinct GPUs within the ROCm ecosystem.

Docker image compatibility#

AMD validates and publishes llama.cpp images with ROCm backends on Docker Hub. The following Docker image tags and associated inventories represent the latest available llama.cpp versions from the official Docker Hub. Click to view the image on Docker Hub.

Important

Tag endings of _full, _server, and _light serve different purposes for entrypoints as follows:

Full: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
Server: This image only includes the server executable file.
Light: This image only includes the main executable file.

Full Docker	Server Docker	Light Docker	llama.cpp	ROCm	Ubuntu
rocm/llama.cpp	rocm/llama.cpp	rocm/llama.cpp	b6652	7.0.0	24.04
rocm/llama.cpp	rocm/llama.cpp	rocm/llama.cpp	b6652	7.0.0	22.04
rocm/llama.cpp	rocm/llama.cpp	rocm/llama.cpp	b6356	6.4.3	24.04
rocm/llama.cpp	rocm/llama.cpp	rocm/llama.cpp	b6356	6.4.3	22.04
rocm/llama.cpp	rocm/llama.cpp	rocm/llama.cpp	b6356	6.4.2	24.04
rocm/llama.cpp	rocm/llama.cpp	rocm/llama.cpp	b6356	6.4.2	22.04
rocm/llama.cpp	rocm/llama.cpp	rocm/llama.cpp	b6356	6.4.1	24.04
rocm/llama.cpp	rocm/llama.cpp	rocm/llama.cpp	b6356	6.4.1	22.04
rocm/llama.cpp	rocm/llama.cpp	rocm/llama.cpp	b5997	6.4.0	24.04

Key ROCm libraries for llama.cpp#

llama.cpp functionality on ROCm is determined by its underlying library dependencies. These ROCm components affect the capabilities, performance, and feature set available to developers. Ensure you have the required libraries for your corresponding ROCm version.

ROCm library	ROCm 7.0.0 version	ROCm 6.4.x version	Purpose	Usage
hipBLAS	3.0.0	2.4.0	Provides GPU-accelerated Basic Linear Algebra Subprograms (BLAS) for matrix and vector operations.	Supports operations such as matrix multiplication, matrix-vector products, and tensor contractions. Utilized in both dense and batched linear algebra operations.
hipBLASLt	1.0.0	0.12.0	hipBLASLt is an extension of the hipBLAS library, providing additional features like epilogues fused into the matrix multiplication kernel or use of integer tensor cores.	By setting the flag `ROCBLAS_USE_HIPBLASLT`, you can dispatch hipblasLt kernels where possible.
rocWMMA	2.0.0	1.7.0	Accelerates warp-level matrix-multiply and matrix-accumulate to speed up matrix multiplication (GEMM) and accumulation operations with mixed precision support.	Can be used to enhance the flash attention performance on AMD compute, by enabling the flag during compile time.

Previous versions#

See llama.cpp version history to find documentation for previous releases of the ROCm/llama.cpp Docker image.