llama.cpp compatibility#
2025-09-09
5 min read time
llama.cpp is an open-source framework for Large Language Model (LLM) inference that runs on both central processing units (CPUs) and graphics processing units (GPUs). It is written in plain C/C++, providing a simple, dependency-free setup.
The framework supports multiple quantization options, from 1.5-bit to 8-bit integers, to speed up inference and reduce memory usage. Originally built as a CPU-first library, llama.cpp is easy to integrate with other programming environments and is widely adopted across diverse platforms, including consumer devices.
ROCm support for llama.cpp is upstreamed, and you can build the official source code with ROCm support:
ROCm support for llama.cpp is hosted in the official ROCm/llama.cpp repository.
Due to independent compatibility considerations, this location differs from the ggml-org/llama.cpp upstream repository.
To install llama.cpp, use the prebuilt Docker image, which includes ROCm, llama.cpp, and all required dependencies.
See the ROCm llama.cpp installation guide to install and get started.
See the Installation guide in the upstream llama.cpp documentation.
Note
llama.cpp is supported on ROCm 6.4.0.
Supported devices#
Officially Supported: AMD Instinct™ MI300X, MI210
Use cases and recommendations#
llama.cpp can be applied in a variety of scenarios, particularly when you need to meet one or more of the following requirements:
Plain C/C++ implementation with no external dependencies
Support for 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory usage
Custom HIP (Heterogeneous-compute Interface for Portability) kernels for running large language models (LLMs) on AMD GPUs (graphics processing units)
CPU (central processing unit) + GPU (graphics processing unit) hybrid inference for partially accelerating models larger than the total available VRAM (video random-access memory)
llama.cpp is also used in a range of real-world applications, including:
Games such as Lucy’s Labyrinth: A simple maze game where AI-controlled agents attempt to trick the player.
Tools such as Styled Lines: A proprietary, asynchronous inference wrapper for Unity3D game development, including pre-built mobile and web platform wrappers and a model example.
Various other AI applications use llama.cpp as their inference engine; for a detailed list, see the user interfaces (UIs) section.
Refer to the AMD ROCm blog, where you can search for llama.cpp examples and best practices to optimize your workloads on AMD GPUs.
Docker image compatibility#
AMD validates and publishes ROCm llama.cpp Docker images with ROCm backends on Docker Hub. The following Docker image tags and associated inventories were tested on ROCm 6.4.0. Click to view the image on Docker Hub.
Important
Tag endings of _full
, _server
, and _light
serve different purposes for entrypoints as follows:
Full: This image includes both the main executable file and the tools to convert
LLaMA
models intoggml
and convert into 4-bit quantization.Server: This image only includes the server executable file.
Light: This image only includes the main executable file.
Full Docker |
Server Docker |
Light Docker |
llama.cpp |
Ubuntu |
---|---|---|---|---|
rocm/llama.cpp | rocm/llama.cpp | rocm/llama.cpp | 24.04 |
Key ROCm libraries for llama.cpp#
llama.cpp functionality on ROCm is determined by its underlying library dependencies. These ROCm components affect the capabilities, performance, and feature set available to developers.
ROCm library |
Version |
Purpose |
Usage |
---|---|---|---|
2.4.0 |
Provides GPU-accelerated Basic Linear Algebra Subprograms (BLAS) for matrix and vector operations. |
Supports operations such as matrix multiplication, matrix-vector products, and tensor contractions. Utilized in both dense and batched linear algebra operations. |
|
0.12.1 |
hipBLASLt is an extension of the hipBLAS library, providing additional features like epilogues fused into the matrix multiplication kernel or use of integer tensor cores. |
By setting the flag |
|
1.7.0 |
Accelerates warp-level matrix-multiply and matrix-accumulate to speed up matrix multiplication (GEMM) and accumulation operations with mixed precision support. |
Can be used to enhance the flash attention performance on AMD compute, by enabling the flag during compile time. |