llama.cpp compatibility#

2025-09-09

5 min read time

Applies to Linux

llama.cpp is an open-source framework for Large Language Model (LLM) inference that runs on both central processing units (CPUs) and graphics processing units (GPUs). It is written in plain C/C++, providing a simple, dependency-free setup.

The framework supports multiple quantization options, from 1.5-bit to 8-bit integers, to speed up inference and reduce memory usage. Originally built as a CPU-first library, llama.cpp is easy to integrate with other programming environments and is widely adopted across diverse platforms, including consumer devices.

ROCm support for llama.cpp is upstreamed, and you can build the official source code with ROCm support:

Note

llama.cpp is supported on ROCm 6.4.0.

Supported devices#

Officially Supported: AMD Instinct™ MI300X, MI210

Use cases and recommendations#

llama.cpp can be applied in a variety of scenarios, particularly when you need to meet one or more of the following requirements:

  • Plain C/C++ implementation with no external dependencies

  • Support for 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory usage

  • Custom HIP (Heterogeneous-compute Interface for Portability) kernels for running large language models (LLMs) on AMD GPUs (graphics processing units)

  • CPU (central processing unit) + GPU (graphics processing unit) hybrid inference for partially accelerating models larger than the total available VRAM (video random-access memory)

llama.cpp is also used in a range of real-world applications, including:

  • Games such as Lucy’s Labyrinth: A simple maze game where AI-controlled agents attempt to trick the player.

  • Tools such as Styled Lines: A proprietary, asynchronous inference wrapper for Unity3D game development, including pre-built mobile and web platform wrappers and a model example.

  • Various other AI applications use llama.cpp as their inference engine; for a detailed list, see the user interfaces (UIs) section.

Refer to the AMD ROCm blog, where you can search for llama.cpp examples and best practices to optimize your workloads on AMD GPUs.

Docker image compatibility#

AMD validates and publishes ROCm llama.cpp Docker images with ROCm backends on Docker Hub. The following Docker image tags and associated inventories were tested on ROCm 6.4.0. Click to view the image on Docker Hub.

Important

Tag endings of _full, _server, and _light serve different purposes for entrypoints as follows:

  • Full: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.

  • Server: This image only includes the server executable file.

  • Light: This image only includes the main executable file.

Full Docker

Server Docker

Light Docker

llama.cpp

Ubuntu

rocm/llama.cpp rocm/llama.cpp rocm/llama.cpp

b5997

24.04

Key ROCm libraries for llama.cpp#

llama.cpp functionality on ROCm is determined by its underlying library dependencies. These ROCm components affect the capabilities, performance, and feature set available to developers.

ROCm library

Version

Purpose

Usage

hipBLAS

2.4.0

Provides GPU-accelerated Basic Linear Algebra Subprograms (BLAS) for matrix and vector operations.

Supports operations such as matrix multiplication, matrix-vector products, and tensor contractions. Utilized in both dense and batched linear algebra operations.

hipBLASLt

0.12.1

hipBLASLt is an extension of the hipBLAS library, providing additional features like epilogues fused into the matrix multiplication kernel or use of integer tensor cores.

By setting the flag ROCBLAS_USE_HIPBLASLT, you can dispatch hipblasLt kernels where possible.

rocWMMA

1.7.0

Accelerates warp-level matrix-multiply and matrix-accumulate to speed up matrix multiplication (GEMM) and accumulation operations with mixed precision support.

Can be used to enhance the flash attention performance on AMD compute, by enabling the flag during compile time.