llama.cpp on ROCm documentation

llama.cpp on ROCm documentation#

2026-02-27

1 min read time

Applies to Linux

Run llama.cpp on ROCm to deliver optimized LLM inference on AMD Instinct GPUs and CPUs, enabling low-latency, memory-efficient on-prem deployments for chat, summarization, and code assistance.

llama.cpp is an open-source inference library and framework for Large Language Models (LLMs) that runs on both central processing units (CPUs) and graphics processing units (GPUs). It is written in plain C/C++, providing a simple, dependency-free setup.

llama.cpp on ROCm supports multiple quantization options, from 1.5-bit to 8-bit integers, to accelerate inference and reduce memory usage. Originally built as a CPU-first library, llama.cpp is easy to integrate with other programming environments and is widely adopted across diverse platforms, including consumer devices.

llama.cpp is part of the ROCm-LLMExt toolkit.

The llama.cpp public repository is located at ROCm/llama.cpp.

Install

Install llama.cpp

Examples

Run a llama.cpp example

Reference

API reference (upstream)

To contribute to the documentation, refer to Contributing to ROCm.

You can find licensing information on the Licensing page.