llama.cpp on ROCm documentation#
2026-02-27
1 min read time
Run llama.cpp on ROCm to deliver optimized LLM inference on AMD Instinct GPUs and CPUs, enabling low-latency, memory-efficient on-prem deployments for chat, summarization, and code assistance.
llama.cpp is an open-source inference library and framework for Large Language Models (LLMs) that runs on both central processing units (CPUs) and graphics processing units (GPUs). It is written in plain C/C++, providing a simple, dependency-free setup.
llama.cpp on ROCm supports multiple quantization options, from 1.5-bit to 8-bit integers, to accelerate inference and reduce memory usage. Originally built as a CPU-first library, llama.cpp is easy to integrate with other programming environments and is widely adopted across diverse platforms, including consumer devices.
llama.cpp is part of the ROCm-LLMExt toolkit.
The llama.cpp public repository is located at ROCm/llama.cpp.
To contribute to the documentation, refer to Contributing to ROCm.
You can find licensing information on the Licensing page.