Llama.cpp pre-built binaries

Llama.cpp pre-built binaries#

llama.cpp is an open-source framework for Large Language Model (LLM) inference that runs on both central processing units (CPUs) and graphics processing units (GPUs).

This document provides installation instructions for the AMD-validated llama.cpp prebuilt binaries. These are pre-compiled, stable executables (like server and llama-bench) that are ready to run on a Linux system without requiring any compilation.

Download the AMD-validated Linux binary package and extract it.

Ubuntu 24.04

Download the prebuilt binary package.

wget -O llama-bin-linux.zip "https://repo.radeon.com/rocm/llama.cpp/linux/rocm-rel-7.2/llama-b7782-ubuntu-24.04-rocm-7.2.0-gfx110X-gfx115X-gfx120X-x64.zip"

Unzip the package into a new directory.

unzip llama-bin-linux.zip -d ./llama_cpp_binaries

Navigate into the inner directory.

cd ./llama_cpp_binaries/<specific_folder_name>

Make binaries executable. Once in the new directory, grant the llama-server, llama-bench and llama-cli tool execute permissions.
```
chmod +x ./llama-server
chmod +x ./llama-bench
chmod +x ./llama-cli
```
Download a Test Model. These binaries are the “engine”; you still need a model file (in GGUF format) to run. For this tutorial, we will download the GPT-OSS-20B for testing.
Ubuntu 24.04
wget -O test_model.gguf "https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-mxfp4.gguf"

Run Llama-Server. llama-server is a lightweight, OpenAI-compatible web server included with llama.cpp that hosts your model locally. Once running, it provides a simple web interface that allows you to chat with the model directly in your browser.

Ubuntu 24.04

# Start the server
# -ngl 99: Offload all layers to your AMD GPU (Crucial for performance)
# -c: Context Length
# -fa: Enable Flash Attention to reduce memory usage and increase speed
./llama-server -m test_model.gguf -c 2048 -ngl 99 -fa on --port 8080

(Optional) Run a benchmark. Now, run the llama-bench tool against the test model. This command will load the model and run a standardized performance test, measuring your system’s prompt processing (PP) and token generation (TG) speed.
Ubuntu 24.04
# Run the benchmark with the downloaded model. # -m: specifies the model file ./llama-bench -m ./test_model.gguf -fa 1