Llama.cpp pre-built binaries#
llama.cpp is an open-source framework for Large Language Model (LLM) inference that runs on both central processing units (CPUs) and graphics processing units (GPUs).
This document provides installation instructions for the AMD-validated llama.cpp prebuilt binaries. These are pre-compiled, stable executables (like server and llama-bench) that are ready to run on a Linux system without requiring any compilation.
Download the AMD-validated Linux binary package and extract it.
Download the prebuilt binary package.
wget -O llama-bin-linux.zip "https://repo.radeon.com/rocm/llama.cpp/linux/rocm-rel-7.1.1/llama-b7146-ubuntu-24.04-rocm-7.1.1-gfx1150-gfx1151-x64.zip"
Download the prebuilt binary package.
wget -O llama-bin-linux.zip "https://repo.radeon.com/rocm/llama.cpp/linux/rocm-rel-7.1.1/llama-b7146-ubuntu-24.04-rocm-7.1.1-gfx110X-gfx120X-x64.zip"
Unzip the package into a new directory.
unzip llama-bin-linux.zip -d ./llama_cpp_binaries
Navigate into the inner directory.
cd ./llama_cpp_binaries/<specific_folder_name>
Make binaries executable. Once in the new directory, grant the
llama-server,llama-benchandllama-clitool execute permissions.chmod +x ./llama-server chmod +x ./llama-bench chmod +x ./llama-cli
Download a Test Model. These binaries are the “engine”; you still need a model file (in GGUF format) to run. For this tutorial, we will download the GPT-OSS-20B for testing.
wget -O test_model.gguf "https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-mxfp4.gguf"
Run Llama-Server.
llama-serveris a lightweight, OpenAI-compatible web server included withllama.cppthat hosts your model locally. Once running, it provides a simple web interface that allows you to chat with the model directly in your browser.# Start the server # -ngl 99: Offload all layers to your AMD GPU (Crucial for performance) # -c: Context Length # -fa: Enable Flash Attention to reduce memory usage and increase speed ./llama-server -m test_model.gguf -c 2048 -ngl 99 -fa on --port 8080
(Optional) Run a benchmark. Now, run the llama-bench tool against the test model. This command will load the model and run a standardized performance test, measuring your system’s prompt processing (PP) and token generation (TG) speed.
# Run the benchmark with the downloaded model. # -m: specifies the model file ./llama-bench -m ./test_model.gguf