Llama.cpp pre-built binaries

Llama.cpp pre-built binaries#

llama.cpp is an open-source framework for Large Language Model (LLM) inference that runs on both central processing units (CPUs) and graphics processing units (GPUs).

This document provides installation instructions for the AMD-validated llama.cpp prebuilt binaries. These are pre-compiled, stable executables (like server and llama-bench) that are ready to run on a Linux system without requiring any compilation.

  1. Download the AMD-validated Linux binary package and extract it.

    (Use tabs for different OS’s, and then under Ubuntu Tab, have 2 options for Strix Halo/Strix and Navi)

    Download the prebuilt binary package.

    curl.exe -o llama-bin-windows.zip "https://repo.radeon.com/rocm/llama.cpp/windows/rocm-rel-7.1.1/llama-b7146-windows-rocm-7.1.1-gfx1150-gfx1151-x64.zip"
    

    Download the prebuilt binary package.

    curl.exe -o llama-bin-windows.zip "https://repo.radeon.com/rocm/llama.cpp/windows/rocm-rel-7.1.1/llama-b7146-windows-rocm-7.1.1-gfx110X-gfx120X-x64.zip"
    
  2. Unzip the package into a new directory.

    Expand-Archive -Path "llama-bin-windows.zip" -DestinationPath ".\llama_cpp_binaries"
    
  3. Navigate into the inner directory

    cd ./llama_cpp_binaries/<specific_folder_name>
    
  4. Download a Test Model. These binaries are the “engine”; you still need a model file (in GGUF format) to run. For this tutorial, we will download the GPT-OSS-20B for testing.

    curl.exe -L -o test_model.gguf "https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-mxfp4.gguf"
    
  5. Run Llama-Server. llama-server is a lightweight, OpenAI-compatible web server included with llama.cpp that hosts your model locally. Once running, it provides a simple web interface that allows you to chat with the model directly in your browser.

    # Start the server
    # -ngl 99: Offload all layers to your AMD GPU (Crucial for performance)
    # -c: Context Length
    # -fa: Enable Flash Attention to reduce memory usage and increase speed
    .\llama-server.exe -m test_model.gguf -c 2048 -ngl 99 -fa on --port 8080
    
  6. (Optional) Run a benchmark. Now, run the llama-bench tool against the test model. This command will load the model and run a standardized performance test, measuring your system’s prompt processing (PP) and token generation (TG) speed.

    # Run the benchmark with the downloaded model.
    # -m: specifies the model file
    .\llama-bench.exe -m .\test_model.gguf