ROCm-RAG installation#

2026-04-28

6 min read time

Applies to Linux

This topic covers setup and install instructions to help you get started running ROCm-RAG.

System requirements#

To use ROCm-RAG 1.0.0, you need the following prerequisites:

  • ROCm version: 6.4.1

  • Operating system: Ubuntu 22.04

  • GPU platform: AMD Instinct™ MI300X

Note

  • If you’re hosting the LLM outside the Docker container, the container requires one MI300X GPU. By default, the GPU ID is 0, but this can be changed by setting the environment variables for ROCM_RAG_EMBEDDER_TP and ROCM_RAG_EMBEDDER_GPU_IDS.

  • If you’re hosting the LLM inside the container (ROCM_RAG_USE_EXAMPLE_LLM is set to true), three MI300X GPUs are required. By default, the GPU IDs are 0, 1, 2, but these can be changed by setting the environment variables for ROCM_RAG_EMBEDDER_TP, ROCM_RAG_EMBEDDER_GPU_IDS, ROCM_RAG_LLM_TP, and ROCM_RAG_LLM_GPU_IDS.

Install ROCm-RAG#

To install ROCm-RAG on ROCm, you have the following options:

After setting up the container with either option, configure your RAG framework, inferencing framework, and environment variables before running the pipelines.

Use a prebuilt Docker image with ROCm-RAG pre-installed#

The prebuilt image contains a fully configured ROCm-RAG installation and all required dependencies.

Pull the Docker image.

docker pull rocm/rocm-rag:rocm-rag-1.0.0-rocm6.4.1-ubuntu22.04

Build from source#

ROCm-RAG can be built from source using the provided Dockerfile.

  1. Clone the ROCm/rocm-rag repository.

    git clone https://github.com/ROCm/rocm-rag.git --recursive
    cd rocm-rag
    
  2. Build the Docker image.

    docker build -t rocm-rag -f Dockerfile .
    

Configure ROCm-RAG#

Before running ROCm-RAG, you need to configure the RAG framework, inferencing framework, and environment variables.

Choose a RAG framework#

The ROCm-RAG implementation leverages two widely adopted RAG frameworks:

  • Haystack: An open source framework designed for building search systems, QA pipelines, and RAG workflows.

  • LangGraph: A modular framework tailored for developing applications powered by language models.

Choose a framework that best suits your preferences and workflow. Both frameworks are actively maintained and widely used in the field of LLM-based application development.

You can configure a framework by setting environment variables when running the Docker container:

# Options: haystack, langgraph
ROCM_RAG_EXTRACTION_FRAMEWORK=haystack
ROCM_RAG_RETRIEVAL_FRAMEWORK=haystack

Choose an inferencing framework#

ROCm-RAG supports three inferencing frameworks:

  • SGLang: An LLM serving engine known for radix tree caching and speculative decoding for fast inference.

  • vLLM: An efficient LLM inference library built around PagedAttention for fast, memory-optimized serving.

  • llama.cpp: A lightweight C/C++ inference framework for running GGUF-quantized LLMs locally on CPUs and GPUs.

Choose the framework that best suits your preferences and workflow, then follow the setup guide to deploy your inference server.

Using the example LLM (optional)#

If you prefer to test the pipeline without deploying your own inference server, enable the example LLM by setting this environment variable:

ROCM_RAG_USE_EXAMPLE_LLM=True

By default, this launches Qwen/Qwen3-30B-A3B-Instruct-2507 using vLLM inside the provided Docker container, running on GPUs with logical IDs 1 and 2.

If you’re using the example LLM, you can skip the inferencing framework setup steps below.

Deploy an external inference server#

If you set ROCM_RAG_USE_EXAMPLE_LLM=False, follow these steps to deploy an LLM inference server outside the ROCm-RAG container.

Deploy DeepSeek V3 using SGLang:

# On a separate node
docker run --cap-add=SYS_PTRACE --ipc=host --privileged=true \
        --shm-size=128GB --network=host --device=/dev/kfd \
        --device=/dev/dri --group-add video -it \
lmsysorg/sglang:v0.5.3rc0-rocm630-mi30x

RCCL_MSCCL_ENABLE=0 CK_MOE=1 HSA_NO_SCRATCH_RECLAIM=1 \
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3.1 \
--host 0.0.0.0 --port 30000 --tp 8 --trust-remote-code
  1. Download the GGUF model files.

    from huggingface_hub import snapshot_download
    
    # Define the model repository and destination directory
    model_id = "unsloth/DeepSeek-V3.1-GGUF"
    local_dir = "<your huggingface cache directory>/hub/models--unsloth--DeepSeek-V3.1-GGUF"
    
    # Download only files matching the pattern "DeepSeek-V3.1-Q4_K_M*"
    snapshot_download(
        repo_id=model_id,
        local_dir=local_dir,
        local_dir_use_symlinks=False,
        allow_patterns=["Q4_K_M/DeepSeek-V3.1-Q4_K_M*"]
    )
    
    print(f"Downloaded GGUF file(s) matching pattern to: {local_dir}")
    
  2. Build the llama.cpp Docker image.

    git clone https://github.com/ROCm/llama.cpp
    cd llama.cpp/
    docker build -t local/llama.cpp:rocm6.4_ubuntu24.04-complete \
    --target build -f .devops/rocm.Dockerfile .
    
  3. Start the Docker container with your checkpoints directory mounted.

    docker run --cap-add=SYS_PTRACE --ipc=host --privileged=true \
      --shm-size=128GB --network=host --device=/dev/kfd \
      --device=/dev/dri --group-add video -it \
      -v <your huggingface cache directory on host>:<your huggingface cache directory inside container> \
    local/llama.cpp:rocm6.4_ubuntu24.04-complete
    
  4. Launch the llama.cpp HTTP server.

    cd /app/build/bin
    ./llama-server \
    -m <your huggingface cache directory inside the container>/hub/models--unsloth--DeepSeek-V3.1-GGUF/Q4_K_M/DeepSeek-V3.1-Q4_K_M-00001-of-00009.gguf \
    -ngl 999 -np 4 --alias unsloth/DeepSeek-V3.1-Q4_K_M \
    --host 0.0.0.0 --port 30000
    

After setting up your inference server, ensure you set the correct API endpoints for LLM server-related environment variables.

Run ROCm-RAG#

Once the ROCm-RAG framework is configured, you can execute the extraction and retrieval pipelines through:

Next steps#

Now that you have ROCm-RAG configured on your AMD Instinct GPU, you can:

  • Explore different RAG frameworks (Haystack and LangGraph)

  • Experiment with different inferencing frameworks (SGLang, vLLM, llama.cpp)

  • Customize extraction and retrieval parameters for your use case

  • Build custom RAG pipelines for specialized tasks

  • Integrate ROCm-RAG into your AI applications