Llama Stack

Llama Stack#

This tutorial provides a step-by-step guide on deploying the Llama Stack on an AMD Instinct™ MI300X accelerator.

Introduction#

Meta is a leader in AI open-source innovation, and its Llama series has democratized access to large language models, empowering developers worldwide. The Llama Stack — a Meta all-in-one deployment framework — extends this vision by enabling seamless transitions from research to production through built-in tools for optimization, API integration, and scalability. This unified platform is ideal for teams requiring robust support to deploy the Meta models at scale across diverse applications.

Complementing this ecosystem, AMD reinforces its position as a leader in AI acceleration hardware by expanding the AI software frontier through the ROCm™ open-source software stack. By fostering collaboration and optimizing performance, ROCm equips developers with a robust foundation to build high-throughput AI solutions tailored for production environments.

This tutorial guides developers in deploying the Llama Stack on AMD ROCm-powered GPUs, creating a production-ready infrastructure for large language model (LLM) inference. It also demonstrates programmatic interactions using the Llama Stack CLI and Python SDK, ensuring seamless server integration. To streamline this journey, the tutorial first previews the core components involved, such as the ROCm optimization tools, the Llama Stack deployment workflows, and scalable GPU configurations, before diving into the hands-on session. By the end of this guide, you’ll have a fully functional deployment of the Llama Stack inference services using AMD ROCm.

Llama Stack and remote vLLM distribution#

Llama Stack defines and standardizes the core building blocks needed to bring generative AI applications to market. It provides a unified set of APIs with implementations from leading service providers, enabling seamless transitions between development and production environments. For more information, see the Llama Stack documentation.

Llama Stack

The Llama Stack Inference API is interoperable with a wide range of LLM inference providers, including vLLM, TGI, Ollama, and OpenAI APIs, ensuring seamless integration and flexibility for deployment. It also provides four types of client SDKs: Python, Swift, Node, and Kotlin.

For this tutorial, you’ll use vLLM as the inference provider along with the Llama Stack Python client SDK to showcase scalable deployment workflows and illustrate hands-on, low-latency LLM integration into production-ready services.

ROCm and vLLM Docker images#

ROCm is an open-source software platform optimized to extract HPC and AI workload performance from AMD Instinct accelerators and AMD Radeon GPUs while maintaining compatibility with industry software frameworks. For more information, see What is ROCm?

AMD collaborates with vLLM to deliver a streamlined, high-performance LLM inference engine and production-ready deployment solutions for enterprise-grade AI workloads.

Available vLLM containers

AMD provides two main vLLM container options. For more information, see How to Build a vLLM Container for Inference and Benchmarking.

ROCm/vllm: Production-ready container
- Pinned to a specific version, for example, ROCm/vllm-dev:ROCm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6
- Designed for stability
- Optimized for deployment
ROCm/vllm-dev: Development container with the latest vLLM features
- nightly, main and other specialized builds are available:
  - nightly tags are built daily from the latest code, but might contain bugs
  - main tags are more stable builds, updated after testing
- Includes development tools
- Best for testing new features or custom modifications

Deployment of Llama Stack with ROCm#

This tutorial uses Remote vLLM Distribution running with the ROCm/vllm-dev Docker image on an Instinct MI300X GPU. In addition to supporting many LLM inference providers (for example, Fireworks, Together, AWS Bedrock, Groq, Cerebras, SambaNova, and vLLM), Llama Stack also allows you to choose safety providers as an option (for instance, Meta Llama Guard, AWS Bedrock Guardrails, and vLLM). This tutorial uses two Instinct MI300X GPUs: one for deploying LLM inference APIs, and another for deploying the Safety or Shield APIs.

Prerequisites#

This tutorial was developed and tested using the following setup.

Operating system#

Ubuntu 22.04: Ensure your system is running Ubuntu version 22.04.

Hardware#

AMD Instinct™ GPUs: This tutorial was tested on an AMD Instinct MI300X GPU. Ensure you are using an AMD Instinct GPU or compatible hardware with ROCm support and that your system meets the official requirements.

Software#

ROCm 6.2 or 6.3: Install and verify ROCm by following the ROCm install guide. After installation, confirm your setup using:
```
rocm-smi
```
This command lists your AMD GPUs with relevant details, similar to the image below.
Docker: Ensure Docker is installed and configured correctly. Follow the Docker installation guide for your operating system.

Note: Ensure the Docker permissions are correctly configured. To configure permissions to allow non-root access, run the following commands:
```
sudo usermod -aG docker $USER
newgrp docker
```
Verify Docker is working correctly with:
```
docker run hello-world
```

Hugging Face API access#

Obtain an API token from Hugging Face for downloading models.
Ensure the Hugging Face API token has the necessary permissions and approval to access the Meta Llama checkpoints.

Launch Jupyter Notebooks#

Install Jupyter using the following command:

pip install jupyter

Start the Jupyter server:

jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root

Note: Ensure port 8888 is not already in use on your system before running the above command. If it is, you can specify a different port by replacing --port=8888 with another port number, for example, --port=8890.

Launch the ROCm vLLM server#

Starting with this section, you can upload this notebook in your Jupyter server and run the rest of this tutorial in your Jupyter environment.

Note: Before launching the server, provide your Hugging Face token.

Provide your Hugging Face token#

You’ll require a Hugging Face API token to access Llama-3.2-3B-Instruct. Generate your token at Hugging Face Tokens and request access for Llama-3.2-3B-Instruct. Tokens typically start with “hf_”.

Run the following interactive block in your Jupyter notebook to set up the token:

from huggingface_hub import notebook_login, HfApi

# Prompt the user to log in
notebook_login()

Verify that your token was accepted correctly:

# Validate the token
try:
    api = HfApi()
    user_info = api.whoami()
    print(f"Token validated successfully! Logged in as: {user_info['name']}")
except Exception as e:
    print(f"Token validation failed. Error: {e}")

Launch the vLLM container and OpenAI-compatible serving endpoint.

%%bash
export INFERENCE_PORT=8080
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
export CUDA_VISIBLE_DEVICES=0
export VLLM_DIMG="rocm/vllm-dev:main"
docker run -d --rm\
    --ipc=host \
    --privileged \
    --shm-size 16g \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --cap-add=CAP_SYS_ADMIN \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --env "HIP_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" \
    -p $INFERENCE_PORT:$INFERENCE_PORT \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --name rocm-vllm-provider \
    $VLLM_DIMG \
    python -m vllm.entrypoints.openai.api_server \
    --model $INFERENCE_MODEL \
    --port $INFERENCE_PORT

Note: Set --enable-auto-tool-choice and --tool-call-parser to enable tool calling in vLLM.

If you’re using the Llama Stack Safety or Shield APIs, then you’ll also need to run another instance of a vLLM with a corresponding safety model like meta-llama/Llama-Guard-3-1B using a script like the following:

%%bash
export SAFETY_PORT=8081
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
export CUDA_VISIBLE_DEVICES=1
export VLLM_DIMG="rocm/vllm-dev:main"

docker run -d --rm\
    --ipc=host \
    --privileged \
    --shm-size 16g \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --cap-add=CAP_SYS_ADMIN \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --env "HIP_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" \
    -p $SAFETY_PORT:$SAFETY_PORT \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --name rocm-vllm-guard \
    $VLLM_DIMG \
    python -m vllm.entrypoints.openai.api_server \
    --model $SAFETY_MODEL \
    --port $SAFETY_PORT

The script needs to provide enough time for the vllm serve command to launch the LLM service. The bigger LLMs require more loading time, so you might need to adjust the sleep time for your environment. Another method is to start the two vLLM server containers before running the subsequent steps of this Jupyter notebook, then use a curl test to ensure the vllm serve command is finished.

!sleep 360

Now test whether the two vllm serve operations for the containers are ready. If not, you should add more time by using the sleep command in bash until you receive the correct response from the curl test.

!curl http://localhost:8080/v1/models

%%bash
curl http://localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.2-3B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

!curl http://localhost:8081/v1/models

%%bash
 curl http://localhost:8081/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-Guard-3-1B",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

Install Llama Stack#

Next, install Llama Stack.

%%bash
pip install llama-stack llama-stack-client

%%bash
pip list | grep llama_stack

%%bash
git clone https://github.com/meta-llama/llama-stack.git
# Copy the template yaml of remote-vllm distro 
cp ./llama-stack/llama_stack/templates/remote-vllm/run.yaml .
cp ./llama-stack/llama_stack/templates/remote-vllm/run-with-safety.yaml .

Running Llama Stack#

Use the distribution-remote-vllm Docker image as the Llama Stack frontend and a vLLM server container as the backend. Configure the model path and port mappings for the vLLM container, then launch both containers to bring up Llama Stack.

%%bash
export INFERENCE_PORT=8080
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
export LLAMA_STACK_PORT=8321
export SAFETY_PORT=8081
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B

docker run -d --rm \
  --network=host \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  -v ./run-with-safety.yaml:/root/my-run.yaml \
  --name llama-stack-distro \
  llamastack/distribution-remote-vllm \
  --config /root/my-run.yaml \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env VLLM_URL=http://0.0.0.0:$INFERENCE_PORT/v1 \
  --env SAFETY_MODEL=$SAFETY_MODEL \
  --env SAFETY_VLLM_URL=http://0.0.0.0:$SAFETY_PORT/v1

Now you have three containers.

# waiting for the container start
!sleep 60

!docker ps

Use the Llama Stack client CLI#

You can use the client side CLI to access the llama-stack service. For more details, see the Llama Stack Client CLI reference.

!llama-stack-client models list

Available models

┏━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━┳━━┓
┃  ┃ identifier                       ┃ provider_resource_id             ┃  ┃  ┃
┡━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━╇━━┩
│  │ all-MiniLM-L6-v2                 │ all-MiniLM-L6-v2                 │  │  │
├──┼──────────────────────────────────┼──────────────────────────────────┼──┼──┤
│  │ meta-llama/Llama-3.2-3B-Instruct │ meta-llama/Llama-3.2-3B-Instruct │  │  │
├──┼──────────────────────────────────┼──────────────────────────────────┼──┼──┤
│  │ meta-llama/Llama-Guard-3-1B      │ meta-llama/Llama-Guard-3-1B      │  │  │
└──┴──────────────────────────────────┴──────────────────────────────────┴──┴──┘

Total models: 3

!llama-stack-client providers list

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ API          ┃ Provider ID            ┃ Provider Type                  ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ inference    │ vllm-inference         │ remote::vllm                   │
│ inference    │ vllm-safety            │ remote::vllm                   │
│ inference    │ sentence-transformers  │ inline::sentence-transformers  │
│ vector_io    │ faiss                  │ inline::faiss                  │
│ safety       │ llama-guard            │ inline::llama-guard            │
│ agents       │ meta-reference         │ inline::meta-reference         │
│ eval         │ meta-reference         │ inline::meta-reference         │
│ datasetio    │ huggingface            │ remote::huggingface            │
│ datasetio    │ localfs                │ inline::localfs                │
│ scoring      │ basic                  │ inline::basic                  │
│ scoring      │ llm-as-judge           │ inline::llm-as-judge           │
│ scoring      │ braintrust             │ inline::braintrust             │
│ telemetry    │ meta-reference         │ inline::meta-reference         │
│ tool_runtime │ brave-search           │ remote::brave-search           │
│ tool_runtime │ tavily-search          │ remote::tavily-search          │
│ tool_runtime │ code-interpreter       │ inline::code-interpreter       │
│ tool_runtime │ rag-runtime            │ inline::rag-runtime            │
│ tool_runtime │ model-context-protocol │ remote::model-context-protocol │
│ tool_runtime │ wolfram-alpha          │ remote::wolfram-alpha          │
└──────────────┴────────────────────────┴────────────────────────────────┘

You can request inference from the CLI in this manner:

!llama-stack-client inference chat-completion --message "tell me a joke"

ChatCompletionResponse(
    completion_message=CompletionMessage(
        content='{"name": "print", "parameters": {"f": "Why was the math book 
sad? Because it had too many problems."}}',
        role='assistant',
        stop_reason='end_of_turn',
        tool_calls=[]
    ),
    logprobs=None,
    metrics=[
        Metric(metric='prompt_tokens', value=14.0, unit=None),
        Metric(metric='completion_tokens', value=38.0, unit=None),
        Metric(metric='total_tokens', value=52.0, unit=None)
    ]
)

Use the Python client SDK#

Llama Stack provides a Python client SDK for developing the application. Here’s a example showing how to use the API to perform inference. You can find example code for reference on the Getting started page.

%%bash
cat > inference.py << EOF
# inference.py
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f"http://localhost:8321")

# List available models
models = client.models.list()

# Select the first LLM
llm = next(m for m in models if m.model_type == "llm")
model_id = llm.identifier

print("Model:", model_id)

response = client.inference.chat_completion(
    model_id=model_id,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about coding"},
    ],
)
print(response.completion_message.content)
EOF

!python inference.py

Model: meta-llama/Llama-3.2-3B-Instruct
{"name": "haiku", "parameters": {"c": "code", "t": "typing", "s": "silent"}}

Cleanup#

When you are finished, use the following commands to clean up the system:

%%bash
rm -rf llama-stack
rm run.yaml
rm run-with-safety.yaml
rm inference.py
docker stop llama-stack-distro
docker stop rocm-vllm-guard
docker stop rocm-vllm-provider

Llama Stack

Contents

Llama Stack#

Introduction#

Llama Stack and remote vLLM distribution#

ROCm and vLLM Docker images#

Deployment of Llama Stack with ROCm#

Prerequisites#

Operating system#

Hardware#

Software#

Hugging Face API access#

Launch Jupyter Notebooks#

Launch the ROCm vLLM server#

Provide your Hugging Face token#

Install Llama Stack#

Running Llama Stack#

Use the Llama Stack client CLI#

Use the Python client SDK#

Cleanup#