LLM distributed inference and PD disaggregation on AMD Instinct GPUs#
With the rapid growth of LLM model sizes, single-node inference optimization starts to show its limitations on LLM serving scaling. Distributed inference on multiple nodes becomes more important for efficient LLM serving. Prefill and Decode (PD) disaggregation is a typical use case for LLM distributed inference on GPU nodes. LLM inference comprises two distinct phases: prefill and decode. The prefill phase is computationally intensive, processing the entire input sequence, while the decode phase is memory-intensive, managing the key-value (KV) cache for token generation. PD disaggregation runs these two phases independently on different GPU nodes, which provides the benefits of efficient GPU resource allocation and independent performance tuning. This tutorial demonstrates how to set up 1P1D distributed inference on either one or two nodes with AMD Instinct™ GPUs.
Prerequisites#
This tutorial was developed and tested using the following setup.
Operating system#
Ubuntu 22.04/24.04: Ensure your system is running Ubuntu 22.04 or 24.04.
Hardware#
AMD Instinct GPUs: Tested on MI300X. Works on a single MI300X node or two MI300X nodes (each node has 8 MI300X GPUs).
RDMA NIC (required for single-node and two-node for transferring the KV cache over RDMA):
Use an RDMA‑capable NIC (for example, Broadcom Thor2/BCM‑57608 or NVIDIA/Mellanox ConnectX).
Install the appropriate vendor driver and RDMA userspace (
rdma-core/libibverbs
). Verify with the following commands:ibv_devices
andibv_devinfo
ls /dev/infiniband
If running in Docker, ensure
/dev/infiniband
is mapped into the container (see the launch command below).For a two-node configuraton, RDMA must be available on both nodes (RoCEv2 with PFC or InfiniBand) and properly cabled and switch configured.
ROCm compatibility: Use AMD Instinct GPUs with ROCm support and ensure your system meets the official requirements.
Software#
ROCm 6.3 or later (host): Install and verify ROCm by following the ROCm install guide. After installation, confirm your setup using the
rocm-smi
command.Docker (host): Install the Docker Engine on Linux and verify with the
docker --version
command. You will run this tutorial inside a container. Ensure the host allows you to map GPU and RDMA devices.When launching the container, map the devices and use the recommended flags:
Devices:
/dev/kfd
,/dev/dri
,/dev/infiniband
, and/dev/infiniband/rdma_cm
Flags:
--network=host --ipc=host --shm-size 32G --group-add=video
Prebuilt ROCm Docker images (recommended to reduce the setup effort):
Jupyter (in container): Install with
pip install jupyter
to run this notebook.
Hugging Face API access#
Obtain an API token from Hugging Face for downloading models.
Ensure the Hugging Face API token has the necessary permissions and approvals to access the required checkpoints:
For one node, you must have access to the Meta Llama Llama-3.1-8B-Instruct checkpoints.
For two nodes, you must have access to the Meta Llama Llama-3.3-70B-Instruct checkpoints.
Set up the PD disaggregation environment#
In this tutorial, you will work on the prebuilt ROCm SGLang image, which integrates SGLang with the AMD ROCm software stack. You can also try other ROCm images as the base image if you want.
Step 1: Prepare the tutorial environment#
Follow these steps to configure your tutorial environment:
Pull the Docker image#
Use the lmsysorg/sglang:v0.4.9-rocm630
Docker image as the base image. This is the latest image that was tested for this tutorial.
Note: The SGLang community continues to release additional ROCm SGLang Docker images. You are strongly encouraged to try the latest available image for better performance.
docker pull lmsysorg/sglang:v0.4.9-rocm630
Launch the Docker container#
To achieve good network transfer performance, an RDMA NIC is required for the nodes running PD disaggregation. When launching the Docker images, map the RDMA device into the Docker container, as shown in the command below.
docker run -it --rm \
--network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add=video \
--ipc=host \
--device=/dev/infiniband \
--device=/dev/infiniband/rdma_cm \
--privileged \
--cap-add=SYS_ADMIN \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--shm-size 32G \
-v $(pwd):/workspace \
-w /workspace \
lmsysorg/sglang:v0.4.9-rocm630
Note: This command mounts the current directory to the /workspace
directory in the container. Ensure the notebook file is either copied to this directory before running the Docker command or uploaded into the Jupyter Notebook environment after it starts. Save the token or URL provided in the terminal output to access the notebook from your web browser. You can download this notebook from the AI Developer Hub GitHub repository.
Install and launch Jupyter#
Inside the Docker container, install Jupyter using the following command:
pip install jupyter
Start the Jupyter server:
jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
Note: Ensure port 8888
is not already in use on your system before running the above command. If it is, you can specify a different port by replacing --port=8888
with another port number, for example, --port=8890
.
Note: The rest of this notebook is designed to run as a Jupyter notebook. This notebook demonstrates Prefill and Decode (PD) disaggregation on AMD Instinct GPUs. It runs on a single node (Intra‑node 1P1D) by default. The two‑node (Inter‑node 1P1D) steps are optional and clearly marked.
Run modes
Single node (default): The etcd key-value store server is not required. Use the “Intra‑node 1P1D” section.
Two nodes (optional): Requires a setup with the etcd server, RDMA, and SSH. Use the two‑node “Inter‑node 1P1D” section.
Provide your Hugging Face token#
You’ll require a Hugging Face API token to access Llama models with the appropriate permissions as indicated in earlier sections of this notebook. First, install the Hugging Face Hub library.
!pip install --upgrade huggingface_hub
Run the following interactive block in your Jupyter notebook to set up the token:
from huggingface_hub import notebook_login, HfApi
# Prompt the user to log in
notebook_login()
Verify that your token was accepted correctly:
# Validate the token
try:
api = HfApi()
user_info = api.whoami()
print(f"Token validated successfully! Logged in as: {user_info['name']}")
except Exception as e:
print(f"Token validation failed. Error: {e}")
Step 2: Install the necessary software components#
For the Intra-node 1P1D mode, you must install the Mooncake transfer engine. The etcd server is not required for this case. For the Inter-node 1P1D mode, you must install both etcd and the Mooncake transfer engine.
Step 2.1: (Optional) Install etcd#
You can skip this step if you are running the single-node test. etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines. In the SGLang PD disaggregation solution design, the etcd server is required on each GPU node to provide cluster metadata storage. It must be installed to run this tutorial on two nodes.
%%bash
cd /sgl-workspace
apt update && apt install -y wget
wget https://github.com/etcd-io/etcd/releases/download/v3.6.0-rc.5/etcd-v3.6.0-rc.5-linux-amd64.tar.gz -O /tmp/etcd.tar.gz
tar -xvf /tmp/etcd.tar.gz -C /usr/local/bin/ --strip-components=1 && rm /tmp/etcd.tar.gz
Step 2.2: Install Mooncake#
Mooncake features a KV cache-centric disaggregated architecture that separates the prefill and decoding clusters. Its core components, such as the transfer engine, have been integrated into the SGLang PD disaggregation solution to transfer the KV cache between nodes.
%%bash
apt update && apt install -y zip unzip openssh-server
apt -y install gcc make libtool autoconf librdmacm-dev rdmacm-utils infiniband-diags ibverbs-utils perftest ethtool libibverbs-dev rdma-core strace
cd /sgl-workspace
pip install mooncake-transfer-engine
Run SGLang PD disaggregation#
SGLang supports Prefill-Decode (PD) disaggregation on AMD Instinct GPUs, which uses Mooncake to transfer the KV cache. From a system architecture perspective, SGLang PD disaggregation includes 3 distinct components: a proxy server, prefill server, and decode server. When a request comes in, the proxy server selects a pair of prefill and decode servers based on a workload-balancing scheme. The selected prefill server and decode server pair using a handshake, establishing a local sender and receiver, respectively. The decode server preallocates the KV cache, then signals the prefill server to begin LLM prefill inference and compute the KV caches. After the prefill work is done, the KV cache data is transferred to the decode server, which handles iterative token generation.
This tutorial tests SGLang PD Disaggregation for two configurations: Intra-node 1P1D and Inter-node 1P1D. For the Intra-node case, you need at least two GPUs: one GPU to run the prefill server and the other to run the decode server. For inter-node 1P1D, you need two nodes. One node will run the prefill server, and the other node will run the decode server. Because the proxy server doesn’t require a large amount of GPU resources, it runs on the prefill node. If you have a larger cluster, the proxy node can run on a standalone node for better performance. For the following steps demonstrating SGLang PD Disaggregation, the example prefill node has the IP address 10.21.9.10
and the decode node has the address 10.21.9.15
. Modify the relevant parameters and settings according to your cluster configuration.
Intra-node (single-node) 1P1D#
For intra-node testing, follow these steps:
Note: Run all commands in this section from a terminal, not from notebook code cells. In JupyterLab, open a terminal using Launcher → Terminal (or File → New → Terminal). Use separate terminals for the prefill, decode, and proxy servers.
Run the prefill server#
Use the sglang.launch_server
command to launch the prefill server. For more information and a detailed description of the command options, see the latest version of the SGLang documentation or source code. The RDMA device names can be found by using ibv_devices
(see the earlier section).
Note: For a multi-node configuration, Ensure PATH
and LD_LIBRARY_PATH
include UCX and Open MPI (see the earlier cells, or export them in the terminal). Replace any placeholders (for example, IPs, ports, and RDMA device names) before running the commands.
HIP_VISIBLE_DEVICES=0 python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode prefill --port 30000 \
--disaggregation-ib-device rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
Run the decode server#
Use the sglang.launch_server
command to launch the decode server. The RDMA device names can be found by using ibv_devices
.
HIP_VISIBLE_DEVICES=1 python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode decode --port 30001 \
--disaggregation-ib-device rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
Run the proxy server#
This step configures the prefill and decode server ports when launching the proxy server on the same node. The proxy server port is also provided so the test client program can connect.
python -m sglang.srt.disaggregation.mini_lb --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 40000
Inter-node (multi-node) 1P1D#
For inter-node testing, follow these steps:
Note: Run all commands in this section from a terminal, not from notebook code cells. In JupyterLab, open a terminal using Launcher → Terminal (or File → New → Terminal). Use separate terminals for the prefill, decode, and proxy servers.
Run the etcd server on each node#
Run the commands below in the SGLang ROCm containers of both the prefill and decode nodes. The etcd server ports in these commands are for reference only. If they are in use by other processes, try different ports.
On the prefill node, start the etcd server using the following command.
etcd --name infra0 --data-dir /var/lib/etcd --initial-advertise-peer-urls http://10.21.9.10:2380 \
--listen-peer-urls http://10.21.9.10:2380 \
--listen-client-urls http://10.21.9.10:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.21.9.10:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster infra0=http://10.21.9.10:2380,infra1=http://10.21.9.15:2380 \
--initial-cluster-state new
On the decode node, use this command to run the etcd server.
etcd --name infra1 --data-dir /var/lib/etcd --initial-advertise-peer-urls http://10.21.9.15:2380 \
--listen-peer-urls http://10.21.9.15:2380 \
--listen-client-urls http://10.21.9.15:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.21.9.15:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster infra0=http://10.21.9.10:2380,infra1=http://10.21.9.15:2380 \
--initial-cluster-state new
Run the proxy server#
As previously mentioned, this server runs on the prefill node in this tutorial. You can run it on a standalone node in the same cluster for better performance.
In this step, the IP addresses and ports of the prefill and decode node pools are configured. The IP address and port of the proxy server are also provided for the test client program to connect to.
python -m sglang.srt.disaggregation.mini_lb --prefill http://10.21.9.10:30000 \
--decode http://10.21.9.15:30000 --host 0.0.0.0 --port 40000
Run the prefill server#
Use the sglang.launch_server
command to launch the prefill server. For more information and a detailed description of the command options, see the latest version of the SGLang documentation or source code. The RDMA device names can be found by using ibv_devices
(see the earlier section).
python3 -m sglang.launch_server --model meta-llama/Llama-3.3-70B-Instruct \
--disaggregation-mode prefill --disaggregation-ib-device rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 \
--host 10.21.9.10 --port 30000 --trust-remote-code \
--tp 8 --disable-radix-cache --disable-cuda-graph \
--max-running-requests 1024 --stream-output \
--dist-init-addr 10.21.9.10:5757 --nnodes 1 --node-rank 0 \
--mem-fraction-static 0.8
Run the decode server#
Use the sglang.launch_server
command to launch the decode server. The RDMA device names can be found by using ibv_devices
.
python3 -m sglang.launch_server --model meta-llama/Llama-3.3-70B-Instruct \
--disaggregation-mode decode --disaggregation-ib-device rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 \
--host 10.21.9.15 --port 30000 --trust-remote-code \
--tp 8 --disable-radix-cache --disable-cuda-graph \
--max-running-requests 1024 --stream-output \
--dist-init-addr 10.21.9.15:5757 --nnodes 1 --node-rank 0 \
--mem-fraction-static 0.8
Test PD disaggregation#
In this step, use sglang.bench_serving
to test the 1P1D configuration in the same way as a normal SGLang benchmark test. This tutorial runs the command on the prefill node to simplify the demo. To run the command on another machine in the cluster, set the host IP address and port of the proxy server in this command. The other test parameters can be changed as required.
%%bash
python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 40000 --dataset-name generated-shared-prefix \
--gsp-system-prompt-len 0 \
--gsp-question-len 1024 \
--gsp-output-len 1024 \
--gsp-num-groups 1 \
--gsp-prompts-per-group 16\
--random-range-ratio 1 \
--max-concurrency 16 \
--pd-separated \
2>&1 | tee test.log
xPyD setup#
If you have a larger GPU cluster for running PD disaggregation, you can use xPyD (multiple prefill and decode instances) for better performance. The xPyD setup is the same as the steps described above with the following modifications to the multi-node related configuration:
Change the prefill and decode node configuration in proxy server, for example,
--prefill "http://YOUR_FIRST_PREFILL_NODE_IP:30000"
and--decode "http://YOUR_FIRST_DECODE_NODE_IP:30000"
.Change the multi-node distributed serving options, such as
dist-init-addr
,nnodes
, andnode-rank
, when launching the prefill and decode server.Change the
tp
,dp
, andep-size
options of the SGLang serving program, if required.
Summary#
In this tutorial, you learned how to set up and run SGLang PD disaggregation on AMD Instinct MI300X GPUs. The tutorial demonstrated how to configure 1P1D on both a single Instinct MI300X node and two Instinct MI300X GPU nodes, but you can easily implement xPyD on your own GPU cluster. To learn more about PD disaggregation, see the Mooncake, LLM-d, and vLLM disagg_prefill resources. This tutorial aims to encourage you to tune, test, and contribute to LLM distributed inference on AMD GPUs, helping to shape the future of AI acceleration.