Benchmark DeepSeek R1 FP8 inference with SGLang

Benchmark DeepSeek R1 FP8 inference with SGLang#

2025-09-18

3 min read time

Applies to Linux

Note

For the latest iteration of AI training and inference performance for ROCm 7.0, see Infinity Hub and the ROCm 7.0 AI training and inference performance documentation.

This section provides instructions to test the inference performance of DeepSeek R1 with FP8 precision via the SGLang serving framework. The accompanying Docker image integrates the ROCm 7.0 preview with SGLang, and is tailored for AMD Instinct MI355X, MI350X, and MI300X series accelerators. This benchmark does not support other accelerators.

Follow these steps to pull the required image, spin up the container with the appropriate options, download the model, and run the benchmark.

Pull the Docker image#

Use the following command to pull the appropriate Docker image for your system.

MI350 series

docker pull rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_sgl-dev-v0.5.2rc2_mi35x_rc1

MI300X series

docker pull rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_sgl-dev-v0.5.2rc2-mi30x_rc1

Download the model#

See the model card on Hugging Face at deepseek-ai/DeepSeek-R1-0528.

pip install huggingface_hub[cli] hf_transfer hf_xet
HF_HUB_ENABLE_HF_TRANSFER=1 \
HF_HOME=/data/huggingface-cache \
HF_TOKEN="<HF_TOKEN>" \
huggingface-cli download deepseek-ai/DeepSeek-R1-0528 --exclude "original/*"

Run the inference benchmark#

Start the container using the following command.

MI350 series

docker run -it \
    --user root \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    -w /app/ \
    --ipc=host \
    --network=host \
    --shm-size 64G \
    --mount type=bind,src=/data,dst=/data \
    --device=/dev/kfd \
    --device=/dev/dri \
    -e SGLANG_USE_AITER=1 \
    rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_sgl-dev-v0.5.2rc2_mi35x_rc1

MI300X series

docker run -it \
    --user root \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    -w /app/ \
    --ipc=host \
    --network=host \
    --shm-size 64G \
    --mount type=bind,src=/data,dst=/data \
    --device=/dev/kfd \
    --device=/dev/dri \
    -e SGLANG_USE_AITER=1 \
    rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_sgl-dev-v0.5.2rc2-mi30x_rc1

Start the server.

python3 -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-R1-0528 \
    --host localhost \
    --port 8000 \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --chunked-prefill-size 196608 \
    --mem-fraction-static 0.8 \
    --disable-radix-cache \
    --num-continuous-decode-steps 4 \
    --max-prefill-tokens 196608 \
    --cuda-graph-max-bs 128 &

Run the benchmark with the following options.

input_tokens=1024
output_tokens=1024
max_concurrency=64
num_prompts=128

python3 -m sglang.bench_serving \
    --host localhost \
    --port 8000 \
    --model deepseek-ai/DeepSeek-R1-0528 \
    --dataset-name random \
    --random-input ${input_tokens} \
    --random-output ${output_tokens} \
    --random-range-ratio 1.0 \
    --max-concurrency ${max_concurrency} \
    --num-prompt ${num_prompts}