Benchmark DeepSeek R1 FP4 inference with SGLang

Benchmark DeepSeek R1 FP4 inference with SGLang#

2025-09-18

3 min read time

Applies to Linux

Note

For the latest iteration of AI training and inference performance for ROCm 7.0, see Infinity Hub and the ROCm 7.0 AI training and inference performance documentation.

This section provides instructions to test the inference performance of DeepSeek R1 with FP4 precision via the SGLang serving framework. The accompanying Docker image integrates the ROCm 7.0 preview with SGLang, and is tailored for AMD Instinct MI355X and MI350X accelerators. This benchmark does not support other accelerators.

Follow these steps to pull the required image, spin up the container with the appropriate options, download the model, and run the benchmark.

Pull the Docker image#

Use the following command to pull the Docker image.

docker pull rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_sgl-dev-v0.5.2rc2_mi35x_rc1

Download the model#

See the model card on Hugging Face at DeepSeek-R1-MXFP4-Preview. This model uses microscaling 4-bit floating point (MXFP4) quantization through AMD Quark for efficient inference on AMD accelerators.

pip install huggingface_hub[cli] hf_transfer hf_xet
HF_HUB_ENABLE_HF_TRANSFER=1 \
HF_HOME=/data/huggingface-cache \
HF_TOKEN="<HF_TOKEN>" \
huggingface-cli download amd/DeepSeek-R1-0528-MXFP4-Preview --exclude "original/*"

Run the inference benchmark#

Start the container using the following command.

docker run -it \
    --user root \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    -w /app/ \
    --ipc=host \
    --network=host \
    --shm-size 64G \
    --mount type=bind,src=/data,dst=/data \
    --device=/dev/kfd \
    --device=/dev/dri \
    -e SGLANG_USE_AITER=1 \
    rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_sgl-dev-v0.5.2rc2_mi35x_rc1

Start the server.

python3 -m sglang.launch_server \
    --model-path amd/DeepSeek-R1-0528-MXFP4-Preview \
    --host localhost \
    --port 8000 \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --chunked-prefill-size 196608 \
    --mem-fraction-static 0.8 \
    --disable-radix-cache \
    --num-continuous-decode-steps 4 \
    --max-prefill-tokens 196608 \
    --cuda-graph-max-bs 128 &

Run the benchmark with the following options.

input_tokens=1024
output_tokens=1024
max_concurrency=64
num_prompts=128

python3 -m sglang.bench_serving \
    --host localhost \
    --port 8000 \
    --model amd/DeepSeek-R1-0528-MXFP4-Preview \
    --dataset-name random \
    --random-input ${input_tokens} \
    --random-output ${output_tokens} \
    --random-range-ratio 1.0 \
    --max-concurrency ${max_concurrency} \
    --num-prompt ${num_prompts}