Run a Triton Inference Server example

Run a Triton Inference Server example#

2026-04-01

4 min read time

Applies to Linux

This example shows how to deploy and run a simple ONNX model using Triton Inference Server ONNX runtime backend MIGraphX execution provider on AMD Instinct MI300X or MI355X GPUs.

Prepare a Triton model repository with a simple ONNX model.

Create a dummy two-layer MLP ONNX model and a Triton model repository layout that uses the ONNX Runtime backend with the MIGraphX execution provider.

# Create the model repository directory
mkdir -p $PWD/model_repository/dummy_migraphx_onnx/0

# Write the Triton model configuration
cat > $PWD/model_repository/dummy_migraphx_onnx/config.pbtxt << 'EOF'
name: "dummy_migraphx_onnx"
backend: "onnxruntime"
max_batch_size: 64

dynamic_batching {
    max_queue_delay_microseconds: 300
    preserve_ordering: true
}

optimization {
    cuda {
        graphs: true
    }
    input_pinned_memory {
        enable: true
    }
    execution_accelerators {
        gpu_execution_accelerator {
            name: "migraphx"
            parameters: {
                key: "migraphx_max_dynamic_batch"
                value: "64"
            }
        }
    }
}

input: [
    { name: "input", data_type: TYPE_FP32, dims: [8] }
]
output: [
    { name: "output", data_type: TYPE_FP32, dims: [4] }
]

instance_group [
    { kind: KIND_GPU, count: 2, gpus: [0] }
]
EOF

Save the following Python script to generate a dummy two-layer MLP ONNX model (input dimension 8, hidden dimension 16, output dimension 4) with dynamic batching support.

# save as create_dummy_onnx.py under your current working directory (run with: python create_dummy_onnx.py)
import os
import numpy as np
import onnx
from onnx import helper, TensorProto

INPUT_SIZE = 8
HIDDEN_SIZE = 16
OUTPUT_SIZE = 4

MODEL_DIR = os.path.join(os.getcwd(), "model_repository", "dummy_migraphx_onnx", "0")
MODEL_PATH = os.path.join(MODEL_DIR, "model.onnx")
os.makedirs(MODEL_DIR, exist_ok=True)

x = helper.make_tensor_value_info("input", TensorProto.FLOAT, ["batch", INPUT_SIZE])
y = helper.make_tensor_value_info("output", TensorProto.FLOAT, ["batch", OUTPUT_SIZE])

np.random.seed(42)
W1 = np.random.randn(INPUT_SIZE, HIDDEN_SIZE).astype(np.float32) * 0.1
b1 = np.zeros(HIDDEN_SIZE, dtype=np.float32)
W2 = np.random.randn(HIDDEN_SIZE, OUTPUT_SIZE).astype(np.float32) * 0.1
b2 = np.zeros(OUTPUT_SIZE, dtype=np.float32)

w1 = helper.make_tensor("W1", TensorProto.FLOAT, [INPUT_SIZE, HIDDEN_SIZE], W1.flatten().tolist())
b1_t = helper.make_tensor("b1", TensorProto.FLOAT, [HIDDEN_SIZE], b1.tolist())
w2 = helper.make_tensor("W2", TensorProto.FLOAT, [HIDDEN_SIZE, OUTPUT_SIZE], W2.flatten().tolist())
b2_t = helper.make_tensor("b2", TensorProto.FLOAT, [OUTPUT_SIZE], b2.tolist())

nodes = [
    helper.make_node("MatMul", ["input", "W1"], ["h1"], name="matmul1"),
    helper.make_node("Add", ["h1", "b1"], ["h1b"], name="add1"),
    helper.make_node("Relu", ["h1b"], ["h1r"], name="relu1"),
    helper.make_node("MatMul", ["h1r", "W2"], ["h2"], name="matmul2"),
    helper.make_node("Add", ["h2", "b2"], ["output"], name="add2"),
]

graph = helper.make_graph(
    nodes, "dummy_mlp", [x], [y],
    initializer=[w1, b1_t, w2, b2_t],
)
model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 11)])
model.ir_version = 9
onnx.checker.check_model(model)
onnx.save(model, MODEL_PATH)
print(f"Saved: {MODEL_PATH}")
print(f"Input shape: [batch, {INPUT_SIZE}], Output shape: [batch, {OUTPUT_SIZE}]")

# Generate the ONNX model file in the model repository
python create_dummy_onnx.py
ls -R $PWD/model_repository/

Start Triton Inference Server (ROCm-enabled) with your model repository.

Use a ROCm-enabled Triton container and pass through the AMD GPU devices. The example below assumes a ROCm 7.2 + Triton 25.12 image tag.

# Pull the ROCm-enabled Triton Server image
docker pull rocm/tritoninferenceserver:tritoninferenceserver-25.12.amd1_rocm7.2_ubuntu24.04_py3.12

# Create a host directory for MIGraphX compilation cache
mkdir -p $PWD/migraphx_cache

# Launch Triton Server, mapping your model repository and MIGraphX cache
docker run --rm -it --net=host \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video \
  --ipc=host \
  -p 8000:8000 \
  -p 8001:8001 \
  -p 8002:8002 \
  -e ORT_MIGRAPHX_MODEL_CACHE_PATH=/migraphx_cache \
  -e ORT_MIGRAPHX_CACHE_PATH=/migraphx_cache \
  -v $PWD/model_repository:/models \
  -v $PWD/migraphx_cache:/migraphx_cache \
  rocm/tritoninferenceserver:tritoninferenceserver-25.12.amd1_rocm7.2_ubuntu24.04_py3.12 \
  tritonserver --model-repository=/models --exit-on-error=false

Keep this terminal running. You should see Triton load the dummy_migraphx_onnx model and report that the model is ready.

Use the python script or perf_analyzer tool from the ROCm/triton-inference-server-server repository to send client requests for end-to-end benchmark tests.