Triton Inference Server on ROCm documentation

Triton Inference Server on ROCm documentation#

2026-04-08

2 min read time

Applies to Linux

Use Triton Inference Server on ROCm to serve large language models, computer vision models, recommender systems, and custom pipelines, all from a single platform.

Triton Inference Server is a high-performance model server for machine learning inference. It supports a wide range of model types and deep learning frameworks, including Pytorch, Tensorflow, ONNX runtime, Python, vLLM and more. Triton Inference Server handles concurrent model execution, dynamic batching, model ensembles, and streaming inference, maximizing throughput and GPU/CPU utilization for production deployments at scale.

Triton Inference Server on ROCm is optimized by the AMD ROCm software stack for high performance on AMD Instinct GPUs. It integrates ROCm-aware runtime libraries and optimized kernel backends, delivering efficient inference across all supported model types and frameworks on AMD hardware.

Note

The ROCm port of Triton Inference Server is under active development, and some features are not yet available. For the most up-to-date feature support, refer to the README in the ROCm/triton-inference-server-server repository.

Triton Inference Server is part of the ROCm-LLMExt toolkit.

The Triton Inference Server public repository is located at ROCm/triton-inference-server-server.

Install

Install Triton Inference Server

Examples

Run a Triton Inference Server example

To contribute to the documentation, refer to Contributing to ROCm.

You can find licensing information on the Licensing page.