ROCm-LLMExt 26.03 release notes#

3 min read time

Applies to Linux

This is the fourth release of the AMD ROCm LLMExt toolkit (ROCm-LLMExt), an open-source software toolkit built on the ROCm platform for large language model (LLM) extensions, integrations, and performance enablement on AMD GPUs. The domain brings together training, post-training, inference, and orchestration components to make modern LLM stacks practical and reproducible on AMD hardware.

Release highlights#

Note

ROCm-LLMExt 26.03 introduces a new component (Triton Inference Server) and includes targeted updates to one component (FlashInfer); other components remain unchanged (verl, Ray, and llama.cpp).

This release introduces the following component with support for ROCm 7.2.0:

  • Triton Inference Server is a high-performance serving system that lets you deploy and run trained AI models in production so applications can send requests and recieve predictions efficiently in real time.

This release adds support for ROCm 7.2.0 and ROCm 7.0.2 for the following component:

  • FlashInfer is a library and kernel generator for large language models (LLMs) that provides a high-performance implementation of kernels for graphics processing units (GPUs). FlashInfer focuses on LLM serving and inference, as well as advanced performance across diverse scenarios.

System requirements#

For the 26.03 release, the ROCm‑LLMExt components span a range of ROCm version requirements depending on the specific extension. Ensure you follow the installation instructions for each individual component, where the exact ROCm dependency is listed, or refer to the compatibility matrix to verify supported ROCm versions.

ROCm-LLMExt components#

The following table lists ROCm-LLMExt component versions for the 26.03 release. Click to go to the component’s source on GitHub.

Name Version Source
verl 0.6.0
Ray 2.51.1
llama.cpp b6652
FlashInfer 0.2.5 ⇒ 0.5.3
Triton Inference Server 25.12

Detailed component changelogs#

FlashInfer 0.5.3#

This release adds support for ROCm 7.2.0 and ROCm 7.0.2 on AMD Instinct MI325X and MI300X GPUs and introduces support for MI355X GPUs.

Triton Inference Server 25.12#

Triton Inference Server is a newly supported component as part of the ROCm-LLMExt toolkit. Triton Inference Server is a high-performance serving system that lets you deploy and run trained AI models in production so applications can send requests and get predictions efficiently in real time. This release is supported on ROCm 7.2.0 on AMD Instinct MI355X and MI300X GPUs.