FlashInfer on ROCm documentation

FlashInfer on ROCm documentation#

2026-02-27

2 min read time

Applies to Linux

Accelerate LLM attention and decoding kernels with FlashInfer on ROCm for AMD Instinct GPUs, enabling high-throughput batch and streaming generation for real-time applications like multilingual chat and code completion.

FlashInfer is a library and kernel generator for Large Language Models (LLMs) that provides a high-performance implementation of kernels for graphic processing units (GPUs). FlashInfer focuses on LLM serving and inference, as well as advanced performance across diverse scenarios.

FlashInfer on ROCm includes capabilities such as load balancing, sparse and dense attention optimizations, and single and batch decode, alongside prefill for high‑performance execution on AMD Instinct MI300X and MI325X GPUs. FlashInfer features highly efficient attention kernels, load-balanced scheduling, and memory-optimized techniques, while supporting customized attention variants. It’s compatible with torch.compile, and offers high-performance LLM-specific operators, with easy integration through PyTorch, and C++ APIs.

Note

The ROCm port of FlashInfer is under active development, and some features are not yet available. For the most up-to-date feature support matrix, refer to the README in the ROCm/flashinfer repository.

FlashInfer is part of the ROCm-LLMExt toolkit.

The FlashInfer public repository is located at ROCm/flashinfer.

Install

Install FlashInfer

Examples

Run a FlashInfer example

Reference

API reference (upstream)

To contribute to the documentation, refer to Contributing to ROCm.

You can find licensing information on the Licensing page.