FlashInfer on ROCm documentation#
2026-02-27
2 min read time
Accelerate LLM attention and decoding kernels with FlashInfer on ROCm for AMD Instinct GPUs, enabling high-throughput batch and streaming generation for real-time applications like multilingual chat and code completion.
FlashInfer is a library and kernel generator for Large Language Models (LLMs) that provides a high-performance implementation of kernels for graphic processing units (GPUs). FlashInfer focuses on LLM serving and inference, as well as advanced performance across diverse scenarios.
FlashInfer on ROCm includes capabilities such as load balancing,
sparse and dense attention optimizations, and single and batch decode, alongside
prefill for high‑performance execution on AMD Instinct MI300X and MI325X GPUs.
FlashInfer features highly efficient attention kernels, load-balanced scheduling,
and memory-optimized techniques, while supporting customized attention variants.
It’s compatible with torch.compile, and offers high-performance LLM-specific
operators, with easy integration through PyTorch, and C++ APIs.
Note
The ROCm port of FlashInfer is under active development, and some features are not yet available.
For the most up-to-date feature support matrix, refer to the README in the
ROCm/flashinfer repository.
FlashInfer is part of the ROCm-LLMExt toolkit.
The FlashInfer public repository is located at ROCm/flashinfer.
To contribute to the documentation, refer to Contributing to ROCm.
You can find licensing information on the Licensing page.