FlashInfer compatibility#
2025-10-21
4 min read time
FlashInfer is a library and kernel generator for Large Language Models (LLMs) that provides a high-performance implementation of graphics processing units (GPUs) kernels. FlashInfer focuses on LLM serving and inference, as well as advanced performance across diverse scenarios.
FlashInfer features highly efficient attention kernels, load-balanced scheduling, and memory-optimized
techniques, while supporting customized attention variants. It’s compatible with torch.compile, and
offers high-performance LLM-specific operators, with easy integration through PyTorch, and C++ APIs.
Note
The ROCm port of FlashInfer is under active development, and some features are not yet available.
For the latest feature compatibility matrix, refer to the README of the
ROCm/flashinfer repository.
Support overview#
The ROCm-supported version of FlashInfer is maintained in the official ROCm/flashinfer repository, which differs from the flashinfer-ai/flashinfer upstream repository.
To get started and install FlashInfer on ROCm, use the prebuilt Docker images, which include ROCm, FlashInfer, and all required dependencies.
See the ROCm FlashInfer installation guide for installation and setup instructions.
You can also consult the upstream Installation guide for additional context.
Version support#
FlashInfer is supported on ROCm 6.4.1.
Supported devices#
Officially Supported: AMD Instinct™ MI300X
Use cases and recommendations#
This release of FlashInfer on ROCm provides the decode functionality for LLM inferencing. In the decode phase, tokens are generated sequentially, with the model predicting each new token based on the previously generated tokens and the input context.
FlashInfer on ROCm brings over upstream features such as load balancing, sparse and dense attention optimizations, and batching support, enabling efficient execution on AMD Instinct™ MI300X GPUs.
Because large LLMs often require substantial KV caches or long context windows, FlashInfer on ROCm also implements cascade attention from upstream to reduce memory usage.
For currently supported use cases and recommendations, refer to the AMD ROCm blog, where you can search for examples and best practices to optimize your workloads on AMD GPUs.
Docker image compatibility#
AMD validates and publishes FlashInfer images with ROCm backends on Docker Hub. The following Docker image tag and associated inventories represent the latest available FlashInfer version from the official Docker Hub. Click to view the image on Docker Hub.
Docker image |
ROCm |
FlashInfer |
PyTorch |
Ubuntu |
Python |
|---|---|---|---|---|---|
| rocm/flashinfer | 24.04 |