FlashInfer compatibility

FlashInfer compatibility#

2025-10-02

4 min read time

Applies to Linux

FlashInfer is a library and kernel generator for Large Language Models (LLMs) that provides high-performance implementation of graphics processing units (GPUs) kernels. FlashInfer focuses on LLM serving and inference, as well as advanced performance across diverse scenarios.

FlashInfer features highly efficient attention kernels, load-balanced scheduling, and memory-optimized techniques, while supporting customized attention variants. It’s compatible with torch.compile, and offers high-performance LLM-specific operators, with easy integration through PyTorch, and C++ APIs.

Note

The ROCm port of FlashInfer is under active development, and some features are not yet available. For the latest feature compatibility matrix, refer to the README of the ROCm/flashinfer repository.

Support for the ROCm port of FlashInfer is available as follows:

ROCm support for FlashInfer is hosted in the ROCm/flashinfer repository. This location differs from the flashinfer-ai/flashinfer upstream repository.
To install FlashInfer, use the prebuilt Docker image, which includes ROCm, FlashInfer, and all required dependencies.
- See the ROCm FlashInfer installation guide to install and get started.
- See the Installation guide in the upstream FlashInfer documentation.

Note

Flashinfer is supported on ROCm 6.4.1.

Supported devices#

Officially Supported: AMD Instinct™ MI300X

Use cases and recommendations#

This release of FlashInfer on ROCm provides the decode functionality for LLM inferencing. In the decode phase, tokens are generated sequentially, with the model predicting each new token based on the previously generated tokens and the input context.

FlashInfer on ROCm brings over upstream features such as load balancing, sparse and dense attention optimizations, and batching support, enabling efficient execution on AMD Instinct™ MI300X GPUs.

Because large LLMs often require substantial KV caches or long context windows, FlashInfer on ROCm also implements cascade attention from upstream to reduce memory usage.

For currently supported use cases and recommendations, refer to the AMD ROCm blog, where you can search for examples and best practices to optimize your workloads on AMD GPUs.

Docker image compatibility#

AMD validates and publishes ROCm FlashInfer images with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated inventories represent the FlashInfer version from the official Docker Hub. The Docker images have been validated for ROCm 6.4.1. Click to view the image on Docker Hub.

Docker image	ROCm	FlashInfer	PyTorch	Ubuntu	Python
rocm/flashinfer	6.4.1	v0.2.5	2.7.1	24.04	3.12