What is FlashInfer?

What is FlashInfer?#

2026-03-19

4 min read time

Applies to Linux

FlashInfer is a library and kernel generator for Large Language Models (LLMs) that provides a high-performance implementation of kernels for graphic processing units (GPUs). FlashInfer focuses on LLM serving and inference, as well as advanced performance across diverse scenarios.

FlashInfer features highly efficient attention kernels, load-balanced scheduling, and memory-optimized techniques, while supporting customized attention variants. It’s compatible with torch.compile, and offers high-performance LLM-specific operators, with easy integration through PyTorch, and C++ APIs.

FlashInfer on ROCm supports both stages of LLM inference:

Prefill – processes the input prompt to construct KV caches and activations.
Decode – generates tokens sequentially using previously computed states.

Why FlashInfer?#

FlashInfer is well suited for LLM inference acceleration for the following reasons:

Its specialized attention kernels target the critical path of decoding to deliver substantial latency and throughput gains.
Paged cache and fused operations reduce memory bandwidth pressure and improve efficiency at long sequence lengths.
Flexible integration allows drop-in acceleration within existing inference stacks and services.
Optimized for GPUs with ROCm builds to leverage AMD Instinct hardware effectively in production environments.

Features and use cases#

FlashInfer provides the following key features:

High-Performance Attention Kernels: Delivers optimized prefill and decode kernels for transformer attention to minimize latency on ROCm GPUs.
Paged KV Cache Management: Implements efficient KV cache layouts and eviction strategies to sustain long-context, high-throughput generation.
Kernel Generator and Fusion: Provides a flexible kernel generation approach with fused operations to reduce memory transfers and overhead.
Streaming and Batched Inference: Supports both real-time streaming generation and large-batch workloads with dynamic sequence handling.
Model Compatibility: Works with common attention variants (for example, RoPE and multi-query/group-query attention) to accelerate modern LLMs.

FlashInfer on ROCm also includes performance-enhancing features:

Load balancing across GPU compute units
Sparse and dense attention optimizations
Batch and single‑sequence decode
High‑throughput prefill optimized for AMD Instinct MI300X GPUs

FlashInfer is commonly used in the following scenarios:

Real-Time Serving: Power low-latency chat, agent, and code-completion systems with optimized attention paths.
Large-Scale Batch Generation: Accelerate batched inference for content creation, retrieval-augmented generation, and evaluation jobs.
Throughput-Critical Pipelines: Reduce end-to-end latency in model serving stacks and microservices.
Research on Kernel Efficiency: Prototype new attention and cache strategies to improve inference performance on AMD Instinct GPUs.

For currently supported use cases and recommendations, refer to the AMD ROCm blog, where you can search for examples and best practices to optimize your workloads on AMD GPUs.

Kernel support matrix#

Recommended attention modes available upstream:

Multi-Head Attention (MHA): Each attention head has its own keys/values, so KV cache and bandwidth are the largest.
Grouped-Query Attention (GQA): Groups of heads share the same keys/values, reducing KV cache size and memory traffic.
Multi-Query Attention (MQA): All heads share one keys/values stream, minimizing KV cache and maximizing decode speed.

Note

The ROCm port of FlashInfer is under active development, and some features are not yet available. For the most up-to-date feature support matrix, refer to the README in the ROCm/flashinfer repository.

Kernel Type	FP16 / BF16	FP8 (E4M3, E5M2)	Has AITER backend	Notes
Decode Attention	✅	✅	No	Supports MHA, GQA, and MQA
Prefill Attention	✅	—	✅	Supports MHA, GQA, and MQA
Cascade Attention	—	—	No	Not Yet Ported
MLA	—	—	No	Not Yet Ported
POD	—	—	No	Not Yet Ported
Positional Encoding	—	—	No	Not Yet Ported
Sampling	✅	—	No	Supports Top-K/Top-P Sampling/OnlineSoftmax/SamplingFromLogits
Logits Processor	✅	—	No
Normalization	✅	—	No	Supports RMS-Norm/Layer-Norm