Run a FlashInfer example#
2026-04-01
1 min read time
Applies to Linux
The examples folder in the ROCm/flashinfer repository has example code that you can use to run FlashInfer.
You also have the option to use the AITER backend for the prefill attention kernels.
The AITER backend is currently enabled for the single_prefill and batch_prefill kernels only.
To use AITER as the backend for these kernels, use the backend="aiter" keyword argument to invoke the kernels.
Save the following code snippet to a Python script named
flashinfer_example.py.import torch import flashinfer # Configuration seq_len = 1024 # Prompt length num_qo_heads = 32 # Number of query/output heads num_kv_heads = 8 # Number of KV heads (GQA with 4:1 ratio) head_dim = 128 # Create Q, K, V tensors (NHD layout: sequence, heads, dimension) q = torch.randn(seq_len, num_qo_heads, head_dim, dtype=torch.float16, device="cuda") k = torch.randn(seq_len, num_kv_heads, head_dim, dtype=torch.float16, device="cuda") v = torch.randn(seq_len, num_kv_heads, head_dim, dtype=torch.float16, device="cuda") # Run single prefill attention with causal masking output = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True, backend="aiter")
Run the script to use FlashInfer.
python flashinfer_example.py