Detecting NaN in hipBLASLt GEMM Output#
hipBLASLt ships an opt-in NaN scanner (HIPBLASLT_CHECK_NUMERICS) that can
tell you which matmul call first produced NaN in your training run.
Set five environment variables, re-run your job, and read the answer from
a log file — no changes to your training code.
Requires a hipBLASLt build that includes the NaN-scanner feature (hipBLASLt PR 7423). Stock ROCm releases prior to that PR will silently ignore
HIPBLASLT_CHECK_NUMERICS*env vars and emit nothing. See §1.1 — Verify the scanner is active to confirm your build supports it before running a long job.
1. Set the environment variables#
export HIPBLASLT_CHECK_NUMERICS=warn
export HIPBLASLT_CHECK_NUMERICS_SCAN_EVERY=1000
export HIPBLASLT_CHECK_NUMERICS_STOP_ON_FIRST=1
export HIPBLASLT_LOG_MASK=160
export HIPBLASLT_LOG_FILE=/tmp/hipblaslt.log
Then run your training script normally. No code changes needed.
Variable |
What it does |
|---|---|
|
Turns the scanner on. |
|
Check 1 out of every 1000 matmul calls. +0.04% overhead (1-in-1000 sampling). |
|
Stop scanning and auto-log when the first NaN is found. |
|
Write per-matmul details to the log file. |
|
Where to write the log. For multi-GPU runs, give each rank a separate file — every rank writes to this path, so a shared filename would interleave or truncate across ranks. Use whatever per-rank scheme your launcher provides, e.g. |
Note that
STOP_ON_FIRSTis a per-process flag — each rank stops scanning on its own first NaN independently — so multiple ranks may log aCHECK_NUMERICSline. If so, work from the rank with the earliest reportedcall_id.
STOP_ON_FIRST=1 is critical: it makes the scanner auto-emit a log
line as soon as NaN is detected and then suppress further scans — no
GPU sync, no manual drain, no Python shim. Works with PyTorch out of the
box.
1.1 Verify the scanner is active#
Before kicking off a long training run, do a ~10-second sanity check. A
hipBLASLt build that’s missing PR 7423 will silently ignore the
CHECK_NUMERICS* env vars, so it’s worth confirming once per environment
(new container, new ROCm install, new LD_LIBRARY_PATH, etc.).
LOG=/tmp/hipblaslt_smoke.log
rm -f "$LOG"
HIPBLASLT_CHECK_NUMERICS=warn \
HIPBLASLT_CHECK_NUMERICS_SCAN_EVERY=1 \
HIPBLASLT_CHECK_NUMERICS_STOP_ON_FIRST=1 \
HIPBLASLT_LOG_MASK=160 \
HIPBLASLT_LOG_FILE="$LOG" \
python -c '
import torch
x = torch.full((128, 128), float("nan"), device="cuda")
# Need >=2 matmuls: the auto-drain log line fires when the NEXT host-peek
# sees a non-zero flag, so one matmul alone produces no CHECK_NUMERICS line.
for _ in range(5):
(x @ x).cpu()
'
echo "--- bench lines (env vars reaching hipBLASLt): $(grep -c '^hipblaslt-bench' "$LOG")"
echo "--- CHECK_NUMERICS lines (scanner is active): $(grep -c CHECK_NUMERICS "$LOG")"
Expected output on a working build:
--- bench lines (env vars reaching hipBLASLt): 5
--- CHECK_NUMERICS lines (scanner is active): 1
And grep CHECK_NUMERICS "$LOG" will show one auto-drain line:
[hipBLASLt CHECK_NUMERICS] auto-drain on host peek:
first NaN observed at sampled matmul call #1,
effective window [1..2], mode=2, scan_every=1.
(STOP_ON_FIRST: further scans suppressed after this call.)
Interpretation:
bench lines |
CHECK_NUMERICS lines |
Diagnosis |
|---|---|---|
|
|
|
|
|
Env vars are flowing, but your hipBLASLt build does not include PR 7423. Rebuild or use a patched image. |
|
|
Scanner is active. You’re good. |
Why the loop?
STOP_ON_FIRST=1emits theCHECK_NUMERICSlog line via a host-peek of the device flag on the NEXT matmul call. The scan kernel for call #1 sets the flag; the host-peek for call #2 sees it non-zero, CAS-elects a logger thread, and emits the line. With only one matmul there is no “next” call to trigger the emission, so the line never appears in the log even though the scan happened. In real training this is never an issue (there are thousands of matmuls).
2. Read the result#
After your job crashes (or you stop it), look at the log file:
Step 1 — Find the call_id#
grep 'CHECK_NUMERICS' /tmp/hipblaslt.log
Example output:
[2026-05-21 12:53:54.694][hipBLASLt CHECK_NUMERICS] auto-drain on host peek:
first NaN observed at sampled matmul call #132000
(true first NaN somewhere in (131000..132000] due to scan_every=1000),
effective window [1..132054], mode=2, scan_every=1000.
To bisect, re-run with
HIPBLASLT_CHECK_NUMERICS_SCAN_FROM=131001
HIPBLASLT_CHECK_NUMERICS_SCAN_UNTIL=132000
HIPBLASLT_CHECK_NUMERICS_SCAN_EVERY=1.
(STOP_ON_FIRST: further scans suppressed after this call.)
Two important features of this line:
The reported
call_id(132000) is the sampled call that observed the NaN. WithSCAN_EVERY=1000, the true first buggy matmul lies somewhere in the previous 1000 calls — the line tells you exactly that window.The line includes a copy-pasteable bisect hint (
SCAN_FROM/SCAN_UNTIL/SCAN_EVERY=1) you can use on a re-run to pinpoint the exact call_id (only useful for deterministic bugs — see §3 below).
Step 2 — Look up the matmul#
sed -n '132000p' /tmp/hipblaslt.log
This works because, with HIPBLASLT_LOG_MASK=160, each matmul call emits
exactly one “bench” line at the top of the log file with no header lines
above it — so log line N corresponds to matmul call_id N. If you change
the log mask (or prepend anything to the file) this 1-to-1 mapping breaks
and the sed lookup will silently return the wrong matmul; adjust the line
offset accordingly, or grep for the call_id directly instead.
This prints one line with the matmul’s shape, data types, transpose modes,
and solution_index. Example:
-m 128 -n 128 -k 1280 --transA T --transB N
--a_type R_32F --b_type R_32F --d_type R_32F --solution_index 482365 ...
Step 3 — Get the kernel name#
grep 'solution_index: 482365,' /tmp/hipblaslt.log | head -1
All three steps in one line#
CID=$(grep -oP 'matmul call #\K[0-9]+' /tmp/hipblaslt.log | head -1) && \
echo "Call ID: $CID" && \
sed -n "${CID}p" /tmp/hipblaslt.log
Enumerate candidate kernels in the 1000-call window#
Since SCAN_EVERY=1000, the actual buggy call is somewhere in calls
(CID-999)..CID inclusive — i.e. the half-open window (CID-1000, CID]
that the scanner reports. With CID=132000, that’s lines 131001..132000:
CID=132000
sed -n "$((CID - 999)),${CID}p" /tmp/hipblaslt.log \
| grep -oP -- '--solution_index \K[0-9]+' \
| sort | uniq -c | sort -rn
3. Narrowing down: deterministic vs non-deterministic bugs#
Some NaN bugs fire at the same matmul call on every run (deterministic); others fire at different calls each run (HBM-contention races, etc.).
Deterministic bugs: re-run with the printed SCAN_FROM/SCAN_UNTIL/
SCAN_EVERY=1 bisect hint. The next log line will report the exact
call_id of the buggy matmul.
Non-deterministic bugs: the bisect hint won’t help — the bug fires at
a different call_id each run. Just re-run from the start with
SCAN_EVERY=1 (and keep STOP_ON_FIRST=1). The scanner short-circuits
after the first NaN, so the per-call overhead only applies until then.
The next log line will report the exact buggy call_id.
4. What the scanner covers#
Covered |
Not covered |
|---|---|
NaN in output (D matrix) of |
NaN from other libraries (aiter, CK, MIOpen, rocBLAS) |
dtypes: f32, f64, f16, bf16, fp8 (e4m3, e5m2) |
NaN in matmul inputs (A, B, C) |
Standard and grouped-GEMM paths |
Device-side user-args paths |
Sub-byte packed types (fp6, fp4) |
|
Non-GEMM operations (attention, convolution, etc.) |
Note on NaN-in-inputs: the scanner only inspects the output D. If a
NaN enters a matmul via A, B, or C (e.g. produced by an earlier
non-GEMM op), the scanner will flag the first matmul that propagates it
to D, not the op that originally produced it. The reported call_id
is therefore the earliest GEMM-visible symptom, which may be one or more
ops downstream of the true source.
5. Troubleshooting#
“I set the env vars but never see a CHECK_NUMERICS line.”#
Three possibilities, in order of likelihood:
No NaN actually fired during the run. Confirm with the smoke test in §1.1 — if that produces a
CHECK_NUMERICSline, your setup is fine and the bug just didn’t trigger this run.LD_LIBRARY_PATHis resolving to an unpatched hipBLASLt that silently ignores the env vars. The §1.1 smoke test’s “bench lines ≥ 1 but CHECK_NUMERICS lines = 0” row diagnoses exactly this. Verify with:ldd $(python -c 'import torch, os; \ print(os.path.join(os.path.dirname(torch.__file__), "lib/libtorch_hip.so"))') \ | grep hipblaslt
and confirm the resolved
libhipblaslt.socame from a build that includes PR 7423.The log file is being clobbered by another process writing to the same path, or the path is unwritable. Try a per-PID path (
HIPBLASLT_LOG_FILE=/tmp/hipblaslt.$$.log) and re-run.
“My job NaNs in production, but the scanner reports no NaN in the bisect window on re-run.”#
This is a non-deterministic bug — see §3.
The narrowed-window bisect only works for bugs that fire at the same
call_id every run. For non-deterministic bugs, run the full 1000-call
window enumeration (§2 last code block) across multiple trials and look
for solution_index values that recur.
“The log file is too big.”#
HIPBLASLT_LOG_MASK=160 writes ≈2.5 KB per matmul call (bench line + a
profile/solution-index line). To roughly halve the size, drop bit 7 and
use bit 5 only:
export HIPBLASLT_LOG_MASK=32
The workflow in this doc still works with mask=32 because Step 2 (sed -n '${CID}p') only needs the bench line. Step 3 (the solution_index: ${SOL}, kernel-name grep) reads the profile line that bit 7 emits and
will return no result with mask=32; you’ll still know the
solution_index from the bench line, just not the human-readable kernel
name. If you need the kernel name, keep mask=160 and rotate /
compress logs out-of-band instead (e.g. with logrotate or
gzip --rsyncable).
Note: raising HIPBLASLT_CHECK_NUMERICS_SCAN_EVERY does not shrink
the log — every matmul still emits its LOG_MASK lines regardless of
whether the scanner sampled it. SCAN_EVERY only trades bisect-window
granularity for scanner overhead.
6. Quick-reference cheat sheet#
Setup (env vars only — no code changes):
export HIPBLASLT_CHECK_NUMERICS=warn
export HIPBLASLT_CHECK_NUMERICS_SCAN_EVERY=1000
export HIPBLASLT_CHECK_NUMERICS_STOP_ON_FIRST=1
export HIPBLASLT_LOG_MASK=160
export HIPBLASLT_LOG_FILE=/tmp/hipblaslt.log # use a per-rank path for multi-GPU
After a NaN crash (single copy-pasteable pipeline):
LOG=/tmp/hipblaslt.log # match what you set HIPBLASLT_LOG_FILE to
# 1. Find the call_id that first observed NaN
CID=$(grep -oP 'matmul call #\K[0-9]+' "$LOG" | head -1)
echo "Call ID: $CID"
# 2. Print the matmul shape + dtype + solution_index for that call
sed -n "${CID}p" "$LOG"
# 3. Extract solution_index and find the kernel-name line
SOL=$(sed -n "${CID}p" "$LOG" | grep -oP -- '--solution_index \K[0-9]+')
grep "solution_index: ${SOL}," "$LOG" | head -1