GPU violations#

GPU violations monitoring in AMD SMI tracks throttling events caused by power or thermal limits. When your GPU throttles, performance decreases to protect the hardware from damage due to overheating or excessive power draw. AMD SMI provides APIs to monitor violation percentages and identify the specific causes of throttling, enabling system administrators and developers to maintain GPU health and optimize performance.

GPU architecture support#

AMD SMI exposes two distinct throttling APIs – which one you use depends on your GPU generation:

  • On Instinct MI300 Series and newer data center GPUs (gpu_metrics v1.6+):

    Use amdsmi_get_violation_status() or the CLI (amd-smi monitor --violation, amd-smi metric --violation). This API reports throttling as time-based percentages, active status flags, and accumulated counters for each violation type — useful for historical and trend-oriented monitoring. See Interpreting violations API results for details.

    On these GPUs, throttle_status in amd-smi metric --power reports N/A.

  • On Radeon (Navi) and Instinct MI100/MI200 Series GPUs (gpu_metrics v1.3):

    Use amdsmi_get_gpu_metrics_info() or amd-smi metric --power to check throttle_status (throttled/unthrottled) and indep_throttle_status (per-reason bit flags such as PROCHOT_GFX, TDC_GFX, TEMP_MEM). These indicate whether throttling is happening right now, not how much over time. See Interpreting throttle_status results for details.

    On these GPUs, the violations API and amd-smi [metric/monitor] --violation return N/A or max_uint.

The two mechanisms aren’t interchangeable: the violations API measures how much throttling occurred (PVIOL%, TVIOL%); throttle_status measures whether it’s happening now. Both can help surface the same root causes, but the violations API gives you finer-grained trend data.

Note

MI300 Series and newer use gpu_metrics v1.8+, which introduced per-XCP and per-XCC violation fields (v1.9 adds a fully dynamic layout). Older GPUs use gpu_metrics v1.3 with a fixed layout.

Key concepts and common tasks#

Understanding how violations are measured, what they mean for GPU performance, and how to act on them helps you diagnose throttling issues effectively.

Throttling and performance degradation#

Throttling directly reduces GPU clock speeds to stay within power and thermal limits, which decreases performance. Higher violation percentages mean more time throttled, which translates to more performance loss.

For example, on Instinct MI300X systems and newer, 50% TVIOL means the GPU spent half the sampling period at reduced clocks. See Checking power and thermal violations for more information.

Checking power and thermal violations#

  • On Instinct MI300 Series and newer, check per_ppt_pwr (PVIOL) and TVIOL percentages via amdsmi_get_violation_status(). Using the CLI, amd-smi monitor --violation and amd-smi metric --violation display PVIOL and TVIOL percentages. See Usage for more information.

    • PVIOL, representing power violations, is the percentage of time the GPU throttled due to power (wattage) limits. This occurs when the GPU’s power consumption exceeds safe limits, triggering power throttling to stay within the power budget.

    • TVIOL, representing thermal violation, is the percentage of time the GPU throttled due to temperature limits. This happens when GPU temperatures exceed safe operating thresholds, causing the GPU to reduce performance to cool down.

    Values greater than 0% indicate time spent throttled. A GPU can experience both simultaneously, for example 30% PVIOL and 20% TVIOL at the same time. See Interpreting violations API results for details.

  • On Radeon (Navi) and Instinct MI100/MI200 Series, look at throttle_status bit flags in amd-smi metric --power to see if PPT (power) or thermal throttling is active. See Interpreting throttle_status results for more information.

Monitoring PROCHOT (processor hot) throttling#

  • On Instinct MI300 Series and newer, use amd-smi metric --violation or amd-smi monitor --violation to check PROCHOT_VIOLATION_ACTIVITY and PROCHOT_VIOLATION_STATUS.

  • On Radeon (Navi) and Instinct MI100/MI200 Series, use amdsmi_get_gpu_metrics_info() and check the indep_throttle_status field for PROCHOT_GFX bits. PROCHOT indicates emergency thermal throttling when the GPU hits critical temperature limits.

Monitoring GPU hotspot temperature violations#

  • On Instinct MI300 Series and newer, monitor hotspot temperature with amdsmi_get_temp_metric() and correlate with TVIOL%. High TVIOL% combined with high hotspot temps (>95°C) indicates thermal throttling. Use amd-smi metric --gpu all --temperature to track temperatures alongside violation status. See Usage for more information.

Detecting HBM (high bandwidth memory) thermal throttling#

HBM (High-Bandwidth Memory) thermal throttling occurs when GPU memory overheats.

  • On Instinct MI300 Series and newer, this is detected via per_hbm_thrm (HBM_TVIOL%) and active_hbm_thrm in the violations API. See Violation status fields for more information.

  • On Radeon (Navi) and Instinct MI100/MI200 Series, check the TEMP_MEM bit in indep_throttle_status. Detailed HBM violation percentages are only available on MI3x+. See Interpreting throttle_status results for more information.

Adjusting clock limits#

Some Instinct MI300X GPUs and newer variants support adjusting the graphics clock (SCLK) and memory clock (MCLK) min/max limits, which can help manage power violations by capping clock speeds before the hardware throttles.

# View available clock limit options
amd-smi set -h

# View current min/max clock ranges (Checks capabilities for your specific model)
amd-smi static --clock

# Set clock limits (--clk-limit / -L): adjust sclk or mclk min/max
# Notes:
#    - Not all MI3x+ models support adjusting clock limits for both SCLK and MCLK; check your model's capabilities with `amd-smi static --clock`
#    - Recommend to set max limits, then adjust min limits.
sudo amd-smi set --clk-limit <CLK_TYPE> <LIM_TYPE> <VALUE>

# Confirm changes took place:
amd-smi metric --clock

# Reset clocks back to their default state
sudo amd-smi reset --clocks

Lowering the SCLK maximum reduces peak power draw, which can reduce PVIOL percentage at the cost of peak compute throughput. See amd-smi set -h for the full list of supported options for your hardware.

Adjusting the sample rate#

Violations are sampled every 100ms — the fastest rate the driver can update the metric cache. Set AMDSMI_GPU_METRICS_CACHE_MS=0 to disable AMD SMI’s internal cache and let the driver control when the cache updates. See AMD SMI C++ library usage for environment variable details.

Usage#

AMD SMI provides tools to programmatically monitor GPU violations and throttling events.

Tip for NVML users

The closest equivalent to nvmlDeviceGetViolationStatus() is amdsmi_get_violation_status().

nvidia-smi command

amd-smi equivalent

Notes

nvidia-smi -q -d PERFORMANCE

amd-smi metric --violation

Instantaneous violation status; MI3x+ only

nvidia-smi dmon -s p

amd-smi monitor --violation

Continuous violation monitoring; MI3x+ only

nvidia-smi -q -d CLOCK

amd-smi metric --clock

Current clock frequencies

nvidia-smi -q -d POWER

amd-smi metric --power

Power usage and throttle status (Navi/MI1x/MI2x)

The AMD SMI library provides APIs to query violation status and related functionalities.

See GPU monitoring APIs:

  • amdsmi_get_violation_status() - Get violation percentages

  • amdsmi_get_temp_metric() - Monitor GPU temperatures

  • amdsmi_get_gpu_activity() - Monitor GPU utilization

See Clock, power, and performance queries:

  • amdsmi_get_gpu_metrics_info() - Get throttle_status and detailed metrics

See PCIe queries:

  • amdsmi_get_gpu_asic_info() - Check ASIC capabilities

See ASIC and board static information APIs:

  • amdsmi_get_power_cap_info() - Check power limits

  • amdsmi_get_gpu_bdf_id() - Identify GPU device

See example/amd_smi_drm_example.cc for a complete working example.

amdsmi_violation_status_t status = {};
amdsmi_status_t ret = amdsmi_get_violation_status(processor_handle, &status);
if (ret == AMDSMI_STATUS_SUCCESS) {
    // MI3x+: access per_ppt_pwr (PVIOL%), per_socket_thrm (TVIOL%),
    // active_prochot_thrm, active_hbm_thrm, etc.
    // Max uint64/uint8 sentinel values indicate unsupported fields (N/A).
} else if (ret == AMDSMI_STATUS_NOT_SUPPORTED) {
    // Navi/MI1x/MI2x: violation API not supported.
    // Use amdsmi_get_gpu_metrics_info() and check throttle_status instead.
}

See related APIs:

See example/amd_smi_violation_example.py for a complete working example.

import amdsmi

amdsmi.amdsmi_init(amdsmi.AmdSmiInitFlags.INIT_AMD_GPUS)
try:
    for processor in amdsmi.amdsmi_get_processor_handles():
        try:
            v = amdsmi.amdsmi_get_violation_status(processor)
            # MI3x+: access v['per_ppt_pwr'] (PVIOL%), v['per_socket_thrm'] (TVIOL%),
            # v['active_prochot_thrm'], v['active_hbm_thrm'], etc.
            # 'N/A' indicates unsupported fields on this ASIC.
        except amdsmi.AmdSmiLibraryException as e:
            if e.err_code == amdsmi.AmdSmiRetCode.STATUS_NOT_SUPPORTED:
                # Navi/MI1x/MI2x: violation API not supported.
                # Fall back to gpu_metrics throttle_status for a basic
                # THROTTLED / UNTHROTTLED indicator.
                m = amdsmi.amdsmi_get_gpu_metrics_info(processor)
                ts = m.get("throttle_status", "N/A")
                if ts == "N/A":
                    print("throttle_status: N/A")
                elif ts:
                    print("throttle_status: THROTTLED")
                else:
                    print("throttle_status: UNTHROTTLED")
            else:
                raise
finally:
    amdsmi.amdsmi_shut_down()

Monitor GPU violations using the CLI tool:

# MI3x+ (MI300X and newer): Check detailed violation status
amd-smi metric --violation

# MI3x+: Monitor violations in real time
amd-smi monitor --violation

# MI3x+: Monitor power, temp, GFX clock, and utilization violations every second
# AMDSMI_GPU_METRICS_CACHE_MS=0 disables the 100ms cache so the driver controls updates
AMDSMI_GPU_METRICS_CACHE_MS=0 amd-smi monitor -ptV --watch 1

# Navi/MI1x/MI2x: Check throttle status via power metrics
# (MI3x+ shows N/A here; use metric --violation or monitor --violation instead)
amd-smi metric --power

# All architectures: Monitor temperatures alongside power
amd-smi metric --gpu all --power --temperature

Interpreting violations API results#

GPU architecture support

The violations API is supported on Instinct MI300 Series and newer data center GPUs.

The violations API returns both power violations (PVIOL) and thermal violations (TVIOL) as percentages over time. You can generally interpret these percentages using the following heuristic:

Value

Meaning

0%

No throttling - GPU operating normally

1-25%

Light throttling - minor performance impact

25-50%

Moderate throttling - noticeable performance loss

50-100%

Heavy throttling - significant performance degradation

N/A or max_uint

Feature not supported on this GPU

Violation status fields#

The amdsmi_violation_status_t struct (returned by amdsmi_get_violation_status()) provides three categories of data for each violation type. See amdsmi_violation_status_t for details.

Category

Field prefix

Value type

Description

Accumulated counters

acc_*

uint64

Raw counter incremented while violation is active

Violation status

active_*

uint8 (1/0)

Whether the violation is currently ACTIVE or NOT ACTIVE

Violation activity

per_*

uint64 (%)

Percentage of sampling period spent in violation (>0% = throttled)

Core violation types#

Each core violation type maps to a specific hardware protection mechanism. The fields below cover the most common causes of GPU throttling.

Violation type

Accumulated

Status

Activity

Description

PROCHOT

acc_prochot_thrm

active_prochot_thrm

per_prochot_thrm

Processor hot — emergency thermal throttling at critical temperature

PPT (Power)

acc_ppt_pwr

active_ppt_pwr

per_ppt_pwr

Package Power Tracking — PVIOL; power consumption exceeds limits

Socket Thermal

acc_socket_thrm

active_socket_thrm

per_socket_thrm

Socket-level thermal — TVIOL; socket temperature exceeds limits

VR Thermal

acc_vr_thrm

active_vr_thrm

per_vr_thrm

Voltage regulator thermal throttling

HBM Thermal

acc_hbm_thrm

active_hbm_thrm

per_hbm_thrm

High Bandwidth Memory thermal throttling

Per-XCP per-XCC violation types#

These fields are 2D arrays indexed by [XCP][XCC] and require gpu_metrics v1.8 or newer:

Violation type

Accumulated

Status

Activity

Description

GFX Clock Below Host Limit (Power)

acc_gfx_clk_below_host_limit_pwr

active_gfx_clk_below_host_limit_pwr

per_gfx_clk_below_host_limit_pwr

GFX clock limited below host limit due to power

GFX Clock Below Host Limit (Thermal)

acc_gfx_clk_below_host_limit_thm

active_gfx_clk_below_host_limit_thm

per_gfx_clk_below_host_limit_thm

GFX clock limited below host limit due to thermal

GFX Clock Below Host Limit (Total)

acc_gfx_clk_below_host_limit_total

active_gfx_clk_below_host_limit_total

per_gfx_clk_below_host_limit_total

GFX clock limited below host limit for any reason

Low Utilization

acc_low_utilization

active_low_utilization

per_low_utilization

Low GPU utilization detected

Note

How GFXCLK_* and LOW_UTIL* differ from core PVIOL and TVIOL fields:

  • Scope: These are per-XCP (Compute Partition) × per-XCC (Compute Complex) 2D arrays, not socket-level aggregates like PVIOL/TVIOL.

  • What they measure: GFXCLK_* tracks when the GFX clock is forced below a host-set clock limit due to power (_pwr) or thermal (_thm) reasons. LOW_UTIL* tracks periods of low GPU utilization — a clock reduction cause unrelated to power or thermal limits.

  • Availability: Require gpu_metrics v1.8 or newer; returns max_uint on earlier drivers/ASICs.

Metadata fields#

Field

Type

Description

reference_timestamp

uint64

CPU timestamp in microseconds (µs)

violation_timestamp

uint64

Violation time in nanoseconds (bare metal Linux) or milliseconds (host)

acc_counter

uint64

Accumulation counter used for percentage calculations

Note

max_uint64 (for uint64 fields) or max_uint8 (for uint8 fields) indicates the feature is unsupported on the current ASIC. The original acc_gfx_clk_below_host_limit, per_gfx_clk_below_host_limit, and active_gfx_clk_below_host_limit fields are deprecated in favor of the per-XCP/XCC v1.8 variants above.

Interpreting throttle_status results#

GPU architecture support

throttle_status and indep_throttle_status apply to Radeon (Navi) and Instinct MI100/MI200 Series GPUs.

  • throttle_status (uint32_t) indicates whether the GPU is throttling (non-zero means throttled).

  • indep_throttle_status (uint64_t) encodes why via per-reason bit flags defined in the kernel driver.

The canonical bit definitions are in the SMU_THROTTLER_* enum in drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h in the ROCm AMDGPU kernel driver. AMD SMI passes the raw uint64_t value through to the caller without interpreting the bits, so refer to that header for the authoritative bit-position-to-flag mapping. The following table summarizes the key ranges:

Category

Bit range

Key flags

Power throttlers

0–7

PPT0, PPT1, SPL, FPPT, SPPT

Current throttlers

16–23

TDC_GFX, TDC_SOC, TDC_MEM, EDC_CPU, EDC_GFX

Temperature throttlers

32–47

TEMP_GPU, TEMP_MEM (HBM_THM), TEMP_HOTSPOT, TEMP_SOC (SOCKET_THM), TEMP_VR_GFX (VR_THM), PROCHOT_GFX

Troubleshooting#

High PVIOL (Power Violations)?#

  • Check power limit settings with amdsmi_get_power_cap_info()

  • View static power cap details (default, min, max): amd-smi static --limit

  • Monitor live power consumption: amd-smi monitor --power

  • Verify adequate PSU capacity for your system

  • Consider reducing workload intensity or power limits

  • Monitor with: amd-smi metric --gpu all --power

Note

amd-smi static --limit shows power cap thresholds and thermal shutdown/slowdown limits. If your GPU is hitting these limits, it may throttle to stay within them, causing PVIOL/TVIOL. Adjusting power limits or improving cooling can help reduce power or thermal related violations.

High TVIOL (Thermal Violations)?#

  • Check cooling system (fans, airflow)

  • Verify thermal paste application

  • Monitor ambient temperature

  • Check for dust buildup in coolers

  • Use: amd-smi metric --gpu all --temperature

Getting N/A or max_uint values?#

  • For violation fields (metric --violation) returning N/A: The violation API is only supported on MI3x+ (MI300X and newer). On older ASICs (Navi/MI1x/MI2x), use amd-smi metric --power and check THROTTLE_STATUS instead.

  • For THROTTLE_STATUS in metric --power showing N/A: This field is available on Navi/MI1x/MI2x (gpu_metrics v1.3) but not on MI3x+. On MI3x+, use amd-smi metric --violation or amd-smi monitor --violation instead.

  • Check your ASIC generation with amdsmi_get_gpu_asic_info() or amd-smi static --asic

Further reading#