ROCm Core SDK 7.13.0 release notes

Contents

ROCm Core SDK 7.13.0 release notes#

2026-05-15

59 min read time

Applies to Linux and Windows

ROCm Core SDK 7.13.0 continues the technology preview release stream that began with ROCm 7.9.0, advancing the transition to the new TheRock build and release system. To learn more, see the transition guide.

Important

ROCm 7.13.0 follows the versioning discontinuity that began with the 7.9.0 preview release and remains separate from the 7.0 to 7.2 production releases. For the latest production stream release, see the ROCm documentation.

Maintaining parallel release streams – preview and production – gives users ample time to evaluate and adopt the new build system and dependency changes. The technology preview stream is planned to continue through mid-2026, after which it will replace the current production stream.

For previous preview releases, see the release history.

Release highlights#

ROCm Core SDK 7.13.0 with TheRock builds upon the 7.12.0 preview release.

This release expands support for AI inference, distributed workloads, and profiling workflows across AMD Instinct™, Radeon™, and Ryzen™ AI platforms. ROCm 7.13.0 adds inference-ready vLLM containers, expands GPU virtualization and partitioning support, introduces new profiling and tracing capabilities, and improves AI kernel, sparse math, and communication libraries.

Platform and hardware support#

This release expands GPU, operating system, virtualization, and partitioning support.

Expanded AMD GPU support#

ROCm 7.13.0 adds support for the following AMD GPUs and APUs:

  • AMD Instinct MI350P (gfx950)

  • AMD Radeon PRO W6800 (gfx1030)

  • AMD Radeon PRO V620 (gfx1030)

  • AMD Ryzen AI 7 PRO 360 (gfx1152)

  • AMD Ryzen AI 7 PRO 350 (gfx1152)

  • AMD Ryzen AI 5 PRO 340 (gfx1152)

  • AMD Ryzen AI 7 350 (gfx1152)

  • AMD Ryzen AI 7 345 (gfx1152)

  • AMD Ryzen AI 5 340 (gfx1152)

  • AMD Ryzen AI 5 330 (gfx1152)

For the complete list of supported AMD hardware, see AMD hardware support.

Expanded Ubuntu support#

ROCm 7.13.0 adds support for Ubuntu 26.04 on Instinct, Radeon, and Ryzen devices.

24.04.4 is now the validated Ubuntu 24 release instead of Ubuntu 24.04.3.

For the full list of supported Linux distributions, see Operating system support.

Expanded GPU virtualization support for Instinct GPUs#

ROCm 7.13.0 adds support for the following virtualization configurations on AMD Instinct GPUs.

  • On MI355X: VMware ESXi 9.1 with Ubuntu 24.04 guest OS.

  • On MI300X: KVM SR-IOV with Ubuntu 24.04 host OS and Ubuntu 24.04 guest OS.

  • On MI210:

    • KVM passthrough with RHEL 9.4 host OS and Ubuntu 22.04 guest OS.

    • KVM SR-IOV with RHEL 9.4 host OS and Ubuntu 22.04 guest OS.

    • KVM SR-IOV with RHEL 9.4 host OS and RHEL 9.4 guest OS.

Supported SR-IOV configurations require the GIM Driver 9.0.0K. For details, see GPU virtualization support.

Expanded Instinct GPU partitioning support#

ROCm 7.13.0 enables the QPX compute + NPS 2 memory partition combination in bare metal deployments.

For details, see GPU partitioning support.

AI inference and frameworks#

This release adds inference-ready container images and improves multi-node communication for distributed workloads.

vLLM 0.19.1 Docker images and pip packages#

With ROCm 7.13.0, Docker images for running vLLM inference workloads are available. Images include vLLM 0.19.1, PyTorch 2.10, and Python 3.13 on Ubuntu 24.04.

Architecture-specific images are available for:

  • AMD Instinct GPUs: gfx942 (MI325X, MI300X, MI300A) and gfx950 (MI355X, MI350X, MI350P)

  • AMD Radeon GPUs: gfx1100, gfx1101, gfx1102, gfx1200, gfx1201

  • AMD Ryzen AI APUs: gfx1150, gfx1151, gfx1152

See vLLM inference and serving on ROCm to get started.

RCCL multi-node optimization for AMD Ryzen AI Max 300 series#

RCCL improves multi-node clustering performance on systems with AMD Ryzen AI Max 300 series connected over Ethernet. Building on the initial multi-node enablement in ROCm 7.12.0, this release optimizes collective communication for distributed AI inference workloads using tensor parallelism (TP) and expert parallelism (EP) across up to 4 Ethernet-connected nodes.

RCCL GDA-based alltoall via rocSHMEM integration (experimental)#

RCCL adds experimental support for GPU Direct Async (GDA)-based alltoall and alltoallv collective operations through rocSHMEM integration. When enabled, RCCL invokes rocSHMEM operations that use GDA to reduce latency for small message alltoall patterns.

This feature requires building RCCL with the --rocshmem flag and setting RCCL_ROCSHMEM_ENABLE=1 at runtime. GDA support currently requires Broadcom NICs with GDA capability.

Developer tools and profiling#

This release adds new profiling capabilities, introduces the open-source ROCprof Trace Decoder, and extends HIP programming APIs.

ROCprof Trace Decoder open source release#

ROCprof Trace Decoder, previously delivered as a closed-source component within ROCprofiler-SDK, is now available as the open-source rocprof-trace-decoder library. The decoder converts raw SQTT data from AMD GPUs into structured execution traces for performance analysis and debugging. It supports a wide range of AMD GPUs spanning Instinct, Radeon, and Ryzen architectures, with unit and integration tests across all supported hardware. See AMD hardware support for the complete list.

HIP cooperative groups reduce operations#

HIP adds cooperative_groups::reduce() for performing reduction operations across thread_block_tile and coalesced_threads groups. The implementation is based on __reduce_*_sync operations, and the HIP_ENABLE_EXTRA_WARP_SYNC_TYPES macro might be required to enable some optimizations.

Additionally, __reduce_and_sync(), __reduce_or_sync(), and __reduce_xor_sync() now provide consistent behavior for all mask values. All masks now emit bitwise instructions, aligning behavior with NVIDIA CUDA. This is a change from previous versions, where some masks were translated to bitwise operations, and others were not.

ROCm Compute Profiler feature highlights#

The following are notable enhancements to the ROCm Compute Profiler (rocprofiler-compute).

  • RDNA 3.5 support: ROCm Compute Profiler now supports GPU performance profiling and analysis on AMD Ryzen AI Max 300 series processors.

  • Removed dependency requirements for profiling: Building ROCm Compute Profiler and using profile mode no longer requires installing Python dependencies from the requirements.txt file. Analysis mode still requires Python dependencies.

    This change moves several operations from profile mode to analysis mode, including roofline HTML generation, roofline-related options (--sort, --mem-level, --roofline-data-type), and creation of the combined pmc_perf.csv file. Profile mode now only runs the roofline empirical benchmark, creates a roofline.csv file, and creates per-replay CSV files without merging them.

ROCm Systems Profiler feature highlights#

The following are notable enhancements to the ROCm Systems Profiler (rocprofiler-systems).

  • Pause and resume profiling: ROCm Systems Profiler now supports pausing and resuming profiling at runtime through the roctxProfilerPause and roctxProfilerResume APIs. This allows you to capture profiling data only during specific execution phases, reducing overhead and minimizing output size for long-running workloads.

  • Selective region tracing: You can now restrict tracing to defined regions of interest using the ROCPROFSYS_SELECTED_REGIONS environment variable, reducing noise and limiting data collection to relevant workload segments.

  • KFD event tracing: Kernel Fusion Driver (KFD) event tracing is now available for GPU memory management analysis, including page faults, page migrations, queue evictions, GPU unmap events, and dropped events. Requires an XNACK-capable GPU and ROCprofiler-SDK 1.2.1 or later.

  • MPI file-output filtering: You can now filter profiler output files based on MPI rank using the --rank-filter-output CLI option or the ROCPROFSYS_RANK_FILTER_OUTPUT configuration setting, suppressing output from all other ranks. An optional --rank-filter-id option (ROCPROFSYS_RANK_FILTER_ID) allows specifying a custom environment variable for rank identification.

  • JSON-based profiling presets and domain flags: You can now configure common profiling workflows using JSON-based presets and a single --preset=<name> flag instead of manually setting multiple ROCPROFSYS_* environment variables. Eleven built-in presets cover common profiling scenarios, including GPU tracing, HPC workloads, and API-level analysis. Composable domain flags (--gpu, --rocm, --cpu, --parallel) and a topic-based --help=<topic> system further simplify configuration and discoverability.

AMD SMI feature highlights#

  • APU metrics and memory tuning: New APU telemetry provides per-core temperature, power, clock, voltage, current, and throttle monitoring, with additional support for IPU activity and DRAM bandwidth metrics. New VRAM carveout and GTT tuning controls enable configurable memory allocation on supported APU platforms.

  • Per-component GPU temperature and clock monitoring: GPU metrics table version 1.9 adds HBM stack temperatures, per-die temperature monitoring, and per-die memory and SOC clock reporting for data center deployments.

  • CPU power APIs report in milliwatts (breaking change): CPU power APIs now return values in milliwatts (mW) instead of watts. Python bindings now return numeric integer values instead of formatted strings. Existing applications that parse previous string-based outputs must be updated.

For more information, see the AMD SMI section in the ROCm component changelogs.

Libraries#

This release adds new routines, data type support, and performance improvements across ROCm math and AI libraries.

Composable Kernel adds quantization and attention kernel capabilities#

Composable Kernel adds several capabilities for AI and large language model workloads:

  • Microscaling (MX) FP8/FP4 support: Mixed data type support for MX FP8 and FP4 in GEMM and Flash Multi-Head Attention (FMHA) forward kernels on AMD Instinct MI350 Series GPUs.

  • FP8 quantization for FMHA: FMHA forward kernels now support multiple FP8 quantization modes, including dynamic tensor-wise quantization, block scale quantization, per-tensor quantization, and FP8 KV cache support for batch prefill.

  • StreamingLLM and long-context inference: Sink token support for FMHA forward enables StreamingLLM-style long-context inference.

  • Batch prefill enhancements: FMHA batch prefill kernels now support multiple KV cache layouts, flexible page sizes, and configurable lookup table configurations.

  • RDNA 3 FMHA support: Flash Attention kernels are now available on RDNA 3 architectures.

  • SageAttention v2 forward kernel: Multi-granularity quantization for Q, K, and V tensors with FP8, INT8, and INT4 data types and per-tensor, per-block, per-warp, and per-thread scale granularities on AMD Instinct MI300 Series and MI350 Series GPUs.

General Batched GEMM support in hipBLASLt#

hipBLASLt adds native support for General Batched GEMM, where all matrices in a batch share the same problem dimensions but can have independent leading dimensions and strides. This replaces the previous implementation through the hipblaslt_ext Grouped GEMM APIs, which had known limitations.

The new implementation includes support for Global Split-U (GSU) to improve performance at large problem sizes. General Batched GEMM is important for inference workloads that dispatch batches of same-shape GEMM operations.

rocSOLVER adds new solver routines and matrix analysis functions#

rocSOLVER adds the following new routines, all with 64-bit index support:

  • GETRS_NPVT: Solution of linear systems using LU factorization without pivoting. Batched and strided-batched variants are available.

  • SYTRS: Solution of linear systems for symmetric matrices. Batched and strided-batched variants are available.

Additionally, POTF2 and downstream POTRF Cholesky factorization performance have been improved.

rocSPARSE adds sparse factorization routines#

rocSPARSE adds new generic API routines for sparse incomplete factorization and triangular solve:

  • rocsparse_spic0 and rocsparse_spilu0: Generic incomplete Cholesky (IC0) and incomplete LU (ILU0) factorization routines with strided-batched computation support.

  • rocsparse_sptrsv: Extended with strided-batched computation support and singularity detection through the new rocsparse_singularity enumeration.

Performance of tridiagonal solvers rocsparse_Xgtsv_no_pivot and rocsparse_Xgtsv_no_pivot_strided_batch has been improved.

Added rocDecode and rocJPEG libraries to the ROCm Core SDK#

rocDecode provides hardware-accelerated video decoding for H.264, H.265/HEVC, AV1, and VP9 codecs, while rocJPEG provides hardware-accelerated JPEG decoding on AMD GPUs. Together, they enable efficient GPU-based media processing pipelines for data-intensive workloads such as AI training.

Both libraries are supported on Linux on AMD Instinct, Radeon, and Ryzen AI. See the projects in ROCm/rocm-systems for more information.

Added ROCm Data Center Tool to the ROCm Core SDK#

ROCm Data Center Tool (RDC) provides telemetry collection, health monitoring, and job-level GPU statistics for data center deployments with AMD Instinct accelerators. RDC enables system administrators and cluster managers to monitor GPU health, collect telemetry data, and track per-job GPU usage across multi-node environments.

RDC is supported on Linux with AMD Instinct GPUs.

AMD hardware support#

The following table lists supported AMD Instinct GPUs, Radeon GPUs, and Ryzen APUs. Each supported device is listed with its corresponding GPU microarchitecture and LLVM target.

Note

If your GPU is not listed, it might be community-enabled through TheRock nightly builds. For more information, see TheRock supported GPUs. For installation guidance, see TheRock releases.

Device series

Device

LLVM target

Architecture

AMD Ryzen AI Max PRO
300 Series

Ryzen AI Max+ PRO 395 (Radeon 8060S)

Ryzen AI Max PRO 390 (Radeon 8050S)

Ryzen AI Max PRO 385 (Radeon 8050S)

Ryzen AI Max PRO 380 (Radeon 8040S)

gfx1151

RDNA 3.5

AMD Ryzen AI Max
300 Series

Ryzen AI Max+ 395 (Radeon 8060S)

Ryzen AI Max+ 392 (Radeon 8060S)

Ryzen AI Max+ 388 (Radeon 8060S)

Ryzen AI Max 390 (Radeon 8050S)

Ryzen AI Max 385 (Radeon 8050S)

gfx1151

AMD Ryzen AI PRO
400 Series

Ryzen AI 9 HX PRO 475 (Radeon 890M)

Ryzen AI 9 HX PRO 470 (Radeon 890M)

Ryzen AI 9 PRO 465 (Radeon 880M)

gfx1150

Ryzen AI 7 PRO 450 (Radeon 860M)

Ryzen AI 5 PRO 440 (Radeon 840M)

gfx1152

AMD Ryzen AI
400 Series

Ryzen AI 9 HX 475 (Radeon 890M)

Ryzen AI 9 HX 470 (Radeon 890M)

Ryzen AI 9 465 (Radeon 880M)

gfx1150

Ryzen AI 7 450 (Radeon 860M)

gfx1152

AMD Ryzen AI PRO
300 Series

Ryzen AI 9 HX PRO 375 (Radeon 890M)

Ryzen AI 9 HX PRO 370 (Radeon 890M)

gfx1150

Ryzen AI 7 PRO 350 (Radeon 860M)

Ryzen AI 5 PRO 340 (Radeon 840M)

gfx1152

AMD Ryzen AI
300 Series

Ryzen AI 9 HX 375 (Radeon 890M)

Ryzen AI 9 HX 370 (Radeon 890M)

Ryzen AI 9 365 (Radeon 880M)

gfx1150

Ryzen AI 7 350 (Radeon 860M)

Ryzen AI 7 345 (Radeon 840M)

Ryzen AI 5 340 (Radeon 840M)

Ryzen AI 5 330 (Radeon 820M)

gfx1152

AMD Ryzen PRO
200 Series

Ryzen 7 PRO 250 (Radeon 780M)

Ryzen 5 PRO 230 (Radeon 760M)

Ryzen 5 PRO 220 (Radeon 740M)

Ryzen 5 PRO 215 (Radeon 740M)

Ryzen 3 PRO 210 (Radeon 740M)

gfx1103

RDNA 3
AMD Ryzen
200 Series

Ryzen 9 270 (Radeon 780M)

Ryzen 7 260 (Radeon 780M)

Ryzen 7 250 (Radeon 780M)

Ryzen 5 240 (Radeon 760M)

Ryzen 5 230 (Radeon 760M)

Ryzen 5 220 (Radeon 740M)

Ryzen 3 210 (Radeon 740M)

gfx1103

Operating system support#

ROCm supports the following Linux distribution and Microsoft Windows versions. If you’re running ROCm on Linux, ensure your system is using a supported kernel version.

Important

The following table is a general overview of supported OSes. Actual support might vary by AMD GPU or APU. Use the Compatibility matrix to verify support for your specific setup before installation.

Linux distribution

Supported versions

Linux kernel version

Ubuntu

26.04

GA 7.0

24.04.4

GA 6.8

22.04.5

GA 5.15

Debian

13

6.12

12

6.1.0

Red Hat Enterprise Linux (RHEL)

10.1

6.12.0-124

10.0

6.12.0-55

9.7

5.14.0-611

9.6

5.14.0-570

9.4

5.14.0-427

8.10

4.18.0-553

Oracle Linux

10

UEK 8.1

9

UEK 8

8

UEK 7

Rocky Linux

9

5.14.0-570

SUSE Linux Enterprise Server (SLES)

16.0

6.12

15.7

6.4.0-150700.51

Operating system

Supported versions

Linux kernel version

Ubuntu

26.04

GA 7.0

24.04.4

GA 6.8

22.04.5

GA 5.15

Red Hat Enterprise Linux (RHEL)

10.1

6.12.0-124

9.7

5.14.0-611

Windows

11 25H2

Operating system

Supported versions

Linux kernel version

Ubuntu

26.04

GA 7.0

24.04.4

HWE 6.17

Windows

11 25H2

Installation updates#

ROCm 7.13.0 introduces several improvements to the Runfile Installer:

  • Performance improvements for installing and uninstalling gfx architectures.

  • ROCm component tests are now included.

  • Support for prerequisite OEM kernel installation as part of the dependency install on Ryzen systems. You no longer need to install it manually.

  • Auto-detection of the GPU when using the GUI or when the gfx= argument is not provided on the command line. If the installer cannot detect the GPU, you must specify the gfx architecture using the GUI or the gfx= argument.

Kernel driver and firmware bundle support#

ROCm requires a coordinated stack of compatible firmware, driver, and user space components. Maintaining version alignment between these layers ensures correct GPU operation and performance, especially for AMD data center products. While AMD publishes the AMD GPU driver and ROCm user space components, your server OEM (original equipment manufacturer) or infrastructure provider distributes the firmware packages. AMD supplies those firmware images (PLDM bundles), which the OEM integrates and distributes.

AMD device

Firmware

Linux driver

Instinct MI355X

PLDM bundle 01.26.00.02

AMD GPU Driver (amdgpu)
31.30.0
31.20.0
31.10.0
30.30.3
30.30.2
30.30.1
30.30.0
30.20.1
30.20.0
30.10.2
30.10.1
30.10.0

Instinct MI350X

Instinct MI350P

IFWI 00185129

Instinct MI325X

PLDM bundle 01.25.04.02

Instinct MI300X

PLDM bundle 01.26.00.02

Instinct MI300A

BKC 26.1

Instinct MI250X

IFWI 75 (or later)

Instinct MI250

Maintenance update (MU) 5 with IFWI 75 (or later)

Instinct MI210

Instinct MI100

VBIOS D3430401-037

Linux driver

Windows driver

AMD GPU Driver (amdgpu)
31.30.0
31.20.0
31.10.0
30.30.3
30.30.2
30.30.1
30.30.0
30.20.1
30.20.0
30.10.2
30.10.1
30.10.0

AMD Software: Adrenalin Edition 26.5.1

Linux driver

Windows driver

Inbox kernel driver in Ubuntu 26.04 or 24.04.4

AMD Software: Adrenalin Edition 26.5.1

GPU virtualization support#

AMD Instinct data center GPUs support virtualization in the following configurations. Supported SR-IOV configurations require the AMD GPU Virtualization Driver (GIM) 9.0.0K – see the AMD Instinct Virtualization Driver documentation for more information.

AMD GPU

Hypervisor

Virtualization technology

Virtualization driver

Host OS

Guest OS

Instinct MI355X

KVM

Passthrough

Ubuntu 24.04

Ubuntu 24.04

SR-IOV

GIM 9.0.0K

Ubuntu 24.04

RHEL 10.0

RHEL 9.6

ESXi

VMware ESXi 9.1

Ubuntu 24.04

Instinct MI350X

KVM

Passthrough

Ubuntu 24.04

Ubuntu 24.04

SR-IOV

GIM 9.0.0K

Ubuntu 24.04

RHEL 9.6

Instinct MI325X

KVM

SR-IOV

GIM 9.0.0K

Ubuntu 22.04

Ubuntu 22.04

Instinct MI300X

KVM

Passthrough

Ubuntu 22.04

Ubuntu 22.04

SR-IOV

GIM 9.0.0K

Ubuntu 24.04

Ubuntu 24.04

Ubuntu 22.04

Ubuntu 22.04

Instinct MI210

KVM

Passthrough

RHEL 9.4

Ubuntu 22.04

SR-IOV

GIM 9.0.0K

Ubuntu 22.04

RHEL 9.4

GPU partitioning support#

The following compute partition and NUMA-per-socket (NPS) configurations are available on AMD Instinct GPUs in bare metal deployments.

Device

Compute partition mode

NPS mode

Deployment

Instinct MI355X, MI350X

CPX

NPS 2

Bare metal

DPX

NPS 2

QPX

NPS 2

Instinct MI300X

CPX

NPS 4

DPX

NPS 2

See the AMD GPU partitioning topic in the AMD GPU Driver documentation to learn more.

AI ecosystem support#

ROCm 7.13.0 provides optimized support for popular deep learning frameworks and AI inference engines. The following table lists supported frameworks and libraries, their compatible operating systems, and validated versions.

Framework

Supported versions

Supported OS

Supported Python versions

PyTorch

2.11.0, 2.10.0, 2.9.1

Linux

3.14, 3.13, 3.12, 3.11

2.11.0

Windows

JAX

0.9.1, 0.8.2

Linux

3.14, 3.13, 3.12, 3.11

vLLM
(gfx950, gfx942, gfx1200,
gfx1201, gfx1100, gfx1101,
gfx1102, gfx1151 GPUs only
)

0.19.1
(requires PyTorch 2.10.0)

Linux

3.13

ROCm Core SDK components#

The following table lists core tools and libraries included in the ROCm 7.13.0 release.

Important

The following table is a general overview of ROCm Core SDK components. Actual support for these libraries and tools can vary by GPU and OS. Use the Compatibility matrix to verify support for your specific setup.

Component group

Component name

Version

Supported platforms

Math and compute libraries

hipBLAS 3.4.0 Linux/Windows · Instinct/Radeon/Ryzen
hipBLASLt 1.3.0
hipCUB 4.4.0
hipFFT 1.0.23
hipRAND 3.3.0
hipSOLVER 3.4.0
hipSPARSE 4.5.0
MIOpen 3.5.1
rocBLAS 5.4.0
rocFFT 1.0.37
rocRAND 4.4.0
rocSOLVER 3.34.0
rocSPARSE 4.6.0
rocPRIM 4.4.0
rocThrust 4.4.0
rocWMMA 2.2.1
Composable Kernel 1.3.0 Linux/Windows · Instinct/Radeon
hipSPARSELt 0.2.8 Linux/Windows · Instinct (gfx950/gfx942)

Communication libraries

RCCL 2.28.3 Linux · Instinct/Radeon/Ryzen
rocSHMEM 3.4.0 Linux · Instinct (gfx950/gfx942/gfx90a) · Radeon (gfx1201/gfx1200/gfx1100/gfx1101/gfx1102)

Media libraries

rocDecode 1.8.0 Linux · Instinct/Radeon · Ryzen (gfx1150/gfx1151/gfx1152)
rocJPEG 1.5.0

Runtimes and compilers

HIP 7.13 Linux/Windows · Instinct/Radeon/Ryzen
HIPIFY 7.13
LLVM 23.0.0
SPIRV-LLVM-Translator 23.0.0
ROCr Runtime 1.21.0 Linux · Instinct/Radeon/Ryzen

Profiling and debugging tools

ROCm Compute Profiler (rocprofiler-compute) 3.6.0 Linux · Instinct · Ryzen (gfx1150/gfx1151/gfx1152)
ROCm Systems Profiler (rocprofiler-systems) 1.6.0
ROCprofiler-SDK 1.3.0 Linux · Instinct/Radeon · Ryzen (gfx1150/gfx1151/gfx1152)
ROCdbgapi 0.80.0 Linux · Instinct/Radeon
ROCm Debugger (ROCgdb) 16.3
ROCr Debug Agent 2.1.0

Control and monitoring tools

AMD SMI (BM) 26.4.0 Linux · Instinct/Radeon
rocminfo 1.0.0 Linux · Instinct/Radeon/Ryzen
ROCm Data Center Tool (RDC) 1.3.0 Linux · Instinct

ROCm component changelogs#

The following sections describe key changes to ROCm Core SDK components.

AMD SMI (BM) (26.4.0)#

Added#
  • Added APU metrics support (table versions 2.4 and 3.0).

    • New amdsmi_apu_metrics_t struct accessible via amdsmi_gpu_metrics_t.apu_metrics pointer (non-null when APU-specific metrics are available).

    • v2.4 metrics:

      • temperature_gfx, temperature_soc, temperature_core[8], temperature_l3[2]

      • average_gfx_activity, average_mm_activity

      • average_socket_power, average_cpu_power, average_soc_power, average_gfx_power, average_core_power[8]

      • Average clocks: gfxclk, socclk, uclk, fclk, vclk, dclk

      • Current clocks: gfxclk, socclk, uclk, fclk, vclk, dclk, coreclk[8], l3clk[2]

      • average_temperature_gfx, average_temperature_soc, average_temperature_core[8], average_temperature_l3[2]

      • average_cpu_voltage, average_soc_voltage, average_gfx_voltage, average_cpu_current, average_soc_current, average_gfx_current

      • throttle_status, indep_throttle_status

      • fan_pwm

    • v3.0 metrics:

      • temperature_core[16], temperature_skin

      • average_vcn_activity, average_ipu_activity[8], average_core_c0_activity[16]

      • average_dram_reads, average_dram_writes, average_ipu_reads, average_ipu_writes

      • average_apu_power, average_dgpu_power, average_all_core_power, average_ipu_power, average_sys_power

      • stapm_power_limit, current_stapm_power_limit

      • average_core_power[16], current_coreclk[16]

      • current_core_maxfreq, current_gfx_maxfreq

      • average_vpeclk_frequency, average_ipuclk_frequency, average_mpipu_frequency

      • throttle_residency_prochot, throttle_residency_spl, throttle_residency_fppt, throttle_residency_sppt, throttle_residency_thm_core, throttle_residency_thm_gfx, throttle_residency_thm_soc

      • time_filter_alphavalue

    • Fields not applicable to the current version are set to sentinel values: 0xFFFF for uint16_t, 0xFFFFFFFF for uint32_t, and UINT64_MAX for uint64_t fields.

    • Python bindings updated with AmdSmiApuMetrics ctypes structure.

  • Added oam_id to amdsmi_enumeration_info_t.

    • amd-smi list -e now displays OAM_ID (Physical XGMI ID / OAM ID).

    • Added --enumeration as a long-form alias for -e in amd-smi list.

  • Added support for GPU metrics v1.9 new fields.

    • Added new temperature fields to amdsmi_gpu_metrics_t:

      • temperature_hbm_stacks — per-stack HBM temperatures (°C)

      • temperature_mid — per-MID temperatures (°C)

      • temperature_aid — per-AID temperatures (°C)

      • temperature_xcd — per-XCC compute die temperatures (°C)

    • Added new per-die clock fields to amdsmi_gpu_metrics_t:

      • current_uclk_aid — per-AID uclk (MHz)

      • current_socclks_mid — per-MID SOC clock (MHz)

    • Added new constants:

      • AMDSMI_MAX_NUM_HBM_STACKS (12)

      • AMDSMI_MAX_NUM_AID (2)

      • AMDSMI_MAX_NUM_MID (2)

      • AMDSMI_MAX_NUM_CLKS_PER_AID (2)

      • AMDSMI_MAX_NUM_CLKS_PER_MID (2)

  • Added VRAM and GTT tuning interface.

    • New amd-smi static --mem-carveout to view VRAM carveout options.

    • New amd-smi set --mem-carveout to change the VRAM carveout (APU).

    • New amd-smi set --gtt and amd-smi reset --gtt for system-wide GTT size tuning.

    • New APIs: amdsmi_get_gpu_uma_carveout_info(), amdsmi_set_gpu_uma_carveout(), amdsmi_get_ttm_info(), amdsmi_set_ttm_pages_limit(), amdsmi_reset_ttm_pages_limit().

  • Added UBB power and power_limit fields to amdsmi_power_info_t and amdsmi_npm_info_t.

    • amd-smi metric --power now displays ubb_power when available.

    • amd-smi node -p now displays UBB power threshold when available.

  • Added CPU support for family 1A Models 50h-57h.

    • New APIs: amdsmi_get_cpu_xgmi_pstate_range(), amdsmi_get_cpu_core_ccd_power(), amdsmi_get_cpu_tdelta(), amdsmi_get_cpu_dimm_sb_reg(), amdsmi_get_cpu_svi3_vr_controller_temp(), amdsmi_get_cpu_pc6_enable(), amdsmi_get_cpu_cc6_enable(), amdsmi_get_cpu_sdps_limit(), amdsmi_get_cpu_core_floor_freq_limit(), amdsmi_get_cpu_core_eff_floor_freq_limit(), and corresponding set APIs.

    • Note: amdsmi_get_dfc_ctrl() renamed to amdsmi_get_cpu_dfc_ctrl() and amdsmi_set_dfc_ctrl() renamed to amdsmi_set_cpu_dfc_ctrl() for naming consistency.

  • Updated memory API documentation Added note that the sum of per-process memory usage is not expected to equal total usage.

Changed#
  • Renamed processor_type_t enum typedef to amdsmi_processor_type_t.

    • The unprefixed typedef name did not follow the amdsmi_*_t convention used throughout amdsmi.h and was easy to collide with identifiers defined by other system-management libraries. New code should use amdsmi_processor_type_t. The old name is preserved as a backward-compatibility typedef alias, so existing callers continue to compile unchanged.

  • Package install no longer modifies the system-wide logrotate timer or cron schedule.

    • Previously, installing amd-smi-lib overwrote /lib/systemd/system/logrotate.timer (or moved /etc/cron.daily/logrotate to /etc/cron.hourly/) to force hourly rotation, which affected every other package using logrotate.

    • The package now only ships /etc/logrotate.d/amd_smi.conf, which sets its own hourly + size 1M cadence. AMD-SMI logs still rotate at the same frequency; system-wide settings stay as the distribution configured them.

Optimized#
  • Optimized rsmi_dev_device_identifiers_get() in the ROCm-SMI device layer.

    • Removed unnecessary iteration by directly indexing the device list.

    • Added bounds checking for device_id, with clearer error handling/logging.

    • Improves performance for device identifier queries.

Resolved issues#
  • Fixed amd-smi metric crashing with TypeError on MI300A when no CPU flags are specified.

    • When no CPU arguments are passed, metric_cpu() sets all boolean CPU args to True to display all available data. --cpu-svi3-vr-controller-temp takes a TYPE argument (and optional RAIL_INDEX) rather than a boolean flag — setting it to True caused a TypeError crash when the code tried to subscript it with [0][0]. Added cpu_svi3_vr_controller_temp to the show-all exclusion list, following the existing pattern for cpu_lclk_dpm_level, cpu_io_bandwidth, cpu_dimm_sb_reg, and similar argument-taking flags.

  • Fixed amdsmi_get_gpu_accelerator_partition_profile() returning incorrect num_partitions when num_partition is unavailable from GPU metrics.

    • GPU metrics no longer always provides num_partition. The function now derives the partition count from the active partition type when num_partition is not available:

      • SPX → 1, DPX → 2, TPX → 3, QPX → 4

      • CPX → derived from the XCD counter via amdsmi_get_gpu_xcd_counter()

  • Fixed amdsmi_topo_get_p2p_status() returning a raw ctypes.c_uint32 object instead of an integer for the type field.

    • The 'type' key in the returned dictionary now correctly returns type_32.value (an int) rather than the unwrapped ctypes object, consistent with the pattern used in amdsmi_topo_get_link_type().

  • Adjusted KFD process caching to be more responsive.

    • Updated process caching to allow cache duration adjustment via the AMDSMI_PROCESS_INFO_CACHE_MS environment variable for workflows with rapid metric polling.

  • Fixed CLI exit codes to use absolute values.

    • Invalid GPU parameters now return positive error codes as documented.

  • Fixed CLI breakage when amdgpu driver is not present.

    • Improved init to better catch driver loading issues.

  • Aligned amdsmi_get_gpu_device_uuid() with HIP/rocminfo UUID format.

    • Modified amdsmi_asic_info_t.asic_serial to report per-socket serial using KFD’s unique_id.

  • Fixed multiple bugs in NIC/switch code and amdsmi_init() NIC handling.

    • Fixed sizeof operator precedence, hw_mon reset, NUMA=65535 handling, and several CLI function call errors.

    • Fixed amdsmi_init() to succeed when no NIC hardware is present.

  • Fixed shared mutex and self-heal.

    • Improved self-heal logic to correctly identify and recover from corrupted or uninitialized mutex state.

  • Fixed cu_occupancy displaying 0% instead of N/A when file is unavailable.

    • Process cu_occupancy is now initialized to INVALID instead of zero, so amd-smi process displays N/A rather than a misleading 0% when the sysfs file is not accessible.

  • Fixed CLI set commands silently succeeding on invalid input values.

    • amd-smi set --profile <INVALID> now returns a non-zero exit code and lists available profiles in the error message; invalid profile names are rejected at parse time.

    • amd-smi set --clk-level <CLK_TYPE> (missing performance level indices) now returns a non-zero exit code with a usage hint instead of silently succeeding.

    • amd-smi set --power-cap <OUT_OF_RANGE> now returns a non-zero exit code.

    • amd-smi set --fan <INVALID>% no longer prompts the out-of-spec warning before validating the percentage range; invalid values are rejected immediately.

  • Fixed amd-smi set --profile help text omitting BOOTUP_DEFAULT.

    • BOOTUP_DEFAULT was always accepted at runtime but was missing from the --help profile list. Auditing invalid-input handling exposed this gap. amd-smi reset --profile can also be used to return to the bootup default power profile.

  • Fixed amd-smi monitor --brcm_nic and --brcm_switch flags being registered on non-BRCM systems.

    • These flags are now only registered when BRCM hardware is present, preventing spurious failures on AMD GPU-only systems.

  • Fixed amd-smi default command alignment.

    • Updated default amd-smi output to align values to the left for improved readability. Several items were misaligned in the default output, and this change ensures a consistent left-aligned format across all fields.

    • This change is purely cosmetic and does not affect any functionality.

  • Renamed lc_perf_other_end_recovery to lc_perf_other_end_recovery_count in amd-smi metric CLI output for unification.

  • Removed references to deprecated amd-smi reset -r.

    • CLI help text and memory partition change warnings no longer reference amd-smi reset -r for driver reloading.

    • Users are now directed to use sudo modprobe -r amdgpu && sudo modprobe amdgpu to reload the driver after partition changes.

  • Changed CPU power APIs to return values in milliwatts (mW) for higher precision.

    • Removed lossy integer rounding ((mW + 500) / 1000) from 6 CPU power get APIs. Values are now returned in milliwatts directly from the ESMI library, preserving sub-watt precision.

    • C API: Output parameter type remains uint32_t*, but the unit changed from watts to milliwatts (mW).

      • amdsmi_get_cpu_socket_power

      • amdsmi_get_cpu_socket_power_cap

      • amdsmi_get_cpu_socket_power_cap_max

      • amdsmi_get_cpu_pwr_efficiency_mode (ppt_limit field)

      • amdsmi_get_cpu_core_ccd_power

      • amdsmi_get_cpu_sdps_limit

    • Python API (breaking): These functions now return int (milliwatts) instead of str (e.g., "240 Watts"). Callers that parsed the string output must update to handle the numeric return value.

    • CLI output: Power values now display with milliwatt precision (e.g., 240.500 Watts).

    • Added missing null-pointer validation for output parameters in amdsmi_get_cpu_socket_power_cap and amdsmi_get_cpu_socket_power_cap_max.

    • Updated header documentation to specify milliwatt units for all affected get and set API parameters.

  • Changed power APIs to have consistent output parameter types.

    • Modified 6 CPU power APIs to have consistent output power types. All set and get APIs have uint32_t output values.

    • Modified get and set APIs that had double output types to have uint32_t output types in milliwatts (mW).

      • amdsmi_get_cpu_socket_power(amdsmi_processor_handle processor_handle, uint32_t* ppower)

      • amdsmi_get_cpu_socket_power_cap(amdsmi_processor_handle processor_handle, uint32_t* pcap)

      • amdsmi_get_cpu_socket_power_cap_max(amdsmi_processor_handle processor_handle, uint32_t* pmax)

      • amdsmi_get_cpu_pwr_efficiency_mode(amdsmi_processor_handle processor_handle, uint32_t* power_efficiency_mode, uint32_t* utilization, uint32_t* ppt_limit)

      • amdsmi_get_cpu_core_ccd_power(amdsmi_processor_handle processor_handle, uint32_t* power)

      • amdsmi_get_cpu_sdps_limit(amdsmi_processor_handle processor_handle, uint32_t* sdps_limit)

Composable Kernel (1.3.0)#

Added#
  • Added overload of load_tile_transpose that takes reference to output tensor as output parameter.

  • Use data type from LDS tensor view when determining tile distribution for transpose in the GEMM pipeline.

  • Added eightwarps support for abquant mode in blockscale GEMM.

  • Added preshuffleB support for abquant mode in blockscale GEMM.

  • Added support for explicit GEMM in CK_TILE grouped convolution forward and backward weight.

  • Added TF32 convolution support on gfx942 and gfx950 in CK. It can be enabled or disabled via DTYPES of tf32.

  • Added streamingllm sink support for FMHA FWD, include qr_ks_vs, qr_async and splitkv pipelines.

  • Added support for microscaling (MX) FP8/FP4 mixed data types to Flatmm pipeline.

  • Added support for fp8 dynamic tensor-wise quantization of FP8 fmha fwd kernel.

  • Added FP8 KV cache support for FMHA batch prefill.

  • Added FMHA batch prefill kernel support for several KV cache layouts, flexible page sizes, and different lookup table configurations.

  • Added gpt-oss sink support for FMHA FWD, include qr_ks_vs, qr_async, qr_async_trload and splitkv pipelines.

  • Added persistent async input scheduler for CK Tile universal GEMM kernels to support asynchronous input streaming.

  • Added FP8 block scale quantization for FMHA forward kernel.

  • Added gfx11xx support for FMHA.

  • Added microscaling (MX) FP8/FP4 support on gfx950 for FMHA forward kernel (qr pipeline only).

  • Added FP8 per-tensor quantization support for FMHA forward V3 pipeline on gfx950.

HIP (7.13)#

Added#
  • New HIP APIs

    • cooperative_groups::reduce() allows calling reduce operators on thread_block_tile and coalesced_threads. The implementation is based on the __reduce_*_sync operations, so the macro HIP_ENABLE_EXTRA_WARP_SYNC_TYPES might be needed to unlock some optimizations.

  • New device attribute hipDeviceAttributeGPUDirectRDMAWithHipVMMSupported, indicating support for GPU Direct RDMA when using HIP VMM. This attribute corresponds to the CUDA CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED.

Resolved issues#
  • A segmentation fault that occurred in child graphs during the graph‑launch phase. The issue originated from the entire graph being launched solely according to the parent graph’s scheduling logic. The HIP runtime now introduces a per‑graph segment‑scheduling control flag and propagates the parent graph’s scheduling mode to its child graphs, ensuring consistent scheduling behavior (classic vs. segment) and preventing failures when the parent falls back to classic scheduling.

  • A segmentation fault caused by passing a null pointer to the hipMemGetAddressRange API. The function now handles null pointers correctly, matching the behavior of the corresponding CUDA API.

Changed#
  • __reduce_and_sync(), __reduce_or_sync() and __reduce_xor_sync() now provide a consistent behavior for all mask values and with CUDA. Previously, some masks were translated into bitwise operations, but others were not (such as those containing “holes”). Now, all masks cause bitwise instructions to be emitted. This is a change in behavior compared to previous versions.

Optimized#
  • Improved HIP runtime error logging when an application’s fat binary does not include a compatible code object for the detected GPU architecture, offering clearer guidance to rebuild with the appropriate --offload-arch=gfxXXXX option.

  • Enables in‑memory and background‑thread asynchronous logging in the HIP runtime by default to improve overall logging capability. This behavior can be disabled by setting the environment variable AMD_LOG_ASYNC=0.

hipBLAS (3.4.0)#

Added#
  • gfx1250 and gfx90c support to clients.

  • Version and other properties to Windows hipblas.dll.

  • Support for OpenBLAS ILP64-based API usage in clients.

Resolved issues#
  • Restored the fallback of using the deprecated rocBLAS API rocblas_set_device_memory_size if allocations are failing.

hipBLASLt (1.3.0)#

Added#
  • General Batched GEMM support.

Changed#
  • Replaced install.sh with an invoke-based task runner (tasks.py) to support cross-platform builds including Windows (ROCm 7.0+).

  • gtest and msgpack-cxx are now fetched automatically using CMake FetchContent if not found on the system.

hipCUB (4.4.0)#

Optimized#
  • Reduced build times for unit tests.

Resolved issues#
  • Fixed more memory leak issues with some unit tests.

hipFFT (1.0.23)#

Added#
  • hipFFTW plan creation functions for advanced and general plans:

    • fftw_plan_many_dft

    • fftwf_plan_many_dft

    • fftw_plan_many_dft_r2c

    • fftwf_plan_many_dft_r2c

    • fftw_plan_many_dft_c2r

    • fftwf_plan_many_dft_c2r

    • fftw_plan_guru_dft

    • fftwf_plan_guru_dft

    • fftw_plan_guru_dft_r2c

    • fftwf_plan_guru_dft_r2c

    • fftw_plan_guru_dft_c2r

    • fftwf_plan_guru_dft_c2r

    • fftw_plan_guru64_dft

    • fftwf_plan_guru64_dft

    • fftw_plan_guru64_dft_r2c

    • fftwf_plan_guru64_dft_r2c

    • fftw_plan_guru64_dft_c2r

    • fftwf_plan_guru64_dft_c2r

  • Support for gfx1150 architecture.

Changed#
  • Moved library to C++20 standard.

  • Removed Boost as a dependency for clients and samples.

  • Callback functions will be deprecated in a future release.

Resolved issues#
  • Fixed potential launch failure of data generation kernels in test and benchmark programs.

hipRAND (3.3.0)#

Added#
  • hiprand.dll now contains embedded file version metadata.

hipSOLVER (3.4.0)#

Added#
  • Compatibility-only functions:

    • geev

      • hipsolverDnXgeev_bufferSize

      • hipsolverDnXgeev

    • syevBatched

      • hipsolverDnXsyevBatched_bufferSize

      • hipsolverDnXsyevBatched

    • syevd

      • hipsolverDnXsyevd_bufferSize

      • hipsolverDnXsyevd

    • sytrs

      • hipsolverDnXsytrs_bufferSize

      • hipsolverDnXsytrs

hipSPARSELt (0.2.8)#

Added#
  • CTest and test categories support (--smoke, --pre_checkin, and --nightly).

Optimized#
  • Provided more kernels for the FP16, BF16, and Int8 datatypes.

  • Improved the performance of the HIPSPARSELT_PRUNE_SPMMA_TILE function.

Resolved issues#
  • Fixed incorrect behavior when retrieving the PCI chip ID.

  • Fixed LDS out-of-bounds read in prune_tile_kernel.

  • Fixed out-of-bounds access for compress function test cases.

  • Fixed missing null terminator in the return value of hipsparseLtGetArchName().

  • Fixed incorrect CPU result when bias_type is BF16 for spmm test cases.

  • Fixed double-free issue in the example code example_prune_strip.

  • Fixed symbol interposition in the hipSPARSELt library.

MIOpen (3.5.1)#

Added#
  • Added MIOPEN_LOG_BUFFER_SIZE option: when set to non-zero, dumps recent MIOpen logs to file on error.

  • [Conv] Added ConvDepthwiseFwd3D solver for optimizing specific 3D depthwise convolutions.

  • [Conv] Added NHWC layout support for Winograd convolution solvers.

  • [Conv] Added regular GEMM solver support for Conv3D forward and backward-data with 1x1x1 filters.

  • [Conv] Added configurable problem size threshold (MIOPEN_CONV_DIRECT_MAX_SIZE) for direct solver.

  • [Softmax] Added tuning support via Generic Search.

Changed#
  • [Conv] Improved default kernel selection for Composable Kernel (CK) convolution solvers with ranked shortlists.

  • [Conv] Split CK grouped convolution kernels into per-architecture runtime-loaded dynamic libraries.

Optimized#
  • Optimized transpose operations with tiled and vectorized variants for NCHW/NHWC conversions.

  • [BatchNorm] Optimized batchnorm reduction using warp shuffle intrinsics.

  • [Conv] Added heuristic filtering of slow GEMM solver configurations during tuning.

Deprecated#
  • [Conv] Deprecated CK non-grouped convolution forward and backward solvers.

  • Deprecated miopenConvolutionBackwardBias: the underlying OpenCL kernel (MIOpenConvBwdBias.cl) has been removed. The function now returns miopenStatusNotImplemented and will be removed in a future release.

Removed#
  • Removed GraphAPI experimental feature and related code.

Resolved issues#
  • [Conv] Fixed Winograd Fury grouped convolution correctness on gfx12xx when G > 1.

  • [Conv] Fixed bf16 WrW convolution precision loss in inter-batch accumulation.

  • [Conv] Fixed GPU memory fault in Winograd v3.0 WrW solver for large tensor shapes.

  • Fixed BF16 abs function precision error caused by unnecessary cast through FP16.

  • Fixed pooling kernel runtime compilation failure.

  • Fixed gfx1151 inline assembly compilation errors in batchnorm kernels.

  • Fixed use-after-free in HIPOCProgram binary loading.

ROCm Data Center Tool (RDC) (1.3.0)#

Resolved issues#
  • Fixed broken partition metrics.

    • Regardless of whether the GPU was partitioned, RDC only saw the GPU index and no instances due to upstream gpu_metrics changes.

rocBLAS (5.4.0)#

Added#
  • gfx1250 and gfx90c enabled.

  • Trace logging using ROCBLAS_LAYER=1 for rocblas_gemm_ex_get_solutions, rocblas_gemm_batched_ex_get_solutions, rocblas_gemm_ex_get_solutions_by_type, and rocblas_gemm_batched_ex_get_solutions_by_type.

  • Version and other properties to Windows rocblas.dll.

  • Support for OpenBLAS ILP64 API for host reference in clients.

  • Dockerfiles in the docker directory to assist in setting up development.

Optimized#
  • Improved the performance of Level 3 geam for pure transpose scale use cases.

  • Improved the performance of Level 2 tpsv.

Resolved issues#
  • Fix for querying solutions when using the hipBLASLt backend with rocblas_gemm_batched_ex_get_solutions if using null data pointers.

ROCdbgapi (0.80.0)#

Added#
  • amd_dbgapi_process_get_info() adds a new query to get a mask spanning over all the bits used by all the address spaces. The query is called AMD_DBGAPI_PROCESS_INFO_SIGNIFICANT_ADDRESS_BITS.

rocDecode (1.8.0)#

Added#
  • Logging improvement: Added function entry and exit logs (at Info log level).

  • Logging improvement: Added duration to function exit logs and optimized log message formatting to reduce runtime overhead.

  • Logging improvement: Merged all logger instances into one global instance.

  • Logging improvement: Unified logging format in utility classes with core library logging format.

  • Logging improvement: Moved debug logging from a compile-time switch to the runtime logger level controlled by ROCDEC_LOG_LEVEL (debug = 4).

  • Added support for user-set output surface format.

Changed#
  • Removed CPack packaging (DEB/RPM/NSIS/TGZ/ZIP generation and all related CPACK variables).

  • Removed rocDecode-setup.py dependency installer script.

  • Removed Docker files.

  • Removed package install documentation; updated all documentation to reference TheRock for installation.

  • Simplified libva version check (single >= 1.22 requirement).

  • Cleaned up CMake error messages.

rocFFT (1.0.37)#

Optimized#
  • Allow plans to share hipModules if they use the same kernels. This reduces time spent and memory used when creating plans that exist concurrently.

  • Improved performance of unit-strided, interleaved, complex-to-complex and real-to-complex FFTs on gfx1201, gfx90a, gfx942, and gfx950.

    Single-precision lengths:

    • (160,72,72)

    • (160,80,72)

    • (160,80,80)

    • (72,72,72)

    • (80,80,80)

    • (84,84,72)

    • (96,96,96)

    • (108,108,80)

    Double-precision lengths:

    • (72,72,52)

    • (60,60,60)

    • (64,64,52)

    • (64,64,64)

Changed#
  • Moved library to C++20 standard.

  • Removed Boost as a dependency for clients and samples.

  • Split the precompiled kernel cache file (rocfft_kernel_cache.db) into per-architecture files (rocfft_kernel_cache_gfx950.db, rocfft_kernel_cache_gfx1201.db, etc).

  • rocfft_plan_create returns rocfft_status_invalid_offset for any usage of non-zero offsets in plan descriptions. The feature is not supported yet.

  • Callback functions will be deprecated in a future release.

Resolved issues#
  • Potential issue with data generation for multi-dimensional transforms in rocfft-tests and rocfft-bench.

  • An issue that sometimes blocked complex-to-complex FFT plan creation when using noncontiguous strides in multiple dimensions.

  • An issue that sometimes blocked complex-to-real FFT plan creation when using noncontiguous strides in multiple dimensions.

  • An issue that sometimes blocked complex-to-real FFT plan creation when using noncontiguous strides with small lengths on the two fastest dimensions.

  • Potential launch failure of data generation kernels in test and benchmark programs.

  • Incorrect results on some strided real-complex FFTs on gfx90a.

  • Incorrect results on some even-length real FFTs that have odd-length strides on higher dimensions.

  • Callbacks on MPI transforms when not all ranks have the same number of data bricks.

  • Functional issues for multi-device, in-place real transforms.

  • Functional issues for multi-dimensional, multi-device transforms involving some unit length(s).

  • Functional issues for multi-device transforms involving data divisions along the slowest-varying axis (only) for some bricks but not all.

  • Functional issues for multi-device transforms setting no field on input or output.

  • Automatic allocation of work memory at plan execution time, when work memory is required on multiple devices.

rocJPEG (1.5.0)#

Changed#
  • rocJPEG is now delivered as part of TheRock. All core dependencies are provided by the TheRock build.

  • Removed CPack packaging (DEB/RPM/NSIS/TGZ/ZIP generation and all related CPACK variables).

  • Removed rocJPEG-setup.py dependency installer script.

  • Removed Docker files.

  • Removed package install documentation; updated all documentation to reference TheRock for installation.

  • Simplified libva version check (single >= 1.22 requirement).

  • Cleaned up CMake error messages.

ROCm Compute Profiler (3.6.0)#

Added#
  • Added L2 memory bandwidth derived metrics under --membw-analysis to allow L2 memory bandwidth specific profiling and analysis metric block 30.

  • Added AMD Ryzen AI Max 300 series (gfx1151) support.

    • New memory hierarchy visualization for RDNA 3.5 (gfx115X) in analyze CLI mode.

  • Introduced support for AMD Instinct MI350P GPU.

  • --view table option in analyze mode to force all TTY output to plain tables and ignore cli_style from YAML config (for example, mem_chart, Roofline charts render as tables). The --view argument is reserved for future TTY views (for example, other chart styles).

  • Added EA memory bandwidth derived metrics under --membw-analysis to allow EA memory bandwidth specific profiling and analysis metric block 30.

Changed#
  • Standalone roofline (--roof-only option) in profile mode now creates roofline.csv only. HTML roofline charts are generated via rocprof-compute analyze. The calc_ai_profile() function has been removed; calc_ai_analyze() is the single source of truth for arithmetic intensity calculation.

    • Roofline visualization options (--sort, --mem-level, --roofline-data-type) have moved from profile mode to analyze mode.

  • Standardized unit naming in analysis configs and Python utilities: pct/PctPercent, instrInstructions.

  • Profile mode output format:

    • Profile mode now creates separate counter collection files for each application replay (pmc_perf_.csv or results_.csv).

    • Analyze mode automatically merges these files into a unified pmc_perf.csv containing information from all application replays during pre-processing.

  • ROCm Compute Profiler now builds and runs profile mode with vanilla Python without requiring any Python dependencies to be installed via pip.

    • Note that analysis mode will still require Python dependencies and will report any missing packages.

Removed#
  • Removed HIP API tracing since it’s out-of-scope for ROCm Compute Profiler and the trace files were not being analyzed.

Optimized#
  • Filtering for block 21 (-b 21) in profile mode now only performs pc sampling and skips unnecessary counter collection.

    • Filtering for block 21 in analysis mode now skips metrics calculations and only shows kernel/dispatch/system statistics and pc sampling table.

Resolved issues#
  • Fixed roofline benchmark MFMA FP16/BF16/INT8 peaks for MI350.

  • Fixed an issue where pc sampling profiling failed with multi-argument commands and live process attachment.

Upcoming changes#
  • --path and --subpath options are deprecated and will be removed in a future release.

  • Intermediate CSV generation (results_*.csv) from rocpd databases during profiling is deprecated and will be removed in a future release. The analyze step will read .db files directly.

  • --retain-rocpd-output is deprecated and will be removed in a future release. .db files will be retained by default.

Known issues#
  • For AMD Ryzen AI Max 300 series, the roofline metrics table will have N/A values for “peak” field.

    • This is planned to be addressed by adding empirical benchmark support for AMD Ryzen AI Max 300 series in a future release.

ROCm Systems Profiler (1.6.0)#

Added#
  • Kernel Fusion Driver (KFD) event tracing support to capture page faults, page migrations, queue evictions, GPU unmap events, and dropped events. Requires ROCprofiler-SDK 1.2.1 or later. Enable with ROCPROFSYS_ROCM_DOMAINS=kfd_events.

  • Support for pause and resume of profiling via roctxProfilerPause and roctxProfilerResume.

  • Support for selective region tracing via the ROCPROFSYS_SELECTED_REGIONS environment variable, limiting tracing to specified regions.

  • --selected-regions CLI argument to rocprof-sys-sample, rocprof-sys-run, and rocprof-sys-instrument for specifying selective region tracing from the command line.

  • Support for re-attaching to a previously profiled process. After detaching, rocprof-sys-attach can re-attach to the same PID for a new profiling session.

  • MPI-rank-based file output filtering feature controlled with two new CLI arguments: --rank-filter-output and --rank-filter-id.

  • JSON-based configurable preset system with --preset=<name> flag, replacing the old --<preset-name> flags. Presets are now loaded from JSON files in source/bin/common/presets/, making them extensible and exportable. Use --list-presets to see available presets and --explain=<name> for detailed preset information.

  • Domain flags for composable configuration: --gpu[=metrics], --rocm[=domains], --cpu[=hz], --parallel[=runtimes]. Domain flags can be combined with presets to customize profiling without editing configuration files.

  • Configuration export via --export-config[=file] to save resolved settings as reusable JSON configuration files. Exported configs can be loaded back with --preset=./config.json.

  • Topic-based help system: --help now shows a compact summary with essential options and a list of help topics. Use --help=<topic> (e.g., --help=sampling, --help=gpu, --help=tracing) to see only relevant options. Use --help=all for the full option listing.

  • Post-run output summary during library finalization showing result file locations.

  • JSON schema file (share/rocprofiler-systems/presets/schema.json) for preset validation.

  • Documentation (docs/how-to/instrumenting-rewriting-binary-application.rst) describing what to do when Dyninst reports a “Failed to transform trace” error during instrumentation.

Changed#
  • rocprof-sys-avail no longer queries GPU devices or hardware counters unless --hw-counters or --all is requested, reducing startup time and allowing settings/component queries in environments without GPU/ROCm.

  • rocprof-sys-instrument diagnostic file dumps (available, instrumented, excluded, coverage, overlapping) are now gated behind the --dump-info flag instead of being generated unconditionally.

  • Preset flags changed from --balanced to --preset=balanced syntax. The old --<preset-name> flags are still supported and handled within preset_registry.

  • Removed the ROCPROFSYS_USE_ROCM CMake option. ROCm is now required for building the ROCm Systems Profiler.

Resolved issues#
  • Fixed an issue where the --rocm-domains CLI option for rocprof-sys-run was not recognized.

rocminfo (1.0.0)#

Resolved issues#
  • Fixed BDF (Bus:Device.Function) ID truncation issue that caused incorrect display of PCI device identifiers. The bdf_id field was incorrectly declared as uint16_t instead of uint32_t, causing silent truncation when HSA runtime returned the full 32-bit BDF ID value. This has been corrected to properly display complete BDF information for all GPU agents.

rocPRIM (4.4.0)#

Added#
  • Added type trait definitions for __hip_bfloat16. This should resolve issues where this type did not work with radix-based algorithms.

  • Unit tests for config_types.

Optimized#
  • Reduced build times for unit tests.

  • Reduced memory usage in unit tests.

Resolved issues#
  • Fixed a silent overflow in rocprim::device_segmented_reduce where it could exceed the maximum number of HIP threads, resulting in missing output.

  • Certain large unit tests now properly detect if insufficient system memory is present and skip the test case accordingly.

  • Fixed out-of-bounds memory access in block run length decode.

  • Fixed memory leak in unit tests.

ROCprofiler-SDK (1.3.0)#

Added#

API:

  • Late-start profiling support: Enables profiling when rocprofiler-sdk is loaded after HSA/HIP runtimes have already initialized.

    • rocprofiler_force_configure() now automatically detects and profiles runtimes initialized before the SDK loads.

    • Integrates with rocprofiler-register to retrieve the registered API tables.

    • Supports all runtime types (HSA, HIP, ROCTX, RCCL, rocDecode, rocJPEG, and more) automatically.

    • No explicit late-start API calls required; works transparently.

  • KFD (Kernel Fusion Driver) event tracing support:

    • Buffer service configurations for each KFD buffer tracing type.

    • New type tool_buffer_tracing_kfd_record_t using std::variant to wrap 8 different KFD buffer tracing types.

    • Each KFD event generates rocpd_info_pmc, rocpd_event, rocpd_region, and rocpd_pmc_event rows.

    • Fixed handling for special SVM location in KFD prefetch location reporting.

    • Fixed parsing for queue restore events to handle both correct format (character ‘0’) and broken driver format (NULL character ‘\0’).

rocprofv3 (CLI):

  • Multi-pass counter collection support: Support for multiple --pmc flags to define separate counter groups for different profiling passes.

    • Ability to combine command-line --pmc flags with input file counter groups.

    • Each pass generates output in a separate pass_n subdirectory.

    • Example: rocprofv3 --pmc SQ_WAVES --pmc GRBM_COUNT -- <app> creates two profiling passes.

  • KFD (Kernel Fusion Driver) event tracing support:

    • KFD record dumping to rocpd with support for 8 main KFD event types.

    • Support for rocpd to Perfetto conversion for KFD events.

    • --kfd-trace flag to enable KFD event tracing.

  • ROCTx support for ATT: Added ROCtx support to device thread trace when using --att --selected-regions.

    • Allows roctxProfilerPause and roctxProfilerResume to explicitly control when ATT data collection starts and stops.

    • Enables more precise, region-focused ATT tracing with reduced overhead and noise.

    • Supports multiple resume/pause cycles, each producing separate trace output files.

    • Incompatible with --att-consecutive-kernels.

  • PC sampling support for dynamic attach: Allows users to attach to a running application and collect PC samples without restarting the workload.

    • Enables profiling long-running or production-style jobs at the point of interest.

    • Results integrate with the existing PC sampling analysis flow.

Documentation:

  • Added marker-controlled thread tracing section to the thread trace how-to guide.

  • Added cross-reference from ROCTx documentation to ATT with selected-regions.

Changed#

Implementation:

  • Late-start architecture redesign: Removed direct runtime symbol access in favor of proper rocprofiler-register integration.

    • Replaced ~600 lines of dlopen/dlsym bypass logic with ~80 lines by using rocprofiler_register_invoke_all_registrations().

    • Late-start now works by requesting rocprofiler-register to re-propagate stored API tables.

    • Extensible design. Automatically supports new runtimes without SDK code changes.

    • Provides a proper separation of concerns. rocprofiler-register manages the table storage while SDK manages the table wrapping.

  • Counter dimension encoding changed from fixed-width to variable-width allocation per dimension type.

  • Dimension selection and reduction logic now uses explicit dimension masks and single-index selection.

  • HSA queue interception extended to handle AMD extended kernel dispatch packets.

Removed#
  • Counter collection support for plain text (.txt) input files. Only structured file formats (JSON and YAML) with schema validation are now supported.

Resolved issues#
  • Fixed rocpd OTF2 output to add ACCELERATOR_DEVICE as system tree node domain for AMD devices.

  • Fixed rocprofv3 input file parsing where comment lines containing pmc: were incorrectly processed as valid counter collection directives, causing unintended profiling passes.

rocRAND (4.4.0)#

Added#
  • gfx1150 and gfx1152 support.

  • rocrand.dll now contains embedded file version metadata.

Resolved issues#
  • Fixed memory leak in unit tests.

rocSHMEM (3.4.0)#

Added#
  • Added new APIs:

    • rocshmem_quiet_on_stream

    • rocshmem_sync_all_on_stream

    • rocshmem_TYPENAME_alltoall_wg

    • rocshmem_TYPENAME_alltoallv_wg

    • rocshmem_team_my_pe

    • rocshmem_team_n_pes

    • rocshmem_barrier

    • rocshmem_barrier_wave

    • rocshmem_barrier_wg

    • rocshmem_buffer_register

    • rocshmem_buffer_unregister

    • rocshmem_info_get_version

    • rocshmem_info_get_name

    • rocshmem_vendor_get_version_info

  • Added library constants: ROCSHMEM_MAJOR_VERSION, ROCSHMEM_MINOR_VERSION, ROCSHMEM_MAX_NAME_LEN, ROCSHMEM_VENDOR_STRING, ROCSHMEM_VERSION, ROCSHMEM_VENDOR_MAJOR_VERSION, ROCSHMEM_VENDOR_MINOR_VERSION, ROCSHMEM_VENDOR_PATCH_VERSION.

  • Added vendor string and backend metadata to the rocshmem_info output.

  • Added ROCSHMEM_TEAM_WORLD for device code.

  • Added ROCSHMEM_TEAM_SHARED predefined team for PEs sharing a common memory domain (same node).

  • Added new environment variables:

    • ROCSHMEM_GDA_OVERRIDE_NIC_FIRMWARE_CHECK

    • ROCSHMEM_GDA_NUM_QPS_PER_PE_DEFAULT_CTX

    • ROCSHMEM_GDA_NUM_QPS_PER_PE_USR_CTX

  • Added VMM POSIX memory allocator (USE_HEAP_DEVICE_VMM_POSIX):

    • Uses HIP Virtual Memory Management (VMM) APIs for fine-grained memory control.

    • Requires ROCm 7.0+ and Linux kernel 5.6+.

    • Not compatible with MPI-based initialization (use ROCSHMEM_INIT_WITH_UNIQUEID instead).

Changed#
  • Use CQ collapsing for the Mellanox MLX5 GDA conduit.

rocSOLVER (3.34.0)#

Added#
  • Computation of solution for LU factorization without pivoting:

    • GETRS_NPVT (with batched and strided_batched versions)

    • GETRS_NPVT_64 (with batched and strided_batched versions)

  • Linear solver routines for symmetric matrices:

    • SYTRS (with batched and strided_batched versions)

    • SYTRS_64 (with batched and strided_batched versions)

Optimized#
  • Improved the performance of POTF2 and downstream functions such as POTRF.

Resolved issues#
  • Fixed a memory access error in SYTRF and synchronization issues in LASYF and SYTF2.

rocSPARSE (4.6.0)#

Added#
  • rocsparse_create_const_bsr_descr routine for creating a const sparse BSR matrix descriptor.

  • rocsparse_spic0 and rocsparse_spilu0 routines for incomplete factorizations, with strided batched computations enabled.

  • rocsparse_sptrsv_descr_create and rocsparse_sptrsv_descr_destroy routines.

  • rocsparse_singularity enumeration.

  • rocsparse_sptrsv_output_singularity and rocsparse_sptrsv_output_singularity_position in rocsparse_sptrsv_output.

  • Strided batched computations for rocsparse_sptrsv.

Optimized#
  • Significant performance improvement for rocsparse_Xgtsv_no_pivot_strided_batch.

  • Significant performance improvement for rocsparse_Xgtsv_no_pivot.

Resolved issues#
  • Fixed incorrect usage of __syncthreads in bsrmm, csrmm (row_split), and csritilu0x.

  • Fixed incorrect usage of __syncthreads in csx2dense, dense2csx, prune_dense2csr, csrcolor, and csrmm (nnz_split).

  • Fixed rocsparse_[s|d|c|z]csric0 where rocsparse_status_invalid_value was being returned when the maximum number of non-zeros in any row is between 513 and 1024.

  • Fixed compilation when using --rocsparse_ILP64.

  • Fixed off-by-one heap-buffer-overflow in temporary buffer allocation for rocsparse_csrsort, rocsparse_check_matrix_csr, and rocsparse_check_matrix_gebsr (and their delegating routines rocsparse_cscsort, rocsparse_coosort, rocsparse_check_matrix_csc, and rocsparse_check_matrix_gebsc) where the shift_offsets_kernel temp buffer was sized for m elements instead of m+1.

Removed#
  • The deprecated C++14 support, which is no longer supported by the rocPRIM dependency.

rocThrust (4.4.0)#

Resolved issues#
  • Fixed memory leak in unit test.

  • Fixed unit test compatibility with ASAN.

rocWMMA (2.2.1)#

Added#
  • Added the following community samples for external contributions, with build support and documentation:

    • simple_gemm_silu: demonstrates a GEMM + SiLU fused operator using the rocWMMA API.

    • simple_gemm_fusion: demonstrates block-tile-level dual-GEMM fusion using the rocWMMA API.

    • simple_gemm_swiglu: demonstrates a SwiGLU fused dual-GEMM kernel (LLaMA/Mistral FFN gate layer) using the rocWMMA API.

Changed#
  • Updated the find_package search for OpenMP to prefer the openmp-config.cmake provided by ROCm, with a fallback to module search mode.

  • Updated INSTALL_RPATH and added BUILD_RPATH for OpenMP.

Resolved issues#
  • Improved HIP RTC regression test portability when deployed outside the default path.

ROCm known issues#

ROCm known issues are noted on GitHub. These issues will be fixed in a future ROCm release. For known issues related to individual components, review the ROCm component changelogs.

ROCm Compute Profiler might fail when profiling bash script or command#

Running a bash script or command as a target for ROCm Compute Profiler might fail because bash overwrites the required environment variables. As a workaround, use --no-native-tool option in the profile mode. Note that this will disable iteration multiplexing.

hipFFT and rocFFT callback examples fail to build on Windows#

The hipFFT and rocFFT callback examples in rocm-examples fail to build on a Windows operating system due to a linker error. CMake configuration and HIP object compilation will complete successfully, but the final link step fails with clang: error: invalid linker name in argument '-fuse-ld=lld-link' This issue affects all Windows configurations using Relocatable Device Code (RDC) mode. Linux builds are not affected. As a workaround, skip the hipFFT and rocFFT callback examples on Windows, and refer to the Linux builds or callback functionality documentation.

QMCPACK might become unresponsive during DMC simulation on AMD Instinct MI300A GPUs#

QMCPACK might become unresponsive when running Diffusion Monte Carlo (DMC) simulations with certain inputs on AMD Instinct MI300A GPUs. The application stops making progress after initialization and must be terminated manually.

Resource-intensive workloads might result in GPU memory faults#

Applications that pass large, complex data structures between device functions using scratch memory, and particularly rely on compiler optimization to minimize the number of copy operations, might encounter GPU memory access faults and become unresponsive.

Increased binary size for multi-target GPU builds#

Applications targeting multiple AMD GPU architectures might observe significantly larger binary sizes. Multi-target builds can produce binaries up to 54 percent larger. Single-target builds add approximately 8 MB of additional size per GPU target. As a workaround, reduce the number of GPU targets in multi-target builds, or strip the resource-usage symbols from release binaries.

HIP cooperative groups might fail when compiled using the SPIR-V path#

HIP applications that use cooperative groups might fail at kernel launch when compiled with --offload-arch=amdgcnspirv. The application fails at runtime with LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.s.wait.asynccnt error message. This affects all GPU architectures when using the SPIR-V compilation path. As a workaround, compile using a direct GPU architecture target (for example, --offload-arch=gfx942) instead of --offload-arch=amdgcnspirv.

Illegal memory address error when using placement new with device function returns#

HIP kernels that use the placement new operators to construct objects in the hipMalloc device memory might crash with hipErrorIllegalAddress error message when you pass a __device__ function return value as the constructor argument. This only affects non-trivially-copyable types (for example, types with user-defined or deleted copy/move constructors). Trivially-copyable types are not affected. As a workaround, assign the device function return value to a local variable before passing it to placement new.

LLVM-based compilers might fail when compiling half-precision vector operations#

LLVM-based compilers might fail, returning Failed to find subregs! error message in SIInstrInfo::copyPhysReg, when compiling half-precision vector operations with optimization enabled. The issue was observed at optimization levels -O1 to -O3.

hipBLAS test suites failure on Windows#

When using hipBLAS on Windows, the test suites might return non-zero exit codes, even when all mathematical correctness tests pass. This issue can affect CI/CD pipeline validation and block automated testing workflows on Windows systems, because the test framework might fail to detect successful test completion.

ROCm Systems Profiler overwrites ROCPD output after process re-attachment#

When you use rocprof-sys-attach to re-attach to a previously profiled process, the ROCPD output database files (.db) are written to the initial session’s output directory instead of a new timestamped directory. This makes it difficult to distinguish profiling data between sessions. Perfetto trace files are not affected. As a workaround, back up your output directory before re-attaching to a previously profiled process.

Missing dependencies when installing ROCm Core SDK#

Installing the ROCm Core SDK using amdrocm-core-sdk or amdrocm-core-dev/devel might succeed, but some dependencies from the dev/devel meta packages might not be installed. As a workaround, install the dev packages manually:

sudo apt install amdrocm-*

ROCm resolved issues#

The following notable issues have been fixed in ROCm 7.13.0.

Multi-ROCm installation failed on RPM-based distributions#

Previously, installing multiple ROCm versions side by side on RPM-based distributions (RHEL and SLES) failed due to .build-id file conflicts between versioned packages.

vLLM server failed to launch in ROCm Docker images#

Previously, the vLLM server failed to start in ROCm 7.12.0 Docker images with an ImportError for librocm_smi64.so.1 due to missing library path configuration.

vLLM server failed to launch with tensor parallelism#

Previously, the vLLM server failed to start with an invalid device pointer error when launching models with tensor parallelism set to 8 on AMD Instinct MI300 and MI355X GPUs.

PyTorch DDP Gloo backend test failed on AMD GPUs#

Previously, the PyTorch Distributed Data Parallel (DDP) test test_ddp_apply_optim_in_backward_grad_as_bucket_view_false failed when using the Gloo backend.

rocWMMA header produced unknown type errors in HIP RTC#

Previously, HIP RTC programs that included the rocwmma/rocwmma.hpp header failed to compile with unknown type name errors.

ROCm upcoming changes#

Future releases will add support for:

  • Additional ROCm Core SDK components

  • Domain-specific expansion toolkits (data science, life science, finance, simulation, and other HPC domains)

  • More AMD hardware support