ROCm Core SDK 7.13.0 release notes#
2026-05-15
59 min read time
ROCm Core SDK 7.13.0 continues the technology preview release stream that began with ROCm 7.9.0, advancing the transition to the new TheRock build and release system. To learn more, see the transition guide.
Important
ROCm 7.13.0 follows the versioning discontinuity that began with the 7.9.0 preview release and remains separate from the 7.0 to 7.2 production releases. For the latest production stream release, see the ROCm documentation.
Maintaining parallel release streams – preview and production – gives users ample time to evaluate and adopt the new build system and dependency changes. The technology preview stream is planned to continue through mid-2026, after which it will replace the current production stream.
For previous preview releases, see the release history.
Release highlights#
ROCm Core SDK 7.13.0 with TheRock builds upon the 7.12.0 preview release.
This release expands support for AI inference, distributed workloads, and profiling workflows across AMD Instinct™, Radeon™, and Ryzen™ AI platforms. ROCm 7.13.0 adds inference-ready vLLM containers, expands GPU virtualization and partitioning support, introduces new profiling and tracing capabilities, and improves AI kernel, sparse math, and communication libraries.
Platform and hardware support#
This release expands GPU, operating system, virtualization, and partitioning support.
Expanded AMD GPU support#
ROCm 7.13.0 adds support for the following AMD GPUs and APUs:
AMD Instinct MI350P (gfx950)
AMD Radeon PRO W6800 (gfx1030)
AMD Radeon PRO V620 (gfx1030)
AMD Ryzen AI 7 PRO 360 (gfx1152)
AMD Ryzen AI 7 PRO 350 (gfx1152)
AMD Ryzen AI 5 PRO 340 (gfx1152)
AMD Ryzen AI 7 350 (gfx1152)
AMD Ryzen AI 7 345 (gfx1152)
AMD Ryzen AI 5 340 (gfx1152)
AMD Ryzen AI 5 330 (gfx1152)
For the complete list of supported AMD hardware, see AMD hardware support.
Expanded Ubuntu support#
ROCm 7.13.0 adds support for Ubuntu 26.04 on Instinct, Radeon, and Ryzen devices.
24.04.4 is now the validated Ubuntu 24 release instead of Ubuntu 24.04.3.
For the full list of supported Linux distributions, see Operating system support.
Expanded GPU virtualization support for Instinct GPUs#
ROCm 7.13.0 adds support for the following virtualization configurations on AMD Instinct GPUs.
On MI355X: VMware ESXi 9.1 with Ubuntu 24.04 guest OS.
On MI300X: KVM SR-IOV with Ubuntu 24.04 host OS and Ubuntu 24.04 guest OS.
On MI210:
KVM passthrough with RHEL 9.4 host OS and Ubuntu 22.04 guest OS.
KVM SR-IOV with RHEL 9.4 host OS and Ubuntu 22.04 guest OS.
KVM SR-IOV with RHEL 9.4 host OS and RHEL 9.4 guest OS.
Supported SR-IOV configurations require the GIM Driver 9.0.0K. For details, see GPU virtualization support.
Expanded Instinct GPU partitioning support#
ROCm 7.13.0 enables the QPX compute + NPS 2 memory partition combination in bare metal deployments.
For details, see GPU partitioning support.
AI inference and frameworks#
This release adds inference-ready container images and improves multi-node communication for distributed workloads.
vLLM 0.19.1 Docker images and pip packages#
With ROCm 7.13.0, Docker images for running vLLM inference workloads are available. Images include vLLM 0.19.1, PyTorch 2.10, and Python 3.13 on Ubuntu 24.04.
Architecture-specific images are available for:
AMD Instinct GPUs: gfx942 (MI325X, MI300X, MI300A) and gfx950 (MI355X, MI350X, MI350P)
AMD Radeon GPUs: gfx1100, gfx1101, gfx1102, gfx1200, gfx1201
AMD Ryzen AI APUs: gfx1150, gfx1151, gfx1152
See vLLM inference and serving on ROCm to get started.
RCCL multi-node optimization for AMD Ryzen AI Max 300 series#
RCCL improves multi-node clustering performance on systems with AMD Ryzen AI Max 300 series connected over Ethernet. Building on the initial multi-node enablement in ROCm 7.12.0, this release optimizes collective communication for distributed AI inference workloads using tensor parallelism (TP) and expert parallelism (EP) across up to 4 Ethernet-connected nodes.
RCCL GDA-based alltoall via rocSHMEM integration (experimental)#
RCCL adds experimental support for GPU Direct Async (GDA)-based alltoall and alltoallv collective operations through rocSHMEM integration. When enabled, RCCL invokes rocSHMEM operations that use GDA to reduce latency for small message alltoall patterns.
This feature requires building RCCL with the --rocshmem flag and setting
RCCL_ROCSHMEM_ENABLE=1 at runtime. GDA support currently requires Broadcom
NICs with GDA capability.
Developer tools and profiling#
This release adds new profiling capabilities, introduces the open-source ROCprof Trace Decoder, and extends HIP programming APIs.
ROCprof Trace Decoder open source release#
ROCprof Trace Decoder, previously delivered as a closed-source component within ROCprofiler-SDK, is now available as the open-source rocprof-trace-decoder library. The decoder converts raw SQTT data from AMD GPUs into structured execution traces for performance analysis and debugging. It supports a wide range of AMD GPUs spanning Instinct, Radeon, and Ryzen architectures, with unit and integration tests across all supported hardware. See AMD hardware support for the complete list.
HIP cooperative groups reduce operations#
HIP adds cooperative_groups::reduce() for performing reduction operations
across thread_block_tile and coalesced_threads groups. The implementation
is based on __reduce_*_sync operations, and the
HIP_ENABLE_EXTRA_WARP_SYNC_TYPES macro might be required to enable some
optimizations.
Additionally, __reduce_and_sync(), __reduce_or_sync(), and
__reduce_xor_sync() now provide consistent behavior for all mask values. All
masks now emit bitwise instructions, aligning behavior with NVIDIA CUDA. This
is a change from previous versions, where some masks were translated to bitwise
operations, and others were not.
ROCm Compute Profiler feature highlights#
The following are notable enhancements to the ROCm Compute Profiler (rocprofiler-compute).
RDNA 3.5 support: ROCm Compute Profiler now supports GPU performance profiling and analysis on AMD Ryzen AI Max 300 series processors.
Removed dependency requirements for profiling: Building ROCm Compute Profiler and using profile mode no longer requires installing Python dependencies from the
requirements.txtfile. Analysis mode still requires Python dependencies.This change moves several operations from profile mode to analysis mode, including roofline HTML generation, roofline-related options (
--sort,--mem-level,--roofline-data-type), and creation of the combinedpmc_perf.csvfile. Profile mode now only runs the roofline empirical benchmark, creates aroofline.csvfile, and creates per-replay CSV files without merging them.
ROCm Systems Profiler feature highlights#
The following are notable enhancements to the ROCm Systems Profiler (rocprofiler-systems).
Pause and resume profiling: ROCm Systems Profiler now supports pausing and resuming profiling at runtime through the
roctxProfilerPauseandroctxProfilerResumeAPIs. This allows you to capture profiling data only during specific execution phases, reducing overhead and minimizing output size for long-running workloads.Selective region tracing: You can now restrict tracing to defined regions of interest using the
ROCPROFSYS_SELECTED_REGIONSenvironment variable, reducing noise and limiting data collection to relevant workload segments.KFD event tracing: Kernel Fusion Driver (KFD) event tracing is now available for GPU memory management analysis, including page faults, page migrations, queue evictions, GPU unmap events, and dropped events. Requires an XNACK-capable GPU and ROCprofiler-SDK 1.2.1 or later.
MPI file-output filtering: You can now filter profiler output files based on MPI rank using the
--rank-filter-outputCLI option or theROCPROFSYS_RANK_FILTER_OUTPUTconfiguration setting, suppressing output from all other ranks. An optional--rank-filter-idoption (ROCPROFSYS_RANK_FILTER_ID) allows specifying a custom environment variable for rank identification.JSON-based profiling presets and domain flags: You can now configure common profiling workflows using JSON-based presets and a single
--preset=<name>flag instead of manually setting multipleROCPROFSYS_*environment variables. Eleven built-in presets cover common profiling scenarios, including GPU tracing, HPC workloads, and API-level analysis. Composable domain flags (--gpu,--rocm,--cpu,--parallel) and a topic-based--help=<topic>system further simplify configuration and discoverability.
AMD SMI feature highlights#
APU metrics and memory tuning: New APU telemetry provides per-core temperature, power, clock, voltage, current, and throttle monitoring, with additional support for IPU activity and DRAM bandwidth metrics. New VRAM carveout and GTT tuning controls enable configurable memory allocation on supported APU platforms.
Per-component GPU temperature and clock monitoring: GPU metrics table version 1.9 adds HBM stack temperatures, per-die temperature monitoring, and per-die memory and SOC clock reporting for data center deployments.
CPU power APIs report in milliwatts (breaking change): CPU power APIs now return values in milliwatts (mW) instead of watts. Python bindings now return numeric integer values instead of formatted strings. Existing applications that parse previous string-based outputs must be updated.
For more information, see the AMD SMI section in the ROCm component changelogs.
Libraries#
This release adds new routines, data type support, and performance improvements across ROCm math and AI libraries.
Composable Kernel adds quantization and attention kernel capabilities#
Composable Kernel adds several capabilities for AI and large language model workloads:
Microscaling (MX) FP8/FP4 support: Mixed data type support for MX FP8 and FP4 in GEMM and Flash Multi-Head Attention (FMHA) forward kernels on AMD Instinct MI350 Series GPUs.
FP8 quantization for FMHA: FMHA forward kernels now support multiple FP8 quantization modes, including dynamic tensor-wise quantization, block scale quantization, per-tensor quantization, and FP8 KV cache support for batch prefill.
StreamingLLM and long-context inference: Sink token support for FMHA forward enables StreamingLLM-style long-context inference.
Batch prefill enhancements: FMHA batch prefill kernels now support multiple KV cache layouts, flexible page sizes, and configurable lookup table configurations.
RDNA 3 FMHA support: Flash Attention kernels are now available on RDNA 3 architectures.
SageAttention v2 forward kernel: Multi-granularity quantization for Q, K, and V tensors with FP8, INT8, and INT4 data types and per-tensor, per-block, per-warp, and per-thread scale granularities on AMD Instinct MI300 Series and MI350 Series GPUs.
General Batched GEMM support in hipBLASLt#
hipBLASLt adds native support for General Batched GEMM, where all matrices in
a batch share the same problem dimensions but can have independent leading
dimensions and strides. This replaces the previous implementation through the
hipblaslt_ext Grouped GEMM APIs, which had known limitations.
The new implementation includes support for Global Split-U (GSU) to improve performance at large problem sizes. General Batched GEMM is important for inference workloads that dispatch batches of same-shape GEMM operations.
rocSOLVER adds new solver routines and matrix analysis functions#
rocSOLVER adds the following new routines, all with 64-bit index support:
GETRS_NPVT: Solution of linear systems using LU factorization without pivoting. Batched and strided-batched variants are available.
SYTRS: Solution of linear systems for symmetric matrices. Batched and strided-batched variants are available.
Additionally, POTF2 and downstream POTRF Cholesky factorization performance have been improved.
rocSPARSE adds sparse factorization routines#
rocSPARSE adds new generic API routines for sparse incomplete factorization and triangular solve:
rocsparse_spic0androcsparse_spilu0: Generic incomplete Cholesky (IC0) and incomplete LU (ILU0) factorization routines with strided-batched computation support.rocsparse_sptrsv: Extended with strided-batched computation support and singularity detection through the newrocsparse_singularityenumeration.
Performance of tridiagonal solvers rocsparse_Xgtsv_no_pivot and
rocsparse_Xgtsv_no_pivot_strided_batch has been improved.
Added rocDecode and rocJPEG libraries to the ROCm Core SDK#
rocDecode provides hardware-accelerated video decoding for H.264, H.265/HEVC, AV1, and VP9 codecs, while rocJPEG provides hardware-accelerated JPEG decoding on AMD GPUs. Together, they enable efficient GPU-based media processing pipelines for data-intensive workloads such as AI training.
Both libraries are supported on Linux on AMD Instinct, Radeon, and Ryzen AI. See the projects in ROCm/rocm-systems for more information.
Added ROCm Data Center Tool to the ROCm Core SDK#
ROCm Data Center Tool (RDC) provides telemetry collection, health monitoring, and job-level GPU statistics for data center deployments with AMD Instinct accelerators. RDC enables system administrators and cluster managers to monitor GPU health, collect telemetry data, and track per-job GPU usage across multi-node environments.
RDC is supported on Linux with AMD Instinct GPUs.
AMD hardware support#
The following table lists supported AMD Instinct GPUs, Radeon GPUs, and Ryzen APUs. Each supported device is listed with its corresponding GPU microarchitecture and LLVM target.
Note
If your GPU is not listed, it might be community-enabled through TheRock nightly builds. For more information, see TheRock supported GPUs. For installation guidance, see TheRock releases.
|
Device series |
Device |
LLVM target |
Architecture |
|---|---|---|---|
| AMD Instinct MI350 Series |
gfx950 |
CDNA 4 | |
| AMD Instinct MI300 Series |
gfx942 |
CDNA 3 | |
| AMD Instinct MI200 Series |
gfx90a |
CDNA 2 | |
| AMD Instinct MI100 Series | Instinct MI100 |
gfx908 |
CDNA |
|
Device series |
Device |
LLVM target |
Architecture |
|---|---|---|---|
| AMD Radeon AI PRO R9000 Series |
gfx1201 |
RDNA 4 | |
| AMD Radeon RX 9000 Series |
gfx1201 |
||
|
gfx1200 |
|||
| AMD Radeon PRO W7000 Series |
gfx1100 |
RDNA 3 | |
|
gfx1101 |
|||
| AMD Radeon RX 7000 Series |
gfx1100 |
||
|
Radeon RX 7700 XE |
gfx1101 |
||
|
gfx1102 |
|||
| AMD Radeon PRO V Series |
gfx1101 |
||
|
gfx1030 |
RDNA 2 | ||
| AMD Radeon PRO W6000 Series |
gfx1030 |
Operating system support#
ROCm supports the following Linux distribution and Microsoft Windows versions. If you’re running ROCm on Linux, ensure your system is using a supported kernel version.
Important
The following table is a general overview of supported OSes. Actual support might vary by AMD GPU or APU. Use the Compatibility matrix to verify support for your specific setup before installation.
|
Linux distribution |
Supported versions |
Linux kernel version |
|---|---|---|
|
Ubuntu |
26.04 |
GA 7.0 |
|
24.04.4 |
GA 6.8 |
|
|
22.04.5 |
GA 5.15 |
|
|
Debian |
13 |
6.12 |
|
12 |
6.1.0 |
|
|
Red Hat Enterprise Linux (RHEL) |
10.1 |
6.12.0-124 |
|
10.0 |
6.12.0-55 |
|
|
9.7 |
5.14.0-611 |
|
|
9.6 |
5.14.0-570 |
|
|
9.4 |
5.14.0-427 |
|
|
8.10 |
4.18.0-553 |
|
|
Oracle Linux |
10 |
UEK 8.1 |
|
9 |
UEK 8 |
|
|
8 |
UEK 7 |
|
|
Rocky Linux |
9 |
5.14.0-570 |
|
SUSE Linux Enterprise Server (SLES) |
16.0 |
6.12 |
|
15.7 |
6.4.0-150700.51 |
|
Operating system |
Supported versions |
Linux kernel version |
|---|---|---|
|
Ubuntu |
26.04 |
GA 7.0 |
|
24.04.4 |
GA 6.8 |
|
|
22.04.5 |
GA 5.15 |
|
|
Red Hat Enterprise Linux (RHEL) |
10.1 |
6.12.0-124 |
|
9.7 |
5.14.0-611 |
|
|
Windows |
11 25H2 |
— |
|
Operating system |
Supported versions |
Linux kernel version |
|---|---|---|
|
Ubuntu |
26.04 |
GA 7.0 |
|
24.04.4 |
HWE 6.17 |
|
|
Windows |
11 25H2 |
— |
Installation updates#
ROCm 7.13.0 introduces several improvements to the Runfile Installer:
Performance improvements for installing and uninstalling gfx architectures.
ROCm component tests are now included.
Support for prerequisite OEM kernel installation as part of the dependency install on Ryzen systems. You no longer need to install it manually.
Auto-detection of the GPU when using the GUI or when the
gfx=argument is not provided on the command line. If the installer cannot detect the GPU, you must specify the gfx architecture using the GUI or thegfx=argument.
Kernel driver and firmware bundle support#
ROCm requires a coordinated stack of compatible firmware, driver, and user space components. Maintaining version alignment between these layers ensures correct GPU operation and performance, especially for AMD data center products. While AMD publishes the AMD GPU driver and ROCm user space components, your server OEM (original equipment manufacturer) or infrastructure provider distributes the firmware packages. AMD supplies those firmware images (PLDM bundles), which the OEM integrates and distributes.
|
AMD device |
Firmware |
Linux driver |
|---|---|---|
|
Instinct MI355X |
PLDM bundle 01.26.00.02 |
AMD GPU Driver (amdgpu) |
|
Instinct MI350X |
||
|
Instinct MI350P |
IFWI 00185129 |
|
|
Instinct MI325X |
PLDM bundle 01.25.04.02 |
|
|
Instinct MI300X |
PLDM bundle 01.26.00.02 |
|
|
Instinct MI300A |
BKC 26.1 |
|
|
Instinct MI250X |
IFWI 75 (or later) |
|
|
Instinct MI250 |
Maintenance update (MU) 5 with IFWI 75 (or later) |
|
|
Instinct MI210 |
||
|
Instinct MI100 |
VBIOS D3430401-037 |
|
Linux driver |
Windows driver |
|---|---|
|
AMD GPU Driver (amdgpu) |
AMD Software: Adrenalin Edition 26.5.1 |
|
Linux driver |
Windows driver |
|---|---|
|
Inbox kernel driver in Ubuntu 26.04 or 24.04.4 |
AMD Software: Adrenalin Edition 26.5.1 |
GPU virtualization support#
AMD Instinct data center GPUs support virtualization in the following configurations. Supported SR-IOV configurations require the AMD GPU Virtualization Driver (GIM) 9.0.0K – see the AMD Instinct Virtualization Driver documentation for more information.
|
AMD GPU |
Hypervisor |
Virtualization technology |
Virtualization driver |
Host OS |
Guest OS |
|---|---|---|---|---|---|
|
Instinct MI355X |
KVM |
Passthrough |
— |
Ubuntu 24.04 |
Ubuntu 24.04 |
|
SR-IOV |
GIM 9.0.0K |
Ubuntu 24.04 |
|||
|
RHEL 10.0 |
|||||
|
RHEL 9.6 |
|||||
|
ESXi |
— |
— |
VMware ESXi 9.1 |
Ubuntu 24.04 |
|
|
Instinct MI350X |
KVM |
Passthrough |
— |
Ubuntu 24.04 |
Ubuntu 24.04 |
|
SR-IOV |
GIM 9.0.0K |
Ubuntu 24.04 |
|||
| RHEL 9.6 | |||||
|
Instinct MI325X |
KVM |
SR-IOV |
GIM 9.0.0K |
Ubuntu 22.04 |
Ubuntu 22.04 |
|
Instinct MI300X |
KVM |
Passthrough |
— |
Ubuntu 22.04 |
Ubuntu 22.04 |
|
SR-IOV |
GIM 9.0.0K |
Ubuntu 24.04 |
Ubuntu 24.04 |
||
|
Ubuntu 22.04 |
Ubuntu 22.04 |
||||
|
Instinct MI210 |
KVM |
Passthrough |
— |
RHEL 9.4 |
Ubuntu 22.04 |
|
SR-IOV |
GIM 9.0.0K |
Ubuntu 22.04 |
|||
|
RHEL 9.4 |
GPU partitioning support#
The following compute partition and NUMA-per-socket (NPS) configurations are available on AMD Instinct GPUs in bare metal deployments.
|
Device |
Compute partition mode |
NPS mode |
Deployment |
|---|---|---|---|
|
Instinct MI355X, MI350X |
CPX |
NPS 2 |
Bare metal |
|
DPX |
NPS 2 |
||
|
QPX |
NPS 2 |
||
|
Instinct MI300X |
CPX |
NPS 4 |
|
|
DPX |
NPS 2 |
See the AMD GPU partitioning topic in the AMD GPU Driver documentation to learn more.
AI ecosystem support#
ROCm 7.13.0 provides optimized support for popular deep learning frameworks and AI inference engines. The following table lists supported frameworks and libraries, their compatible operating systems, and validated versions.
|
Framework |
Supported versions |
Supported OS |
Supported Python versions |
|---|---|---|---|
|
PyTorch |
2.11.0, 2.10.0, 2.9.1 |
Linux |
3.14, 3.13, 3.12, 3.11 |
|
2.11.0 |
Windows |
||
|
JAX |
0.9.1, 0.8.2 |
Linux |
3.14, 3.13, 3.12, 3.11 |
|
vLLM |
0.19.1 |
Linux |
3.13 |
ROCm Core SDK components#
The following table lists core tools and libraries included in the ROCm 7.13.0 release.
Important
The following table is a general overview of ROCm Core SDK components. Actual support for these libraries and tools can vary by GPU and OS. Use the Compatibility matrix to verify support for your specific setup.
Component group |
Component name |
Version |
Supported platforms |
|---|---|---|---|
|
Math and compute libraries |
hipBLAS | 3.4.0 | Linux/Windows · Instinct/Radeon/Ryzen |
| hipBLASLt | 1.3.0 | ||
| hipCUB | 4.4.0 | ||
| hipFFT | 1.0.23 | ||
| hipRAND | 3.3.0 | ||
| hipSOLVER | 3.4.0 | ||
| hipSPARSE | 4.5.0 | ||
| MIOpen | 3.5.1 | ||
| rocBLAS | 5.4.0 | ||
| rocFFT | 1.0.37 | ||
| rocRAND | 4.4.0 | ||
| rocSOLVER | 3.34.0 | ||
| rocSPARSE | 4.6.0 | ||
| rocPRIM | 4.4.0 | ||
| rocThrust | 4.4.0 | ||
| rocWMMA | 2.2.1 | ||
| Composable Kernel | 1.3.0 | Linux/Windows · Instinct/Radeon | |
| hipSPARSELt | 0.2.8 | Linux/Windows · Instinct (gfx950/gfx942) | |
|
Communication libraries |
RCCL | 2.28.3 | Linux · Instinct/Radeon/Ryzen |
| rocSHMEM | 3.4.0 | Linux · Instinct (gfx950/gfx942/gfx90a) · Radeon (gfx1201/gfx1200/gfx1100/gfx1101/gfx1102) | |
|
Media libraries |
rocDecode | 1.8.0 | Linux · Instinct/Radeon · Ryzen (gfx1150/gfx1151/gfx1152) |
| rocJPEG | 1.5.0 | ||
|
Runtimes and compilers |
HIP | 7.13 | Linux/Windows · Instinct/Radeon/Ryzen |
| HIPIFY | 7.13 | ||
| LLVM | 23.0.0 | ||
| SPIRV-LLVM-Translator | 23.0.0 | ||
| ROCr Runtime | 1.21.0 | Linux · Instinct/Radeon/Ryzen | |
|
Profiling and debugging tools |
ROCm Compute Profiler (rocprofiler-compute) | 3.6.0 | Linux · Instinct · Ryzen (gfx1150/gfx1151/gfx1152) |
| ROCm Systems Profiler (rocprofiler-systems) | 1.6.0 | ||
| ROCprofiler-SDK | 1.3.0 | Linux · Instinct/Radeon · Ryzen (gfx1150/gfx1151/gfx1152) | |
| ROCdbgapi | 0.80.0 | Linux · Instinct/Radeon | |
| ROCm Debugger (ROCgdb) | 16.3 | ||
| ROCr Debug Agent | 2.1.0 | ||
|
Control and monitoring tools |
AMD SMI (BM) | 26.4.0 | Linux · Instinct/Radeon |
| rocminfo | 1.0.0 | Linux · Instinct/Radeon/Ryzen | |
| ROCm Data Center Tool (RDC) | 1.3.0 | Linux · Instinct |
ROCm component changelogs#
The following sections describe key changes to ROCm Core SDK components.
AMD SMI (BM) (26.4.0)#
Added#
Added APU metrics support (table versions 2.4 and 3.0).
New
amdsmi_apu_metrics_tstruct accessible viaamdsmi_gpu_metrics_t.apu_metricspointer (non-null when APU-specific metrics are available).v2.4 metrics:
temperature_gfx,temperature_soc,temperature_core[8],temperature_l3[2]average_gfx_activity,average_mm_activityaverage_socket_power,average_cpu_power,average_soc_power,average_gfx_power,average_core_power[8]Average clocks:
gfxclk,socclk,uclk,fclk,vclk,dclkCurrent clocks:
gfxclk,socclk,uclk,fclk,vclk,dclk,coreclk[8],l3clk[2]average_temperature_gfx,average_temperature_soc,average_temperature_core[8],average_temperature_l3[2]average_cpu_voltage,average_soc_voltage,average_gfx_voltage,average_cpu_current,average_soc_current,average_gfx_currentthrottle_status,indep_throttle_statusfan_pwm
v3.0 metrics:
temperature_core[16],temperature_skinaverage_vcn_activity,average_ipu_activity[8],average_core_c0_activity[16]average_dram_reads,average_dram_writes,average_ipu_reads,average_ipu_writesaverage_apu_power,average_dgpu_power,average_all_core_power,average_ipu_power,average_sys_powerstapm_power_limit,current_stapm_power_limitaverage_core_power[16],current_coreclk[16]current_core_maxfreq,current_gfx_maxfreqaverage_vpeclk_frequency,average_ipuclk_frequency,average_mpipu_frequencythrottle_residency_prochot,throttle_residency_spl,throttle_residency_fppt,throttle_residency_sppt,throttle_residency_thm_core,throttle_residency_thm_gfx,throttle_residency_thm_soctime_filter_alphavalue
Fields not applicable to the current version are set to sentinel values:
0xFFFFforuint16_t,0xFFFFFFFFforuint32_t, andUINT64_MAXforuint64_tfields.Python bindings updated with
AmdSmiApuMetricsctypes structure.
Added
oam_idtoamdsmi_enumeration_info_t.amd-smi list -enow displaysOAM_ID(Physical XGMI ID / OAM ID).Added
--enumerationas a long-form alias for-einamd-smi list.
Added support for GPU metrics v1.9 new fields.
Added new temperature fields to
amdsmi_gpu_metrics_t:temperature_hbm_stacks— per-stack HBM temperatures (°C)temperature_mid— per-MID temperatures (°C)temperature_aid— per-AID temperatures (°C)temperature_xcd— per-XCC compute die temperatures (°C)
Added new per-die clock fields to
amdsmi_gpu_metrics_t:current_uclk_aid— per-AID uclk (MHz)current_socclks_mid— per-MID SOC clock (MHz)
Added new constants:
AMDSMI_MAX_NUM_HBM_STACKS(12)AMDSMI_MAX_NUM_AID(2)AMDSMI_MAX_NUM_MID(2)AMDSMI_MAX_NUM_CLKS_PER_AID(2)AMDSMI_MAX_NUM_CLKS_PER_MID(2)
Added VRAM and GTT tuning interface.
New
amd-smi static --mem-carveoutto view VRAM carveout options.New
amd-smi set --mem-carveoutto change the VRAM carveout (APU).New
amd-smi set --gttandamd-smi reset --gttfor system-wide GTT size tuning.New APIs:
amdsmi_get_gpu_uma_carveout_info(),amdsmi_set_gpu_uma_carveout(),amdsmi_get_ttm_info(),amdsmi_set_ttm_pages_limit(),amdsmi_reset_ttm_pages_limit().
Added UBB power and power_limit fields to
amdsmi_power_info_tandamdsmi_npm_info_t.amd-smi metric --powernow displaysubb_powerwhen available.amd-smi node -pnow displays UBB power threshold when available.
Added CPU support for family 1A Models 50h-57h.
New APIs:
amdsmi_get_cpu_xgmi_pstate_range(),amdsmi_get_cpu_core_ccd_power(),amdsmi_get_cpu_tdelta(),amdsmi_get_cpu_dimm_sb_reg(),amdsmi_get_cpu_svi3_vr_controller_temp(),amdsmi_get_cpu_pc6_enable(),amdsmi_get_cpu_cc6_enable(),amdsmi_get_cpu_sdps_limit(),amdsmi_get_cpu_core_floor_freq_limit(),amdsmi_get_cpu_core_eff_floor_freq_limit(), and corresponding set APIs.Note:
amdsmi_get_dfc_ctrl()renamed toamdsmi_get_cpu_dfc_ctrl()andamdsmi_set_dfc_ctrl()renamed toamdsmi_set_cpu_dfc_ctrl()for naming consistency.
Updated memory API documentation Added note that the sum of per-process memory usage is not expected to equal total usage.
Changed#
Renamed
processor_type_tenum typedef toamdsmi_processor_type_t.The unprefixed typedef name did not follow the
amdsmi_*_tconvention used throughoutamdsmi.hand was easy to collide with identifiers defined by other system-management libraries. New code should useamdsmi_processor_type_t. The old name is preserved as a backward-compatibility typedef alias, so existing callers continue to compile unchanged.
Package install no longer modifies the system-wide
logrotatetimer or cron schedule.Previously, installing
amd-smi-liboverwrote/lib/systemd/system/logrotate.timer(or moved/etc/cron.daily/logrotateto/etc/cron.hourly/) to force hourly rotation, which affected every other package usinglogrotate.The package now only ships
/etc/logrotate.d/amd_smi.conf, which sets its ownhourly+size 1Mcadence. AMD-SMI logs still rotate at the same frequency; system-wide settings stay as the distribution configured them.
Optimized#
Optimized
rsmi_dev_device_identifiers_get()in the ROCm-SMI device layer.Removed unnecessary iteration by directly indexing the device list.
Added bounds checking for
device_id, with clearer error handling/logging.Improves performance for device identifier queries.
Resolved issues#
Fixed
amd-smi metriccrashing withTypeErroron MI300A when no CPU flags are specified.When no CPU arguments are passed,
metric_cpu()sets all boolean CPU args toTrueto display all available data.--cpu-svi3-vr-controller-temptakes a TYPE argument (and optional RAIL_INDEX) rather than a boolean flag — setting it toTruecaused aTypeErrorcrash when the code tried to subscript it with[0][0]. Addedcpu_svi3_vr_controller_tempto the show-all exclusion list, following the existing pattern forcpu_lclk_dpm_level,cpu_io_bandwidth,cpu_dimm_sb_reg, and similar argument-taking flags.
Fixed
amdsmi_get_gpu_accelerator_partition_profile()returning incorrectnum_partitionswhennum_partitionis unavailable from GPU metrics.GPU metrics no longer always provides
num_partition. The function now derives the partition count from the active partition type whennum_partitionis not available:SPX → 1, DPX → 2, TPX → 3, QPX → 4
CPX → derived from the XCD counter via
amdsmi_get_gpu_xcd_counter()
Fixed
amdsmi_topo_get_p2p_status()returning a rawctypes.c_uint32object instead of an integer for thetypefield.The
'type'key in the returned dictionary now correctly returnstype_32.value(anint) rather than the unwrapped ctypes object, consistent with the pattern used inamdsmi_topo_get_link_type().
Adjusted KFD process caching to be more responsive.
Updated process caching to allow cache duration adjustment via the
AMDSMI_PROCESS_INFO_CACHE_MSenvironment variable for workflows with rapid metric polling.
Fixed CLI exit codes to use absolute values.
Invalid GPU parameters now return positive error codes as documented.
Fixed CLI breakage when
amdgpudriver is not present.Improved init to better catch driver loading issues.
Aligned
amdsmi_get_gpu_device_uuid()with HIP/rocminfo UUID format.Modified
amdsmi_asic_info_t.asic_serialto report per-socket serial using KFD’sunique_id.
Fixed multiple bugs in NIC/switch code and
amdsmi_init()NIC handling.Fixed
sizeofoperator precedence,hw_monreset, NUMA=65535 handling, and several CLI function call errors.Fixed
amdsmi_init()to succeed when no NIC hardware is present.
Fixed shared mutex and self-heal.
Improved self-heal logic to correctly identify and recover from corrupted or uninitialized mutex state.
Fixed
cu_occupancydisplaying0%instead ofN/Awhen file is unavailable.Process
cu_occupancyis now initialized toINVALIDinstead of zero, soamd-smi processdisplaysN/Arather than a misleading0%when the sysfs file is not accessible.
Fixed CLI set commands silently succeeding on invalid input values.
amd-smi set --profile <INVALID>now returns a non-zero exit code and lists available profiles in the error message; invalid profile names are rejected at parse time.amd-smi set --clk-level <CLK_TYPE>(missing performance level indices) now returns a non-zero exit code with a usage hint instead of silently succeeding.amd-smi set --power-cap <OUT_OF_RANGE>now returns a non-zero exit code.amd-smi set --fan <INVALID>%no longer prompts the out-of-spec warning before validating the percentage range; invalid values are rejected immediately.
Fixed
amd-smi set --profilehelp text omittingBOOTUP_DEFAULT.BOOTUP_DEFAULTwas always accepted at runtime but was missing from the--helpprofile list. Auditing invalid-input handling exposed this gap.amd-smi reset --profilecan also be used to return to the bootup default power profile.
Fixed
amd-smi monitor --brcm_nicand--brcm_switchflags being registered on non-BRCM systems.These flags are now only registered when BRCM hardware is present, preventing spurious failures on AMD GPU-only systems.
Fixed
amd-smidefault command alignment.Updated default
amd-smioutput to align values to the left for improved readability. Several items were misaligned in the default output, and this change ensures a consistent left-aligned format across all fields.This change is purely cosmetic and does not affect any functionality.
Renamed
lc_perf_other_end_recoverytolc_perf_other_end_recovery_countinamd-smi metricCLI output for unification.Removed references to deprecated
amd-smi reset -r.CLI help text and memory partition change warnings no longer reference
amd-smi reset -rfor driver reloading.Users are now directed to use
sudo modprobe -r amdgpu && sudo modprobe amdgputo reload the driver after partition changes.
Changed CPU power APIs to return values in milliwatts (mW) for higher precision.
Removed lossy integer rounding (
(mW + 500) / 1000) from 6 CPU power get APIs. Values are now returned in milliwatts directly from the ESMI library, preserving sub-watt precision.C API: Output parameter type remains
uint32_t*, but the unit changed from watts to milliwatts (mW).amdsmi_get_cpu_socket_poweramdsmi_get_cpu_socket_power_capamdsmi_get_cpu_socket_power_cap_maxamdsmi_get_cpu_pwr_efficiency_mode(ppt_limit field)amdsmi_get_cpu_core_ccd_poweramdsmi_get_cpu_sdps_limit
Python API (breaking): These functions now return
int(milliwatts) instead ofstr(e.g.,"240 Watts"). Callers that parsed the string output must update to handle the numeric return value.CLI output: Power values now display with milliwatt precision (e.g.,
240.500 Watts).Added missing null-pointer validation for output parameters in
amdsmi_get_cpu_socket_power_capandamdsmi_get_cpu_socket_power_cap_max.Updated header documentation to specify milliwatt units for all affected get and set API parameters.
Changed power APIs to have consistent output parameter types.
Modified 6 CPU power APIs to have consistent output power types. All set and get APIs have
uint32_toutput values.Modified get and set APIs that had double output types to have
uint32_toutput types in milliwatts (mW).amdsmi_get_cpu_socket_power(amdsmi_processor_handle processor_handle, uint32_t* ppower)amdsmi_get_cpu_socket_power_cap(amdsmi_processor_handle processor_handle, uint32_t* pcap)amdsmi_get_cpu_socket_power_cap_max(amdsmi_processor_handle processor_handle, uint32_t* pmax)amdsmi_get_cpu_pwr_efficiency_mode(amdsmi_processor_handle processor_handle, uint32_t* power_efficiency_mode, uint32_t* utilization, uint32_t* ppt_limit)amdsmi_get_cpu_core_ccd_power(amdsmi_processor_handle processor_handle, uint32_t* power)amdsmi_get_cpu_sdps_limit(amdsmi_processor_handle processor_handle, uint32_t* sdps_limit)
Composable Kernel (1.3.0)#
Added#
Added overload of
load_tile_transposethat takes reference to output tensor as output parameter.Use data type from LDS tensor view when determining tile distribution for transpose in the GEMM pipeline.
Added
eightwarpssupport for abquant mode in blockscale GEMM.Added
preshuffleBsupport for abquant mode in blockscale GEMM.Added support for explicit GEMM in
CK_TILEgrouped convolution forward and backward weight.Added TF32 convolution support on gfx942 and gfx950 in CK. It can be enabled or disabled via
DTYPESoftf32.Added
streamingllmsink support for FMHA FWD, includeqr_ks_vs,qr_asyncandsplitkvpipelines.Added support for microscaling (MX) FP8/FP4 mixed data types to Flatmm pipeline.
Added support for fp8 dynamic tensor-wise quantization of FP8 fmha fwd kernel.
Added FP8 KV cache support for FMHA batch prefill.
Added FMHA batch prefill kernel support for several KV cache layouts, flexible page sizes, and different lookup table configurations.
Added gpt-oss sink support for FMHA FWD, include
qr_ks_vs,qr_async,qr_async_trloadandsplitkvpipelines.Added persistent async input scheduler for CK Tile universal GEMM kernels to support asynchronous input streaming.
Added FP8 block scale quantization for FMHA forward kernel.
Added gfx11xx support for FMHA.
Added microscaling (MX) FP8/FP4 support on gfx950 for FMHA forward kernel (
qrpipeline only).Added FP8 per-tensor quantization support for FMHA forward V3 pipeline on gfx950.
HIP (7.13)#
Added#
New HIP APIs
cooperative_groups::reduce()allows calling reduce operators onthread_block_tileandcoalesced_threads. The implementation is based on the__reduce_*_syncoperations, so the macroHIP_ENABLE_EXTRA_WARP_SYNC_TYPESmight be needed to unlock some optimizations.
New device attribute
hipDeviceAttributeGPUDirectRDMAWithHipVMMSupported, indicating support for GPU Direct RDMA when using HIP VMM. This attribute corresponds to the CUDACU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED.
Resolved issues#
A segmentation fault that occurred in child graphs during the graph‑launch phase. The issue originated from the entire graph being launched solely according to the parent graph’s scheduling logic. The HIP runtime now introduces a per‑graph segment‑scheduling control flag and propagates the parent graph’s scheduling mode to its child graphs, ensuring consistent scheduling behavior (classic vs. segment) and preventing failures when the parent falls back to classic scheduling.
A segmentation fault caused by passing a null pointer to the hipMemGetAddressRange API. The function now handles null pointers correctly, matching the behavior of the corresponding CUDA API.
Changed#
__reduce_and_sync(),__reduce_or_sync()and__reduce_xor_sync()now provide a consistent behavior for all mask values and with CUDA. Previously, some masks were translated into bitwise operations, but others were not (such as those containing “holes”). Now, all masks cause bitwise instructions to be emitted. This is a change in behavior compared to previous versions.
Optimized#
Improved HIP runtime error logging when an application’s fat binary does not include a compatible code object for the detected GPU architecture, offering clearer guidance to rebuild with the appropriate
--offload-arch=gfxXXXXoption.Enables in‑memory and background‑thread asynchronous logging in the HIP runtime by default to improve overall logging capability. This behavior can be disabled by setting the environment variable
AMD_LOG_ASYNC=0.
hipBLAS (3.4.0)#
Added#
gfx1250 and gfx90c support to clients.
Version and other properties to Windows
hipblas.dll.Support for
OpenBLASILP64-based API usage in clients.
Resolved issues#
Restored the fallback of using the deprecated rocBLAS API
rocblas_set_device_memory_sizeif allocations are failing.
hipBLASLt (1.3.0)#
Added#
General Batched GEMM support.
Changed#
Replaced
install.shwith an invoke-based task runner (tasks.py) to support cross-platform builds including Windows (ROCm 7.0+).gtestandmsgpack-cxxare now fetched automatically using CMake FetchContent if not found on the system.
hipCUB (4.4.0)#
Optimized#
Reduced build times for unit tests.
Resolved issues#
Fixed more memory leak issues with some unit tests.
hipFFT (1.0.23)#
Added#
hipFFTW plan creation functions for advanced and general plans:
fftw_plan_many_dftfftwf_plan_many_dftfftw_plan_many_dft_r2cfftwf_plan_many_dft_r2cfftw_plan_many_dft_c2rfftwf_plan_many_dft_c2rfftw_plan_guru_dftfftwf_plan_guru_dftfftw_plan_guru_dft_r2cfftwf_plan_guru_dft_r2cfftw_plan_guru_dft_c2rfftwf_plan_guru_dft_c2rfftw_plan_guru64_dftfftwf_plan_guru64_dftfftw_plan_guru64_dft_r2cfftwf_plan_guru64_dft_r2cfftw_plan_guru64_dft_c2rfftwf_plan_guru64_dft_c2r
Support for gfx1150 architecture.
Changed#
Moved library to C++20 standard.
Removed Boost as a dependency for clients and samples.
Callback functions will be deprecated in a future release.
Resolved issues#
Fixed potential launch failure of data generation kernels in test and benchmark programs.
hipRAND (3.3.0)#
Added#
hiprand.dllnow contains embedded file version metadata.
hipSOLVER (3.4.0)#
Added#
Compatibility-only functions:
geevhipsolverDnXgeev_bufferSizehipsolverDnXgeev
syevBatchedhipsolverDnXsyevBatched_bufferSizehipsolverDnXsyevBatched
syevdhipsolverDnXsyevd_bufferSizehipsolverDnXsyevd
sytrshipsolverDnXsytrs_bufferSizehipsolverDnXsytrs
hipSPARSELt (0.2.8)#
Added#
CTest and test categories support (
--smoke,--pre_checkin, and--nightly).
Optimized#
Provided more kernels for the
FP16,BF16, andInt8datatypes.Improved the performance of the
HIPSPARSELT_PRUNE_SPMMA_TILEfunction.
Resolved issues#
Fixed incorrect behavior when retrieving the PCI chip ID.
Fixed LDS out-of-bounds read in
prune_tile_kernel.Fixed out-of-bounds access for compress function test cases.
Fixed missing null terminator in the return value of
hipsparseLtGetArchName().Fixed incorrect CPU result when
bias_typeisBF16for spmm test cases.Fixed double-free issue in the example code
example_prune_strip.Fixed symbol interposition in the hipSPARSELt library.
MIOpen (3.5.1)#
Added#
Added
MIOPEN_LOG_BUFFER_SIZEoption: when set to non-zero, dumps recent MIOpen logs to file on error.[Conv] Added
ConvDepthwiseFwd3Dsolver for optimizing specific 3D depthwise convolutions.[Conv] Added NHWC layout support for Winograd convolution solvers.
[Conv] Added regular GEMM solver support for Conv3D forward and backward-data with 1x1x1 filters.
[Conv] Added configurable problem size threshold (
MIOPEN_CONV_DIRECT_MAX_SIZE) for direct solver.[Softmax] Added tuning support via Generic Search.
Changed#
[Conv] Improved default kernel selection for Composable Kernel (CK) convolution solvers with ranked shortlists.
[Conv] Split CK grouped convolution kernels into per-architecture runtime-loaded dynamic libraries.
Optimized#
Optimized transpose operations with tiled and vectorized variants for NCHW/NHWC conversions.
[BatchNorm] Optimized batchnorm reduction using warp shuffle intrinsics.
[Conv] Added heuristic filtering of slow GEMM solver configurations during tuning.
Deprecated#
[Conv] Deprecated CK non-grouped convolution forward and backward solvers.
Deprecated
miopenConvolutionBackwardBias: the underlying OpenCL kernel (MIOpenConvBwdBias.cl) has been removed. The function now returnsmiopenStatusNotImplementedand will be removed in a future release.
Removed#
Removed GraphAPI experimental feature and related code.
Resolved issues#
[Conv] Fixed Winograd Fury grouped convolution correctness on gfx12xx when G > 1.
[Conv] Fixed bf16 WrW convolution precision loss in inter-batch accumulation.
[Conv] Fixed GPU memory fault in Winograd v3.0 WrW solver for large tensor shapes.
Fixed BF16
absfunction precision error caused by unnecessary cast through FP16.Fixed pooling kernel runtime compilation failure.
Fixed gfx1151 inline assembly compilation errors in batchnorm kernels.
Fixed use-after-free in HIPOCProgram binary loading.
ROCm Data Center Tool (RDC) (1.3.0)#
Resolved issues#
Fixed broken partition metrics.
Regardless of whether the GPU was partitioned, RDC only saw the GPU index and no instances due to upstream gpu_metrics changes.
rocBLAS (5.4.0)#
Added#
gfx1250 and gfx90c enabled.
Trace logging using
ROCBLAS_LAYER=1forrocblas_gemm_ex_get_solutions,rocblas_gemm_batched_ex_get_solutions,rocblas_gemm_ex_get_solutions_by_type, androcblas_gemm_batched_ex_get_solutions_by_type.Version and other properties to Windows
rocblas.dll.Support for
OpenBLASILP64 API for host reference in clients.Dockerfiles in the
dockerdirectory to assist in setting up development.
Optimized#
Improved the performance of Level 3
geamfor pure transpose scale use cases.Improved the performance of Level 2
tpsv.
Resolved issues#
Fix for querying solutions when using the
hipBLASLtbackend withrocblas_gemm_batched_ex_get_solutionsif using null data pointers.
ROCdbgapi (0.80.0)#
Added#
amd_dbgapi_process_get_info()adds a new query to get a mask spanning over all the bits used by all the address spaces. The query is calledAMD_DBGAPI_PROCESS_INFO_SIGNIFICANT_ADDRESS_BITS.
rocDecode (1.8.0)#
Added#
Logging improvement: Added function entry and exit logs (at Info log level).
Logging improvement: Added duration to function exit logs and optimized log message formatting to reduce runtime overhead.
Logging improvement: Merged all logger instances into one global instance.
Logging improvement: Unified logging format in utility classes with core library logging format.
Logging improvement: Moved debug logging from a compile-time switch to the runtime logger level controlled by
ROCDEC_LOG_LEVEL(debug = 4).Added support for user-set output surface format.
Changed#
Removed CPack packaging (DEB/RPM/NSIS/TGZ/ZIP generation and all related CPACK variables).
Removed
rocDecode-setup.pydependency installer script.Removed Docker files.
Removed package install documentation; updated all documentation to reference TheRock for installation.
Simplified libva version check (single
>= 1.22requirement).Cleaned up CMake error messages.
rocFFT (1.0.37)#
Optimized#
Allow plans to share hipModules if they use the same kernels. This reduces time spent and memory used when creating plans that exist concurrently.
Improved performance of unit-strided, interleaved, complex-to-complex and real-to-complex FFTs on gfx1201, gfx90a, gfx942, and gfx950.
Single-precision lengths:
(160,72,72)
(160,80,72)
(160,80,80)
(72,72,72)
(80,80,80)
(84,84,72)
(96,96,96)
(108,108,80)
Double-precision lengths:
(72,72,52)
(60,60,60)
(64,64,52)
(64,64,64)
Changed#
Moved library to C++20 standard.
Removed Boost as a dependency for clients and samples.
Split the precompiled kernel cache file (
rocfft_kernel_cache.db) into per-architecture files (rocfft_kernel_cache_gfx950.db,rocfft_kernel_cache_gfx1201.db, etc).rocfft_plan_createreturnsrocfft_status_invalid_offsetfor any usage of non-zero offsets in plan descriptions. The feature is not supported yet.Callback functions will be deprecated in a future release.
Resolved issues#
Potential issue with data generation for multi-dimensional transforms in rocfft-tests and rocfft-bench.
An issue that sometimes blocked complex-to-complex FFT plan creation when using noncontiguous strides in multiple dimensions.
An issue that sometimes blocked complex-to-real FFT plan creation when using noncontiguous strides in multiple dimensions.
An issue that sometimes blocked complex-to-real FFT plan creation when using noncontiguous strides with small lengths on the two fastest dimensions.
Potential launch failure of data generation kernels in test and benchmark programs.
Incorrect results on some strided real-complex FFTs on gfx90a.
Incorrect results on some even-length real FFTs that have odd-length strides on higher dimensions.
Callbacks on MPI transforms when not all ranks have the same number of data bricks.
Functional issues for multi-device, in-place real transforms.
Functional issues for multi-dimensional, multi-device transforms involving some unit length(s).
Functional issues for multi-device transforms involving data divisions along the slowest-varying axis (only) for some bricks but not all.
Functional issues for multi-device transforms setting no field on input or output.
Automatic allocation of work memory at plan execution time, when work memory is required on multiple devices.
rocJPEG (1.5.0)#
Changed#
rocJPEG is now delivered as part of TheRock. All core dependencies are provided by the TheRock build.
Removed CPack packaging (DEB/RPM/NSIS/TGZ/ZIP generation and all related CPACK variables).
Removed
rocJPEG-setup.pydependency installer script.Removed Docker files.
Removed package install documentation; updated all documentation to reference TheRock for installation.
Simplified libva version check (single
>= 1.22requirement).Cleaned up CMake error messages.
ROCm Compute Profiler (3.6.0)#
Added#
Added L2 memory bandwidth derived metrics under
--membw-analysisto allow L2 memory bandwidth specific profiling and analysis metric block 30.Added AMD Ryzen AI Max 300 series (gfx1151) support.
New memory hierarchy visualization for RDNA 3.5 (gfx115X) in analyze CLI mode.
Introduced support for AMD Instinct MI350P GPU.
--view tableoption in analyze mode to force all TTY output to plain tables and ignorecli_stylefrom YAML config (for example, mem_chart, Roofline charts render as tables). The--viewargument is reserved for future TTY views (for example, other chart styles).Added EA memory bandwidth derived metrics under
--membw-analysisto allow EA memory bandwidth specific profiling and analysis metric block 30.
Changed#
Standalone roofline (
--roof-onlyoption) in profile mode now createsroofline.csvonly. HTML roofline charts are generated viarocprof-compute analyze. Thecalc_ai_profile()function has been removed;calc_ai_analyze()is the single source of truth for arithmetic intensity calculation.Roofline visualization options (
--sort,--mem-level,--roofline-data-type) have moved from profile mode to analyze mode.
Standardized unit naming in analysis configs and Python utilities:
pct/Pct→Percent,instr→Instructions.Profile mode output format:
Profile mode now creates separate counter collection files for each application replay (pmc_perf_.csv or results_.csv).
Analyze mode automatically merges these files into a unified pmc_perf.csv containing information from all application replays during pre-processing.
ROCm Compute Profiler now builds and runs profile mode with vanilla Python without requiring any Python dependencies to be installed via
pip.Note that analysis mode will still require Python dependencies and will report any missing packages.
Removed#
Removed HIP API tracing since it’s out-of-scope for ROCm Compute Profiler and the trace files were not being analyzed.
Optimized#
Filtering for block 21 (
-b 21) in profile mode now only performs pc sampling and skips unnecessary counter collection.Filtering for block 21 in analysis mode now skips metrics calculations and only shows kernel/dispatch/system statistics and pc sampling table.
Resolved issues#
Fixed roofline benchmark MFMA FP16/BF16/INT8 peaks for MI350.
Fixed an issue where pc sampling profiling failed with multi-argument commands and live process attachment.
Upcoming changes#
--pathand--subpathoptions are deprecated and will be removed in a future release.Intermediate CSV generation (
results_*.csv) from rocpd databases during profiling is deprecated and will be removed in a future release. The analyze step will read.dbfiles directly.--retain-rocpd-outputis deprecated and will be removed in a future release..dbfiles will be retained by default.
Known issues#
For AMD Ryzen AI Max 300 series, the roofline metrics table will have N/A values for “peak” field.
This is planned to be addressed by adding empirical benchmark support for AMD Ryzen AI Max 300 series in a future release.
ROCm Systems Profiler (1.6.0)#
Added#
Kernel Fusion Driver (KFD) event tracing support to capture page faults, page migrations, queue evictions, GPU unmap events, and dropped events. Requires ROCprofiler-SDK 1.2.1 or later. Enable with
ROCPROFSYS_ROCM_DOMAINS=kfd_events.Support for pause and resume of profiling via
roctxProfilerPauseandroctxProfilerResume.Support for selective region tracing via the
ROCPROFSYS_SELECTED_REGIONSenvironment variable, limiting tracing to specified regions.--selected-regionsCLI argument torocprof-sys-sample,rocprof-sys-run, androcprof-sys-instrumentfor specifying selective region tracing from the command line.Support for re-attaching to a previously profiled process. After detaching,
rocprof-sys-attachcan re-attach to the same PID for a new profiling session.MPI-rank-based file output filtering feature controlled with two new CLI arguments:
--rank-filter-outputand--rank-filter-id.JSON-based configurable preset system with
--preset=<name>flag, replacing the old--<preset-name>flags. Presets are now loaded from JSON files insource/bin/common/presets/, making them extensible and exportable. Use--list-presetsto see available presets and--explain=<name>for detailed preset information.Domain flags for composable configuration:
--gpu[=metrics],--rocm[=domains],--cpu[=hz],--parallel[=runtimes]. Domain flags can be combined with presets to customize profiling without editing configuration files.Configuration export via
--export-config[=file]to save resolved settings as reusable JSON configuration files. Exported configs can be loaded back with--preset=./config.json.Topic-based help system:
--helpnow shows a compact summary with essential options and a list of help topics. Use--help=<topic>(e.g.,--help=sampling,--help=gpu,--help=tracing) to see only relevant options. Use--help=allfor the full option listing.Post-run output summary during library finalization showing result file locations.
JSON schema file (
share/rocprofiler-systems/presets/schema.json) for preset validation.Documentation (
docs/how-to/instrumenting-rewriting-binary-application.rst) describing what to do when Dyninst reports a “Failed to transform trace” error during instrumentation.
Changed#
rocprof-sys-availno longer queries GPU devices or hardware counters unless--hw-countersor--allis requested, reducing startup time and allowing settings/component queries in environments without GPU/ROCm.rocprof-sys-instrumentdiagnostic file dumps (available, instrumented, excluded, coverage, overlapping) are now gated behind the--dump-infoflag instead of being generated unconditionally.Preset flags changed from
--balancedto--preset=balancedsyntax. The old--<preset-name>flags are still supported and handled withinpreset_registry.Removed the
ROCPROFSYS_USE_ROCMCMake option. ROCm is now required for building the ROCm Systems Profiler.
Resolved issues#
Fixed an issue where the
--rocm-domainsCLI option forrocprof-sys-runwas not recognized.
rocminfo (1.0.0)#
Resolved issues#
Fixed BDF (Bus:Device.Function) ID truncation issue that caused incorrect display of PCI device identifiers. The
bdf_idfield was incorrectly declared asuint16_tinstead ofuint32_t, causing silent truncation when HSA runtime returned the full 32-bit BDF ID value. This has been corrected to properly display complete BDF information for all GPU agents.
rocPRIM (4.4.0)#
Added#
Added type trait definitions for
__hip_bfloat16. This should resolve issues where this type did not work with radix-based algorithms.Unit tests for config_types.
Optimized#
Reduced build times for unit tests.
Reduced memory usage in unit tests.
Resolved issues#
Fixed a silent overflow in
rocprim::device_segmented_reducewhere it could exceed the maximum number of HIP threads, resulting in missing output.Certain large unit tests now properly detect if insufficient system memory is present and skip the test case accordingly.
Fixed out-of-bounds memory access in block run length decode.
Fixed memory leak in unit tests.
ROCprofiler-SDK (1.3.0)#
Added#
API:
Late-start profiling support: Enables profiling when
rocprofiler-sdkis loaded after HSA/HIP runtimes have already initialized.rocprofiler_force_configure()now automatically detects and profiles runtimes initialized before the SDK loads.Integrates with
rocprofiler-registerto retrieve the registered API tables.Supports all runtime types (HSA, HIP, ROCTX, RCCL, rocDecode, rocJPEG, and more) automatically.
No explicit late-start API calls required; works transparently.
KFD (Kernel Fusion Driver) event tracing support:
Buffer service configurations for each KFD buffer tracing type.
New type
tool_buffer_tracing_kfd_record_tusingstd::variantto wrap 8 different KFD buffer tracing types.Each KFD event generates
rocpd_info_pmc,rocpd_event,rocpd_region, androcpd_pmc_eventrows.Fixed handling for special SVM location in KFD prefetch location reporting.
Fixed parsing for queue restore events to handle both correct format (character ‘0’) and broken driver format (NULL character ‘\0’).
rocprofv3 (CLI):
Multi-pass counter collection support: Support for multiple
--pmcflags to define separate counter groups for different profiling passes.Ability to combine command-line
--pmcflags with input file counter groups.Each pass generates output in a separate
pass_nsubdirectory.Example:
rocprofv3 --pmc SQ_WAVES --pmc GRBM_COUNT -- <app>creates two profiling passes.
KFD (Kernel Fusion Driver) event tracing support:
KFD record dumping to
rocpdwith support for 8 main KFD event types.Support for
rocpdto Perfetto conversion for KFD events.--kfd-traceflag to enable KFD event tracing.
ROCTx support for ATT: Added ROCtx support to device thread trace when using
--att --selected-regions.Allows
roctxProfilerPauseandroctxProfilerResumeto explicitly control when ATT data collection starts and stops.Enables more precise, region-focused ATT tracing with reduced overhead and noise.
Supports multiple resume/pause cycles, each producing separate trace output files.
Incompatible with
--att-consecutive-kernels.
PC sampling support for dynamic attach: Allows users to attach to a running application and collect PC samples without restarting the workload.
Enables profiling long-running or production-style jobs at the point of interest.
Results integrate with the existing PC sampling analysis flow.
Documentation:
Added marker-controlled thread tracing section to the thread trace how-to guide.
Added cross-reference from ROCTx documentation to ATT with
selected-regions.
Changed#
Implementation:
Late-start architecture redesign: Removed direct runtime symbol access in favor of proper rocprofiler-register integration.
Replaced ~600 lines of
dlopen/dlsymbypass logic with ~80 lines by usingrocprofiler_register_invoke_all_registrations().Late-start now works by requesting
rocprofiler-registerto re-propagate stored API tables.Extensible design. Automatically supports new runtimes without SDK code changes.
Provides a proper separation of concerns.
rocprofiler-registermanages the table storage while SDK manages the table wrapping.
Counter dimension encoding changed from fixed-width to variable-width allocation per dimension type.
Dimension selection and reduction logic now uses explicit dimension masks and single-index selection.
HSA queue interception extended to handle AMD extended kernel dispatch packets.
Removed#
Counter collection support for plain text (
.txt) input files. Only structured file formats (JSON and YAML) with schema validation are now supported.
Resolved issues#
Fixed rocpd OTF2 output to add
ACCELERATOR_DEVICEas system tree node domain for AMD devices.Fixed
rocprofv3input file parsing where comment lines containingpmc:were incorrectly processed as valid counter collection directives, causing unintended profiling passes.
rocRAND (4.4.0)#
Added#
gfx1150 and gfx1152 support.
rocrand.dll now contains embedded file version metadata.
Resolved issues#
Fixed memory leak in unit tests.
rocSHMEM (3.4.0)#
Added#
Added new APIs:
rocshmem_quiet_on_streamrocshmem_sync_all_on_streamrocshmem_TYPENAME_alltoall_wgrocshmem_TYPENAME_alltoallv_wgrocshmem_team_my_perocshmem_team_n_pesrocshmem_barrierrocshmem_barrier_waverocshmem_barrier_wgrocshmem_buffer_registerrocshmem_buffer_unregisterrocshmem_info_get_versionrocshmem_info_get_namerocshmem_vendor_get_version_info
Added library constants:
ROCSHMEM_MAJOR_VERSION,ROCSHMEM_MINOR_VERSION,ROCSHMEM_MAX_NAME_LEN,ROCSHMEM_VENDOR_STRING,ROCSHMEM_VERSION,ROCSHMEM_VENDOR_MAJOR_VERSION,ROCSHMEM_VENDOR_MINOR_VERSION,ROCSHMEM_VENDOR_PATCH_VERSION.Added vendor string and backend metadata to the
rocshmem_infooutput.Added
ROCSHMEM_TEAM_WORLDfor device code.Added
ROCSHMEM_TEAM_SHAREDpredefined team for PEs sharing a common memory domain (same node).Added new environment variables:
ROCSHMEM_GDA_OVERRIDE_NIC_FIRMWARE_CHECKROCSHMEM_GDA_NUM_QPS_PER_PE_DEFAULT_CTXROCSHMEM_GDA_NUM_QPS_PER_PE_USR_CTX
Added VMM POSIX memory allocator (
USE_HEAP_DEVICE_VMM_POSIX):Uses HIP Virtual Memory Management (VMM) APIs for fine-grained memory control.
Requires ROCm 7.0+ and Linux kernel 5.6+.
Not compatible with MPI-based initialization (use
ROCSHMEM_INIT_WITH_UNIQUEIDinstead).
Changed#
Use CQ collapsing for the Mellanox MLX5 GDA conduit.
rocSOLVER (3.34.0)#
Added#
Computation of solution for LU factorization without pivoting:
GETRS_NPVT (with batched and strided_batched versions)
GETRS_NPVT_64 (with batched and strided_batched versions)
Linear solver routines for symmetric matrices:
SYTRS (with batched and strided_batched versions)
SYTRS_64 (with batched and strided_batched versions)
Optimized#
Improved the performance of POTF2 and downstream functions such as POTRF.
Resolved issues#
Fixed a memory access error in SYTRF and synchronization issues in LASYF and SYTF2.
rocSPARSE (4.6.0)#
Added#
rocsparse_create_const_bsr_descrroutine for creating a const sparse BSR matrix descriptor.rocsparse_spic0androcsparse_spilu0routines for incomplete factorizations, with strided batched computations enabled.rocsparse_sptrsv_descr_createandrocsparse_sptrsv_descr_destroyroutines.rocsparse_singularityenumeration.rocsparse_sptrsv_output_singularityandrocsparse_sptrsv_output_singularity_positioninrocsparse_sptrsv_output.Strided batched computations for
rocsparse_sptrsv.
Optimized#
Significant performance improvement for
rocsparse_Xgtsv_no_pivot_strided_batch.Significant performance improvement for
rocsparse_Xgtsv_no_pivot.
Resolved issues#
Fixed incorrect usage of
__syncthreadsinbsrmm,csrmm(row_split), andcsritilu0x.Fixed incorrect usage of
__syncthreadsincsx2dense,dense2csx,prune_dense2csr,csrcolor, andcsrmm(nnz_split).Fixed
rocsparse_[s|d|c|z]csric0whererocsparse_status_invalid_valuewas being returned when the maximum number of non-zeros in any row is between 513 and 1024.Fixed compilation when using
--rocsparse_ILP64.Fixed off-by-one heap-buffer-overflow in temporary buffer allocation for
rocsparse_csrsort,rocsparse_check_matrix_csr, androcsparse_check_matrix_gebsr(and their delegating routinesrocsparse_cscsort,rocsparse_coosort,rocsparse_check_matrix_csc, androcsparse_check_matrix_gebsc) where theshift_offsets_kerneltemp buffer was sized formelements instead ofm+1.
Removed#
The deprecated C++14 support, which is no longer supported by the rocPRIM dependency.
rocThrust (4.4.0)#
Resolved issues#
Fixed memory leak in unit test.
Fixed unit test compatibility with ASAN.
rocWMMA (2.2.1)#
Added#
Added the following community samples for external contributions, with build support and documentation:
simple_gemm_silu: demonstrates a GEMM + SiLU fused operator using the rocWMMA API.simple_gemm_fusion: demonstrates block-tile-level dual-GEMM fusion using the rocWMMA API.simple_gemm_swiglu: demonstrates a SwiGLU fused dual-GEMM kernel (LLaMA/Mistral FFN gate layer) using the rocWMMA API.
Changed#
Updated the
find_packagesearch for OpenMP to prefer theopenmp-config.cmakeprovided by ROCm, with a fallback to module search mode.Updated
INSTALL_RPATHand addedBUILD_RPATHfor OpenMP.
Resolved issues#
Improved HIP RTC regression test portability when deployed outside the default path.
ROCm known issues#
ROCm known issues are noted on GitHub. These issues will be fixed in a future ROCm release. For known issues related to individual components, review the ROCm component changelogs.
ROCm Compute Profiler might fail when profiling bash script or command#
Running a bash script or command as a target for ROCm Compute Profiler might fail because bash overwrites the required environment variables. As a workaround, use --no-native-tool option in the profile mode. Note that this will disable iteration multiplexing.
hipFFT and rocFFT callback examples fail to build on Windows#
The hipFFT and rocFFT callback examples in rocm-examples fail to build on a Windows operating system due to a linker error. CMake configuration and HIP object compilation will complete successfully, but the final link step fails with clang: error: invalid linker name in argument '-fuse-ld=lld-link' This issue affects all Windows configurations using Relocatable Device Code (RDC) mode. Linux builds are not affected. As a workaround, skip the hipFFT and rocFFT callback examples on Windows, and refer to the Linux builds or callback functionality documentation.
QMCPACK might become unresponsive during DMC simulation on AMD Instinct MI300A GPUs#
QMCPACK might become unresponsive when running Diffusion Monte Carlo (DMC) simulations with certain inputs on AMD Instinct MI300A GPUs. The application stops making progress after initialization and must be terminated manually.
Resource-intensive workloads might result in GPU memory faults#
Applications that pass large, complex data structures between device functions using scratch memory, and particularly rely on compiler optimization to minimize the number of copy operations, might encounter GPU memory access faults and become unresponsive.
Increased binary size for multi-target GPU builds#
Applications targeting multiple AMD GPU architectures might observe significantly larger binary sizes. Multi-target builds can produce binaries up to 54 percent larger. Single-target builds add approximately 8 MB of additional size per GPU target. As a workaround, reduce the number of GPU targets in multi-target builds, or strip the resource-usage symbols from release binaries.
HIP cooperative groups might fail when compiled using the SPIR-V path#
HIP applications that use cooperative groups might fail at kernel launch when compiled with --offload-arch=amdgcnspirv. The application fails at runtime with LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.s.wait.asynccnt error message. This
affects all GPU architectures when using the SPIR-V compilation path. As a workaround, compile using a direct GPU architecture target (for example, --offload-arch=gfx942) instead of --offload-arch=amdgcnspirv.
Illegal memory address error when using placement new with device function returns#
HIP kernels that use the placement new operators to construct objects in the hipMalloc device memory might crash with hipErrorIllegalAddress error message when you pass a __device__ function return value as the constructor argument. This only affects non-trivially-copyable types (for example, types with user-defined or deleted copy/move constructors). Trivially-copyable types are not affected. As a workaround, assign the device function return value to a local variable before passing it to placement new.
LLVM-based compilers might fail when compiling half-precision vector operations#
LLVM-based compilers might fail, returning Failed to find subregs! error message in SIInstrInfo::copyPhysReg, when compiling half-precision vector operations with optimization enabled. The issue was observed at optimization levels -O1 to -O3.
hipBLAS test suites failure on Windows#
When using hipBLAS on Windows, the test suites might return non-zero exit codes, even when all mathematical correctness tests pass. This issue can affect CI/CD pipeline validation and block automated testing workflows on Windows systems, because the test framework might fail to detect successful test completion.
ROCm Systems Profiler overwrites ROCPD output after process re-attachment#
When you use rocprof-sys-attach to re-attach to a previously profiled process, the ROCPD output database files (.db) are written to the initial session’s output directory instead of a new timestamped directory. This makes it difficult to distinguish profiling data between sessions. Perfetto trace files are not affected. As a workaround, back up your output directory before re-attaching to a previously profiled process.
Missing dependencies when installing ROCm Core SDK#
Installing the ROCm Core SDK using amdrocm-core-sdk or amdrocm-core-dev/devel might succeed, but some dependencies from the dev/devel meta packages might not be installed. As a workaround, install the dev packages manually:
sudo apt install amdrocm-*
ROCm resolved issues#
The following notable issues have been fixed in ROCm 7.13.0.
Multi-ROCm installation failed on RPM-based distributions#
Previously, installing multiple ROCm versions side by side on RPM-based distributions (RHEL and SLES) failed due to .build-id file conflicts between versioned packages.
vLLM server failed to launch in ROCm Docker images#
Previously, the vLLM server failed to start in ROCm 7.12.0 Docker images with an ImportError for librocm_smi64.so.1 due to missing library path configuration.
vLLM server failed to launch with tensor parallelism#
Previously, the vLLM server failed to start with an invalid device pointer error when launching models with tensor parallelism set to 8 on AMD Instinct MI300 and MI355X GPUs.
PyTorch DDP Gloo backend test failed on AMD GPUs#
Previously, the PyTorch Distributed Data Parallel (DDP) test test_ddp_apply_optim_in_backward_grad_as_bucket_view_false failed when using the Gloo backend.
rocWMMA header produced unknown type errors in HIP RTC#
Previously, HIP RTC programs that included the rocwmma/rocwmma.hpp header failed to compile with unknown type name errors.
ROCm upcoming changes#
Future releases will add support for:
Additional ROCm Core SDK components
Domain-specific expansion toolkits (data science, life science, finance, simulation, and other HPC domains)
More AMD hardware support