ROCm 6.2.0 release notes

Contents

ROCm 6.2.0 release notes#

Applies to Linux and Windows

2024-08-02

75 min read time

The release notes provide a comprehensive summary of changes since the previous ROCm release.

The Compatibility matrix provides an overview of operating system, hardware, ecosystem, and ROCm component support across ROCm releases.

Release notes for previous ROCm releases are available in earlier versions of the documentation. See the ROCm documentation release history.

Release highlights#

This section introduces notable new features and improvements in ROCm 6.2. See the Detailed component changes for individual component changes.

New components#

ROCm 6.2.0 introduces the following new components to the ROCm software stack.

  • Omniperf – A kernel-level profiling tool for machine learning and high-performance computing (HPC) workloads running on AMD Instinct accelerators. Omniperf offers comprehensive profiling and advanced analysis via command line or a GUI dashboard. For more information, see Omniperf.

  • Omnitrace – A multi-purpose analysis tool for profiling and tracing applications running on the CPU or the CPU and GPU. It supports dynamic binary instrumentation, call-stack sampling, causal profiling, and other features for determining which function and line number are executing. For more information, see Omnitrace.

  • rocPyDecode – A tool to access rocDecode APIs in Python. It connects Python and C/C++ libraries, enabling function calling and data passing between the two languages. The rocpydecode.so library, a wrapper, uses rocDecode APIs written primarily in C/C++ within Python. For more information, see rocPyDecode.

  • ROCprofiler-SDK – ROCprofiler-SDK is a profiling and tracing library for HIP and ROCm applications on AMD ROCm software used to identify application performance bottlenecks and optimize their performance. The new APIs add restrictions for more efficient implementations and improved thread safety. A new window restriction specifies the services the tool can use. ROCprofiler-SDK also provides a tool library to help you write your tool implementations. rocprofv3 uses this tool library to profile and trace applications for performance bottlenecks. Examples include API tracing, kernel tracing, and so on. For more information, see ROCprofiler-SDK.

    Note

    ROCprofiler-SDK for ROCm 6.2.0 is a beta release and subject to change.

ROCm Offline Installer Creator introduced#

The new ROCm Offline Installer Creator creates an installation package for a preconfigured setup of ROCm, the AMDGPU driver, or a combination of the two on a target system without network access. This new tool customizes multiple unique configurations for use when installing ROCm on a target. Other notable features include:

  • A lightweight, easy-to-use user interface for configuring the creation of the installer

  • Support for multiple Linux distributions

  • Installer support for different ROCm releases and specific ROCm components

  • Optional driver or driver-only installer creation

  • Optional post-install preferences

  • Lightweight installer packages, which are unique to the preconfigured ROCm setup

  • Resolution and inclusion of dependency packages for offline installation

For more information, see ROCm Offline Installer Creator.

Math libraries default to Clang instead of HIPCC#

The default compiler used to build the math libraries on Linux changes from hipcc to amdclang++. Appropriate compiler flags are added to ensure these compilations build correctly. This change only applies when building the libraries. Applications using the libraries can continue to be compiled using hipcc or amdclang++ as described in ROCm compiler reference. The math libraries can also be built with hipcc using any of the previously available methods (for example, the CXX environment variable, the CMAKE_CXX_COMPILER CMake variable, and so on). This change shouldn’t affect performance or functionality.

Framework and library changes#

This section highlights updates to supported deep learning frameworks and notable third-party library optimizations.

Additional PyTorch and TensorFlow support#

ROCm 6.2.0 supports PyTorch versions 2.2 and 2.3 and TensorFlow version 2.16.

See Installing PyTorch for ROCm and Installing TensorFlow for ROCm for installation instructions.

Refer to the Third-party support matrix for a comprehensive list of third-party frameworks and libraries supported by ROCm.

Optimized framework support for OpenXLA#

PyTorch for ROCm and TensorFlow for ROCm now provide native support for OpenXLA. OpenXLA is an open-source ML compiler ecosystem that enables developers to compile and optimize models from all leading ML frameworks. For more information, see Installing PyTorch for ROCm and Installing TensorFlow for ROCm.

PyTorch support for Autocast (automatic mixed precision)#

PyTorch now supports Autocast for recurrent neural networks (RNNs) on ROCm. This can help to reduce computational workloads and improve performance. Based on the information about the magnitude of values, Autocast can substitute the original float32 linear layers and convolutions with their float16 or bfloat16 variants. For more information, see Automatic mixed precision.

Memory savings for bitsandbytes model quantization#

The ROCm-aware bitsandbytes library is a lightweight Python wrapper around HIP custom functions, in particular 8-bit optimizer, matrix multiplication, and 8-bit and 4-bit quantization functions. ROCm 6.2.0 introduces the following bitsandbytes changes:

  • Int8 matrix multiplication is enabled, and it includes the following functions:

    • extract-outliers – extracts rows and columns that have outliers in the inputs. They’re later used for matrix multiplication without quantization.

    • transform – row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and after matmul computation.

    • igemmlt – new function for GEMM computation A*B^T. It uses hipblasLtMatMul and performs 8-bit GEMM operations.

    • dequant_mm – dequantizes output matrix to original data type using scaling factors from vector-wise quantization.

  • Blockwise quantization – input tensors are quantized for a fixed block size.

  • 4-bit quantization and dequantization functions – normalized Float4 quantization, quantile estimation, and quantile quantization functions are enabled.

  • 8-bit and 32-bit optimizers are enabled.

Note

These functions are included in bitsandbytes. They are not part of ROCm. However, ROCm 6.2.0 has enabled the fixes and features to run them.

For more information, see Model quantization techniques.

Improved vLLM support#

ROCm 6.2.0 enhances vLLM support for inference on AMD Instinct accelerators, adding capabilities for FP16/BF16 precision for LLMs, and FP8 support for Llama. ROCm 6.2.0 adds support for the following vLLM features:

  • MP: Multi-GPU execution. Choose between MP and Ray using a flag. To set it to MP, use --distributed-executor-backed=mp. The default depends on the commit in flux.

  • FP8 KV cache: Enhances computational efficiency and performance by significantly reducing memory usage and bandwidth requirements. The QUARK quantizer currently only supports Llama.

  • Triton Flash Attention:

    ROCm supports both Triton and Composable Kernel Flash Attention 2 in vLLM. The default is Triton, but you can change this setting using the VLLM_USE_FLASH_ATTN_TRITON=False environment variable.

  • PyTorch TunableOp:

    Improved optimization and tuning of GEMMs. It requires Docker with PyTorch 2.3 or later.

For more information about enabling these features, see vLLM inference.

ROCm has a vLLM branch for experimental features. This includes performance improvements, accuracy, and correctness testing. These features include:

  • FP8 GEMMs: To improve the performance of FP8 quantization, work is underway on tuning the GEMM using the shapes used in the model’s execution. It only supports LLAMA because the QUARK quantizer currently only supports Llama.

  • Custom decode paged attention: Improves performance by efficiently managing memory and enabling faster attention computation in large-scale models. This benefits all workloads in FP16 configurations.

To enable these experimental new features, see vLLM inference. Use the rocm/vllm branch when cloning the GitHub repo. The vllm/ROCm_performance.md document outlines all the accessible features, and the vllm/Dockerfile.rocm file can be used.

Enhanced performance tuning on AMD Instinct accelerators#

ROCm is pre-tuned for high-performance computing workloads including large language models, generative AI, and scientific computing. The ROCm documentation provides comprehensive guidance on configuring your system for AMD Instinct accelerators. It includes detailed instructions on system settings and application tuning suggestions to help you fully leverage the capabilities of these accelerators for optimal performance. For more information, see AMD MI300X tuning guides and AMD MI300A system optimization.

Removed clang-ocl#

As of version 6.2, ROCm no longer provides the clang-ocl package. See the clang-ocl README.

ROCm documentation changes#

The documentation for the ROCm components has been reorganized and reformatted in a standard look and feel. This improves the usability and readability of the documentation. For more information about the ROCm components, see What is ROCm?.

Since the release of ROCm 6.1, the documentation has added some key topics including:

The following topics have been significantly improved, expanded, or both:

Note

All ROCm projects are open source and available on GitHub. To contribute to ROCm documentation, see the ROCm documentation contribution guidelines.

Operating system and hardware support changes#

ROCm 6.2.0 adds support for the following operating system and kernel versions.

  • Ubuntu 24.04 LTS (kernel: 6.8 [GA])

  • RHEL 8.10 (kernel: 4.18.0-544)

  • SLES 15 SP6 (kernel: 6.4)

ROCm 6.2.0 marks the end of support (EoS) for:

  • Ubuntu 22.04.3

  • RHEL 9.2

  • RHEL 8.8

  • SLES 15 SP 4

  • CentOS 7.9

ROCm 6.2.0 has been tested against pre-release Ubuntu 22.04.5 (kernel: 6.5 [HWE]).

See the Compatibility matrix for an overview of supported operating systems and hardware architectures.

ROCm components#

The following table lists ROCm components and their individual versions for ROCm 6.2.0. Follow the links in the Version column to go to the detailed component changelogs.

Category Group Name Version
Libraries Machine learning and computer vision Composable Kernel 1.1.0 ⇒ 1.1.0
MIGraphX 2.9 ⇒ 2.10
MIOpen 3.1.0 ⇒ 3.2.0
MIVisionX 2.5.0 ⇒ 3.0.0
rocAL 1.0.0 ⇒ 1.0.0
rocDecode 0.6.0 ⇒ 0.6.0
rocPyDecode 0.1.0
RPP 1.5.0 ⇒ 1.8.0
Communication RCCL 2.18.6 ⇒ 2.20.5
Math hipBLAS 2.1.0 ⇒ 2.2.0
hipBLASLt 0.7.0 ⇒ 0.8.0
hipFFT 1.0.14 ⇒ 1.0.15
hipfort 0.4.0
hipRAND 2.10.17 ⇒ 2.11.0
hipSOLVER 2.1.1 ⇒ 2.2.0
hipSPARSE 3.0.1 ⇒ 3.1.1
hipSPARSELt 0.2.0 ⇒ 0.2.1
rocALUTION 3.1.1 ⇒ 3.2.0
rocBLAS 4.1.2 ⇒ 4.2.0
rocFFT 1.0.27 ⇒ 1.0.28
rocRAND 3.0.1 ⇒ 3.1.0
rocSOLVER 3.25.0 ⇒ 3.26.0
rocSPARSE 3.1.2 ⇒ 3.2.0
rocWMMA 1.4.0 ⇒ 1.5.0
Tensile 4.40.0 ⇒ 4.41.0
Primitives hipCUB 3.1.0 ⇒ 3.2.0
hipTensor 1.2.0 ⇒ 1.3.0
rocPRIM 3.1.0 ⇒ 3.2.0
rocThrust 3.0.0 ⇒ 3.1.0
Tools System management AMD SMI 24.5.2 ⇒ 24.6.2
rocminfo 1.0.0
ROCm Data Center Tool 0.3.0 ⇒ 1.0.0
ROCm SMI 7.2.0 ⇒ 7.3.0
ROCm Validation Suite 1.0.0 ⇒ 1.0.0
Performance Omniperf 2.0.1
Omnitrace 1.11.2
ROCm Bandwidth Test 1.4.0
ROCProfiler 2.0.0 ⇒ 2.0.0
ROCprofiler-SDK 0.4.0
ROCTracer 4.1.0
Development HIPIFY 17.0.0 ⇒ 18.0.0
ROCdbgapi 0.71.0 ⇒ 0.76.0
ROCm CMake 0.12.0 ⇒ 0.13.0
ROCm Debugger (ROCgdb) 14.1 ⇒ 14.2
ROCr Debug Agent 2.0.3
Compilers HIPCC 1.0.0 ⇒ 1.1.1
llvm-project 17.0.0 ⇒ 18.0.0
Runtimes HIP 6.1 ⇒ 6.2.0
ROCr Runtime 1.13.0 ⇒ 1.14.0

Detailed component changes#

The following sections describe key changes to ROCm components.

AMD SMI (24.6.2)#

Changes#

  • Added the following functionality:

    • amd-smi dmon is now available as an alias to amd-smi monitor.

    • An optional process table under amd-smi monitor -q.

    • Handling to detect VMs with passthrough configurations in CLI tool.

    • Process Isolation and Clear SRAM functionality to the CLI tool for VMs.

    • Added Ring Hang event.

  • Added macros that were in amdsmi.h to the AMD SMI Python library amdsmi_interface.py.

  • Renamed amdsmi_set_gpu_clear_sram_data() to amdsmi_clean_gpu_local_data().

Removals#

  • Removed throttle-status from amd-smi monitor as it is no longer reliably supported.

  • Removed elevated permission requirements for amdsmi_get_gpu_process_list().

Optimizations#

  • Updated CLI error strings to specify invalid device type queried.

  • Multiple structure updates in amdsmi.h and amdsmi_interface.py to align with host/guest.

    • Added amdsmi.h and amdsmi_interface.py.

    • amdsmi_clk_info_t struct

    • Added AMDSMI prefix to multiple structures.

  • Updated dpm_policy references to soc_pstate.

  • Updated amdsmi_get_gpu_board_info() product_name to fallback to pciids file.

  • Updated amdsmi_get_gpu_board_info() now has larger structure sizes for amdsmi_board_info_t.

  • Updated CLI voltage curve command output.

Resolved issues#

  • Fixed multiple processes not being registered in amd-smi process with JSON and CSV format.

  • amdsmi_get_gpu_board_info() no longer returns junk character strings.

  • Fixed parsing of pp_od_clk_voltage within amdsmi_get_gpu_od_volt_info.

  • Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool. When running amd-smi reset --gpureset --gpu all and then running an instance of amd-smi static (or any other subcommand that access the GPUs) a mutex would lock and not return requiring either a clear of the mutex in /dev/shm or rebooting the machine.

Known issues#

  • amdsmi_get_gpu_process_isolation and amdsmi_clean_gpu_local_data commands do not work. They will be supported in a future release.

See issue #3500 on GitHub.

Note

See the detailed AMD SMI changelog on GitHub for more information.

Composable Kernel (1.1.0)#

Changes#

  • Added support for:

    • Permute scale for any dimension (#1198).

    • Combined elementwise op (#1217).

    • Multi D in grouped convolution backward weight (#1280).

    • K or C equal to 1 for fp16 in grouped convolution backward weight (#1280).

    • Large batch in grouped convolution forward (#1332).

  • Added CK_TILE layernorm example (#1339).

  • CK_TILE-based Flash Attention 2 kernel is now merged into the upstream repository as ROCm backend.

Optimizations#

  • Support universal GEMM in grouped convolution forward (#1320).

  • Optimizations for low M and N in grouped convolution backward weight (#1303).

  • Added a functional enhancement and compiler bug fix for FlashAttention Forward Kernel.

  • FP8 GEMM performance optimization and tuning (#1384).

  • Added FlashAttention backward pass performance optimization (#1397).

HIP (6.2.0)#

Changes#

  • Added the _sync() version of crosslane builtins such as shfl_sync(), __all_sync() and __any_sync(). These take a 64-bit integer as an explicit mask argument.

    • In HIP 6.2, these are hidden behind the preprocessor macro HIP_ENABLE_WARP_SYNC_BUILTINS, and will be enabled unconditionally in a future HIP release.

  • Added new HIP APIs:

    • hipGetProcAddress returns the pointer to driver function, corresponding to the defined driver function symbol.

    • hipGetFuncBySymbol returns the pointer to device entry function that matches entry function symbolPtr.

    • hipStreamBeginCaptureToGraph begins graph capture on a stream to an existing graph.

    • hipGraphInstantiateWithParams creates an executable graph from a graph.

  • Added a new flag integrated – supported in device property.

    • The integrated flag is added in the struct hipDeviceProp_t. On the integrated APU system, the runtime driver detects and sets this flag to 1, in which case the API hipDeviceGetAttribute returns enum hipDeviceAttribute_t for hipDeviceAttributeIntegrated as value 1, for integrated GPU device.

  • Added initial support for 8-bit floating point datatype in amd_hip_fp8.h. These are accessible via #include <hip/hip_fp8.h>.

  • Added UUID support for environment variable HIP_VISIBLE_DEVICES.

Resolved issues#

  • Fixed stream capture support in HIP graphs. Prohibited and unhandled operations are fixed during stream capture in the HIP runtime.

  • Fixed undefined symbol error for hipTexRefGetArray and hipTexRefGetBorderColor.

Upcoming changes#

  • The _sync() version of crosslane builtins such as shfl_sync(), __all_sync(), and __any_sync() will be enabled unconditionally in a future HIP release.

hipBLAS (2.2.0)#

Changes#

  • Added a new ILP64 API for level 2 functions for both C and FORTRAN (_64 name suffix) with int64_t function arguments.

  • Added a new ILP64 API for level 1 _ex functions.

  • The install.sh script now invokes the rmake.py script. Made other various improvements to the build scripts.

  • Changed library dependencies in the install.sh script from rocblas and rocsolver to the development packages rocblas-dev and rocsolver-dev.

  • Updated Linux AOCL dependency to release 4.2 gcc build.

  • Updated Windows vcpkg dependencies to release 2024.02.14.

hipBLASLt (0.8.0)#

Changes#

  • Added extension APIs: *hipblasltExtAMaxWithScale.

    • GemmTuning extension parameter to set wgm by user.

  • Added support for:

    • HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER for FP8/BF8 datatype.

    • FP8/BF8 input, FP32/FP16/BF16/F8/BF8 output (gfx94x platform only).

    • HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_A_EXT and HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_B_EXT for FP16 input data type to use FP8/BF8 MFMA.

  • Added support for gfx110x.

Optimizations#

  • Improved library loading time.

HIPCC (1.1.1)#

Changes#

  • Split hipcc package into two packages for different hardware platforms.

  • Cleaned up references to environment variables.

  • Enabled hipcc and hipconfig binaries (hipcc.bin, hipconfig.bin) by default, instead of their Perl counterparts.

  • Enabled function calls.

  • Added support for generating packages for ROCm stack targeting static libraries.

Resolved issues#

  • Implemented numerous bug fixes and quality improvements.

hipCUB (3.2.0)#

Changes#

  • Added DeviceCopy function for parity with CUB.

  • Added enum WarpExchangeAlgorithm to the rocPRIM backend, which is used as the new optional template argument for WarpExchange.

    • The potential values for the enum are WARP_EXCHANGE_SMEM and WARP_EXCHANGE_SHUFFLE.

    • WARP_EXCHANGE_SMEM stands for the previous algorithm, while WARP_EXCHANGE_SHUFFLE performs the exchange via shuffle operations.

    • WARP_EXCHANGE_SHUFFLE does not require any pre-allocated shared memory, but the ItemsPerThread must be a divisor of WarpSize.

  • Added tuple.hpp which defines templates hipcub::tuple, hipcub::tuple_element, hipcub::tuple_element_t and hipcub::tuple_size.

  • Added new overloaded member functions to BlockRadixSort and DeviceRadixSort that expose a decomposer argument. Keys of a custom type (key_type) can be sorted via these overloads, if an appropriate decomposer is passed. The decomposer has to implement operator(const key_type&) which returns a hipcub::tuple of references pointing to members of key_type.

  • On AMD GPUs (using the HIP backend), you can now issue hipCUB API calls inside of HIP graphs, with several exceptions:

    • CachingDeviceAllocator

    • GridBarrierLifetime

    • DeviceSegmentedRadixSort

    • DeviceRunLengthEncode Currently, these classes rely on one or more synchronous calls to function correctly. Because of this, they cannot be used inside of HIP graphs.

Removals#

  • Deprecated debug_synchronous in hipCUB-2.13.2, and it no longer has any effect. With this release, passing debug_synchronous to the device functions results in a deprecation warning both at runtime and at compile time.

    • The synchronization that was previously achievable by passing debug_synchronous=true can now be achieved at compile time by setting the CUB_DEBUG_SYNC (or higher debug level) or the HIPCUB_DEBUG_SYNC preprocessor definition.

    • The compile time deprecation warnings can be disabled by defining the HIPCUB_IGNORE_DEPRECATED_API preprocessor definition.

Resolved issues#

  • Fixed the derivation for the accumulator type for device scan algorithms in the rocPRIM backend being different compared to CUB. It now derives the accumulator type as the result of the binary operator.

hipFFT (1.0.15)#

Resolved issues#

  • Added hip::host as a public link library, as hipfft.h includes HIP runtime headers.

  • Prevented C++ exceptions leaking from public API functions.

  • Made output of hipfftXt match cufftXt in geometry and alignment for 2D and 3D FFTs.

HIPIFY (18.0.0)#

Changes#

  • Added support for:

    • NVIDIA CUDA 12.4.1

    • cuDNN 9.1.1

    • LLVM 18.1.6

  • Added full hipBLASLt support.

Resolved issues#

  • HIPIFY now applies reinterpret_cast for an explicit conversion between pointer-to-function and pointer-to-object; affected functions: hipFuncGetAttributes, hipFuncSetAttribute, hipFuncSetCacheConfig, hipFuncSetSharedMemConfig, hipLaunchKernel, and hipLaunchCooperativeKernel.

hipRAND (2.11.0)#

Changes#

  • Added support for setting generator output ordering in C and C++ API.

  • hiprandCreateGeneratorHost dispatches to the host generator in the rocRAND backend instead of returning with uHIPRAND_STATUS_NOT_IMPLEMENTED.

  • Added options to create:

    • A host generator to the Fortran wrapper.

    • A host generator to the Python wrapper.

  • Previously, for internal testing with HMM the environment variable ROCRAND_USE_HMM was used in previous versions. The environment variable is now named HIPRAND_USE_HMM.

  • Static library – moved all internal symbols to namespaces to avoid potential symbol name collisions when linking.

  • Device API documentation is improved in this version.

Removals#

  • Removed the option to build hipRAND as a submodule to rocRAND.

  • Removed references to, and workarounds for, the deprecated hcc.

  • Removed support for finding rocRAND based on the environment variable ROCRAND_DIR. Use ROCRAND_PATH instead.

Resolved issues#

  • Fixed a build error when using Clang++ directly due to unsupported references to amdgpu-target.

hipSOLVER (2.2.0)#

Changes#

  • Added compatibility-only functions:

    • auxiliary

      • hipsolverDnCreateParams, hipsolverDnDestroyParams, hipsolverDnSetAdvOptions

    • getrf

      • hipsolverDnXgetrf_bufferSize

      • hipsolverDnXgetrf

    • getrs

      • hipsolverDnXgetrs

  • Added support for building on Ubuntu 24.04 and CBL-Mariner.

  • Added hip::host to roc::hipsolver usage requirements.

  • Added functions

    • syevdx/heevdx

      • hipsolverSsyevdx_bufferSize, hipsolverDsyevdx_bufferSize, hipsolverCheevdx_bufferSize, hipsolverZheevdx_bufferSize

      • hipsolverSsyevdx, hipsolverDsyevdx, hipsolverCheevdx, hipsolverZheevdx

    • sygvdx/hegvdx

      • hipsolverSsygvdx_bufferSize, hipsolverDsygvdx_bufferSize, hipsolverChegvdx_bufferSize, hipsolverZhegvdx_bufferSize

      • hipsolverSsygvdx, hipsolverDsygvdx, hipsolverChegvdx, hipsolverZhegvdx

  • Updated csrlsvchol to perform numerical factorization on the GPU. The symbolic factorization is still performed on the CPU.

  • Renamed hipsolver-compat.h to hipsolver-dense.h.

Removals#

  • Removed dependency on cblas from the hipSOLVER test and benchmark clients.

hipSPARSE (3.1.1)#

Changes#

  • Added the missing hipsparseCscGet() routine.

  • All internal hipSPARSE functions now exist inside a namespace.

  • Match deprecations found in cuSPARSE 12.x.x when using cuSPARSE backend.

  • Improved the user manual and contribution guidelines.

Resolved issues#

  • Fixed SpGEMM and SpGEMM_reuse routines that were not matching cuSPARSE behavior.

Known Issues#

  • In hipsparseSpSM_solve(), the external buffer is currently passed as a parameter. This does not match the cuSPARSE API and this extra external buffer parameter will be removed in a future release. For now this extra parameter can be ignored and passed a nullptr as it is unused internally by hipsparseSpSM_solve().

hipSPARSELt (0.2.1)#

Optimizations#

  • Refined test cases.

hipTensor (1.3.0)#

Changes#

  • Added support for:

    • Tensor permutation of ranks of 2, 3, 4, 5, and 6

    • Tensor contraction of M6N6K6: M, N, K up to rank 6

  • Added tests for:

    • Tensor permutation of ranks of 2, 3, 4, 5, and 6

    • Tensor contraction of M6N6K6: M, N, K up to rank 6

    • YAML parsing to support sequential parameters ordering.

  • Prefer amd-llvm-devel package before system LLVM library.

  • Preferred compilers changed to CC=amdclang CXX=amdclang++.

  • Updated actor-critic selection for new contraction kernel additions.

  • Updated installation, programmer’s guide, and API reference documentation.

Resolved issues#

  • Fixed LLVM parsing crash.

  • Fixed memory consumption issue in complex kernels.

  • Workaround implemented for compiler crash during debug build.

  • Allow random modes ordering for tensor contractions.

llvm-project (18.0.0)#

Changes#

  • LLVM IR

    • The llvm.stacksave and llvm.stackrestore intrinsics now use an overloaded pointer type to support non-0 address spaces.

    • Added llvm.exp10 intrinsic.

  • LLVM infrastructure

    • The minimum Clang version to build LLVM in C++20 configuration is now clang-17.0.6.

  • TableGen

    • Added constructs for debugging TableGen files:

      • dump keyword to dump messages to standard error. See #68793.

      • !repr bang operator to inspect the content of values. See #68716.

  • AArch64 backend

    • Added support for Cortex-A520, Cortex-A720 and Cortex-X4 CPUs.

  • AMDGPU backend

    • llvm.sqrt.f32 is now lowered correctly. Use llvm.amdgcn.sqrt.f32 for raw instruction access.

    • Implemented llvm.stacksave and llvm.stackrestore intrinsics.

    • Implemented llvm.get.rounding.

  • ARM backend

    • Added support for Cortex-M52 CPUs.

    • Added execute-only support for Armv6-M.

  • RISC-V backend

    • The Zfa extension version was upgraded to 1.0 and is no longer experimental.

    • Zihintntl extension version was upgraded to 1.0 and is no longer experimental.

    • Intrinsics were added for Zk*, Zbb, and Zbc. See Scalar Bit Manipulation Extension Intrinsics in the RISC-V C API specification.

    • Default ABI with F but without D was changed to ilp32f for RV32 and to lp64f for RV64.

    • The Zvbb, Zvbc, Zvkb, Zvkg, Zvkn, Zvknc, Zvkned, Zvkng, Zvknha, Zvknhb, Zvks, Zvksc, Zvksed, Zvksg, Zvksh, and Zvkt extension version was upgraded to 1.0 and is no longer experimental. However, the C intrinsics for these extensions are still experimental. To use the C intrinsics for these extensions, -menable-experimental-extensions needs to be passed to Clang.

    • -mcpu=sifive-p450 was added.

    • CodeGen of RV32E and RV64E is supported experimentally.

    • CodeGen of ilp32e and lp64e is supported experimentally.

  • X86 backend

    • Added support for the RDMSRLIST and WRMSRLIST instructions.

    • Added support for the WRMSRNS instruction.

    • Support ISA of AMX-FP16 which contains tdpfp16ps instruction.

    • Support ISA of CMPCCXADD.

    • Support ISA of AVX-IFMA.

    • Support ISA of AVX-VNNI-INT8.

    • Support ISA of AVX-NE-CONVERT.

    • -mcpu=raptorlake, -mcpu=meteorlake and -mcpu=emeraldrapids are now supported.

    • -mcpu=sierraforest, -mcpu=graniterapids and -mcpu=grandridge are now supported.

    • __builtin_unpredictable (unpredictable metadata in LLVM IR), is handled by X86 Backend. X86CmovConversion pass now respects this builtin and does not convert CMOVs to branches.

    • Add support for the PBNDKB instruction.

    • Support ISA of SHA512.

    • Support ISA of SM3.

    • Support ISA of SM4.

    • Support ISA of AVX-VNNI-INT16.

    • -mcpu=graniterapids-d is now supported.

    • The i128 type now matches GCC and clang’s __int128 type. This mainly benefits external projects such as Rust which aim to be binary compatible with C, but also fixes code generation where LLVM already assumed that the type matched and called into libgcc helper functions.

    • Support ISA of USER_MSR.

    • Support ISA of AVX10.1-256 and AVX10.1-512.

    • -mcpu=pantherlake and -mcpu=clearwaterforest are now supported.

    • -mapxf is supported.

    • Marking global variables with code_model = "small"/"large" in the IR now overrides the global code model to allow 32-bit relocations or require 64-bit relocations to the global variable.

    • The medium code model’s code generation was audited to be more similar to the small code model where possible.

  • C API

    • Added LLVMGetTailCallKind and LLVMSetTailCallKind to allow getting and setting tail, musttail, and notail attributes on call instructions.

    • Added LLVMCreateTargetMachineWithOptions, along with helper functions for an opaque option structure, as an alternative to LLVMCreateTargetMachine. The option structure exposes an additional setting (that is, the target ABI) and provides default values for unspecified settings.

    • Added LLVMGetNNeg and LLVMSetNNeg for getting and setting the new nneg flag on zext instructions, and LLVMGetIsDisjoint and LLVMSetIsDisjoint for getting and setting the new disjoint flag on or instructions.

    • Added the following functions for manipulating operand bundles, as well as building call and invoke instructions that use operand bundles:

      • LLVMBuildCallWithOperandBundles

      • LLVMBuildInvokeWithOperandBundles

      • LLVMCreateOperandBundle

      • LLVMDisposeOperandBundle

      • LLVMGetNumOperandBundles

      • LLVMGetOperandBundleAtIndex

      • LLVMGetNumOperandBundleArgs

      • LLVMGetOperandBundleArgAtIndex

      • LLVMGetOperandBundleTag

    • Added LLVMGetFastMathFlags and LLVMSetFastMathFlags for getting and setting the fast-math flags of an instruction, as well as LLVMCanValueUseFastMathFlags for checking if an instruction can use such flag.

  • CodeGen infrastructure

    • A new debug type isel-dump is added to show only the SelectionDAG dumps after each ISel phase (i.e. -debug-only=isel-dump). This new debug type can be filtered by function names using -filter-print-funcs=<function names>, the same flag used to filter IR dumps after each Pass. Note that the existing -debug-only=isel will take precedence over the new behavior and print SelectionDAG dumps of every single function regardless of -filter-print-funcs’s values.

  • Metadata info

    • Added a new loop metadata !{!”llvm.loop.align”, i32 64}.

  • LLVM tools

    • llvm-symbolizer now treats invalid input as an address for which source information is not found.

    • llvm-readelf now supports --extra-sym-info (-X) to display extra information (section name) when showing symbols.

    • llvm-readobj --elf-output-style=JSON no longer prefixes each JSON object with the file name. Previously, each object file’s output looked like "main.o":{"FileSummary":{"File":"main.o"},...} but is now {"FileSummary":{"File":"main.o"},...}. This allows each JSON object to be parsed in the same way, since each object no longer has a unique key. Tools that consume llvm-readobj’s JSON output should update their parsers accordingly.

    • llvm-objdump now uses --print-imm-hex by default, which brings its default behavior closer in line with objdump.

    • llvm-nm now supports the --line-numbers (-l) option to use debugging information to print symbols’ filenames and line numbers.

    • llvm-symbolizer and llvm-addr2line now support addresses specified as symbol names.

    • llvm-objcopy now supports --gap-fill and --pad-to options, for ELF input and binary output files only.

  • LLDB

    • SBType::FindDirectNestedType function is added. It’s useful for formatters to quickly find directly nested type when it’s known where to search for it, avoiding more expensive global search via SBTarget::FindFirstType.

    • Renamed lldb-vscode to lldb-dap and updated its installation instructions to reflect this. The underlying functionality remains unchanged.

    • The mte_ctrl register can now be read from AArch64 Linux core files.

    • LLDB on AArch64 Linux now supports debugging the Scalable Matrix Extension (SME) and Scalable Matrix Extension 2 (SME2) for both live processes and core files. For details refer to the AArch64 Linux documentation.

    • LLDB now supports symbol and binary acquisition automatically using the DEBUFINFOD protocol. The standard mechanism of specifying DEBUFINOD servers in the DEBUGINFOD_URLS environment variable is used by default. In addition, users can specify servers to request symbols from using the LLDB setting plugin.symbol-locator.debuginfod.server_urls, override or adding to the environment variable.

    • When running on AArch64 Linux, lldb-server now provides register field information for the following registers: cpsr, fpcr, fpsr, svcr and mte_ctrl.

  • Sanitizers

    • HWASan now defaults to detecting use-after-scope bugs.

Removals#

  • LLVM IR

    • The constant expression variants of the following instructions have been removed:

      • and

      • or

      • lshr

      • ashr

      • zext

      • sext

      • fptrunc

      • fpext

      • fptoui

      • fptosi

      • uitofp

      • sitofp

  • RISC-V backend

    • XSfcie extension and SiFive CSRs and instructions that were associated with it have been removed. None of these CSRs and instructions were part of “SiFive Custom Instruction Extension”. The LLVM project needs to work with SiFive to define and document real extension names for individual CSRs and instructions.

  • Python bindings

    • The Python bindings have been removed.

  • C API

    • The following functions for creating constant expressions have been removed, because the underlying constant expressions are no longer supported. Instead, an instruction should be created using the LLVMBuildXYZ APIs, which will constant fold the operands if possible and create an instruction otherwise:

      • LLVMConstAnd

      • LLVMConstOr

      • LLVMConstLShr

      • LLVMConstAShr

      • LLVMConstZExt

      • LLVMConstSExt

      • LLVMConstZExtOrBitCast

      • LLVMConstSExtOrBitCast

      • LLVMConstIntCast

      • LLVMConstFPTrunc

      • LLVMConstFPExt

      • LLVMConstFPToUI

      • LLVMConstFPToSI

      • LLVMConstUIToFP

      • LLVMConstSIToFP

      • LLVMConstFPCast

  • CodeGen infrastructure

    • PrologEpilogInserter no longer supports register scavenging during forwards frame index elimination. Targets should use backwards frame index elimination instead.

    • RegScavenger no longer supports forwards register scavenging. Clients should use backwards register scavenging instead, which is preferred because it does not depend on accurate kill flags.

  • LLDB

    • SBWatchpoint::GetHardwareIndex is deprecated and now returns -1 to indicate the index is unavailable.

    • Methods in SBHostOS related to threads have had their implementations removed. These methods will return a value indicating failure.

Resolved issues#

  • AArch64 backend

    • Neoverse-N2 was incorrectly marked as an Armv8.5a core. This has been changed to an Armv9.0a core. However, crypto options are not enabled by default for Armv9 cores, so -mcpu=neoverse-n2+crypto is now required to enable crypto for this core. As far as the compiler is concerned, Armv9.0a has the same features enabled as Armv8.5a, with the exception of crypto.

  • Windows target

    • The LLVM filesystem class UniqueID and function equivalent() no longer determine that distinct different path names for the same hard linked file actually are equal. This is an intentional tradeoff in a bug fix, where the bug used to cause distinct files to be considered equivalent on some file systems. This change fixed the GitHub issues #61401 and #22079.

Known issues#

The compiler may incorrectly compile a program that uses the __shfl(var, srcLane, width) function when one of the parameters to the function is undefined along some path to the function. For most functions, uninitialized inputs cause undefined behavior.

Note

The -Wall compilation flag prompts the compiler to generate a warning if a variable is uninitialized along some path.

As a workaround, initialize the parameters to __shfl. For example:

unsigned long istring = 0 // Initialize the input to __shfl
return __shfl(istring, 0, 64)

See issue #3499 on GitHub.

MIGraphX (2.10.0)#

Changes#

  • Added support for ONNX Runtime MIGraphX EP on Windows.

  • Added FP8 Python API.

  • Added examples for SD 2.1 and SDXL.

  • Added support for BERT to Dynamic Batch.

  • Added a --test flag in migraphx-driver to validate the installation.

  • Added support for ONNX Operator: Einsum.

  • Added uint8 support in ONNX Operators.

  • Added Split-k kernel configurations for performance improvements.

  • Added fusion for group convolutions.

  • Added rocMLIR conv3d support.

  • Added rocgdb to the Dockerfile.

  • Changed default location of libraries with release specific ABI changes.

  • Reorganized documentation in GitHub.

Removals#

  • Removed the --model flag with migraphx-driver.

Optimizations#

  • Improved ONNX Model Zoo coverage.

  • Reorganized memcpys with ONNX Runtime to improve performance.

  • Replaced scaler multibroadcast + unsqueeze with just a multibroadcast.

  • Improved MLIR kernel selection for multibroadcasted GEMMs.

  • Improved details of the perf report.

  • Enable mlir by default for GEMMs with small K.

  • Allow specifying dot or convolution fusion for mlir with environmental flag.

  • Improve performance on small reductions by doing multiple reduction per wavefront.

  • Add additional algebraic simplifications for mul-add-dot sequence of operations involving constants.

  • Use MLIR attention kernels in more cases.

  • Enables MIOpen and CK fusions for MI300 gfx arches.

  • Support for QDQ quantization patterns from Brevitas which have explicit cast/convert nodes before and after QDQ pairs.

  • Added Fusion of “contiguous + pointwise” and “layout + pointwise” operations which may result in performance gains in certain cases.

  • Added Fusion for “pointwise + layout” and “pointwise + contiguous” operations which may result in performance gains when using NHWC layout.

  • Added Fusion for “pointwise + concat” operation which may help in performance in certain cases.

  • Fixes a bug in “concat + pointwise” fusion where output shape memory layout wasn’t maintained.

  • Simplifies “slice + concat” pattern in SDXL UNet.

  • Removed ZeroPoint/Shift in QuantizeLinear or DeQuantizeLinear ops if zero points values are zeros.

  • Improved inference performance by fusing Reduce to Broadcast.

  • Added additional information when printing the perf report.

  • Improve scalar fusions when not all strides are 0.

  • Added support for multi outputs in pointwise ops.

  • Improve reduction fusion with reshape operators.

  • Use the quantized output when an operator is used again.

  • Enabled Split-k GEMM perf configs for rocMLIR based GEMM kernels for better performance on all Hardware.

Resolved issues#

  • Fixed Super Resolution model verification failed with FP16.

  • Fixed confusing messages by suppressing them when compiling the model.

  • Fixed an issue causing the mod operator with int8 and int32 inputs.

  • Fixed an issue by preventing the spawning too many threads for constant propagation when parallel STL is not enabled.

  • Fixed a bug when running migraphx-driver with the --run 1 option.

  • Fixed Layernorm accuracy: calculations in FP32.

  • Fixed update Docker generator script to ROCm 6.1 to point at Jammy.

  • Fixed a floating point exception for dim (-1) in the reshape operator.

  • Fixed issue with int8 accuracy and models which were failing due to requiring a fourth bias input.

  • Fixed missing inputs not previously handled for quantized bias for the weights, and data values of the input matrix.

  • Fixed order of operations for int8 quantization which were causing inaccuracies and slowdowns.

  • Fixed an issues during compilation caused by the incorrect constructor being used at compile time. Removed list initializer of prefix_scan_sum which was causing issues during compilation.

  • Fixed the MIGRAPHX_GPU_COMPILE_PARALLEL flag to enable users to control number of threads used for parallel compilation.

MIOpen (3.2.0)#

Changes#

  • Added:

    • [Conv] bilinear (alpha beta) solvers.

    • [Conv] enable bf16 for ck-based solvers.

    • [Conv] Add split_k tuning to 2d wrw ck-based solver.

    • [MHA] graph API fp8 fwd.

    • [RNN] multi-stream as default solution.

  • Added TunaNetv2.0 for MI300.

  • Added Adam and AMP Adam optimizer.

Resolved issues#

  • Memory access fault caused by GemmBwdRest.

  • Context configuration in GetWorkSpaceSize.

  • Fixes to support huge tensors.

Optimizations#

  • Find: improved precision of benchmarking.

MIVisionX (3.0.0)#

Changes#

  • Added support for:

    • Advanced GPUs

    • PreEmphasis Filter augmentation in openVX extensions

    • Spectrogram augmentation in openVX extensions

    • Downmix and ToDecibels augmentations in openVX extensions

    • Resample augmentation and Operator overloading nodes in openVX extensions

    • NonSilentRegion and Slice augmentations in openVX extensions

    • Mel-Filter bank and Normalize augmentations in openVX extensions

Removals#

  • Deprecated the use of rocAL for processing. rocAL is available at ROCm/rocAL.

Resolved issues#

  • Fixed issues with dependencies.

Known issues#

  • MIVisionX package install requires manual prerequisites installation.

Omniperf (2.0.1)#

Known issues#

  • Error when running Omniperf with an application with command line arguments. As a workaround, create an intermediary script to call the application with the necessary arguments, then call the script with Omniperf. This issue is fixed in a future release of Omniperf. See #347.

  • Omniperf might not work with AMD Instinct MI300 accelerators out of the box, resulting in the following error: “ERROR gfx942 is not enabled rocprofv1. Available profilers include: [‘rocprofv2’]”. As a workaround, add the environment variable export ROCPROF=rocprofv2.

  • Omniperf’s Python dependencies may not be installed with your ROCm installation, resulting in the following message:

    [ERROR] The ‘dash>=1.12.0’ package was not found in the current execution environment.

    [ERROR] The ‘dash-bootstrap-components’ package was not found in the current execution environment.

    Please verify all of the Python dependencies called out in the requirements file are installed locally prior to running omniperf.

    See: /opt/rocm-6.2.0/libexec/omniperf/requirements.txt

    As a workaround, install these Python requirements manually: pip install /opt/rocm-6.2.0/libexec/omniperf/requirements.txt.

See issue #3498 on GitHub.

OpenMP (17.0.0)#

Changes#

  • Added basic experimental support for libc functions on the GPU via the LLVM C Library for GPUs.

  • Added minimal support for calling host functions from the device using the libc interface.

  • Added vendor agnostic OMPT callback support for OpenMP-based device offload.

Removals#

  • Removed the “old” device plugins along with support for the remote and ve plugins.

Resolved issues#

  • Fixed the implementation of omp_get_wtime for AMDGPU targets.

RCCL (2.20.5)#

Changes#

  • Added support for fp8 and rccl_bfloat8.

  • Added support for using HIP contiguous memory.

  • Added ROC-TX for host-side profiling.

  • Added new rome model.

  • Added fp16 and fp8 cases to unit tests.

  • Added a new unit test for main kernel stack size.

  • Added the new -n option for topo_expl to override the number of nodes.

  • Improved debug messages of memory allocations.

  • Enabled static build.

  • Enabled compatibility with:

    • NCCL 2.20.5.

    • NCCL 2.19.4.

  • Performance tuning for some collective operations on MI300.

  • Enabled NVTX code in RCCL.

  • Replaced rccl_bfloat16 with hip_bfloat16.

  • NPKit updates:

    • Removed warm-up iteration removal by default, need to opt in now.

    • Doubled the size of buffers to accommodate for more channels.

  • Modified rings to be rail-optimized topology friendly.

Resolved issues#

  • Fixed a bug when configuring RCCL for only LL128 protocol.

  • Fixed scratch memory allocation after API change for MSCCL.

rocAL (1.0.0)#

Changes#

  • Added tests and samples.

Removals#

  • Removed CuPy from setup.py.

Optimizations#

  • Added setup and install updates.

Resolved issues#

  • Minor bug fixes.

rocALUTION (3.2.0)#

Changes#

  • Added new file I/O based on rocSPARSE I/O format.

  • Added GetConvergenceHistory for ItILU0 preconditioner.

Removals#

  • Deprecated the following:

    • LocalMatrix::ReadFileCSR

    • LocalMatrix::WriteFileCSR

    • GlobalMatrix::ReadFileCSR

    • GlobalMatrix::WriteFileCSR

rocBLAS (4.2.0)#

Changes#

  • Added Level 2 functions and level 3 trsm have additional ILP64 API for both C and FORTRAN (_64 name suffix) with int64_t function arguments.

  • Added cache flush timing for gemm_batched_ex, gemm_strided_batched_ex, and axpy.

  • Added Benchmark class for common timing code.

  • Added an environment variable ROCBLAS_DEFAULT_ATOMICS_MODE; to set default atomics mode during creation of rocblas_handle.

  • Added support for single-precision (fp32_r) input and double-precision (fp64_r) output and compute types by extending dot_ex.

  • Updated Linux AOCL dependency to release 4.2 gcc build.

  • Updated Windows vcpkg dependencies to release 2024.02.14.

  • Increased default device workspace from 32 to 128 MiB for architecture gfx9xx with xx >= 40.

Optimizations#

  • Improved performance of Level 1 dot_batched and dot_strided_batched for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 accelerator.

Removals#

  • Deprecated rocblas_gemm_ex3, gemm_batched_ex3 and gemm_strided_batched_ex3. They will be removed in the next major release of rocBLAS. Refer to hipBLASLt for future 8-bit float usage.

ROCdbgapi (0.76.0)#

Removals#

  • Renamed (AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_APERTURE_VIOLATION to (AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_ADDRESS_ERROR. The old names are still accessible but deprecated.

rocDecode (0.6.0)#

Changes#

  • Added full H.264 support and bug fixes.

rocFFT (1.0.28)#

Changes#

  • Randomly generated accuracy tests are now disabled by default. They can be enabled using the --nrand option (which defaults to 0).

Optimizations#

  • Implemented multi-device transform for 3D pencil decomposition. Contiguous dimensions on input and output bricks are transformed locally, with global transposes to make remaining dimensions contiguous.

rocm-cmake (0.13.0)#

Changes#

  • ROCmCreatePackage now accepts a suffix parameter, automatically generating it for static or ASAN builds.

    • Package names are no longer pulled from CPACK_<GEN>_PACKAGE_NAME.

    • Runtime packages will no longer be generated for static builds.

ROCm Data Center Tool (1.0.0)#

Changes#

  • Added ROCProfiler dmon metrics.

  • Added new ECC metrics.

  • Added ROCm Validation Suite diagnostic command.

  • Fully migrated to AMD SMI.

Removals#

  • Removed RASLIB dependency and blobs.

  • Removed rocm_smi_lib dependency due to migration to AMD SMI.

ROCm Debugger (ROCgdb) (14.2)#

Changes#

  • Introduce the coremerge utility to merge a host core dump and a GPU-only AMDGPU core dump into a unified AMDGPU corefile.

  • Added support for generating and opening core files for heterogeneous processes.

ROCm SMI (7.3.0)#

Changes#

  • Added Partition ID API (rsmi_dev_partition_id_get(..)).

Resolved issues#

  • Fixed Partition ID CLI output.

Note

See the detailed ROCm SMI changelog on GitHub for more information.

ROCm Validation Suite (1.0.0)#

Changes#

  • Added stress tests:

    • IET (power) stress test for MI300A.

    • IET (power transition) test for MI300X.

  • Added support:

    • GEMM self-check and accuracy-check support for checking consistency and accuracy of GEMM output.

    • Trignometric float and random integer matrix data initialization support.

  • Updated GST performance benchmark test for better numbers.

rocPRIM (3.2.0)#

Changes#

  • Added new overloads for warp_scan::exclusive_scan that take no initial value. These new overloads will write an unspecified result to the first value of each warp.

  • The internal accumulator type of inclusive_scan(_by_key) and exclusive_scan(_by_key) is now exposed as an optional type parameter.

    • The default accumulator type is still the value type of the input iterator (inclusive scan) or the initial value’s type (exclusive scan). This is the same behaviour as before this change.

  • Added a new overload for device_adjacent_difference_inplace that allows separate input and output iterators, but allows them to point to the same element.

  • Added new public APIs for deriving resulting type on device-only functions:

    • rocprim::invoke_result

    • rocprim::invoke_result_t

    • rocprim::invoke_result_binary_op

    • rocprim::invoke_result_binary_op_t

  • Added the new rocprim::batch_copy function. Similar to rocprim::batch_memcpy, but copies by element, not with memcpy.

  • Added more test cases, to better cover supported data types.

  • Added an optional decomposer argument for all member functions of rocprim::block_radix_sort and all functions of device_radix_sort. To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce a rocprim::tuple of references to arithmetic types from the key.

  • Added rocprim::predicate_iterator which acts as a proxy for an underlying iterator based on a predicate. It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate is true. It can be instantiated with:

    • rocprim::make_predicate_iterator

    • rocprim::make_mask_iterator

  • Added custom radix sizes as the last parameter for block_radix_sort. The default value is 4, it can be a number between 0 and 32.

  • Added rocprim::radix_key_codec, which allows the encoding/decoding of keys for radix-based sorts. For user-defined key types, a decomposer functor should be passed.

  • Updated some tests to work with supported data types.

Optimizations#

  • Improved the performance of warp_sort_shuffle and block_sort_bitonic.

  • Created an optimized version of the warp_exchange functions blocked_to_striped_shuffle and striped_to_blocked_shuffle when the warpsize is equal to the items per thread.

Resolved issues#

  • Fixed incorrect results of warp_exchange::blocked_to_striped_shuffle and warp_exchange::striped_to_blocked_shuffle when the block size is larger than the logical warp size. The test suite has been updated with such cases.

  • Fixed incorrect results returned when calling device unique_by_key with overlapping values_input and values_output.

  • Fixed incorrect output type used in device_adjacent_difference.

  • Fixed an issue causing incorrect results on the GFX10 (RDNA1, RDNA2) ISA and GFX11 ISA on device scan algorithms rocprim::inclusive_scan(_by_key) and rocprim::exclusive_scan(_by_key) with large input types.

  • Fixed an issue with device_adjacent_difference. It now considers both the input and the output type for selecting the appropriate kernel launch config. Previously only the input type was considered, which could result in compilation errors due to excessive shared memory usage.

  • Fixed incorrect data being loaded with rocprim::thread_load when compiling with -O0.

  • Fixed a compilation failure in the host compiler when instantiating various block and device algorithms with block sizes not divisible by 64.

Removals#

  • Deprecated the internal header detail/match_result_type.hpp.

  • Deprecated TwiddleIn and TwiddleOut in favor of radix_key_codec.

  • Deprecated the internal ::rocprim::detail::radix_key_codec in favor of a new public utility with the same name.

ROCProfiler (2.0.0)#

Removals#

  • Removed pcsampler sample code due to deprecation from version 2.

rocRAND (3.1.0)#

Changes#

  • Added rocrand_create_generator_host.

    • The following generators are supported:

      • ROCRAND_RNG_PSEUDO_MRG31K3P

      • ROCRAND_RNG_PSEUDO_MRG32K3A

      • ROCRAND_RNG_PSEUDO_PHILOX4_32_10

      • ROCRAND_RNG_PSEUDO_THREEFRY2_32_20

      • ROCRAND_RNG_PSEUDO_THREEFRY2_64_20

      • ROCRAND_RNG_PSEUDO_THREEFRY4_32_20

      • ROCRAND_RNG_PSEUDO_THREEFRY4_64_20

      • ROCRAND_RNG_PSEUDO_XORWOW

      • ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL32

      • ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL64

      • ROCRAND_RNG_QUASI_SOBOL32

      • ROCRAND_RNG_QUASI_SOBOL64

    • The host-side generators support multi-core processing. On Linux, this requires the TBB (Thread Building Blocks) development package to be installed on the system when building rocRAND (libtbb-dev on Ubuntu and derivatives).

      • If TBB is not found when configuring rocRAND, the configuration is still successful, and the host generators are executed on a single CPU thread.

  • Added the option to create a host generator to the Python wrapper.

  • Added the option to create a host generator to the Fortran wrapper

  • Added dynamic ordering. This ordering is free to rearrange the produced numbers, which can be specific to devices and distributions. It is implemented for:

    • XORWOW, MRG32K3A, MTGP32, Philox 4x32-10, MRG31K3P, LFSR113, and ThreeFry

  • Added support for using Clang as the host compiler for alternative platform compilation.

  • C++ wrapper:

    • Added support for lfsr113_engine being constructed with a seed of type unsigned long long, not only uint4.

    • Added optional order parameter to the constructor of mt19937_engine.

  • Added the following functions for the ROCRAND_RNG_PSEUDO_MTGP32 generator:

    • rocrand_normal2

    • rocrand_normal_double2

    • rocrand_log_normal2

    • rocrand_log_normal_double2

  • Added rocrand_create_generator_host_blocking which dispatches without stream semantics.

  • Added host-side generator for ROCRAND_RNG_PSEUDO_MTGP32.

  • Added offset and skipahead functionality to LFSR113 generator.

  • Added dynamic ordering for architecture gfx1102.

  • For device-side generators, you can now wrap calls to rocrand_generate_* inside of a hipGraph. There are a few things to be aware of:

    • Generator creation (rocrand_create_generator), initialization (rocrand_initialize_generator), and destruction (rocrand_destroy_generator) must still happen outside the hipGraph.

    • After the generator is created, you may call API functions to set its seed, offset, and order.

    • After the generator is initialized (but before stream capture or manual graph creation begins), use rocrand_set_stream to set the stream the generator will use within the graph.

    • A generator’s seed, offset, and stream may not be changed from within the hipGraph. Attempting to do so may result in unpredictable behaviour.

    • API calls for the poisson distribution (for example, rocrand_generate_poisson) are not yet supported inside of hipGraphs.

    • For sample usage, see the unit tests in test/test_rocrand_hipgraphs.cpp

  • Building rocRAND now requires a C++17 capable compiler, as the internal library sources now require it. However consuming rocRAND is still possible from C++11 as public headers don’t make use of the new features.

  • Building rocRAND should be faster on machines with multiple CPU cores as the library has been split to multiple compilation units.

  • C++ wrapper: the min() and max() member functions of the generators and distributions are now static constexpr.

  • Renamed and unified the existing ROCRAND_DETAIL_.*_BM_NOT_IN_STATE to ROCRAND_DETAIL_BM_NOT_IN_STATE

  • Static and dynamic library: moved all internal symbols to namespaces to avoid potential symbol name collisions when linking.

Removals#

  • Deprecated the following typedefs. Please use the unified state_type alias instead.

    • rocrand_device::threefry2x32_20_engine::threefry2x32_20_state

    • rocrand_device::threefry2x64_20_engine::threefry2x64_20_state

    • rocrand_device::threefry4x32_20_engine::threefry4x32_20_state

    • rocrand_device::threefry4x64_20_engine::threefry4x64_20_state

  • Deprecated the following internal headers:

    • src/rng/distribution/distributions.hpp.

    • src/rng/device_engines.hpp.

  • Removed references to and workarounds for deprecated hcc.

  • Removed support for HIP-CPU.

Known Issues#

  • SOBOL64 and SCRAMBLED_SOBOL64 generate poisson-distributed unsigned long long int numbers instead of unsigned int. This will be fixed in a future release.

ROCr Runtime (1.14.0)#

Changes#

  • Added PC sampling feature (experimental feature).

rocSOLVER (3.26.0)#

Changes#

  • Added 64-bit APIs for existing functions:

    • GETF2_64 (with batched and strided_batched versions)

    • GETRF_64 (with batched and strided_batched versions)

    • GETRS_64 (with batched and strided_batched versions)

  • Added gfx900 to default build targets.

  • Added partial eigenvalue decomposition routines for symmetric/hermitian matrices using Divide & Conquer and Bisection:

    • SYEVDX (with batched and strided_batched versions)

    • HEEVDX (with batched and strided_batched versions)

  • Added partial generalized symmetric/hermitian-definite eigenvalue decomposition using Divide & Conquer and Bisection:

    • SYGVDX (with batched and strided_batched versions)

    • HEGVDX (with batched and strided_batched versions)

  • Renamed install script arguments of the form *_dir to *-path. Arguments of the form *_dir remain functional for backwards compatibility.

  • Functions working with arrays of size n - 1 can now accept null pointers when n = 1.

Optimizations#

  • Improved performance of Cholesky factorization.

  • Improved performance of splitlu to extract the L and U triangular matrices from the result of sparse factorization matrix M, where M = (L - eye) + U.

Resolved issues#

  • Fixed potential accuracy degradation in SYEVJ/HEEVJ for inputs with small eigenvalues.

rocSPARSE (3.2.0)#

Changes#

  • Added a new Merge-Path algorithm to SpMM, supporting CSR format.

  • Added support for row order to SpSM.

  • Added rocsparseio I/O functionality to the library.

  • Added rocsparse_set_identity_permutation.

  • Adjusted rocSPARSE dependencies to related HIP packages.

  • Binary size has been reduced.

  • A namespace has been wrapped around internal rocSPARSE functions and kernels.

  • rocsparse_csr_set_pointers, rocsparse_csc_set_pointers, and rocsparse_bsr_set_pointers now allow the column indices and values arrays to be nullptr if nnz is 0.

  • gfx803 target has been removed from address sanitizer builds.

Optimizations#

  • SpMV adaptive and LRB algorithms have been further optimized on CSR format

  • Improved performance of SpMV adaptive with symmetrically stored matrices on CSR format

  • Improved documentation and contribution guidelines.

Resolved issues#

  • Fixed compilation errors with BUILD_ROCSPARSE_ILP64=ON.

rocThrust (3.1.0)#

Changes#

  • Added changes from upstream CCCL/thrust 2.2.0.

    • Updated the contents of system/hip and test with the upstream changes.

  • Updated internal calls to rocprim::detail::invoke_result to use the public API rocprim::invoke_result.

  • Updated to use rocprim::device_adjacent_difference for adjacent_difference API call.

  • Updated internal use of custom iterator in thrust::detail::unique_by_key to use rocPRIM’s rocprim::unique_by_key.

  • Updated adjecent_difference to make use of rocprim:adjecent_difference when iterators are comparable and not equal otherwise use rocprim:adjacent_difference_inplace.

Known Issues#

  • thrust::reduce_by_key outputs are not bit-wise reproducible, as run-to-run results for pseudo-associative reduction operators (e.g. floating-point arithmetic operators) are not deterministic on the same device.

  • Note that currently, rocThrust memory allocation is performed in such a way that most algorithmic API functions cannot be called from within hipGraphs.

rocWMMA (1.5.0)#

Changes#

  • Added internal utilities for:

    • Element-wise vector transforms.

    • Cross-lane vector transforms.

  • Added internal aos<->soa transforms for block sizes of 16, 32, 64, 128 and 256 and vector widths of 2, 4, 8 and 16.

  • Added tests for new internal transforms.

  • Improved loading layouts by increasing vector width for fragments with blockDim > 32.

  • API applyDataLayout transform now accepts WaveCount template argument for cooperative fragments.

  • API applyDataLayout transform now physically applies aos<->soa transform as necessary.

  • Refactored entry-point of std library usage to improve hipRTC support.

  • Updated installation, programmer’s guide and API reference documentation.

Resolved issues#

  • Fixed the ordering of some header includes to improve portability.

RPP (1.8.0)#

Changes#

  • Prerequisites - ROCm install requires only --usecase=rocm.

  • Use pre-allocated common scratchBufferHip everywhere in Tensor code for scratch HIP memory.

  • Use CHECK_RETURN_STATUS everywhere to adhere to C++17 for HIP.

  • RPP Tensor Audio support on HOST for Spectrogram.

  • RPP Tensor Audio support on HOST/HIP for Slice, by modifying voxel slice kernels to now accept anchor and shape params for a more generic version.

  • RPP Tensor Audio support on HOST for Mel Filter Bank.

  • RPP Tensor Normalize ND support on HOST and HIP.

Tensile (4.41.0)#

Changes#

  • New tuning script to summarize rocBLAS log file

  • New environment variable to test fixed grid size with Stream-K kernels

  • New Stream-K dynamic mode to run large problems at slightly reduced CU count if it improves work division and power

  • Add reject conditions for SourceKernel + PrefetchGlobalRead/LoopDoWhile

  • Add reject condition for PreloadKernelArguments (disable PreloadKernelArguments if not supported (instead of rejecting kernel generation))

  • Support NT flag for global load and store for gfx94x

  • New Kernarg preloading feature (DelayRemainingArgument: initiate the load of the remaining (non-preloaded) arguments, updated AsmCaps, AsmRegisterPool to track registers for arguments and preload)

  • Add option for rotating buffers timing with cache eviction

  • Add predicate for arithmetic intensity

  • Add DirectToVgpr + packing for f8/f16 + TLU cases

  • Enable negative values for ExtraLatencyForLR to reduce interval of local read and wait for DTV

  • Add test cases for DirectToVgpr + packing

  • Add batch support for Stream-K kernels and new test cases

  • New tuning scripts to analyze rocblas-bench results and remove tuned sizes from liblogic

  • Enable VgprForLocalReadPacking + PrefetchLocalRead=1 (removed the reject condition for VFLRP + PLR=1, added test cases for VFLRP + PLR=1)

  • Support VectorWidthB (new parameter VectorWidthB)

  • Support VectorWidth + non SourceSwap

  • Add test cases for VectorWidthB, VectorWidth + non SourceSwap

  • Add code owners file

  • New environment variables to dynamically adjust number of CUs used in Stream-K

  • Add new parameters to specify global load width for A and B separately (GlobalLoadVectorWidthA, B (effective with GlobalReadVectorWidth=-1))

  • Add xf32 option to rocblas-bench input creator

  • Update rocBLAS-bench-input-create script (added number of iteration based on performance, rotating buffer flag)

  • Limit build threads based on CPUs/RAM available on system (for tests)

  • Update required workspace size for Stream-K, skip kernel initialization when possible

  • Use fallback libraries for archs without optimized logic

  • Use hipMemcpyAsync for validation (replace hipMemcpy with hipMemcpyAsync + hipStreamSynchronize in ReferenceValidator)

  • Remove OCL tests

  • Disable HostLibraryTests

  • Reduce extended test time by removing extra parameters in the test config files

  • Disable InitAccVgprOpt for Stream-K

  • Skip sgemm 64bit offset tests for gfx94x

  • Skip DTV, DTL, LSU+MFMA tests for gfx908

  • Increase extended test timeout to 720 min

  • Update xfail test (1sum tests only failing on gfx90a)

  • Update lib logic convertor script

  • Test limiting CI threads for only gfx11

  • wGM related kernargs are removed if they are not needed (WGM=-1,0,1)

  • Cleanup on unused old code, mostly related to old client

  • Change GSUA to SingleBuffer if GlobalSplitU=1 + MultipleBuffer, instead of rejecting it

  • Update efficiency script for new architecture and xf32 datatype

  • Re-enable negative values for WorkGroupMapping (asm kernel only)

  • Disable HW monitor for aquvavanjaram941

  • Pre-apply offsets for strided batch kernels

  • Update tensile build with 16 threads

Optimizations#

  • Made initialization optimizations (reordered init code for PreloadKernelArguments opt, used s_mov_b64 for 64 bit address copy, used v_mov_b64/ds_read_b64 for C register initialization, added undefine AddressC/D with PreloadKernelArguments, optimized waitcnt for prefetch global read with DirectToVgpr, refactored waitcnt code for DTV and moved all asm related code to KernelWriterAssembly.py).

  • Optimized temp vgpr allocation for ClusterLocalRead (added if condition to allocate temp vgpr only for 8bit datatype)

  • Reversed MFMA order in inner loop for odd outer iteration

  • Optimized waitcnt lgkmcnt for 1LDSBuffer + PGR>1 (removed redundant waitcnt lgkmcnt after 1LDSBuffer sync)

  • Enhanced maximum value of DepthU to 1024 (used globalParameters MaxDepthU to define maximum value of DepthU)

Resolved issues#

  • Fixed WorkspaceCheck implementation when used in rocBLAS.

  • Fixed Stream-K partials cache behavior.

  • Fixed MasterSolutionLibrary indexing for multiple architecture build.

  • Fixed memory allocation fail with FlushMemorySize + StridedBatched/Batched cases (multiply batch count size when calculating array size).

  • Fixed BufferLoad=False with Stream-K.

  • Fixed mismatch issue with GlobalReadCoalesceGroup.

  • Fixed rocBLAS build fail on gfx11 (used state[“ISA”] for reject conditions instead of globalParameters[“CurrentISA”]).

  • Fixed for LdsPad auto (fixed incorrect value assignment for autoAdjusted, set LdsBlockSizePerPadA or B = 0 if stride is not power of 2).

  • Fixed inaccurate vgpr allocation for ClusterLocalRead.

  • Fixed mismatch issue with LdsBlockSizePerPad + MT1(or 0) not power of 2.

  • Fixed mismatch issue with InitAccOpt + InnerUnroll (use const 0 for src1 of MFMA only if index of innerUnrll (iui) is 0).

  • Fixed HostLibraryTests on gfx942 and gfx941.

  • Fixed LLVM crash issue.

  • Fixed for newer windows vcpkg msgpack and vcpkg version package name.

  • Fixed an error with DisableKernelPieces + 32bit ShadowLimit.

  • Ignore asm cap check for kernel arg preload for rocm6.0 and older.

ROCm known issues#

ROCm known issues are noted on GitHub. For known issues related to individual components, review the Detailed component changes.

Default processor affinity behavior for helper threads#

Processor affinity is a critical setting to ensure that ROCm helper threads run on the correct cores. By default, ROCm helper threads are spawned on all available cores, ignoring the parent thread’s processor affinity. This can lead to threads competing for available cores, which may result in suboptimal performance. This behavior occurs by default if the environment variable HSA_OVERRIDE_CPU_AFFINITY_DEBUG is not set or is set to 1. If HSA_OVERRIDE_CPU_AFFINITY_DEBUG is set to 0, the ROCr runtime uses the parent process’s core affinity mask when creating helper threads. The parent’s affinity mask should then be set to account for the presence of additional threads by ensuring the affinity mask contains enough cores. Depending on the affinity settings of the software environment, batch system, launch commands like numactl/taskset, or explicit mask manipulation by the application itself, changing the setting may be advantageous to performance.

To ensure the parent’s core affinity mask is honored by the ROCm helper threads, set the HSA_OVERRIDE_CPU_AFFINITY_DEBUG environment variable as follows:

export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0

To ensure ROCm helper threads run on all available cores, set the HSA_OVERRIDE_CPU_AFFINITY_DEBUG environment variable as follows:

export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=1

Or the default:

unset HSA_OVERRIDE_CPU_AFFINITY_DEBUG

If unsure of the default processor affinity settings for your environment, run the following command from the shell:

bash -c "echo taskset -p \$\$" 

See issue #3493 on GitHub.

Display issues on servers with Instinct MI300-series accelerators when loading AMDGPU driver#

AMD Instinct MI300-series accelerators and third-party GPUs such as the Matrox G200 have an issue impacting video output. The issue was reproduced on a Dell server model PowerEdge XE9680. Servers from other vendors utilizing Matrox G200 cards may be impacted as well. This issue was found with ROCm 6.2.0 but is present in older ROCm versions.

The AMDGPU driver shipped with ROCm interferes with the operation of the display card video output. On Dell systems, this includes both the local video output and remote access via iDRAC. The display appears blank (black) after loading the amdgpu driver modules. Video output impacts both terminal access when running in runlevel 3 and GUI access when running in runlevel 5. Server functionality can still be accessed via SSH or other remote connection methods.

See issue #3494 on GitHub.

KFDTest failure on Instinct MI300X with Oracle Linux 8.9#

The KFDEvictTest.QueueTest is failing on the MI300X platform during KFD (Kernel Fusion Driver) tests, causing the full suite to not execute properly. This issue is suspected to be hardware-related.

See issue #3495 on GitHub.

Bandwidth limitation in gang and non-gang modes on Instinct MI300A#

Expected target peak non-gang performance (~60GB/s) and target peak gang performance (~90GB/s) are not achieved. Both gang and non-gang performance are observed to be limited at 45GB/s.

This issue will be addressed in a future ROCm release.

See issue #3496 on GitHub.

rocm-llvm-alt#

ROCm provides an optional package – rocm-llvm-alt – that provides a closed-source compiler for users interested in additional closed-source CPU optimizations. This feature is not functional in the ROCm 6.2.0 release. Users who attempt to invoke the closed-source compiler will experience an LLVM consumer-producer mismatch and the compilation will fail. There is no workaround that allows use of the closed-source compiler. It is recommended to compile using the default open-source compiler, which generates high-quality AMD CPU and AMD GPU code.

See issue #3492 on GitHub.

ROCm upcoming changes#

The section notes upcoming changes to the ROCm software stack. For upcoming changes related to individual components, review the Detailed component changes.

rocm-llvm-alt#

The rocm-llvm-alt package will be removed in an upcoming release. Users relying on the functionality provided by the closed-source compiler should transition to the open-source compiler. Once the rocm-llvm-alt package is removed, any compilation requesting functionality provided by the closed-source compiler will result in a Clang warning: “[AMD] proprietary optimization compiler has been removed”.