ROCm 6.2.0 release notes#
2024-08-02
75 min read time
The release notes provide a comprehensive summary of changes since the previous ROCm release.
The Compatibility matrix provides an overview of operating system, hardware, ecosystem, and ROCm component support across ROCm releases.
Release notes for previous ROCm releases are available in earlier versions of the documentation. See the ROCm documentation release history.
Release highlights#
This section introduces notable new features and improvements in ROCm 6.2. See the Detailed component changes for individual component changes.
New components#
ROCm 6.2.0 introduces the following new components to the ROCm software stack.
Omniperf – A kernel-level profiling tool for machine learning and high-performance computing (HPC) workloads running on AMD Instinct accelerators. Omniperf offers comprehensive profiling and advanced analysis via command line or a GUI dashboard. For more information, see Omniperf.
Omnitrace – A multi-purpose analysis tool for profiling and tracing applications running on the CPU or the CPU and GPU. It supports dynamic binary instrumentation, call-stack sampling, causal profiling, and other features for determining which function and line number are executing. For more information, see Omnitrace.
rocPyDecode – A tool to access rocDecode APIs in Python. It connects Python and C/C++ libraries, enabling function calling and data passing between the two languages. The
rocpydecode.so
library, a wrapper, uses rocDecode APIs written primarily in C/C++ within Python. For more information, see rocPyDecode.ROCprofiler-SDK – ROCprofiler-SDK is a profiling and tracing library for HIP and ROCm applications on AMD ROCm software used to identify application performance bottlenecks and optimize their performance. The new APIs add restrictions for more efficient implementations and improved thread safety. A new window restriction specifies the services the tool can use. ROCprofiler-SDK also provides a tool library to help you write your tool implementations.
rocprofv3
uses this tool library to profile and trace applications for performance bottlenecks. Examples include API tracing, kernel tracing, and so on. For more information, see ROCprofiler-SDK.Note
ROCprofiler-SDK for ROCm 6.2.0 is a beta release and subject to change.
ROCm Offline Installer Creator introduced#
The new ROCm Offline Installer Creator creates an installation package for a preconfigured setup of ROCm, the AMDGPU driver, or a combination of the two on a target system without network access. This new tool customizes multiple unique configurations for use when installing ROCm on a target. Other notable features include:
A lightweight, easy-to-use user interface for configuring the creation of the installer
Support for multiple Linux distributions
Installer support for different ROCm releases and specific ROCm components
Optional driver or driver-only installer creation
Optional post-install preferences
Lightweight installer packages, which are unique to the preconfigured ROCm setup
Resolution and inclusion of dependency packages for offline installation
For more information, see ROCm Offline Installer Creator.
Math libraries default to Clang instead of HIPCC#
The default compiler used to build the math libraries on Linux changes from hipcc
to amdclang++
.
Appropriate compiler flags are added to ensure these compilations build correctly. This change only applies when
building the libraries. Applications using the libraries can continue to be compiled using hipcc
or amdclang++
as
described in ROCm compiler reference.
The math libraries can also be built with hipcc
using any of the previously available methods (for example, the CXX
environment variable, the CMAKE_CXX_COMPILER
CMake variable, and so on). This change shouldn’t affect performance or
functionality.
Framework and library changes#
This section highlights updates to supported deep learning frameworks and notable third-party library optimizations.
Additional PyTorch and TensorFlow support#
ROCm 6.2.0 supports PyTorch versions 2.2 and 2.3 and TensorFlow version 2.16.
See Installing PyTorch for ROCm and Installing TensorFlow for ROCm for installation instructions.
Refer to the Third-party support matrix for a comprehensive list of third-party frameworks and libraries supported by ROCm.
Optimized framework support for OpenXLA#
PyTorch for ROCm and TensorFlow for ROCm now provide native support for OpenXLA. OpenXLA is an open-source ML compiler ecosystem that enables developers to compile and optimize models from all leading ML frameworks. For more information, see Installing PyTorch for ROCm and Installing TensorFlow for ROCm.
PyTorch support for Autocast (automatic mixed precision)#
PyTorch now supports Autocast for recurrent neural networks (RNNs) on ROCm. This can help to reduce computational
workloads and improve performance. Based on the information about the magnitude of values, Autocast can substitute the
original float32
linear layers and convolutions with their float16
or bfloat16
variants. For more information, see
Automatic mixed precision.
Memory savings for bitsandbytes model quantization#
The ROCm-aware bitsandbytes library is a lightweight Python wrapper around HIP custom functions, in particular 8-bit optimizer, matrix multiplication, and 8-bit and 4-bit quantization functions. ROCm 6.2.0 introduces the following bitsandbytes changes:
Int8
matrix multiplication is enabled, and it includes the following functions:extract-outliers
– extracts rows and columns that have outliers in the inputs. They’re later used for matrix multiplication without quantization.transform
– row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and aftermatmul
computation.igemmlt
– new function for GEMM computation A*B^T. It uses hipblasLtMatMul and performs 8-bit GEMM operations.dequant_mm
– dequantizes output matrix to original data type using scaling factors from vector-wise quantization.
Blockwise quantization – input tensors are quantized for a fixed block size.
4-bit quantization and dequantization functions – normalized
Float4
quantization, quantile estimation, and quantile quantization functions are enabled.8-bit and 32-bit optimizers are enabled.
Note
These functions are included in bitsandbytes. They are not part of ROCm. However, ROCm 6.2.0 has enabled the fixes and features to run them.
For more information, see Model quantization techniques.
Improved vLLM support#
ROCm 6.2.0 enhances vLLM support for inference on AMD Instinct accelerators, adding
capabilities for FP16
/BF16
precision for LLMs, and FP8
support for Llama.
ROCm 6.2.0 adds support for the following vLLM features:
MP: Multi-GPU execution. Choose between MP and Ray using a flag. To set it to MP, use
--distributed-executor-backed=mp
. The default depends on the commit in flux.FP8 KV cache: Enhances computational efficiency and performance by significantly reducing memory usage and bandwidth requirements. The QUARK quantizer currently only supports Llama.
Triton Flash Attention:
ROCm supports both Triton and Composable Kernel Flash Attention 2 in vLLM. The default is Triton, but you can change this setting using the
VLLM_USE_FLASH_ATTN_TRITON=False
environment variable.PyTorch TunableOp:
Improved optimization and tuning of GEMMs. It requires Docker with PyTorch 2.3 or later.
For more information about enabling these features, see vLLM inference.
ROCm has a vLLM branch for experimental features. This includes performance improvements, accuracy, and correctness testing. These features include:
FP8 GEMMs: To improve the performance of FP8 quantization, work is underway on tuning the GEMM using the shapes used in the model’s execution. It only supports LLAMA because the QUARK quantizer currently only supports Llama.
Custom decode paged attention: Improves performance by efficiently managing memory and enabling faster attention computation in large-scale models. This benefits all workloads in
FP16
configurations.
To enable these experimental new features, see
vLLM inference.
Use the rocm/vllm
branch when cloning the GitHub repo. The vllm/ROCm_performance.md
document outlines
all the accessible features, and the vllm/Dockerfile.rocm
file can be used.
Enhanced performance tuning on AMD Instinct accelerators#
ROCm is pre-tuned for high-performance computing workloads including large language models, generative AI, and scientific computing. The ROCm documentation provides comprehensive guidance on configuring your system for AMD Instinct accelerators. It includes detailed instructions on system settings and application tuning suggestions to help you fully leverage the capabilities of these accelerators for optimal performance. For more information, see AMD MI300X tuning guides and AMD MI300A system optimization.
Removed clang-ocl#
As of version 6.2, ROCm no longer provides the clang-ocl
package.
See the clang-ocl README.
ROCm documentation changes#
The documentation for the ROCm components has been reorganized and reformatted in a standard look and feel. This improves the usability and readability of the documentation. For more information about the ROCm components, see What is ROCm?.
Since the release of ROCm 6.1, the documentation has added some key topics including:
The following topics have been significantly improved, expanded, or both:
Note
All ROCm projects are open source and available on GitHub. To contribute to ROCm documentation, see the ROCm documentation contribution guidelines.
Operating system and hardware support changes#
ROCm 6.2.0 adds support for the following operating system and kernel versions.
Ubuntu 24.04 LTS (kernel: 6.8 [GA])
RHEL 8.10 (kernel: 4.18.0-544)
SLES 15 SP6 (kernel: 6.4)
ROCm 6.2.0 marks the end of support (EoS) for:
Ubuntu 22.04.3
RHEL 9.2
RHEL 8.8
SLES 15 SP 4
CentOS 7.9
ROCm 6.2.0 has been tested against pre-release Ubuntu 22.04.5 (kernel: 6.5 [HWE]).
See the Compatibility matrix for an overview of supported operating systems and hardware architectures.
ROCm components#
The following table lists ROCm components and their individual versions for ROCm 6.2.0. Follow the links in the Version column to go to the detailed component changelogs.
Detailed component changes#
The following sections describe key changes to ROCm components.
AMD SMI (24.6.2)#
Changes#
Added the following functionality:
amd-smi dmon
is now available as an alias toamd-smi monitor
.An optional process table under
amd-smi monitor -q
.Handling to detect VMs with passthrough configurations in CLI tool.
Process Isolation and Clear SRAM functionality to the CLI tool for VMs.
Added Ring Hang event.
Added macros that were in
amdsmi.h
to the AMD SMI Python libraryamdsmi_interface.py
.Renamed
amdsmi_set_gpu_clear_sram_data()
toamdsmi_clean_gpu_local_data()
.
Removals#
Removed
throttle-status
fromamd-smi monitor
as it is no longer reliably supported.Removed elevated permission requirements for
amdsmi_get_gpu_process_list()
.
Optimizations#
Updated CLI error strings to specify invalid device type queried.
Multiple structure updates in
amdsmi.h
andamdsmi_interface.py
to align with host/guest.Added
amdsmi.h
andamdsmi_interface.py
.amdsmi_clk_info_t
structAdded
AMDSMI
prefix to multiple structures.
Updated
dpm_policy
references tosoc_pstate
.Updated
amdsmi_get_gpu_board_info()
product_name to fallback topciids
file.Updated
amdsmi_get_gpu_board_info()
now has larger structure sizes foramdsmi_board_info_t
.Updated CLI voltage curve command output.
Resolved issues#
Fixed multiple processes not being registered in
amd-smi process
with JSON and CSV format.amdsmi_get_gpu_board_info()
no longer returns junk character strings.Fixed parsing of
pp_od_clk_voltage
withinamdsmi_get_gpu_od_volt_info
.Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool. When running
amd-smi reset --gpureset --gpu all
and then running an instance ofamd-smi static
(or any other subcommand that access the GPUs) a mutex would lock and not return requiring either a clear of the mutex in/dev/shm
or rebooting the machine.
Known issues#
amdsmi_get_gpu_process_isolation
andamdsmi_clean_gpu_local_data
commands do not work. They will be supported in a future release.
See issue #3500 on GitHub.
Note
See the detailed AMD SMI changelog on GitHub for more information.
Composable Kernel (1.1.0)#
Changes#
Added support for:
Permute scale for any dimension (#1198).
Combined elementwise op (#1217).
Multi D in grouped convolution backward weight (#1280).
K or C equal to 1 for
fp16
in grouped convolution backward weight (#1280).Large batch in grouped convolution forward (#1332).
Added
CK_TILE
layernorm example (#1339).CK_TILE
-based Flash Attention 2 kernel is now merged into the upstream repository as ROCm backend.
Optimizations#
Support universal GEMM in grouped convolution forward (#1320).
Optimizations for low M and N in grouped convolution backward weight (#1303).
Added a functional enhancement and compiler bug fix for FlashAttention Forward Kernel.
FP8
GEMM performance optimization and tuning (#1384).Added FlashAttention backward pass performance optimization (#1397).
HIP (6.2.0)#
Changes#
Added the
_sync()
version of crosslane builtins such asshfl_sync()
,__all_sync()
and__any_sync()
. These take a 64-bit integer as an explicit mask argument.In HIP 6.2, these are hidden behind the preprocessor macro
HIP_ENABLE_WARP_SYNC_BUILTINS
, and will be enabled unconditionally in a future HIP release.
Added new HIP APIs:
hipGetProcAddress
returns the pointer to driver function, corresponding to the defined driver function symbol.hipGetFuncBySymbol
returns the pointer to device entry function that matches entry functionsymbolPtr
.hipStreamBeginCaptureToGraph
begins graph capture on a stream to an existing graph.hipGraphInstantiateWithParams
creates an executable graph from a graph.
Added a new flag
integrated
– supported in device property.The integrated flag is added in the struct
hipDeviceProp_t
. On the integrated APU system, the runtime driver detects and sets this flag to1
, in which case the APIhipDeviceGetAttribute
returns enumhipDeviceAttribute_t
forhipDeviceAttributeIntegrated
as value 1, for integrated GPU device.
Added initial support for 8-bit floating point datatype in
amd_hip_fp8.h
. These are accessible via#include <hip/hip_fp8.h>
.Added UUID support for environment variable
HIP_VISIBLE_DEVICES
.
Resolved issues#
Fixed stream capture support in HIP graphs. Prohibited and unhandled operations are fixed during stream capture in the HIP runtime.
Fixed undefined symbol error for
hipTexRefGetArray
andhipTexRefGetBorderColor
.
Upcoming changes#
The
_sync()
version of crosslane builtins such asshfl_sync()
,__all_sync()
, and__any_sync()
will be enabled unconditionally in a future HIP release.
hipBLAS (2.2.0)#
Changes#
Added a new ILP64 API for level 2 functions for both C and FORTRAN (
_64
name suffix) withint64_t
function arguments.Added a new ILP64 API for level 1
_ex
functions.The
install.sh
script now invokes thermake.py
script. Made other various improvements to the build scripts.Changed library dependencies in the
install.sh
script fromrocblas
androcsolver
to the development packagesrocblas-dev
androcsolver-dev
.Updated Linux AOCL dependency to release 4.2
gcc
build.Updated Windows
vcpkg
dependencies to release 2024.02.14.
hipBLASLt (0.8.0)#
Changes#
Added extension APIs: *
hipblasltExtAMaxWithScale
.GemmTuning
extension parameter to setwgm
by user.
Added support for:
HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER
forFP8
/BF8
datatype.FP8
/BF8
input,FP32/FP16/BF16/F8/BF8
output (gfx94x platform only).HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_A_EXT
andHIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_B_EXT
forFP16
input data type to useFP8
/BF8
MFMA.
Added support for gfx110x.
Optimizations#
Improved library loading time.
HIPCC (1.1.1)#
Changes#
Split
hipcc
package into two packages for different hardware platforms.Cleaned up references to environment variables.
Enabled
hipcc
andhipconfig
binaries (hipcc.bin
,hipconfig.bin
) by default, instead of their Perl counterparts.Enabled function calls.
Added support for generating packages for ROCm stack targeting static libraries.
Resolved issues#
Implemented numerous bug fixes and quality improvements.
hipCUB (3.2.0)#
Changes#
Added
DeviceCopy
function for parity with CUB.Added
enum WarpExchangeAlgorithm
to the rocPRIM backend, which is used as the new optional template argument forWarpExchange
.The potential values for the enum are
WARP_EXCHANGE_SMEM
andWARP_EXCHANGE_SHUFFLE
.WARP_EXCHANGE_SMEM
stands for the previous algorithm, whileWARP_EXCHANGE_SHUFFLE
performs the exchange via shuffle operations.WARP_EXCHANGE_SHUFFLE
does not require any pre-allocated shared memory, but theItemsPerThread
must be a divisor ofWarpSize
.
Added
tuple.hpp
which defines templateshipcub::tuple
,hipcub::tuple_element
,hipcub::tuple_element_t
andhipcub::tuple_size
.Added new overloaded member functions to
BlockRadixSort
andDeviceRadixSort
that expose adecomposer
argument. Keys of a custom type (key_type
) can be sorted via these overloads, if an appropriate decomposer is passed. The decomposer has to implementoperator(const key_type&)
which returns ahipcub::tuple
of references pointing to members ofkey_type
.On AMD GPUs (using the HIP backend), you can now issue hipCUB API calls inside of HIP graphs, with several exceptions:
CachingDeviceAllocator
GridBarrierLifetime
DeviceSegmentedRadixSort
DeviceRunLengthEncode
Currently, these classes rely on one or more synchronous calls to function correctly. Because of this, they cannot be used inside of HIP graphs.
Removals#
Deprecated
debug_synchronous
in hipCUB-2.13.2, and it no longer has any effect. With this release, passingdebug_synchronous
to the device functions results in a deprecation warning both at runtime and at compile time.The synchronization that was previously achievable by passing
debug_synchronous=true
can now be achieved at compile time by setting theCUB_DEBUG_SYNC
(or higher debug level) or theHIPCUB_DEBUG_SYNC
preprocessor definition.The compile time deprecation warnings can be disabled by defining the
HIPCUB_IGNORE_DEPRECATED_API
preprocessor definition.
Resolved issues#
Fixed the derivation for the accumulator type for device scan algorithms in the rocPRIM backend being different compared to CUB. It now derives the accumulator type as the result of the binary operator.
hipFFT (1.0.15)#
Resolved issues#
Added
hip::host
as a public link library, ashipfft.h
includes HIP runtime headers.Prevented C++ exceptions leaking from public API functions.
Made output of
hipfftXt
matchcufftXt
in geometry and alignment for 2D and 3D FFTs.
HIPIFY (18.0.0)#
Changes#
Added support for:
NVIDIA CUDA 12.4.1
cuDNN 9.1.1
LLVM 18.1.6
Added full hipBLASLt support.
Resolved issues#
HIPIFY now applies
reinterpret_cast
for an explicit conversion between pointer-to-function and pointer-to-object; affected functions:hipFuncGetAttributes
,hipFuncSetAttribute
,hipFuncSetCacheConfig
,hipFuncSetSharedMemConfig
,hipLaunchKernel
, andhipLaunchCooperativeKernel
.
hipRAND (2.11.0)#
Changes#
Added support for setting generator output ordering in C and C++ API.
hiprandCreateGeneratorHost
dispatches to the host generator in the rocRAND backend instead of returning withuHIPRAND_STATUS_NOT_IMPLEMENTED
.Added options to create:
A host generator to the Fortran wrapper.
A host generator to the Python wrapper.
Previously, for internal testing with HMM the environment variable
ROCRAND_USE_HMM
was used in previous versions. The environment variable is now namedHIPRAND_USE_HMM
.Static library – moved all internal symbols to namespaces to avoid potential symbol name collisions when linking.
Device API documentation is improved in this version.
Removals#
Removed the option to build hipRAND as a submodule to rocRAND.
Removed references to, and workarounds for, the deprecated
hcc
.Removed support for finding rocRAND based on the environment variable
ROCRAND_DIR
. UseROCRAND_PATH
instead.
Resolved issues#
Fixed a build error when using Clang++ directly due to unsupported references to
amdgpu-target
.
hipSOLVER (2.2.0)#
Changes#
Added compatibility-only functions:
auxiliary
hipsolverDnCreateParams
,hipsolverDnDestroyParams
,hipsolverDnSetAdvOptions
getrf
hipsolverDnXgetrf_bufferSize
hipsolverDnXgetrf
getrs
hipsolverDnXgetrs
Added support for building on Ubuntu 24.04 and CBL-Mariner.
Added
hip::host
toroc::hipsolver
usage requirements.Added functions
syevdx
/heevdx
hipsolverSsyevdx_bufferSize
,hipsolverDsyevdx_bufferSize
,hipsolverCheevdx_bufferSize
,hipsolverZheevdx_bufferSize
hipsolverSsyevdx
,hipsolverDsyevdx
,hipsolverCheevdx
,hipsolverZheevdx
sygvdx
/hegvdx
hipsolverSsygvdx_bufferSize
,hipsolverDsygvdx_bufferSize
,hipsolverChegvdx_bufferSize
,hipsolverZhegvdx_bufferSize
hipsolverSsygvdx
,hipsolverDsygvdx
,hipsolverChegvdx
,hipsolverZhegvdx
Updated
csrlsvchol
to perform numerical factorization on the GPU. The symbolic factorization is still performed on the CPU.Renamed
hipsolver-compat.h
tohipsolver-dense.h
.
Removals#
Removed dependency on
cblas
from the hipSOLVER test and benchmark clients.
hipSPARSE (3.1.1)#
Changes#
Added the missing
hipsparseCscGet()
routine.All internal hipSPARSE functions now exist inside a namespace.
Match deprecations found in cuSPARSE 12.x.x when using cuSPARSE backend.
Improved the user manual and contribution guidelines.
Resolved issues#
Fixed
SpGEMM
andSpGEMM_reuse
routines that were not matching cuSPARSE behavior.
Known Issues#
In
hipsparseSpSM_solve()
, the external buffer is currently passed as a parameter. This does not match the cuSPARSE API and this extra external buffer parameter will be removed in a future release. For now this extra parameter can be ignored and passed anullptr
as it is unused internally byhipsparseSpSM_solve()
.
hipSPARSELt (0.2.1)#
Optimizations#
Refined test cases.
hipTensor (1.3.0)#
Changes#
Added support for:
Tensor permutation of ranks of 2, 3, 4, 5, and 6
Tensor contraction of M6N6K6: M, N, K up to rank 6
Added tests for:
Tensor permutation of ranks of 2, 3, 4, 5, and 6
Tensor contraction of M6N6K6: M, N, K up to rank 6
YAML parsing to support sequential parameters ordering.
Prefer
amd-llvm-devel
package before system LLVM library.Preferred compilers changed to
CC=amdclang
CXX=amdclang++
.Updated actor-critic selection for new contraction kernel additions.
Updated installation, programmer’s guide, and API reference documentation.
Resolved issues#
Fixed LLVM parsing crash.
Fixed memory consumption issue in complex kernels.
Workaround implemented for compiler crash during debug build.
Allow random modes ordering for tensor contractions.
llvm-project (18.0.0)#
Changes#
LLVM IR
The
llvm.stacksave
andllvm.stackrestore
intrinsics now use an overloaded pointer type to support non-0 address spaces.Added
llvm.exp10
intrinsic.
LLVM infrastructure
The minimum Clang version to build LLVM in C++20 configuration is now
clang-17.0.6
.
TableGen
AArch64 backend
Added support for Cortex-A520, Cortex-A720 and Cortex-X4 CPUs.
AMDGPU backend
llvm.sqrt.f32
is now lowered correctly. Usellvm.amdgcn.sqrt.f32
for raw instruction access.Implemented
llvm.stacksave
andllvm.stackrestore
intrinsics.Implemented
llvm.get.rounding
.
ARM backend
Added support for Cortex-M52 CPUs.
Added execute-only support for Armv6-M.
RISC-V backend
The
Zfa
extension version was upgraded to 1.0 and is no longer experimental.Zihintntl
extension version was upgraded to 1.0 and is no longer experimental.Intrinsics were added for
Zk*
,Zbb
, andZbc
. See Scalar Bit Manipulation Extension Intrinsics in the RISC-V C API specification.Default ABI with F but without D was changed to ilp32f for RV32 and to lp64f for RV64.
The
Zvbb
,Zvbc
,Zvkb
,Zvkg
,Zvkn
,Zvknc
,Zvkned
,Zvkng
,Zvknha
,Zvknhb
,Zvks
,Zvksc
,Zvksed
,Zvksg
,Zvksh
, andZvkt
extension version was upgraded to 1.0 and is no longer experimental. However, the C intrinsics for these extensions are still experimental. To use the C intrinsics for these extensions,-menable-experimental-extensions
needs to be passed to Clang.-mcpu=sifive-p450
was added.CodeGen of
RV32E
andRV64E
is supported experimentally.CodeGen of
ilp32e
andlp64e
is supported experimentally.
X86 backend
Added support for the RDMSRLIST and WRMSRLIST instructions.
Added support for the WRMSRNS instruction.
Support ISA of AMX-FP16 which contains
tdpfp16ps
instruction.Support ISA of CMPCCXADD.
Support ISA of AVX-IFMA.
Support ISA of AVX-VNNI-INT8.
Support ISA of AVX-NE-CONVERT.
-mcpu=raptorlake
,-mcpu=meteorlake
and-mcpu=emeraldrapids
are now supported.-mcpu=sierraforest
,-mcpu=graniterapids
and-mcpu=grandridge
are now supported.__builtin_unpredictable
(unpredictable metadata in LLVM IR), is handled by X86 Backend. X86CmovConversion pass now respects this builtin and does not convert CMOVs to branches.Add support for the PBNDKB instruction.
Support ISA of SHA512.
Support ISA of SM3.
Support ISA of SM4.
Support ISA of AVX-VNNI-INT16.
-mcpu=graniterapids-d
is now supported.The
i128
type now matches GCC and clang’s__int128
type. This mainly benefits external projects such as Rust which aim to be binary compatible with C, but also fixes code generation where LLVM already assumed that the type matched and called intolibgcc
helper functions.Support ISA of USER_MSR.
Support ISA of AVX10.1-256 and AVX10.1-512.
-mcpu=pantherlake
and-mcpu=clearwaterforest
are now supported.-mapxf
is supported.Marking global variables with
code_model = "small"/"large"
in the IR now overrides the global code model to allow 32-bit relocations or require 64-bit relocations to the global variable.The medium code model’s code generation was audited to be more similar to the small code model where possible.
C API
Added
LLVMGetTailCallKind
andLLVMSetTailCallKind
to allow getting and settingtail
,musttail
, andnotail
attributes on call instructions.Added
LLVMCreateTargetMachineWithOptions
, along with helper functions for an opaque option structure, as an alternative toLLVMCreateTargetMachine
. The option structure exposes an additional setting (that is, the target ABI) and provides default values for unspecified settings.Added
LLVMGetNNeg
andLLVMSetNNeg
for getting and setting the newnneg
flag on zext instructions, andLLVMGetIsDisjoint
andLLVMSetIsDisjoint
for getting and setting the new disjoint flag on or instructions.Added the following functions for manipulating operand bundles, as well as building call and invoke instructions that use operand bundles:
LLVMBuildCallWithOperandBundles
LLVMBuildInvokeWithOperandBundles
LLVMCreateOperandBundle
LLVMDisposeOperandBundle
LLVMGetNumOperandBundles
LLVMGetOperandBundleAtIndex
LLVMGetNumOperandBundleArgs
LLVMGetOperandBundleArgAtIndex
LLVMGetOperandBundleTag
Added
LLVMGetFastMathFlags
andLLVMSetFastMathFlags
for getting and setting the fast-math flags of an instruction, as well asLLVMCanValueUseFastMathFlags
for checking if an instruction can use such flag.
CodeGen infrastructure
A new debug type
isel-dump
is added to show only the SelectionDAG dumps after each ISel phase (i.e.-debug-only=isel-dump
). This new debug type can be filtered by function names using-filter-print-funcs=<function names>
, the same flag used to filter IR dumps after each Pass. Note that the existing-debug-only=isel
will take precedence over the new behavior and print SelectionDAG dumps of every single function regardless of-filter-print-funcs
’s values.
Metadata info
Added a new loop metadata
!{!”llvm.loop.align”, i32 64}
.
LLVM tools
llvm-symbolizer
now treats invalid input as an address for which source information is not found.llvm-readelf
now supports--extra-sym-info
(-X) to display extra information (section name) when showing symbols.llvm-readobj --elf-output-style=JSON
no longer prefixes each JSON object with the file name. Previously, each object file’s output looked like"main.o":{"FileSummary":{"File":"main.o"},...}
but is now{"FileSummary":{"File":"main.o"},...}
. This allows each JSON object to be parsed in the same way, since each object no longer has a unique key. Tools that consumellvm-readobj
’s JSON output should update their parsers accordingly.llvm-objdump
now uses--print-imm-hex
by default, which brings its default behavior closer in line withobjdump
.llvm-nm
now supports the--line-numbers
(-l
) option to use debugging information to print symbols’ filenames and line numbers.llvm-symbolizer
andllvm-addr2line
now support addresses specified as symbol names.llvm-objcopy
now supports--gap-fill
and--pad-to
options, for ELF input and binary output files only.
LLDB
SBType::FindDirectNestedType
function is added. It’s useful for formatters to quickly find directly nested type when it’s known where to search for it, avoiding more expensive global search viaSBTarget::FindFirstType
.Renamed
lldb-vscode
tolldb-dap
and updated its installation instructions to reflect this. The underlying functionality remains unchanged.The
mte_ctrl
register can now be read from AArch64 Linux core files.LLDB on AArch64 Linux now supports debugging the Scalable Matrix Extension (SME) and Scalable Matrix Extension 2 (SME2) for both live processes and core files. For details refer to the AArch64 Linux documentation.
LLDB now supports symbol and binary acquisition automatically using the DEBUFINFOD protocol. The standard mechanism of specifying DEBUFINOD servers in the DEBUGINFOD_URLS environment variable is used by default. In addition, users can specify servers to request symbols from using the LLDB setting
plugin.symbol-locator.debuginfod.server_urls
, override or adding to the environment variable.When running on AArch64 Linux,
lldb-server
now provides register field information for the following registers:cpsr
,fpcr
,fpsr
,svcr
andmte_ctrl
.
Sanitizers
HWASan now defaults to detecting use-after-scope bugs.
Removals#
LLVM IR
The constant expression variants of the following instructions have been removed:
and
or
lshr
ashr
zext
sext
fptrunc
fpext
fptoui
fptosi
uitofp
sitofp
RISC-V backend
XSfcie extension and SiFive CSRs and instructions that were associated with it have been removed. None of these CSRs and instructions were part of “SiFive Custom Instruction Extension”. The LLVM project needs to work with SiFive to define and document real extension names for individual CSRs and instructions.
Python bindings
The Python bindings have been removed.
C API
The following functions for creating constant expressions have been removed, because the underlying constant expressions are no longer supported. Instead, an instruction should be created using the
LLVMBuildXYZ
APIs, which will constant fold the operands if possible and create an instruction otherwise:LLVMConstAnd
LLVMConstOr
LLVMConstLShr
LLVMConstAShr
LLVMConstZExt
LLVMConstSExt
LLVMConstZExtOrBitCast
LLVMConstSExtOrBitCast
LLVMConstIntCast
LLVMConstFPTrunc
LLVMConstFPExt
LLVMConstFPToUI
LLVMConstFPToSI
LLVMConstUIToFP
LLVMConstSIToFP
LLVMConstFPCast
CodeGen infrastructure
PrologEpilogInserter
no longer supports register scavenging during forwards frame index elimination. Targets should use backwards frame index elimination instead.RegScavenger
no longer supports forwards register scavenging. Clients should use backwards register scavenging instead, which is preferred because it does not depend on accurate kill flags.
LLDB
SBWatchpoint::GetHardwareIndex
is deprecated and now returns-1
to indicate the index is unavailable.Methods in
SBHostOS
related to threads have had their implementations removed. These methods will return a value indicating failure.
Resolved issues#
AArch64 backend
Neoverse-N2 was incorrectly marked as an Armv8.5a core. This has been changed to an Armv9.0a core. However, crypto options are not enabled by default for Armv9 cores, so
-mcpu=neoverse-n2+crypto
is now required to enable crypto for this core. As far as the compiler is concerned, Armv9.0a has the same features enabled as Armv8.5a, with the exception of crypto.
Windows target
The LLVM filesystem class
UniqueID
and functionequivalent
() no longer determine that distinct different path names for the same hard linked file actually are equal. This is an intentional tradeoff in a bug fix, where the bug used to cause distinct files to be considered equivalent on some file systems. This change fixed the GitHub issues #61401 and #22079.
Known issues#
The compiler may incorrectly compile a program that uses the
__shfl(var, srcLane, width)
function when one of the parameters to
the function is undefined along some path to the function. For most functions,
uninitialized inputs cause undefined behavior.
Note
The -Wall
compilation flag prompts the compiler to generate a warning if a variable is uninitialized along some path.
As a workaround, initialize the parameters to __shfl
. For example:
unsigned long istring = 0 // Initialize the input to __shfl
return __shfl(istring, 0, 64)
See issue #3499 on GitHub.
MIGraphX (2.10.0)#
Changes#
Added support for ONNX Runtime MIGraphX EP on Windows.
Added
FP8
Python API.Added examples for SD 2.1 and SDXL.
Added support for BERT to Dynamic Batch.
Added a
--test
flag inmigraphx-driver
to validate the installation.Added support for ONNX Operator: Einsum.
Added
uint8
support in ONNX Operators.Added Split-k kernel configurations for performance improvements.
Added fusion for group convolutions.
Added rocMLIR conv3d support.
Added rocgdb to the Dockerfile.
Changed default location of libraries with release specific ABI changes.
Reorganized documentation in GitHub.
Removals#
Removed the
--model
flag withmigraphx-driver
.
Optimizations#
Improved ONNX Model Zoo coverage.
Reorganized
memcpys
with ONNX Runtime to improve performance.Replaced scaler multibroadcast + unsqueeze with just a multibroadcast.
Improved MLIR kernel selection for multibroadcasted GEMMs.
Improved details of the perf report.
Enable mlir by default for GEMMs with small K.
Allow specifying dot or convolution fusion for mlir with environmental flag.
Improve performance on small reductions by doing multiple reduction per wavefront.
Add additional algebraic simplifications for mul-add-dot sequence of operations involving constants.
Use MLIR attention kernels in more cases.
Enables MIOpen and CK fusions for MI300 gfx arches.
Support for QDQ quantization patterns from Brevitas which have explicit cast/convert nodes before and after QDQ pairs.
Added Fusion of “contiguous + pointwise” and “layout + pointwise” operations which may result in performance gains in certain cases.
Added Fusion for “pointwise + layout” and “pointwise + contiguous” operations which may result in performance gains when using NHWC layout.
Added Fusion for “pointwise + concat” operation which may help in performance in certain cases.
Fixes a bug in “concat + pointwise” fusion where output shape memory layout wasn’t maintained.
Simplifies “slice + concat” pattern in SDXL UNet.
Removed ZeroPoint/Shift in QuantizeLinear or DeQuantizeLinear ops if zero points values are zeros.
Improved inference performance by fusing Reduce to Broadcast.
Added additional information when printing the perf report.
Improve scalar fusions when not all strides are 0.
Added support for multi outputs in pointwise ops.
Improve reduction fusion with reshape operators.
Use the quantized output when an operator is used again.
Enabled Split-k GEMM perf configs for rocMLIR based GEMM kernels for better performance on all Hardware.
Resolved issues#
Fixed Super Resolution model verification failed with
FP16
.Fixed confusing messages by suppressing them when compiling the model.
Fixed an issue causing the mod operator with
int8
andint32
inputs.Fixed an issue by preventing the spawning too many threads for constant propagation when parallel STL is not enabled.
Fixed a bug when running
migraphx-driver
with the--run 1
option.Fixed Layernorm accuracy: calculations in
FP32
.Fixed update Docker generator script to ROCm 6.1 to point at Jammy.
Fixed a floating point exception for
dim (-1)
in the reshape operator.Fixed issue with
int8
accuracy and models which were failing due to requiring a fourth bias input.Fixed missing inputs not previously handled for quantized bias for the weights, and data values of the input matrix.
Fixed order of operations for
int8
quantization which were causing inaccuracies and slowdowns.Fixed an issues during compilation caused by the incorrect constructor being used at compile time. Removed list initializer of
prefix_scan_sum
which was causing issues during compilation.Fixed the
MIGRAPHX_GPU_COMPILE_PARALLEL
flag to enable users to control number of threads used for parallel compilation.
MIOpen (3.2.0)#
Changes#
Added:
[Conv] bilinear (alpha beta) solvers.
[Conv] enable bf16 for ck-based solvers.
[Conv] Add split_k tuning to 2d wrw ck-based solver.
[MHA] graph API fp8 fwd.
[RNN] multi-stream as default solution.
Added TunaNetv2.0 for MI300.
Added Adam and AMP Adam optimizer.
Resolved issues#
Memory access fault caused by
GemmBwdRest
.Context configuration in
GetWorkSpaceSize
.Fixes to support huge tensors.
Optimizations#
Find: improved precision of benchmarking.
MIVisionX (3.0.0)#
Changes#
Added support for:
Advanced GPUs
PreEmphasis Filter augmentation in openVX extensions
Spectrogram augmentation in openVX extensions
Downmix and ToDecibels augmentations in openVX extensions
Resample augmentation and Operator overloading nodes in openVX extensions
NonSilentRegion and Slice augmentations in openVX extensions
Mel-Filter bank and Normalize augmentations in openVX extensions
Removals#
Deprecated the use of rocAL for processing. rocAL is available at ROCm/rocAL.
Resolved issues#
Fixed issues with dependencies.
Known issues#
MIVisionX package install requires manual prerequisites installation.
Omniperf (2.0.1)#
Known issues#
Error when running Omniperf with an application with command line arguments. As a workaround, create an intermediary script to call the application with the necessary arguments, then call the script with Omniperf. This issue is fixed in a future release of Omniperf. See #347.
Omniperf might not work with AMD Instinct MI300 accelerators out of the box, resulting in the following error: “ERROR gfx942 is not enabled rocprofv1. Available profilers include: [‘rocprofv2’]”. As a workaround, add the environment variable
export ROCPROF=rocprofv2
.Omniperf’s Python dependencies may not be installed with your ROCm installation, resulting in the following message:
“[ERROR] The ‘dash>=1.12.0’ package was not found in the current execution environment.
[ERROR] The ‘dash-bootstrap-components’ package was not found in the current execution environment.
Please verify all of the Python dependencies called out in the requirements file are installed locally prior to running omniperf.
See: /opt/rocm-6.2.0/libexec/omniperf/requirements.txt”
As a workaround, install these Python requirements manually:
pip install /opt/rocm-6.2.0/libexec/omniperf/requirements.txt
.
See issue #3498 on GitHub.
OpenMP (17.0.0)#
Changes#
Added basic experimental support for
libc
functions on the GPU via the LLVM C Library for GPUs.Added minimal support for calling host functions from the device using the
libc
interface.Added vendor agnostic OMPT callback support for OpenMP-based device offload.
Removals#
Removed the “old” device plugins along with support for the
remote
andve
plugins.
Resolved issues#
Fixed the implementation of
omp_get_wtime
for AMDGPU targets.
RCCL (2.20.5)#
Changes#
Added support for
fp8
andrccl_bfloat8
.Added support for using HIP contiguous memory.
Added ROC-TX for host-side profiling.
Added new rome model.
Added
fp16
andfp8
cases to unit tests.Added a new unit test for main kernel stack size.
Added the new
-n
option fortopo_expl
to override the number of nodes.Improved debug messages of memory allocations.
Enabled static build.
Enabled compatibility with:
NCCL 2.20.5.
NCCL 2.19.4.
Performance tuning for some collective operations on MI300.
Enabled NVTX code in RCCL.
Replaced
rccl_bfloat16
with hip_bfloat16.NPKit updates:
Removed warm-up iteration removal by default, need to opt in now.
Doubled the size of buffers to accommodate for more channels.
Modified rings to be rail-optimized topology friendly.
Resolved issues#
Fixed a bug when configuring RCCL for only LL128 protocol.
Fixed scratch memory allocation after API change for MSCCL.
rocAL (1.0.0)#
Changes#
Added tests and samples.
Removals#
Removed CuPy from
setup.py
.
Optimizations#
Added setup and install updates.
Resolved issues#
Minor bug fixes.
rocALUTION (3.2.0)#
Changes#
Added new file I/O based on rocSPARSE I/O format.
Added
GetConvergenceHistory
for ItILU0 preconditioner.
Removals#
Deprecated the following:
LocalMatrix::ReadFileCSR
LocalMatrix::WriteFileCSR
GlobalMatrix::ReadFileCSR
GlobalMatrix::WriteFileCSR
rocBLAS (4.2.0)#
Changes#
Added Level 2 functions and level 3
trsm
have additional ILP64 API for both C and FORTRAN (_64
name suffix) withint64_t
function arguments.Added cache flush timing for
gemm_batched_ex
,gemm_strided_batched_ex
, andaxpy
.Added Benchmark class for common timing code.
Added an environment variable
ROCBLAS_DEFAULT_ATOMICS_MODE
; to set default atomics mode during creation ofrocblas_handle
.Added support for single-precision (
fp32_r
) input and double-precision (fp64_r
) output and compute types by extendingdot_ex
.Updated Linux AOCL dependency to release 4.2 gcc build.
Updated Windows vcpkg dependencies to release 2024.02.14.
Increased default device workspace from 32 to 128 MiB for architecture gfx9xx with xx >= 40.
Optimizations#
Improved performance of Level 1
dot_batched
anddot_strided_batched
for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 accelerator.
Removals#
Deprecated
rocblas_gemm_ex3
,gemm_batched_ex3
andgemm_strided_batched_ex3
. They will be removed in the next major release of rocBLAS. Refer to hipBLASLt for future 8-bit float usage.
ROCdbgapi (0.76.0)#
Removals#
Renamed
(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_APERTURE_VIOLATION
to(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_ADDRESS_ERROR
. The old names are still accessible but deprecated.
rocDecode (0.6.0)#
Changes#
Added full H.264 support and bug fixes.
rocFFT (1.0.28)#
Changes#
Randomly generated accuracy tests are now disabled by default. They can be enabled using the
--nrand
option (which defaults to 0).
Optimizations#
Implemented multi-device transform for 3D pencil decomposition. Contiguous dimensions on input and output bricks are transformed locally, with global transposes to make remaining dimensions contiguous.
rocm-cmake (0.13.0)#
Changes#
ROCmCreatePackage
now accepts a suffix parameter, automatically generating it for static or ASAN builds.Package names are no longer pulled from
CPACK_<GEN>_PACKAGE_NAME
.Runtime packages will no longer be generated for static builds.
ROCm Data Center Tool (1.0.0)#
Changes#
Added ROCProfiler
dmon
metrics.Added new ECC metrics.
Added ROCm Validation Suite diagnostic command.
Fully migrated to AMD SMI.
Removals#
Removed RASLIB dependency and blobs.
Removed
rocm_smi_lib
dependency due to migration to AMD SMI.
ROCm Debugger (ROCgdb) (14.2)#
Changes#
Introduce the coremerge utility to merge a host core dump and a GPU-only AMDGPU core dump into a unified AMDGPU corefile.
Added support for generating and opening core files for heterogeneous processes.
ROCm SMI (7.3.0)#
Changes#
Added Partition ID API (
rsmi_dev_partition_id_get(..)
).
Resolved issues#
Fixed Partition ID CLI output.
Note
See the detailed ROCm SMI changelog on GitHub for more information.
ROCm Validation Suite (1.0.0)#
Changes#
Added stress tests:
IET (power) stress test for MI300A.
IET (power transition) test for MI300X.
Added support:
GEMM self-check and accuracy-check support for checking consistency and accuracy of GEMM output.
Trignometric float and random integer matrix data initialization support.
Updated GST performance benchmark test for better numbers.
rocPRIM (3.2.0)#
Changes#
Added new overloads for
warp_scan::exclusive_scan
that take no initial value. These new overloads will write an unspecified result to the first value of each warp.The internal accumulator type of
inclusive_scan(_by_key)
andexclusive_scan(_by_key)
is now exposed as an optional type parameter.The default accumulator type is still the value type of the input iterator (inclusive scan) or the initial value’s type (exclusive scan). This is the same behaviour as before this change.
Added a new overload for
device_adjacent_difference_inplace
that allows separate input and output iterators, but allows them to point to the same element.Added new public APIs for deriving resulting type on device-only functions:
rocprim::invoke_result
rocprim::invoke_result_t
rocprim::invoke_result_binary_op
rocprim::invoke_result_binary_op_t
Added the new
rocprim::batch_copy
function. Similar torocprim::batch_memcpy
, but copies by element, not with memcpy.Added more test cases, to better cover supported data types.
Added an optional
decomposer
argument for all member functions ofrocprim::block_radix_sort
and all functions ofdevice_radix_sort
. To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce arocprim::tuple
of references to arithmetic types from the key.Added
rocprim::predicate_iterator
which acts as a proxy for an underlying iterator based on a predicate. It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate istrue
. It can be instantiated with:rocprim::make_predicate_iterator
rocprim::make_mask_iterator
Added custom radix sizes as the last parameter for
block_radix_sort
. The default value is 4, it can be a number between 0 and 32.Added
rocprim::radix_key_codec
, which allows the encoding/decoding of keys for radix-based sorts. For user-defined key types, a decomposer functor should be passed.Updated some tests to work with supported data types.
Optimizations#
Improved the performance of
warp_sort_shuffle
andblock_sort_bitonic
.Created an optimized version of the
warp_exchange
functionsblocked_to_striped_shuffle
andstriped_to_blocked_shuffle
when the warpsize is equal to the items per thread.
Resolved issues#
Fixed incorrect results of
warp_exchange::blocked_to_striped_shuffle
andwarp_exchange::striped_to_blocked_shuffle
when the block size is larger than the logical warp size. The test suite has been updated with such cases.Fixed incorrect results returned when calling device
unique_by_key
with overlappingvalues_input
andvalues_output
.Fixed incorrect output type used in
device_adjacent_difference
.Fixed an issue causing incorrect results on the GFX10 (RDNA1, RDNA2) ISA and GFX11 ISA on device scan algorithms
rocprim::inclusive_scan(_by_key)
androcprim::exclusive_scan(_by_key)
with large input types.Fixed an issue with
device_adjacent_difference
. It now considers both the input and the output type for selecting the appropriate kernel launch config. Previously only the input type was considered, which could result in compilation errors due to excessive shared memory usage.Fixed incorrect data being loaded with
rocprim::thread_load
when compiling with-O0
.Fixed a compilation failure in the host compiler when instantiating various block and device algorithms with block sizes not divisible by 64.
Removals#
Deprecated the internal header
detail/match_result_type.hpp
.Deprecated
TwiddleIn
andTwiddleOut
in favor ofradix_key_codec
.Deprecated the internal
::rocprim::detail::radix_key_codec
in favor of a new public utility with the same name.
ROCProfiler (2.0.0)#
Removals#
Removed
pcsampler
sample code due to deprecation from version 2.
rocRAND (3.1.0)#
Changes#
Added
rocrand_create_generator_host
.The following generators are supported:
ROCRAND_RNG_PSEUDO_MRG31K3P
ROCRAND_RNG_PSEUDO_MRG32K3A
ROCRAND_RNG_PSEUDO_PHILOX4_32_10
ROCRAND_RNG_PSEUDO_THREEFRY2_32_20
ROCRAND_RNG_PSEUDO_THREEFRY2_64_20
ROCRAND_RNG_PSEUDO_THREEFRY4_32_20
ROCRAND_RNG_PSEUDO_THREEFRY4_64_20
ROCRAND_RNG_PSEUDO_XORWOW
ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL32
ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL64
ROCRAND_RNG_QUASI_SOBOL32
ROCRAND_RNG_QUASI_SOBOL64
The host-side generators support multi-core processing. On Linux, this requires the TBB (Thread Building Blocks) development package to be installed on the system when building rocRAND (
libtbb-dev
on Ubuntu and derivatives).If TBB is not found when configuring rocRAND, the configuration is still successful, and the host generators are executed on a single CPU thread.
Added the option to create a host generator to the Python wrapper.
Added the option to create a host generator to the Fortran wrapper
Added dynamic ordering. This ordering is free to rearrange the produced numbers, which can be specific to devices and distributions. It is implemented for:
XORWOW, MRG32K3A, MTGP32, Philox 4x32-10, MRG31K3P, LFSR113, and ThreeFry
Added support for using Clang as the host compiler for alternative platform compilation.
C++ wrapper:
Added support for
lfsr113_engine
being constructed with a seed of typeunsigned long long
, not onlyuint4
.Added optional order parameter to the constructor of
mt19937_engine
.
Added the following functions for the
ROCRAND_RNG_PSEUDO_MTGP32
generator:rocrand_normal2
rocrand_normal_double2
rocrand_log_normal2
rocrand_log_normal_double2
Added
rocrand_create_generator_host_blocking
which dispatches without stream semantics.Added host-side generator for
ROCRAND_RNG_PSEUDO_MTGP32
.Added offset and skipahead functionality to LFSR113 generator.
Added dynamic ordering for architecture
gfx1102
.For device-side generators, you can now wrap calls to
rocrand_generate_*
inside of a hipGraph. There are a few things to be aware of:Generator creation (
rocrand_create_generator
), initialization (rocrand_initialize_generator
), and destruction (rocrand_destroy_generator
) must still happen outside the hipGraph.After the generator is created, you may call API functions to set its seed, offset, and order.
After the generator is initialized (but before stream capture or manual graph creation begins), use
rocrand_set_stream
to set the stream the generator will use within the graph.A generator’s seed, offset, and stream may not be changed from within the hipGraph. Attempting to do so may result in unpredictable behaviour.
API calls for the poisson distribution (for example,
rocrand_generate_poisson
) are not yet supported inside of hipGraphs.For sample usage, see the unit tests in
test/test_rocrand_hipgraphs.cpp
Building rocRAND now requires a C++17 capable compiler, as the internal library sources now require it. However consuming rocRAND is still possible from C++11 as public headers don’t make use of the new features.
Building rocRAND should be faster on machines with multiple CPU cores as the library has been split to multiple compilation units.
C++ wrapper: the
min()
andmax()
member functions of the generators and distributions are nowstatic constexpr
.Renamed and unified the existing
ROCRAND_DETAIL_.*_BM_NOT_IN_STATE
toROCRAND_DETAIL_BM_NOT_IN_STATE
Static and dynamic library: moved all internal symbols to namespaces to avoid potential symbol name collisions when linking.
Removals#
Deprecated the following typedefs. Please use the unified
state_type
alias instead.rocrand_device::threefry2x32_20_engine::threefry2x32_20_state
rocrand_device::threefry2x64_20_engine::threefry2x64_20_state
rocrand_device::threefry4x32_20_engine::threefry4x32_20_state
rocrand_device::threefry4x64_20_engine::threefry4x64_20_state
Deprecated the following internal headers:
src/rng/distribution/distributions.hpp
.src/rng/device_engines.hpp
.
Removed references to and workarounds for deprecated hcc.
Removed support for HIP-CPU.
Known Issues#
SOBOL64
andSCRAMBLED_SOBOL64
generate poisson-distributedunsigned long long int
numbers instead ofunsigned int
. This will be fixed in a future release.
ROCr Runtime (1.14.0)#
Changes#
Added PC sampling feature (experimental feature).
rocSOLVER (3.26.0)#
Changes#
Added 64-bit APIs for existing functions:
GETF2_64 (with
batched
andstrided_batched
versions)GETRF_64 (with
batched
andstrided_batched
versions)GETRS_64 (with
batched
andstrided_batched
versions)
Added gfx900 to default build targets.
Added partial eigenvalue decomposition routines for symmetric/hermitian matrices using Divide & Conquer and Bisection:
SYEVDX (with
batched
andstrided_batched
versions)HEEVDX (with
batched
andstrided_batched
versions)
Added partial generalized symmetric/hermitian-definite eigenvalue decomposition using Divide & Conquer and Bisection:
SYGVDX (with
batched
andstrided_batched
versions)HEGVDX (with
batched
andstrided_batched
versions)
Renamed install script arguments of the form
*_dir to *-path
. Arguments of the form*_dir
remain functional for backwards compatibility.Functions working with arrays of size n - 1 can now accept null pointers when n = 1.
Optimizations#
Improved performance of Cholesky factorization.
Improved performance of
splitlu
to extract the L and U triangular matrices from the result of sparse factorization matrix M, where M = (L - eye) + U.
Resolved issues#
Fixed potential accuracy degradation in SYEVJ/HEEVJ for inputs with small eigenvalues.
rocSPARSE (3.2.0)#
Changes#
Added a new Merge-Path algorithm to SpMM, supporting CSR format.
Added support for row order to SpSM.
Added rocsparseio I/O functionality to the library.
Added
rocsparse_set_identity_permutation
.Adjusted rocSPARSE dependencies to related HIP packages.
Binary size has been reduced.
A namespace has been wrapped around internal rocSPARSE functions and kernels.
rocsparse_csr_set_pointers
,rocsparse_csc_set_pointers
, androcsparse_bsr_set_pointers
now allow the column indices and values arrays to be nullptr ifnnz
is 0.gfx803 target has been removed from address sanitizer builds.
Optimizations#
SpMV adaptive and LRB algorithms have been further optimized on CSR format
Improved performance of SpMV adaptive with symmetrically stored matrices on CSR format
Improved documentation and contribution guidelines.
Resolved issues#
Fixed compilation errors with
BUILD_ROCSPARSE_ILP64=ON
.
rocThrust (3.1.0)#
Changes#
Added changes from upstream CCCL/thrust 2.2.0.
Updated the contents of
system/hip
andtest
with the upstream changes.
Updated internal calls to
rocprim::detail::invoke_result
to use the public APIrocprim::invoke_result
.Updated to use
rocprim::device_adjacent_difference
foradjacent_difference
API call.Updated internal use of custom iterator in
thrust::detail::unique_by_key
to use rocPRIM’srocprim::unique_by_key
.Updated
adjecent_difference
to make use ofrocprim:adjecent_difference
when iterators are comparable and not equal otherwise userocprim:adjacent_difference_inplace
.
Known Issues#
thrust::reduce_by_key
outputs are not bit-wise reproducible, as run-to-run results for pseudo-associative reduction operators (e.g. floating-point arithmetic operators) are not deterministic on the same device.Note that currently, rocThrust memory allocation is performed in such a way that most algorithmic API functions cannot be called from within hipGraphs.
rocWMMA (1.5.0)#
Changes#
Added internal utilities for:
Element-wise vector transforms.
Cross-lane vector transforms.
Added internal aos<->soa transforms for block sizes of 16, 32, 64, 128 and 256 and vector widths of 2, 4, 8 and 16.
Added tests for new internal transforms.
Improved loading layouts by increasing vector width for fragments with
blockDim > 32
.API
applyDataLayout
transform now accepts WaveCount template argument for cooperative fragments.API
applyDataLayout
transform now physically applies aos<->soa transform as necessary.Refactored entry-point of std library usage to improve hipRTC support.
Updated installation, programmer’s guide and API reference documentation.
Resolved issues#
Fixed the ordering of some header includes to improve portability.
RPP (1.8.0)#
Changes#
Prerequisites - ROCm install requires only
--usecase=rocm
.Use pre-allocated common scratchBufferHip everywhere in Tensor code for scratch HIP memory.
Use
CHECK_RETURN_STATUS
everywhere to adhere to C++17 for HIP.RPP Tensor Audio support on HOST for Spectrogram.
RPP Tensor Audio support on HOST/HIP for Slice, by modifying voxel slice kernels to now accept anchor and shape params for a more generic version.
RPP Tensor Audio support on HOST for Mel Filter Bank.
RPP Tensor Normalize ND support on HOST and
HIP
.
Tensile (4.41.0)#
Changes#
New tuning script to summarize rocBLAS log file
New environment variable to test fixed grid size with Stream-K kernels
New Stream-K dynamic mode to run large problems at slightly reduced CU count if it improves work division and power
Add reject conditions for SourceKernel + PrefetchGlobalRead/LoopDoWhile
Add reject condition for PreloadKernelArguments (disable PreloadKernelArguments if not supported (instead of rejecting kernel generation))
Support NT flag for global load and store for gfx94x
New Kernarg preloading feature (DelayRemainingArgument: initiate the load of the remaining (non-preloaded) arguments, updated AsmCaps, AsmRegisterPool to track registers for arguments and preload)
Add option for rotating buffers timing with cache eviction
Add predicate for arithmetic intensity
Add DirectToVgpr + packing for f8/f16 + TLU cases
Enable negative values for ExtraLatencyForLR to reduce interval of local read and wait for DTV
Add test cases for DirectToVgpr + packing
Add batch support for Stream-K kernels and new test cases
New tuning scripts to analyze rocblas-bench results and remove tuned sizes from liblogic
Enable VgprForLocalReadPacking + PrefetchLocalRead=1 (removed the reject condition for VFLRP + PLR=1, added test cases for VFLRP + PLR=1)
Support VectorWidthB (new parameter VectorWidthB)
Support VectorWidth + non SourceSwap
Add test cases for VectorWidthB, VectorWidth + non SourceSwap
Add code owners file
New environment variables to dynamically adjust number of CUs used in Stream-K
Add new parameters to specify global load width for A and B separately (GlobalLoadVectorWidthA, B (effective with GlobalReadVectorWidth=-1))
Add xf32 option to rocblas-bench input creator
Update rocBLAS-bench-input-create script (added number of iteration based on performance, rotating buffer flag)
Limit build threads based on CPUs/RAM available on system (for tests)
Update required workspace size for Stream-K, skip kernel initialization when possible
Use fallback libraries for archs without optimized logic
Use hipMemcpyAsync for validation (replace hipMemcpy with hipMemcpyAsync + hipStreamSynchronize in ReferenceValidator)
Remove OCL tests
Disable HostLibraryTests
Reduce extended test time by removing extra parameters in the test config files
Disable InitAccVgprOpt for Stream-K
Skip sgemm 64bit offset tests for gfx94x
Skip DTV, DTL, LSU+MFMA tests for gfx908
Increase extended test timeout to 720 min
Update xfail test (1sum tests only failing on gfx90a)
Update lib logic convertor script
Test limiting CI threads for only gfx11
wGM related kernargs are removed if they are not needed (WGM=-1,0,1)
Cleanup on unused old code, mostly related to old client
Change GSUA to SingleBuffer if GlobalSplitU=1 + MultipleBuffer, instead of rejecting it
Update efficiency script for new architecture and xf32 datatype
Re-enable negative values for WorkGroupMapping (asm kernel only)
Disable HW monitor for aquvavanjaram941
Pre-apply offsets for strided batch kernels
Update tensile build with 16 threads
Optimizations#
Made initialization optimizations (reordered init code for PreloadKernelArguments opt, used s_mov_b64 for 64 bit address copy, used v_mov_b64/ds_read_b64 for C register initialization, added undefine AddressC/D with PreloadKernelArguments, optimized waitcnt for prefetch global read with DirectToVgpr, refactored waitcnt code for DTV and moved all asm related code to KernelWriterAssembly.py).
Optimized temp vgpr allocation for ClusterLocalRead (added if condition to allocate temp vgpr only for 8bit datatype)
Reversed MFMA order in inner loop for odd outer iteration
Optimized waitcnt lgkmcnt for 1LDSBuffer + PGR>1 (removed redundant waitcnt lgkmcnt after 1LDSBuffer sync)
Enhanced maximum value of DepthU to 1024 (used globalParameters MaxDepthU to define maximum value of DepthU)
Resolved issues#
Fixed
WorkspaceCheck
implementation when used in rocBLAS.Fixed Stream-K partials cache behavior.
Fixed
MasterSolutionLibrary
indexing for multiple architecture build.Fixed memory allocation fail with FlushMemorySize + StridedBatched/Batched cases (multiply batch count size when calculating array size).
Fixed BufferLoad=False with Stream-K.
Fixed mismatch issue with
GlobalReadCoalesceGroup
.Fixed rocBLAS build fail on gfx11 (used state[“ISA”] for reject conditions instead of globalParameters[“CurrentISA”]).
Fixed for LdsPad auto (fixed incorrect value assignment for autoAdjusted, set LdsBlockSizePerPadA or B = 0 if stride is not power of 2).
Fixed inaccurate vgpr allocation for ClusterLocalRead.
Fixed mismatch issue with LdsBlockSizePerPad + MT1(or 0) not power of 2.
Fixed mismatch issue with InitAccOpt + InnerUnroll (use const 0 for src1 of MFMA only if index of innerUnrll (iui) is 0).
Fixed HostLibraryTests on gfx942 and gfx941.
Fixed LLVM crash issue.
Fixed for newer windows vcpkg msgpack and vcpkg version package name.
Fixed an error with DisableKernelPieces + 32bit ShadowLimit.
Ignore asm cap check for kernel arg preload for rocm6.0 and older.
ROCm known issues#
ROCm known issues are noted on GitHub. For known issues related to individual components, review the Detailed component changes.
Default processor affinity behavior for helper threads#
Processor affinity is a critical setting to ensure that ROCm helper threads run on the correct cores. By default, ROCm
helper threads are spawned on all available cores, ignoring the parent thread’s processor affinity. This can lead to
threads competing for available cores, which may result in suboptimal performance. This behavior occurs by default if
the environment variable HSA_OVERRIDE_CPU_AFFINITY_DEBUG
is not set or is set to 1
. If
HSA_OVERRIDE_CPU_AFFINITY_DEBUG
is set to 0
, the ROCr runtime uses the parent process’s core affinity mask when
creating helper threads. The parent’s affinity mask should then be set to account for the presence of additional threads
by ensuring the affinity mask contains enough cores. Depending on the affinity settings of the software environment,
batch system, launch commands like numactl
/taskset
, or explicit mask manipulation by the application itself, changing
the setting may be advantageous to performance.
To ensure the parent’s core affinity mask is honored by the ROCm helper threads, set the
HSA_OVERRIDE_CPU_AFFINITY_DEBUG
environment variable as follows:
export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0
To ensure ROCm helper threads run on all available cores, set the HSA_OVERRIDE_CPU_AFFINITY_DEBUG
environment variable
as follows:
export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=1
Or the default:
unset HSA_OVERRIDE_CPU_AFFINITY_DEBUG
If unsure of the default processor affinity settings for your environment, run the following command from the shell:
bash -c "echo taskset -p \$\$"
See issue #3493 on GitHub.
Display issues on servers with Instinct MI300-series accelerators when loading AMDGPU driver#
AMD Instinct MI300-series accelerators and third-party GPUs such as the Matrox G200 have an issue impacting video output. The issue was reproduced on a Dell server model PowerEdge XE9680. Servers from other vendors utilizing Matrox G200 cards may be impacted as well. This issue was found with ROCm 6.2.0 but is present in older ROCm versions.
The AMDGPU driver shipped with ROCm interferes with the operation of the display card video output. On Dell systems,
this includes both the local video output and remote access via iDRAC. The display appears blank (black) after loading
the amdgpu
driver modules. Video output impacts both terminal access when running in runlevel 3
and GUI access when
running in runlevel 5
. Server functionality can still be accessed via SSH or other remote connection methods.
See issue #3494 on GitHub.
KFDTest failure on Instinct MI300X with Oracle Linux 8.9#
The KFDEvictTest.QueueTest
is failing on the MI300X platform during KFD (Kernel Fusion Driver) tests, causing the full
suite to not execute properly. This issue is suspected to be hardware-related.
See issue #3495 on GitHub.
Bandwidth limitation in gang and non-gang modes on Instinct MI300A#
Expected target peak non-gang performance (~60GB/s) and target peak gang performance (~90GB/s) are not achieved. Both gang and non-gang performance are observed to be limited at 45GB/s.
This issue will be addressed in a future ROCm release.
See issue #3496 on GitHub.
rocm-llvm-alt#
ROCm provides an optional package – rocm-llvm-alt
– that provides a closed-source compiler for
users interested in additional closed-source CPU optimizations. This feature is not functional in
the ROCm 6.2.0 release. Users who attempt to invoke the closed-source compiler will experience an
LLVM consumer-producer mismatch and the compilation will fail. There is no workaround that allows
use of the closed-source compiler. It is recommended to compile using the default open-source
compiler, which generates high-quality AMD CPU and AMD GPU code.
See issue #3492 on GitHub.
ROCm upcoming changes#
The section notes upcoming changes to the ROCm software stack. For upcoming changes related to individual components, review the Detailed component changes.
rocm-llvm-alt#
The rocm-llvm-alt
package will be removed in an upcoming release. Users relying on the
functionality provided by the closed-source compiler should transition to the open-source compiler.
Once the rocm-llvm-alt
package is removed, any compilation requesting functionality provided by
the closed-source compiler will result in a Clang warning: “[AMD] proprietary optimization compiler
has been removed”.