Release Notes

Release Notes#

Applies to Linux and Windows

2023-07-27

21 min read time

The release notes for the ROCm platform.

ROCm 5.6.0#

Release Highlights#

ROCm 5.6 consists of several AI software ecosystem improvements to our fast-growing user base. A few examples include:

New documentation portal at https://rocm.docs.amd.com
Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test suite
OpenAI Triton, CuPy, HIP Graph support, and many other library performance enhancements
Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger, profiler, and docker containers
New pseudorandom generators are available in rocRAND. Added support for half-precision transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in rocSOLVER.

OS and GPU Support Changes#

SLES15 SP5 support was added this release. SLES15 SP3 support was dropped.
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs) will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA release date.
- No new features and performance optimizations will be supported for the gfx906 GPUs beyond ROCm 5.7
- Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (End of Maintenance [EOM])(will be aligned with the closest ROCm release)
- Bug fixes during the maintenance will be made to the next ROCm point release
- Bug fixes will not be back ported to older ROCm releases for this SKU
- Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.

AMDSMI CLI 23.0.0.4#

Added#

AMDSMI CLI tool enabled for Linux Bare Metal & Guest
Package: amd-smi-lib

Known Issues#

not all Error Correction Code (ECC) fields are currently supported
RHEL 8 & SLES 15 have extra install steps

Kernel Modules (DKMS)#

Fixes#

Stability fix for multi GPU system reproducilble via ROCm_Bandwidth_Test as reported in Issue 2198.

HIP 5.6 (For ROCm 5.6)#

Optimizations#

Consolidation of hipamd, rocclr and OpenCL projects in clr
Optimized lock for graph global capture mode

Added#

Added hipRTC support for amd_hip_fp16
Added hipStreamGetDevice implementation to get the device associated with the stream
Added HIP_AD_FORMAT_SIGNED_INT16 in hipArray formats
hipArrayGetInfo for getting information about the specified array
hipArrayGetDescriptor for getting 1D or 2D array descriptor
hipArray3DGetDescriptor to get 3D array descriptor

Changed#

hipMallocAsync to return success for zero size allocation to match hipMalloc
Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package
Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide
Removed hipBusBandwidth and hipCommander samples from hip-tests

Fixed#

Fixed regression in hipMemCpyParam3D when offset is applied

Known Issues#

Limited testing on xnack+ configuration
- Multiple HIP tests failures (gpuvm fault or hangs)
hipSetDevice and hipSetDeviceFlags APIs return hipErrorInvalidDevice instead of hipErrorNoDevice, on a system without GPU
Known memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs. Issue will be fixed in a future ROCm release

Upcoming changes in future release#

Removal of gcnarch from hipDeviceProp_t structure
Addition of new fields in hipDeviceProp_t structure
- maxTexture1D
- maxTexture2D
- maxTexture1DLayered
- maxTexture2DLayered
- sharedMemPerMultiprocessor
- deviceOverlap
- asyncEngineCount
- surfaceAlignment
- unifiedAddressing
- computePreemptionSupported
- uuid
Removal of deprecated code
- hip-hcc codes from hip code tree
Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
HIPMEMCPY_3D fields correction (unsigned int -> size_t)
Renaming of ‘memoryType’ in hipPointerAttribute_t structure to ‘type’

ROCgdb-13 (For ROCm 5.6.0)#

Optimized#

Improved performances when handling the end of a process with a large number of threads.

Known Issues

On certain configurations, ROCgdb can show the following warning message:

warning: Probes-based dynamic linker interface failed. Reverting to original interface.

This does not affect ROCgdb’s functionalities.

ROCprofiler (For ROCm 5.6.0)#

In ROCm 5.6 the rocprofilerv1 and rocprofilerv2 include and library files of ROCm 5.5 are split into separate files. The rocmtools files that were deprecated in ROCm 5.5 have been removed.

ROCm 5.6	rocprofilerv1	rocprofilerv2
Tool script	`bin/rocprof`	`bin/rocprofv2`
API include	`include/rocprofiler/rocprofiler.h`	`include/rocprofiler/v2/rocprofiler.h`
API library	`lib/librocprofiler.so.1`	`lib/librocprofiler.so.2`

The ROCm Profiler Tool that uses rocprofilerV1 can be invoked using the following command:

$ rocprof …

To write a custom tool based on the rocprofilerV1 API do the following:

main.c:
#include <rocprofiler/rocprofiler.h> // Use the rocprofilerV1 API
int main() {
  // Use the rocprofilerV1 API
  return 0;
}

This can be built in the following manner:

$ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64

The resulting a.out will depend on /opt/rocm-5.6.0/lib/librocprofiler64.so.1.

The ROCm Profiler that uses rocprofilerV2 API can be invoked using the following command:

$ rocprofv2 …

To write a custom tool based on the rocprofilerV2 API do the following:

main.c:
#include <rocprofiler/v2/rocprofiler.h> // Use the rocprofilerV2 API
int main() {
  // Use the rocprofilerV2 API
  return 0;
}

This can be built in the following manner:

$ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64-v2

The resulting a.out will depend on /opt/rocm-5.6.0/lib/librocprofiler64.so.2.

Optimized#

Improved Test Suite

Added#

‘end_time’ need to be disabled in roctx_trace.txt

Fixed#

rocprof in ROcm/5.4.0 gpu selector broken.
rocprof in ROCm/5.4.1 fails to generate kernel info.
rocprof clobbers LD_PRELOAD.

Library Changes in ROCM 5.6.0#

Library	Version
hipBLAS	⇒ 1.0.0
hipCUB	⇒ 2.13.1
hipFFT	⇒ 1.0.12
hipSOLVER	⇒ 1.8.0
hipSPARSE	⇒ 2.3.6
MIOpen	⇒ 2.19.0
rccl	⇒ 2.15.5
rocALUTION	⇒ 2.1.9
rocBLAS	⇒ 3.0.0
rocFFT	⇒ 1.0.23
rocm-cmake	⇒ 0.9.0
rocPRIM	⇒ 2.13.0
rocRAND	⇒ 2.10.17
rocSOLVER	⇒ 3.22.0
rocSPARSE	⇒ 2.5.2
rocThrust	⇒ 2.18.0
rocWMMA	⇒ 1.1.0
Tensile	⇒ 4.37.0

hipBLAS 1.0.0#

hipBLAS 1.0.0 for ROCm 5.6.0

Changed#

added const qualifier to hipBLAS functions (swap, sbmv, spmv, symv, trsm) where missing

Removed#

removed support for deprecated hipblasInt8Datatype_t enum
removed support for deprecated hipblasSetInt8Datatype and hipblasGetInt8Datatype functions

Deprecated#

in-place trmm is deprecated. It will be replaced by trmm which includes both in-place and out-of-place functionality

hipCUB 2.13.1#

hipCUB 2.13.1 for ROCm 5.6.0

Added#

Benchmarks for BlockShuffle, BlockLoad, and BlockStore.

Changed#

CUB backend references CUB and Thrust version 1.17.2.
Improved benchmark coverage of BlockScan by adding ExclusiveScan, benchmark coverage of BlockRadixSort by adding SortBlockedToStriped, and benchmark coverage of WarpScan by adding Broadcast.
Updated docs directory structure to match the standard of rocm-docs-core.

Known Issues#

BlockRadixRankMatch is currently broken under the rocPRIM backend.
BlockRadixRankMatch with a warp size that does not exactly divide the block size is broken under the CUB backend.

hipFFT 1.0.12#

hipFFT 1.0.12 for ROCm 5.6.0

Added#

Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms.

Changed#

Added –precision argument to benchmark/test clients. –double is still accepted but is deprecated as a method to request a double-precision transform.

hipSOLVER 1.8.0#

hipSOLVER 1.8.0 for ROCm 5.6.0

Added#

Added compatibility API with hipsolverRf prefix

hipSPARSE 2.3.6#

hipSPARSE 2.3.6 for ROCm 5.6.0

Added#

Added SpGEMM algorithms

Changed#

For hipsparseXbsr2csr and hipsparseXcsr2bsr, blockDim == 0 now returns HIPSPARSE_STATUS_INVALID_SIZE

MIOpen 2.19.0#

MIOpen 2.19.0 for ROCm 5.6.0

Added#

ROCm 5.5 support for gfx1101 (Navi32)

Changed#

Tuning results for MLIR on ROCm 5.5
Bumping MLIR commit to 5.5.0 release tag

Fixed#

Fix 3d convolution Host API bug
[HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required.

rccl 2.15.5#

RCCL 2.15.5 for ROCm 5.6.0

Changed#

Compatibility with NCCL 2.15.5
Unit test executable renamed to rccl-UnitTests

Added#

HW-topology aware binary tree implementation
Experimental support for MSCCL
New unit tests for hipGraph support
NPKit integration

Fixed#

rocm-smi ID conversion
Support for HIP_VISIBLE_DEVICES for unit tests
Support for p2p transfers to non (HIP) visible devices

Removed#

Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench

rocALUTION 2.1.9#

rocALUTION 2.1.9 for ROCm 5.6.0

Improved#

Fixed synchronization issues in level 1 routines

rocBLAS 3.0.0#

rocBLAS 3.0.0 for ROCm 5.6.0

Optimizations#

Improved performance of Level 2 rocBLAS GEMV on gfx90a GPU for non-transposed problems having small matrices and larger batch counts. Performance enhanced for problem sizes when m and n <= 32 and batch_count >= 256.
Improved performance of rocBLAS syr2k for single, double, and double-complex precision, and her2k for double-complex precision. Slightly improved performance for general sizes on gfx90a.

Added#

Added bf16 inputs and f32 compute support to Level 1 rocBLAS Extension functions axpy_ex, scal_ex and nrm2_ex.

Deprecated#

trmm inplace is deprecated. It will be replaced by trmm that has both inplace and out-of-place functionality
rocblas_query_int8_layout_flag() is deprecated and will be removed in a future release
rocblas_gemm_flags_pack_int8x4 enum is deprecated and will be removed in a future release
rocblas_set_device_memory_size() is deprecated and will be replaced by a future function rocblas_increase_device_memory_size()
rocblas_is_user_managing_device_memory() is deprecated and will be removed in a future release

Removed#

is_complex helper was deprecated and now removed. Use rocblas_is_complex instead.
The enum truncate_t and the value truncate was deprecated and now removed from. It was replaced by rocblas_truncate_t and rocblas_truncate, respectively.
rocblas_set_int8_type_for_hipblas was deprecated and is now removed.
rocblas_get_int8_type_for_hipblas was deprecated and is now removed.

Dependencies#

build only dependency on python joblib added as used by Tensile build
fix for cmake install on some OS when performed by install.sh -d –cmake_install

Fixed#

make trsm offset calculations 64 bit safe

Changed#

refactor rotg test code

rocFFT 1.0.23#

rocFFT 1.0.23 for ROCm 5.6.0

Added#

Implemented half-precision transforms, which can be requested by passing rocfft_precision_half to rocfft_plan_create.
Implemented a hierarchical solution map which saves how to decompose a problem and the kernels to be used.
Implemented a first version of offline-tuner to support tuning kernels for C2C/Z2Z problems.

Changed#

Replaced std::complex with hipComplex data types for data generator.
FFT plan dimensions are now sorted to be row-major internally where possible, which produces better plans if the dimensions were accidentally specified in a different order (column-major, for example).
Added –precision argument to benchmark/test clients. –double is still accepted but is deprecated as a method to request a double-precision transform.

Fixed#

Fixed over-allocation of LDS in some real-complex kernels, which was resulting in kernel launch failure.

rocm-cmake 0.9.0#

rocm-cmake 0.9.0 for ROCm 5.6.0

Added#

Added the option ROCM_HEADER_WRAPPER_WERROR
- Compile-time C macro in the wrapper headers causes errors to be emitted instead of warnings.
- Configure-time CMake option sets the default for the C macro.

rocPRIM 2.13.0#

rocPRIM 2.13.0 for ROCm 5.6.0

Added#

New block level radix_rank primitive.
New block level radix_rank_match primitive.
Added a stable block sorting implementation. This be used with block_sort by using the block_sort_algorithm::stable_merge_sort algorithm.

Changed#

Improved the performance of block_radix_sort and device_radix_sort.
Improved the performance of device_merge_sort.
Updated docs directory structure to match the standard of rocm-docs-core. Contributed by: v01dXYZ.

Known Issues#

Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows.
When ROCPRIM_DISABLE_LOOKBACK_SCAN is set, device_scan fails for input sizes bigger than scan_config::size_limit, which defaults to std::numeric_limits<unsigned int>::max().

rocRAND 2.10.17#

rocRAND 2.10.17 for ROCm 5.6.0

Added#

MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
New benchmark for the device API using Google Benchmark, benchmark_rocrand_device_api, replacing benchmark_rocrand_kernel. benchmark_rocrand_kernel is deprecated and will be removed in a future version. Likewise, benchmark_curand_host_api is added to replace benchmark_curand_generate and benchmark_curand_device_api is added to replace benchmark_curand_kernel.
experimental HIP-CPU feature
ThreeFry pseudorandom number generator based on Salmon et al., 2011, “Parallel random numbers: as easy as 1, 2, 3”.

Changed#

Python 2.7 is no longer officially supported.

rocSOLVER 3.22.0#

rocSOLVER 3.22.0 for ROCm 5.6.0

Added#

LU refactorization for sparse matrices
- CSRRF_ANALYSIS
- CSRRF_SUMLU
- CSRRF_SPLITLU
- CSRRF_REFACTLU
Linear system solver for sparse matrices
- CSRRF_SOLVE
Added type rocsolver_rfinfo for use with sparse matrix routines

Optimized#

Improved the performance of BDSQR and GESVD when singular vectors are requested

Fixed#

BDSQR and GESVD should no longer hang when the input contains NaN or Inf

rocSPARSE 2.5.2#

rocSPARSE 2.5.2 for ROCm 5.6.0

Improved#

Fixed a memory leak in csritsv
Fixed a bug in csrsm and bsrsm

rocThrust 2.18.0#

rocThrust 2.18.0 for ROCm 5.6.0

Fixed#

lower_bound, upper_bound, and binary_search failed to compile for certain types.

Changed#

Updated docs directory structure to match the standard of rocm-docs-core.

rocWMMA 1.1.0#

rocWMMA 1.1.0 for ROCm 5.6.0

Added#

Added cross-lane operation backends (Blend, Permute, Swizzle and Dpp)
Added GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation)
Added performance gemm samples for half, single and double precision
Added rocWMMA cmake versioning
Added vectorized support in coordinate transforms
Included ROCm smi for runtime clock rate detection
Added fragment transforms for transpose and change data layout

Changed#

Default to GPU rocBLAS validation against rocWMMA
Re-enabled int8 gemm tests on gfx9
Upgraded to C++17
Restructured unit test folder for consistency
Consolidated rocWMMA samples common code

Tensile 4.37.0#

Tensile 4.37.0 for ROCm 5.6.0

Added#

Added user driven tuning API
Added decision tree fallback feature
Added SingleBuffer + AtomicAdd option for GlobalSplitU
DirectToVgpr support for fp16 and Int8 with TN orientation
Added new test cases for various functions
Added SingleBuffer algorithm for ZGEMM/CGEMM
Added joblib for parallel map calls
Added support for MFMA + LocalSplitU + DirectToVgprA+B
Added asmcap check for MIArchVgpr
Added support for MFMA + LocalSplitU
Added frequency, power, and temperature data to the output

Optimizations#

Improved the performance of GlobalSplitU with SingleBuffer algorithm
Reduced the running time of the extended and pre_checkin tests
Optimized the Tailloop section of the assembly kernel
Optimized complex GEMM (fixed vgpr allocation, unified CGEMM and ZGEMM code in MulMIoutAlphaToArch)
Improved the performance of the second kernel of MultipleBuffer algorithm

Changed#

Updated custom kernels with 64-bit offsets
Adapted 64-bit offset arguments for assembly kernels
Improved temporary register re-use to reduce max sgpr usage
Removed some restrictions on VectorWidth and DirectToVgpr
Updated the dependency requirements for Tensile
Changed the range of AssertSummationElementMultiple
Modified the error messages for more clarity
Changed DivideAndReminder to vectorStaticRemainder in case quotient is not used
Removed dummy vgpr for vectorStaticRemainder
Removed tmpVgpr parameter from vectorStaticRemainder/Divide/DivideAndReminder
Removed qReg parameter from vectorStaticRemainder

Fixed#

Fixed tmp sgpr allocation to avoid over-writing values (alpha)
64-bit offset parameters for post kernels
Fixed gfx908 CI test failures
Fixed offset calculation to prevent overflow for large offsets
Fixed issues when BufferLoad and BufferStore are equal to zero
Fixed StoreCInUnroll + DirectToVgpr + no useInitAccVgprOpt mismatch
Fixed DirectToVgpr + LocalSplitU + FractionalLoad mismatch
Fixed the memory access error related to StaggerU + large stride
Fixed ZGEMM 4x4 MatrixInst mismatch
Fixed DGEMM 4x4 MatrixInst mismatch
Fixed ASEM + GSU + NoTailLoop opt mismatch
Fixed AssertSummationElementMultiple + GlobalSplitU issues
Fixed ASEM + GSU + TailLoop inner unroll

Release Notes

Contents

Release Notes#

ROCm 5.6.0#

Release Highlights#

OS and GPU Support Changes#

AMDSMI CLI 23.0.0.4#

Added#

Known Issues#

Kernel Modules (DKMS)#

Fixes#

HIP 5.6 (For ROCm 5.6)#

Optimizations#

Added#

Changed#

Fixed#

Known Issues#

Upcoming changes in future release#

ROCgdb-13 (For ROCm 5.6.0)#

Optimized#

ROCprofiler (For ROCm 5.6.0)#

Optimized#

Added#

Fixed#

Library Changes in ROCM 5.6.0#

hipBLAS 1.0.0#

Changed#

Removed#

Deprecated#

hipCUB 2.13.1#

Added#

Changed#

Known Issues#

hipFFT 1.0.12#

Added#

Changed#

hipSOLVER 1.8.0#

Added#

hipSPARSE 2.3.6#

Added#

Changed#

MIOpen 2.19.0#

Added#

Changed#

Fixed#

rccl 2.15.5#

Changed#

Added#

Fixed#

Removed#

rocALUTION 2.1.9#

Improved#

rocBLAS 3.0.0#

Optimizations#

Added#

Deprecated#

Removed#

Dependencies#

Fixed#

Changed#

rocFFT 1.0.23#

Added#

Changed#

Fixed#

rocm-cmake 0.9.0#

Added#

rocPRIM 2.13.0#

Added#

Changed#

Known Issues#

rocRAND 2.10.17#

Added#

Changed#

rocSOLVER 3.22.0#

Added#

Optimized#

Fixed#

rocSPARSE 2.5.2#

Improved#

rocThrust 2.18.0#