Release Notes#
2023-07-27
21 min read time
The release notes for the ROCm platform.
ROCm 5.6.0#
Release Highlights#
ROCm 5.6 consists of several AI software ecosystem improvements to our fast-growing user base. A few examples include:
New documentation portal at https://rocm.docs.amd.com
Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test suite
OpenAI Triton, CuPy, HIP Graph support, and many other library performance enhancements
Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger, profiler, and docker containers
New pseudorandom generators are available in rocRAND. Added support for half-precision transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in rocSOLVER.
OS and GPU Support Changes#
SLES15 SP5 support was added this release. SLES15 SP3 support was dropped.
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs) will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA release date.
No new features and performance optimizations will be supported for the gfx906 GPUs beyond ROCm 5.7
Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (End of Maintenance [EOM])(will be aligned with the closest ROCm release)
Bug fixes during the maintenance will be made to the next ROCm point release
Bug fixes will not be back ported to older ROCm releases for this SKU
Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
AMDSMI CLI 23.0.0.4#
Added#
AMDSMI CLI tool enabled for Linux Bare Metal & Guest
Package: amd-smi-lib
Known Issues#
not all Error Correction Code (ECC) fields are currently supported
RHEL 8 & SLES 15 have extra install steps
Kernel Modules (DKMS)#
Fixes#
Stability fix for multi GPU system reproducilble via ROCm_Bandwidth_Test as reported in Issue 2198.
HIP 5.6 (For ROCm 5.6)#
Optimizations#
Consolidation of hipamd, rocclr and OpenCL projects in clr
Optimized lock for graph global capture mode
Added#
Added hipRTC support for amd_hip_fp16
Added hipStreamGetDevice implementation to get the device associated with the stream
Added HIP_AD_FORMAT_SIGNED_INT16 in hipArray formats
hipArrayGetInfo for getting information about the specified array
hipArrayGetDescriptor for getting 1D or 2D array descriptor
hipArray3DGetDescriptor to get 3D array descriptor
Changed#
hipMallocAsync to return success for zero size allocation to match hipMalloc
Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package
Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide
Removed hipBusBandwidth and hipCommander samples from hip-tests
Fixed#
Fixed regression in hipMemCpyParam3D when offset is applied
Known Issues#
Limited testing on xnack+ configuration
Multiple HIP tests failures (gpuvm fault or hangs)
hipSetDevice and hipSetDeviceFlags APIs return hipErrorInvalidDevice instead of hipErrorNoDevice, on a system without GPU
Known memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs. Issue will be fixed in a future ROCm release
Upcoming changes in future release#
Removal of gcnarch from hipDeviceProp_t structure
Addition of new fields in hipDeviceProp_t structure
maxTexture1D
maxTexture2D
maxTexture1DLayered
maxTexture2DLayered
sharedMemPerMultiprocessor
deviceOverlap
asyncEngineCount
surfaceAlignment
unifiedAddressing
computePreemptionSupported
uuid
Removal of deprecated code
hip-hcc codes from hip code tree
Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
HIPMEMCPY_3D fields correction (unsigned int -> size_t)
Renaming of ‘memoryType’ in hipPointerAttribute_t structure to ‘type’
ROCgdb-13 (For ROCm 5.6.0)#
Optimized#
Improved performances when handling the end of a process with a large number of threads.
Known Issues
On certain configurations, ROCgdb can show the following warning message:
warning: Probes-based dynamic linker interface failed. Reverting to original interface.
This does not affect ROCgdb’s functionalities.
ROCprofiler (For ROCm 5.6.0)#
In ROCm 5.6 the rocprofilerv1
and rocprofilerv2
include and library files of
ROCm 5.5 are split into separate files. The rocmtools
files that were
deprecated in ROCm 5.5 have been removed.
ROCm 5.6 |
rocprofilerv1 |
rocprofilerv2 |
---|---|---|
Tool script |
|
|
API include |
|
|
API library |
|
|
The ROCm Profiler Tool that uses rocprofilerV1
can be invoked using the
following command:
$ rocprof …
To write a custom tool based on the rocprofilerV1
API do the following:
main.c:
#include <rocprofiler/rocprofiler.h> // Use the rocprofilerV1 API
int main() {
// Use the rocprofilerV1 API
return 0;
}
This can be built in the following manner:
$ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64
The resulting a.out
will depend on
/opt/rocm-5.6.0/lib/librocprofiler64.so.1
.
The ROCm Profiler that uses rocprofilerV2
API can be invoked using the
following command:
$ rocprofv2 …
To write a custom tool based on the rocprofilerV2
API do the following:
main.c:
#include <rocprofiler/v2/rocprofiler.h> // Use the rocprofilerV2 API
int main() {
// Use the rocprofilerV2 API
return 0;
}
This can be built in the following manner:
$ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64-v2
The resulting a.out
will depend on
/opt/rocm-5.6.0/lib/librocprofiler64.so.2
.
Optimized#
Improved Test Suite
Added#
‘end_time’ need to be disabled in roctx_trace.txt
Fixed#
rocprof in ROcm/5.4.0 gpu selector broken.
rocprof in ROCm/5.4.1 fails to generate kernel info.
rocprof clobbers LD_PRELOAD.
Library Changes in ROCM 5.6.0#
Library |
Version |
---|---|
hipBLAS |
⇒ 1.0.0 |
hipCUB |
⇒ 2.13.1 |
hipFFT |
⇒ 1.0.12 |
hipSOLVER |
⇒ 1.8.0 |
hipSPARSE |
⇒ 2.3.6 |
MIOpen |
⇒ 2.19.0 |
rccl |
⇒ 2.15.5 |
rocALUTION |
⇒ 2.1.9 |
rocBLAS |
⇒ 3.0.0 |
rocFFT |
⇒ 1.0.23 |
rocm-cmake |
⇒ 0.9.0 |
rocPRIM |
⇒ 2.13.0 |
rocRAND |
⇒ 2.10.17 |
rocSOLVER |
⇒ 3.22.0 |
rocSPARSE |
⇒ 2.5.2 |
rocThrust |
⇒ 2.18.0 |
rocWMMA |
⇒ 1.1.0 |
Tensile |
⇒ 4.37.0 |
hipBLAS 1.0.0#
hipBLAS 1.0.0 for ROCm 5.6.0
Changed#
added const qualifier to hipBLAS functions (swap, sbmv, spmv, symv, trsm) where missing
Removed#
removed support for deprecated hipblasInt8Datatype_t enum
removed support for deprecated hipblasSetInt8Datatype and hipblasGetInt8Datatype functions
Deprecated#
in-place trmm is deprecated. It will be replaced by trmm which includes both in-place and out-of-place functionality
hipCUB 2.13.1#
hipCUB 2.13.1 for ROCm 5.6.0
Added#
Benchmarks for
BlockShuffle
,BlockLoad
, andBlockStore
.
Changed#
CUB backend references CUB and Thrust version 1.17.2.
Improved benchmark coverage of
BlockScan
by addingExclusiveScan
, benchmark coverage ofBlockRadixSort
by addingSortBlockedToStriped
, and benchmark coverage ofWarpScan
by addingBroadcast
.Updated
docs
directory structure to match the standard of rocm-docs-core.
Known Issues#
BlockRadixRankMatch
is currently broken under the rocPRIM backend.BlockRadixRankMatch
with a warp size that does not exactly divide the block size is broken under the CUB backend.
hipFFT 1.0.12#
hipFFT 1.0.12 for ROCm 5.6.0
Added#
Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms.
Changed#
Added –precision argument to benchmark/test clients. –double is still accepted but is deprecated as a method to request a double-precision transform.
hipSOLVER 1.8.0#
hipSOLVER 1.8.0 for ROCm 5.6.0
Added#
Added compatibility API with hipsolverRf prefix
hipSPARSE 2.3.6#
hipSPARSE 2.3.6 for ROCm 5.6.0
Added#
Added SpGEMM algorithms
Changed#
For hipsparseXbsr2csr and hipsparseXcsr2bsr, blockDim == 0 now returns HIPSPARSE_STATUS_INVALID_SIZE
MIOpen 2.19.0#
MIOpen 2.19.0 for ROCm 5.6.0
Added#
ROCm 5.5 support for gfx1101 (Navi32)
Changed#
Tuning results for MLIR on ROCm 5.5
Bumping MLIR commit to 5.5.0 release tag
Fixed#
Fix 3d convolution Host API bug
[HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required.
rccl 2.15.5#
RCCL 2.15.5 for ROCm 5.6.0
Changed#
Compatibility with NCCL 2.15.5
Unit test executable renamed to rccl-UnitTests
Added#
HW-topology aware binary tree implementation
Experimental support for MSCCL
New unit tests for hipGraph support
NPKit integration
Fixed#
rocm-smi ID conversion
Support for HIP_VISIBLE_DEVICES for unit tests
Support for p2p transfers to non (HIP) visible devices
Removed#
Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench
rocALUTION 2.1.9#
rocALUTION 2.1.9 for ROCm 5.6.0
Improved#
Fixed synchronization issues in level 1 routines
rocBLAS 3.0.0#
rocBLAS 3.0.0 for ROCm 5.6.0
Optimizations#
Improved performance of Level 2 rocBLAS GEMV on gfx90a GPU for non-transposed problems having small matrices and larger batch counts. Performance enhanced for problem sizes when m and n <= 32 and batch_count >= 256.
Improved performance of rocBLAS syr2k for single, double, and double-complex precision, and her2k for double-complex precision. Slightly improved performance for general sizes on gfx90a.
Added#
Added bf16 inputs and f32 compute support to Level 1 rocBLAS Extension functions axpy_ex, scal_ex and nrm2_ex.
Deprecated#
trmm inplace is deprecated. It will be replaced by trmm that has both inplace and out-of-place functionality
rocblas_query_int8_layout_flag() is deprecated and will be removed in a future release
rocblas_gemm_flags_pack_int8x4 enum is deprecated and will be removed in a future release
rocblas_set_device_memory_size() is deprecated and will be replaced by a future function rocblas_increase_device_memory_size()
rocblas_is_user_managing_device_memory() is deprecated and will be removed in a future release
Removed#
is_complex helper was deprecated and now removed. Use rocblas_is_complex instead.
The enum truncate_t and the value truncate was deprecated and now removed from. It was replaced by rocblas_truncate_t and rocblas_truncate, respectively.
rocblas_set_int8_type_for_hipblas was deprecated and is now removed.
rocblas_get_int8_type_for_hipblas was deprecated and is now removed.
Dependencies#
build only dependency on python joblib added as used by Tensile build
fix for cmake install on some OS when performed by install.sh -d –cmake_install
Fixed#
make trsm offset calculations 64 bit safe
Changed#
refactor rotg test code
rocFFT 1.0.23#
rocFFT 1.0.23 for ROCm 5.6.0
Added#
Implemented half-precision transforms, which can be requested by passing rocfft_precision_half to rocfft_plan_create.
Implemented a hierarchical solution map which saves how to decompose a problem and the kernels to be used.
Implemented a first version of offline-tuner to support tuning kernels for C2C/Z2Z problems.
Changed#
Replaced std::complex with hipComplex data types for data generator.
FFT plan dimensions are now sorted to be row-major internally where possible, which produces better plans if the dimensions were accidentally specified in a different order (column-major, for example).
Added –precision argument to benchmark/test clients. –double is still accepted but is deprecated as a method to request a double-precision transform.
Fixed#
Fixed over-allocation of LDS in some real-complex kernels, which was resulting in kernel launch failure.
rocm-cmake 0.9.0#
rocm-cmake 0.9.0 for ROCm 5.6.0
Added#
Added the option ROCM_HEADER_WRAPPER_WERROR
Compile-time C macro in the wrapper headers causes errors to be emitted instead of warnings.
Configure-time CMake option sets the default for the C macro.
rocPRIM 2.13.0#
rocPRIM 2.13.0 for ROCm 5.6.0
Added#
New block level
radix_rank
primitive.New block level
radix_rank_match
primitive.Added a stable block sorting implementation. This be used with
block_sort
by using theblock_sort_algorithm::stable_merge_sort
algorithm.
Changed#
Improved the performance of
block_radix_sort
anddevice_radix_sort
.Improved the performance of
device_merge_sort
.Updated
docs
directory structure to match the standard of rocm-docs-core. Contributed by: v01dXYZ.
Known Issues#
Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows.
When
ROCPRIM_DISABLE_LOOKBACK_SCAN
is set,device_scan
fails for input sizes bigger thanscan_config::size_limit
, which defaults tostd::numeric_limits<unsigned int>::max()
.
rocRAND 2.10.17#
rocRAND 2.10.17 for ROCm 5.6.0
Added#
MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
New benchmark for the device API using Google Benchmark,
benchmark_rocrand_device_api
, replacingbenchmark_rocrand_kernel
.benchmark_rocrand_kernel
is deprecated and will be removed in a future version. Likewise,benchmark_curand_host_api
is added to replacebenchmark_curand_generate
andbenchmark_curand_device_api
is added to replacebenchmark_curand_kernel
.experimental HIP-CPU feature
ThreeFry pseudorandom number generator based on Salmon et al., 2011, “Parallel random numbers: as easy as 1, 2, 3”.
Changed#
Python 2.7 is no longer officially supported.
rocSOLVER 3.22.0#
rocSOLVER 3.22.0 for ROCm 5.6.0
Added#
LU refactorization for sparse matrices
CSRRF_ANALYSIS
CSRRF_SUMLU
CSRRF_SPLITLU
CSRRF_REFACTLU
Linear system solver for sparse matrices
CSRRF_SOLVE
Added type
rocsolver_rfinfo
for use with sparse matrix routines
Optimized#
Improved the performance of BDSQR and GESVD when singular vectors are requested
Fixed#
BDSQR and GESVD should no longer hang when the input contains
NaN
orInf
rocSPARSE 2.5.2#
rocSPARSE 2.5.2 for ROCm 5.6.0
Improved#
Fixed a memory leak in csritsv
Fixed a bug in csrsm and bsrsm
rocThrust 2.18.0#
rocThrust 2.18.0 for ROCm 5.6.0
Fixed#
lower_bound
,upper_bound
, andbinary_search
failed to compile for certain types.
Changed#
Updated
docs
directory structure to match the standard of rocm-docs-core.
rocWMMA 1.1.0#
rocWMMA 1.1.0 for ROCm 5.6.0
Added#
Added cross-lane operation backends (Blend, Permute, Swizzle and Dpp)
Added GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation)
Added performance gemm samples for half, single and double precision
Added rocWMMA cmake versioning
Added vectorized support in coordinate transforms
Included ROCm smi for runtime clock rate detection
Added fragment transforms for transpose and change data layout
Changed#
Default to GPU rocBLAS validation against rocWMMA
Re-enabled int8 gemm tests on gfx9
Upgraded to C++17
Restructured unit test folder for consistency
Consolidated rocWMMA samples common code
Tensile 4.37.0#
Tensile 4.37.0 for ROCm 5.6.0
Added#
Added user driven tuning API
Added decision tree fallback feature
Added SingleBuffer + AtomicAdd option for GlobalSplitU
DirectToVgpr support for fp16 and Int8 with TN orientation
Added new test cases for various functions
Added SingleBuffer algorithm for ZGEMM/CGEMM
Added joblib for parallel map calls
Added support for MFMA + LocalSplitU + DirectToVgprA+B
Added asmcap check for MIArchVgpr
Added support for MFMA + LocalSplitU
Added frequency, power, and temperature data to the output
Optimizations#
Improved the performance of GlobalSplitU with SingleBuffer algorithm
Reduced the running time of the extended and pre_checkin tests
Optimized the Tailloop section of the assembly kernel
Optimized complex GEMM (fixed vgpr allocation, unified CGEMM and ZGEMM code in MulMIoutAlphaToArch)
Improved the performance of the second kernel of MultipleBuffer algorithm
Changed#
Updated custom kernels with 64-bit offsets
Adapted 64-bit offset arguments for assembly kernels
Improved temporary register re-use to reduce max sgpr usage
Removed some restrictions on VectorWidth and DirectToVgpr
Updated the dependency requirements for Tensile
Changed the range of AssertSummationElementMultiple
Modified the error messages for more clarity
Changed DivideAndReminder to vectorStaticRemainder in case quotient is not used
Removed dummy vgpr for vectorStaticRemainder
Removed tmpVgpr parameter from vectorStaticRemainder/Divide/DivideAndReminder
Removed qReg parameter from vectorStaticRemainder
Fixed#
Fixed tmp sgpr allocation to avoid over-writing values (alpha)
64-bit offset parameters for post kernels
Fixed gfx908 CI test failures
Fixed offset calculation to prevent overflow for large offsets
Fixed issues when BufferLoad and BufferStore are equal to zero
Fixed StoreCInUnroll + DirectToVgpr + no useInitAccVgprOpt mismatch
Fixed DirectToVgpr + LocalSplitU + FractionalLoad mismatch
Fixed the memory access error related to StaggerU + large stride
Fixed ZGEMM 4x4 MatrixInst mismatch
Fixed DGEMM 4x4 MatrixInst mismatch
Fixed ASEM + GSU + NoTailLoop opt mismatch
Fixed AssertSummationElementMultiple + GlobalSplitU issues
Fixed ASEM + GSU + TailLoop inner unroll