Changelog#
2024-04-16
328 min read time
This page contains the changelog for AMD ROCm Software.
ROCm 6.1.0#
The ROCm™ 6.1 release consists of new features and fixes to improve the stability and performance of AMD Instinct™ MI300 GPU applications. Notably, we’ve added:
Full support for Ubuntu 22.04.4.
rocDecode, a new ROCm component that provides high-performance video decode support for AMD GPUs. With rocDecode, you can decode compressed video streams while keeping the resulting YUV frames in video memory. With decoded frames in video memory, you can run video post-processing using ROCm HIP, avoiding unnecessary data copies via the PCIe bus.
To learn more, refer to the rocDecode documentation.
OS and GPU support changes#
ROCm 6.1 adds the following operating system support:
MI300A: Ubuntu 22.04.4 and RHEL 9.3
MI300X: Ubuntu 22.04.4
Future releases will add additional operating systems to match our general offering. For older generations of supported AMD Instinct products, we’ve added Ubuntu 22.04.4 support.
Installation packages#
This release includes a new set of packages for every module (all libraries and binaries default to
DT_RPATH
). Package names have the suffix rpath
; for example, the rpath
variant of rocminfo
is
rocminfo-rpath
.
Warning
The new rpath
packages will conflict with the default packages; they are meant to be used only in
environments where legacy DT_RPATH
is the preferred form of linking (instead of DT_RUNPATH
). We
do not recommend trying to install both sets of packages.
AMD SMI#
AMD SMI for ROCm 6.1.0
Additions#
Added Monitor command. This provides users the ability to customize GPU metrics to capture, collect, and observe. Output is provided in a table view. This aligns closer to ROCm SMI
rocm-smi
(no argument), and allows you to customize per the data that are helpful for your use-case.Integrated ESMI Tool. You can get CPU metrics and telemetry through our API and CLI tools. You can get this information using the
amd-smi static
andamd-smi metric
commands. This is only available for limited target processors. As of ROCm 6.0.2, this is listed as:AMD Zen3 based CPU Family 19h Models 0h-Fh and 30h-3Fh
AMD Zen4 based CPU Family 19h Models 10h-1Fh and A0-AFh
Added support for new metrics: VCN, JPEG engines, and PCIe errors. Using the AMD SMIrccl tool, you can retrieve VCN, JPEG engines, and PCIe errors by calling
amd-smi metric -P
oramd-smi metric --usage
. Depending on device support,VCN_ACTIVITY
will update for MI3x ASICs (with 4 separate VCN engine activities) for older ASICsMM_ACTIVITY
with UVD/VCN engine activity (average of all engines).JPEG_ACTIVITY
is a new field for MI3x ASICs, where device can support up to 32 JPEG engine activities. See our documentation for more in-depth understanding of these new fields.Added AMDSMI Tool version. AMD SMI will report three versions: AMDSMI Tool, AMDSMI Library version, and ROCm version.
The AMDSMI Tool version is the CLI/tool version number with commit ID appended after the
+
sign. The AMDSMI Library version is the library package version number. The ROCm version is the system’s installed ROCm version; if ROCm is not installed, it reports N/A.Added XGMI table. Displays XGMI information for AMD GPU devices in a table format. This is only available on supported ASICs (e.g., MI300). Here, users can view read/write data XGMI or PCIe accumulated data transfer size (in KiloBytes).
Added units of measure to JSON output.. We added unit of measure to JSON/CSV
amd-smi metric
,amd-smi static
, andamd-smi monitor
commands.
Changes#
Topology is now left-aligned with BDF for each device listed individual table’s row/columns. We provided each device’s BDF for every table’s row/columns, then left-aligned data. We want AMD SMI Tool output to be easy to understand and digest. Having to scroll up to find this information made it difficult to follow, especially for devices that have many devices associated with one ASIC.
Fixes#
Fix for RDNA3/RDNA2/MI100 ‘amdsmi_get_gpu_pci_bandwidth()’ in ‘frequencies_read’ tests. For devices that do not report (e.g., RDNA3/RDNA2/MI100), we have added checks to confirm that these devices return
AMDSMI_STATUS_NOT_SUPPORTED
. Otherwise, tests now display a return string.Fix for devices that have an older PyYAML installed. For platforms that are identified as having an older PyYAML version or pip, we now manually update both pip and PyYAML as needed. This fix impacts the following CLI commands:
amd-smi list
amd-smi static
amd-smi firmware
amd-smi metric
amd-smi topology
Fix for crash when user is not a member of video/render groups. AMD SMI now uses the same mutex handler for devices as ROCm SMI. This helps avoid crashes when DRM/device data are inaccessible to the logged-in user.
Known issues#
There is an
AttributeError
while runningamd-smi process --csv
GPU reset results in an “Unable to reset non-amd GPU” error
bad pages results with “ValueError: NULL pointer access”
Some RDNA3 cards may enumerate to
Slot type = UNKNOWN
HIP#
HIP 6.1 for ROCm 6.1
Additions#
New environment variable,
HIP_LAUNCH_BLOCKING
, which is used for serialization on kernel execution.The default value is 0 (disable): kernel runs normally, as defined in the queue
When set as 1 (enable): HIP runtime serializes the kernel enqueue and behaves the same as
AMD_SERIALIZE_KERNEL
Added HIPRTC support for hip headers
driver_types
,math_functions
,library_types
,math_functions
,hip_math_constants
,channel_descriptor
,device_functions
,hip_complex
,surface_types
,texture_types
Changes#
HIPRTC now assumes WGP mode for gfx10+. You can enable CU mode by passing
-mcumode
to the compile options fromhiprtcCompileProgram
.
Fixes#
HIP complex vector type multiplication and division operations. On an AMD platform, some duplicated complex operators are removed to avoid compilation failures. In HIP,
hipFloatComplex
andhipDoubleComplex
are defined as complex datatypes:typedef float2 hipFloatComplex
typedef double2 hipDoubleComplex
Any application that uses complex multiplication and division operations must replace
*
and/
operators with the following:hipCmulf() and hipCdivf() for hipFloatComplex
hipCmul() and hipCdiv() for hipDoubleComplex
Note that these complex operations are equivalent to corresponding types/functions on an NVIDIA platform.
ROCm Compiler#
ROCm Compiler for ROCm 6.1.0
Additions#
Compiler now generates
.uniform_work_group_size
and records it in the metadata. It indicates if the kernel requires that each dimension of global size is a multiple of the corresponding dimension of work-group size. A value of 1 is true, and 0 is false. This metadata is only provided when the value is 1.Added the
rocm-llvm-docs
package.Added ROCm Device-Libs, ROCm Compiler Support, and hipCC within the
llvm-project/amd
subdirectory to AMD’s fork of the LLVM project.Added support for C++ Parallel Algorithm Offload via HIP (HIPSTDPAR), which allows parallel algorithms to run on the GPU.
Changes#
rocm-clang-ocl
is now an optional package and will require manual installation.
Deprecations#
hipCC adds
-mllvm
,-amdgpu-early-inline-all=true
, and-mllvm
-amdgpu-function-calls=false
by default to compiler invocations. These flags will be removed from hipCC in a future ROCm release.
Fixes#
AddressSanitizer (ASan):
Added
sanitized_padded_global
LLVM ir attribute to identify sanitizer instrumented globals.For ASan instrumented global, emit two symbols: one with actual size and the other with instrumented size.
Known issues#
Due to an issue within the
amd-llvm
compiler shipping with ROCm 6.1, HIPSTDPAR’s interposition mode, which is enabled by--hipstdpar-interpose-alloc
is currently broken.
The temporary workaround is to use the upstream LLVM 18 (or newer) compiler. This issue will be addressed in a future ROCm release .”
ROCm Data Center (RDC)#
RDC for ROCm 6.1.0
Changes#
Added
--address
flag to rdcdUpgraded from C++11 to C++17
Upgraded gRPC
ROCDebugger (ROCgdb)#
ROCgdb for ROCm 6.1.0
Fixes#
Previously, ROCDebugger encountered hangs and crashes when stepping over the s_endpgm
instruction at the end of a HIP kernel entry function, which caused the stepped wave to exit. This issue
is fixed in the ROCm 6.1 release. You can now step over the last instruction of any HIP kernel without
debugger hangs or crashes.
ROCm SMI#
ROCm SMI for ROCm 6.1.0
Additions#
Added support to set max/min clock level for sclk (‘RSMI_CLK_TYPE_SYS’) or mclk (‘RSMI_CLK_TYPE_MEM’). You can now set a maximum or minimum
sclk
ormclk
value through thersmi_dev_clk_extremum_set()
API provided ASIC support. Alternatively, you can use our Python CLI tool (rocm-smi --setextremum max sclk 1500
).Added
rsmi_dev_target_graphics_version_get()
. You can now query through ROCm SMI API (rsmi_dev_target_graphics_version_get()
) to retreive the target graphics version for a GPU device. Currently, this output is not supplied through our ROCm SMI CLI.
Changes#
Removed non-unified API headers: Individual GPU metric APIs are no longer supported. The individual metric APIs (
rsmi_dev_metrics_*
) were removed in order to keep updates easier for new GPU metric support. By providing a simple API (rsmi_dev_gpu_metrics_info_get()
) with its reported device metrics, it is worth noting there is a risk for ABI break-age usingrsmi_dev_gpu_metrics_info_get()
. It is vital to understand that ABI breaks are necessary (in some cases) in order to support newer ASICs and metrics for our customers. We will continue to supportrsmi_dev_gpu_metrics_info_get()
with these considerations and limitations in mind.Deprecated ‘rsmi_dev_power_ave_get()’; use the newer API, ‘rsmi_dev_power_get()’. As outlined in the change for 6.0.0 (Added a generic power API: rsmi_dev_power_get), is now deprecated. You must update your ROCm SMI API calls accordingly.
Fixes#
Fixed
--showpids
reporting[PID] [PROCESS NAME] 1 UNKNOWN UNKNOWN UNKNOWN
. Output was failing becausecu_occupancy debugfs
method is not provided on some graphics cards by design.get_compute_process_info_by_pid
was updated to reflect this and returns with the output needed by the CLI.Fixed
rocm-smi --showpower
output, which was inconsistent on some RDNA3 devices. We updated this to usersmi_dev_power_get()
within the CLI to provide a consistent device power output. This was caused by using the now-deprecatedrsmi_dev_average_power_get()
API.Fixed
rocm-smi --setcomputepartition
androcm-smi --resetcomputepartition
to notate if device isEBUSY
Fixed
rocm-smi --setmemorypartition
androcm-smi --resetmemorypartition
read only SYSFS to returnRSMI_STATUS_NOT_SUPPORTED
Thersmi_dev_memory_partition_set
API is updated to handle the read-only SYSFS check. Corresponding tests and CLI (rocm-smi --setmemorypartition
androcm-smi --resetmemorypartition
) calls were updated accordingly.Fixed
rocm-smi --showclkvolt
androcm-smi --showvc
, which were displaying 0 for overdrive and that the voltage curve is not supported.
ROCProfiler#
ROCProfiler for ROCm 6.1.0
Fixes#
Fixed ROCprofiler to match versioning changes in HIP Runtime
Fixed plugins race condition
Updated metrics to MI300
ROCm Validation Suite#
Known issue#
In a future release, the ROCm Validation Suite P2P Benchmark and Qualification Tool (PBQT) tests will be optimized to meet the target bandwidth requirements for MI300X.
MI200 SR-IOV#
Known issue#
Multimedia applications may encounter compilation errors in the MI200 Single Root Input/Output Virtualization (SR-IOV) environment. This is because MI200 SR-IOV does not currently support multimedia applications.
AMD MI300A RAS#
Fixed defect#
GFX correctable and uncorrectable error inject failures#
Previously, the AMD CPU Reliability, Availability, and Serviceability (RAS) installation encountered correctable and uncorrectable failures while injecting an error.
This issue is resolved in the ROCm 6.1 release, and users will no longer encounter the GFX correctable error (CE) and uncorrectable error (UE) failures.
Library changes in ROCm 6.1.0#
Library |
Version |
---|---|
AMDMIGraphX |
2.8 ⇒ 2.9 |
composable_kernel |
|
hipBLAS |
2.0.0 ⇒ 2.1.0 |
hipBLASLt |
|
hipCUB |
3.0.0 ⇒ 3.1.0 |
hipFFT |
1.0.13 ⇒ 1.0.14 |
hipRAND |
|
hipSOLVER |
2.0.0 ⇒ 2.1.0 |
hipSPARSE |
3.0.0 ⇒ 3.0.1 |
hipSPARSELt |
|
hipTensor |
1.1.0 ⇒ 1.2.0 |
MIOpen |
2.19.0 ⇒ 3.1.0 |
MIVisionX |
|
rccl |
|
rocALUTION |
3.0.3 ⇒ 3.1.1 |
rocBLAS |
4.0.0 ⇒ 4.1.0 |
rocDecode |
|
rocFFT |
1.0.25 ⇒ 1.0.26 |
rocm-cmake |
0.11.0 ⇒ 0.12.0 |
rocPRIM |
3.0.0 ⇒ 3.1.0 |
rocRAND |
3.0.0 ⇒ 3.0.1 |
rocSOLVER |
3.24.0 ⇒ 3.25.0 |
rocSPARSE |
3.0.2 ⇒ 3.1.2 |
rocThrust |
3.0.0 ⇒ 3.0.1 |
rocWMMA |
1.3.0 ⇒ 1.4.0 |
rpp |
1.4.0 ⇒ 1.5.0 |
Tensile |
4.39.0 ⇒ 4.40.0 |
AMDMIGraphX 2.9#
MIGraphX 2.9 for ROCm 6.1.0
Additions#
Added FP8 support
Created a dockerfile with MIGraphX+ONNX Runtime EP+Torch
Added support for the
Hardmax
,DynamicQuantizeLinear
,Qlinearconcat
,Unique
,QLinearAveragePool
,QLinearSigmoid
,QLinearLeakyRelu
,QLinearMul
,IsInf
operatorsCreated web site examples for
Whisper
,Llama-2
, andStable Diffusion 2.1
Created examples of using the ONNX Runtime MIGraphX Execution Provider with the
InceptionV3
andResnet50
modelsUpdated operators to support ONNX Opset 19
Enable fuse_pointwise and fuse_reduce in the driver
Add support for dot-(mul)-softmax-dot offloads to MLIR
Added Blas auto-tuning for GEMMs
Added dynamic shape support for the multinomial operator
Added fp16 to accuracy checker
Added initial code for running on Windows OS
Optimizations#
Improved the output of migraphx-driver command
Documentation now shows all environment variables
Updates needed for general stride support
Enabled Asymmetric Quantization
Added ScatterND unsupported reduction modes
Rewrote softmax for better performance
General improvement to how quantization is performed to support INT8
Used problem_cache for gemm tuning
Improved performance by always using rocMLIR for quantized convolution
Improved group convolutions by using rocMLIR
Improved accuracy of fp16 models
ScatterElements unsupported reduction
Added concat fusions
Improved INT8 support to include UINT8
Allow reshape ops between dq and quant_op
Improve dpp reductions on navi
Have the accuracy checker print the whole final buffer
Added support for handling dynamic Slice and ConstantOfShape ONNX operators
Add support for the dilations attribute to Pooling ops
Add layout attribute support for LSTM operator
Improved performance by removing contiguous for reshapes
Handle all slice input variations
Add scales attribute parse in upsample for older opset versions
Added support for uneven Split operations
Improved unit testing to run in python virtual environments
Fixes#
Fixed outstanding issues in autogenerated documentation
Update model zoo paths for examples
Fixed promote_literals_test by using additional if condition
Fixed export API symbols from dynamic library
Fixed bug in pad operator from dimension reduction
Fixed using the LD to embed files and enable by default when building shared libraries on linux
fixed get_version()
Fixed Round operator inaccuracy
Fixed wrong size check when axes not present for slice
Set the .SO version correctly
Changes#
Cleanup LSTM and RNN activation functions
Placed gemm_pointwise at a higher priority than layernorm_pointwise
Updated README to mention the need to include GPU_TARGETS when building MIGraphX
Removals#
Removed unused device kernels from Gather and Pad operators
Removed int8x4 format
hipBLAS 2.1.0#
hipBLAS 2.1.0 for ROCm 6.1.0
Additions#
New build option to automatically use hipconfig –platform to determine HIP platform
Level 1 functions have additional ILP64 API for both C and Fortran (
_64
name suffix) with int64_t function argumentsNew functions hipblasGetMathMode and hipblasSetMathMode
Deprecations#
USE_CUDA build option; use HIP_PLATFORM=amd or HIP_PLATFORM=nvidia to override hipconfig
Changes#
Some Level 2 function argument names have changed from
m
ton
to match legacy BLAS; there was no change in implementation.Updated client code to use YAML-based testing
Renamed
.doxygen
and.sphinx
folders todoxygen
andsphinx
, respectivelyAdded CMake support for documentation
hipBLASLt 0.7.0#
hipBLASLt 0.7.0 for ROCm 6.1.0
Additions#
Added
hipblasltExtSoftmax
extension APIAdded
hipblasltExtLayerNorm
extension APIAdded
hipblasltExtAMax
extension APIAdded
GemmTuning
extension parameter to set split-k by userSupport for mix precision datatype: fp16/fp8 in with fp16 out
hipCUB 3.1.0#
hipCUB 3.1.0 for ROCm 6.1.0
Changed#
CUB backend references CUB and Thrust version 2.1.0.
Updated
HIPCUB_HOST_WARP_THREADS
macro definition to matchhost_warp_size
changes from rocPRIM 3.0.Implemented
__int128_t
and__uint128_t
support for radix_sort.
Fixed#
Fixed build issues with
rmake.py
on Windows when using VS 2017 15.8 or later due to a breaking fix with extended aligned storage.
Added#
Added interface
DeviceMemcpy::Batched
for batched memcpy from rocPRIM and CUB.
hipFFT 1.0.14#
hipFFT 1.0.14 for ROCm 6.1.0
Changes#
When building hipFFT from source, rocFFT code no longer needs to be initialized as a git submodule.
Fixes#
Fixed error when creating length-1 plans.
hipSOLVER 2.1.0#
hipSOLVER 2.1.0 for ROCm 6.1.0
Added#
Added compatibility API with hipsolverSp prefix
Added compatibility-only functions
csrlsvchol
hipsolverSpScsrlsvcholHost, hipsolverSpDcsrlsvcholHost
hipsolverSpScsrlsvchol, hipsolverSpDcsrlsvchol
Added rocSPARSE and SuiteSparse as optional dependencies to hipSOLVER (rocSOLVER backend only). Use the
BUILD_WITH_SPARSE
CMake option to enable functionality for the hipsolverSp API (on by default).Added hipSPARSE as an optional dependency to hipsolver-test. Use the
BUILD_WITH_SPARSE
CMake option to enable tests of the hipsolverSp API (on by default).
Changed#
Relax array length requirements for GESVDA.
Fixed#
Fixed incorrect singular vectors returned from GESVDA.
hipSPARSE 3.0.1#
hipSPARSE 3.0.1 for ROCm 6.1.0
Fixes#
Fixes to the build chain
hipSPARSELt 0.2.0#
hipSPARSELt 0.2.0 for ROCm 6.1.0
Added#
Support Matrix B is a Structured Sparsity Matrix.
hipTensor 1.2.0#
hipTensor 1.2.0 for ROCm 6.1.0
Additions#
API support for permutation of rank 4 tensors: f16 and f32
New datatype support in contractions of rank 4: f16, bf16, complex f32, complex f64
Added scale and bilinear contraction samples and tests for new supported data types
Added permutation samples and tests for f16, f32 types
Fixes#
Fixed bug in contraction calculation with data type f32
MIOpen 3.1.0#
MIOpen 3.1.0 for ROCm 6.1.0
Added#
CK-based 2d/3d convolution solvers to support nchw/ncdhw layout
Fused solver for Fwd Convolution with Residual, Bias and activation
AI Based Parameter Prediction Model for conv_hip_igemm_group_fwd_xdlops Solver
Forward, backward data and backward weight convolution solver with fp8/bfp8
check for packed tensors for convolution solvers
Integrate CK’s layer norm
Combine gtests into single binary
Fixed#
fix for backward passes bwd/wrw for CK group conv 3d
Fixed out-of-bounds memory access : ConvOclDirectFwdGen
fixed build failure due to hipRTC
Changed#
Standardize workspace abstraction
Use split CK libraries
Removed#
clamping to MAX from CastTensor used in Bwd and WrW convolution
rccl 2.18.6#
RCCL 2.18.6 for ROCm 6.1.0
Changed#
Compatibility with NCCL 2.18.6
rocALUTION 3.1.1#
rocALUTION 3.1.1 for ROCm 6.1.0
Additions#
TripleMatrixProduct
functionality forGlobalMatrix
Multi-Node/GPU support for
UA-AMG
,SA-AMG
andRS-AMG
Iterative ILU0 preconditioner
ItILU0
Iterative triangular solve, selectable via
SolverDecr
class
Deprecations#
LocalMatrix::AMGConnect
LocalMatrix::AMGAggregate
LocalMatrix::AMGPMISAggregate
LocalMatrix::AMGSmoothedAggregation
LocalMatrix::AMGAggregation
PairwiseAMG
Known Issues#
PairwiseAMG
does currently not support matrix sizes that exceed int32 rangePairwiseAMG
might fail building the hierarchy on certain input matrices
rocBLAS 4.1.0#
rocBLAS 4.1.0 for ROCm 6.1.0
Additions#
Level 1 and Level 1 Extension functions have additional ILP64 API for both C and FORTRAN (_64 name suffix) with int64_t function arguments.
Cache flush timing for gemm_ex.
Changes#
Some Level 2 function argument names have changed ‘m’ to ‘n’ to match legacy BLAS, there was no change in implementation.
Standardized the use of non-blocking streams for copying results from device to host.
Fixes#
Fixed host-pointer mode reductions for non-blocking streams.
rocDecode 0.5.0#
rocDecode 0.5.0 for ROCm 6.1.0
Changes#
Changed setup updates
Added AMDGPU package support
Optimized package dependencies
Updated README
Fixes#
Minor bug fix and updates
Tested Configurations#
Linux distribution
Ubuntu -
20.04
/22.04
ROCm:
rocm-core -
6.1.0.60100-28
amdgpu-core -
1:6.1.60100-1731559
FFMPEG -
4.2.7
/4.4.2-0
rocDecode Setup Script -
V1.4
rocFFT 1.0.26#
rocFFT 1.0.26 for ROCm 6.1.0
Changes#
Multi-device FFTs now allow batch greater than 1
Multi-device, real-complex FFTs are now supported
rocFFT now statically links libstdc++ when only
std::experimental::filesystem
is available (to guard against ABI incompatibilities with newer libstdc++ libraries that includestd::filesystem
)
rocm-cmake 0.12.0#
rocm-cmake 0.12.0 for ROCm 6.1.0
Changed#
ROCMSphinxDoc: Allow separate source and config directories.
ROCMCreatePackage: Allow additional
PROVIDES
on header-only packages.ROCMInstallTargets: Don’t install executable targets by default for ASAN builds.
ROCMTest: Add RPATH for installed tests.
Finalize rename to ROCmCMakeBuildTools
Fixed#
ROCMClangTidy: Fixed invalid list index.
Test failures when ROCM_CMAKE_GENERATOR is empty.
rocPRIM 3.1.0#
rocPRIM 3.1.0 for ROCm 6.1.0
Additions#
New primitive:
block_run_length_decode
New primitive:
batch_memcpy
Changes#
Renamed:
scan_config_v2
toscan_config
scan_by_key_config_v2
toscan_by_key_config
radix_sort_config_v2
toradix_sort_config
reduce_by_key_config_v2
toreduce_by_key_config
radix_sort_config_v2
toradix_sort_config
Removed support for custom config types for device algorithms
host_warp_size()
was moved intorocprim/device/config_types.hpp
; it now uses eitherdevice_id
or astream
parameter to query the proper device and adevice_id
out parameterThe return type is
hipError_t
Added support for
__int128_t
indevice_radix_sort
andblock_radix_sort
Improved the performance of
match_any
, andblock_histogram
which uses it
Deprecations#
Removed
reduce_by_key_config
,MatchAny
,scan_config
,scan_by_key_config
, andradix_sort_config
Fixes#
Build issues with
rmake.py
on Windows when using VS 2017 15.8 or later (due to a breaking fix with extended aligned storage)
rocRAND 3.0.1#
rocRAND 3.0.1 for ROCm 6.1.0
Fixes#
Implemented workaround for regressions in XORWOW and LFSR on MI200
rocSOLVER 3.25.0#
rocSOLVER 3.25.0 for ROCm 6.1.0
Added#
Eigensolver routines for symmetric/hermitian matrices using Divide & Conquer and Jacobi algorithm:
SYEVDJ (with batched and strided_batched versions)
HEEVDJ (with batched and strided_batched versions)
Generalized symmetric/hermitian-definite eigensolvers using Divide & Conquer and Jacobi algorithm:
SYGVDJ (with batched and strided_batched versions)
HEGVDJ (with batched and strided_batched versions)
Changed#
Relaxed array length requirements for GESVDX with
rocblas_srange_index
.
Removed#
Removed gfx803 and gfx900 from default build targets.
Fixed#
Corrected singular vector normalization in BDSVDX and GESVDX
Fixed potential memory access fault in STEIN, SYEVX/HEEVX, SYGVX/HEGVX, BDSVDX and GESVDX
rocSPARSE 3.1.2#
rocSPARSE 3.1.2 for ROCm 6.1.0
Additions#
New LRB algorithm to SpMV, supporting CSR format
rocBLAS as now an optional dependency for SDDMM algorithms
Additional verbose output for
csrgemm
andbsrgemm
Optimizations#
Triangular solve with multiple rhs (SpSM, csrsm, …) now calls SpSV, csrsv, etcetera when nrhs equals 1
Improved user manual section Installation and Building for Linux and Windows
Improved SpMV in CSR format on MI300
rocThrust 3.0.1#
rocThrust 3.0.1 for ROCm 6.1.0
Fixes#
Ported a fix from thrust 2.2 that ensures
thrust::optional
is trivially copyable.
rocWMMA 1.4.0#
rocWMMA 1.4.0 for ROCm 6.1.0
Additions#
Added bf16 support for hipRTC sample
Changes#
Changed Clang C++ version to C++17
Updated rocwmma_coop API
Linked rocWMMA to hiprtc
Fixes#
Fixed compile/runtime arch checks
Built all test in large code model
Removed inefficient branching in layout loop unrolling
rpp 1.5.0#
rpp for ROCm 6.1.0
Changes#
Prerequisites
Tested Configurations#
Linux distribution
Ubuntu -
20.04
/22.04
CentOS -
7
RHEL -
8
/9
ROCm: rocm-core -
5.5.0.50500-63
Clang - Version
5.0.1
and aboveCMake - Version
3.22.3
IEEE 754-based half-precision floating-point library - Version
1.12.0
Tensile 4.40.0#
Tensile 4.40.0 for ROCm 6.1.0
Additions#
new DisableKernelPieces values to invalidate local read, local write, and global read
stream-K kernel generation, including two-tile stream-k algorithm by setting StreamK=3
feature to allow testing stream-k grid multipliers
debug output to check occupancy for Stream-K
reject condition for FractionalLoad + DepthU!=power of 2
new TENSILE_DB debugging value to dump the common kernel parameters
predicate for APU libs
new parameter (ClusterLocalRead) to turn on/off wider local read opt for TileMajorLDS
new parameter (ExtraLatencyForLR) to add extra interval between local read and wait
new logic to check LDS size with auto LdsPad(=1) and change LdsPad to 0 if LDS overflows
initialization type and general batched options to the rocblas-bench input creator script
Optimizations#
enabled MFMA + LocalSplitU=4 for MT16x16
enabled (DirectToVgpr + MI4x4) and supported skinny MacroTile
optimized postGSU kernel: separate postGSU kernels for different GSU values, loop unroll for GSU loop, wider global load depending on array size, and parallel reduction depending on array size
auto LdsPad calculation for TileMajorLds + MI16x16
auto LdsPad calculation for UnrollMajorLds + MI16x16 + VectorWidth
Changes#
cleared hipErrorNotFound error since it is an expected part of the search
modified hipcc search path for Linux
changed PCI ID from 32bit to 64bit for ROCm SMI HW monitor
changed LdsBlockSizePerPad to LdsBlockSizePerPadA, B to specify LBSPP separately
changed the default value of LdsPadA, B, LdsBlockSizePerPadA, B from 0 to -1
updated test cases according to parameter changes for LdsPad, LBSPP and ClusterLocalRead
Replaced std::regex with fnmatch()/PathMatchSpec as a workaround to std::regex stack overflow known bug
Fixes#
hipcc compile append flag parallel-jobs=4
race condition in Stream-K that appeared with large grids and small sizes
mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and TailLoop
mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and SplitLds
incorrect reject condition check for DirectToLds + LdsBlockSizePerPad=-1 case
small fix for LdsPad optimization (LdsElement calculation)
ROCm 6.0.2#
The ROCm 6.0.2 point release consists of minor bug fixes to improve the stability of MI300 GPU applications. This release introduces several new driver features for system qualification on our partner server offerings.
hipFFT 1.0.13#
hipFFT 1.0.13 for ROCm 6.0.2
Changes#
Removed the Git submodule for shared files between rocFFT and hipFFT; instead, just copy the files over (this should help simplify downstream builds and packaging)
Library changes in ROCm 6.0.2#
Library |
Version |
---|---|
AMDMIGraphX |
|
composable_kernel |
|
hipBLAS |
|
hipCUB |
|
hipFFT |
|
hipRAND |
|
hipSOLVER |
|
hipSPARSE |
|
hipTensor |
|
MIOpen |
|
MIVisionX |
|
rccl |
|
rocALUTION |
|
rocBLAS |
|
rocFFT |
|
rocm-cmake |
|
rocPRIM |
|
rocRAND |
2.10.17 ⇒ 3.0.0 |
rocSOLVER |
|
rocSPARSE |
|
rocThrust |
|
rocWMMA |
|
rpp |
|
Tensile |
rocRAND 3.0.0#
rocRAND 3.0.0 for ROCm 6.0.2
Changed#
Generator classes from
rocrand.hpp
are no longer copyable, in previous versions these copies would copy internal references to the generators and would lead to double free or memory leak errors. These types should be moved instead of copied, and move constructors and operators are now defined for them.
Optimized#
Improved MT19937 initialization and generation performance.
Removed#
Removed hipRAND submodule from rocRAND. hipRAND is now only available as a separate package.
Removed references to and workarounds for deprecated hcc
Fixed#
mt19937_engine
fromrocrand.hpp
is now move-constructible and move-assignable. Previously the move constructor and move assignment operator was deleted for this class.Various fixes for the C++ wrapper header rocrand.hpp
fixed the name of
mrg31k3p
it is now correctly spelled (was incorrectly namedmrg31k3a
in previous versions).added missing
order
setter method forthreefry4x64
fixed the default ordering parameter for
lfsr113
Build error when using clang++ directly due to unsupported references to amdgpu-target
ROCm 6.0.0#
ROCm 6.0 is a major release with new performance optimizations, expanded frameworks and library support, and improved developer experience. This includes initial enablement of the AMD Instinct™ MI300 series. Future releases will further enable and optimize this new platform. Key features include:
Improved performance in areas like lower precision math and attention layers.
New hipSPARSELt library to accelerate AI workloads via AMD’s sparse matrix core technique.
Latest upstream support for popular AI frameworks like PyTorch, TensorFlow, and JAX.
New support for libraries, such as DeepSpeed, ONNX-RT, and CuPy.
Prepackaged HPC and AI containers on AMD Infinity Hub, with improved documentation and tutorials on the AMD ROCm Docs site.
Consolidated developer resources and training on the new AMD ROCm Developer Hub.
The following section provide a release overview for ROCm 6.0. For additional details, you can refer to the Changelog.
OS and GPU support changes#
AMD Instinct™ MI300A and MI300X Accelerator support has been enabled for limited operating systems.
Ubuntu 22.04.3 (MI300A and MI300X)
RHEL 8.9 (MI300A)
SLES 15 SP5 (MI300A)
We’ve added support for the following operating systems:
RHEL 9.3
RHEL 8.9
Note that, of ROCm 6.2, we’ve planned for end-of-support (EoS) for the following operating systems:
Ubuntu 20.04.5
SLES 15 SP4
RHEL/CentOS 7.9
New ROCm meta package#
We’ve added a new ROCm meta package for easy installation of all ROCm core packages, tools, and
libraries. For example, the following command will install the full ROCm package: apt-get install rocm
(Ubuntu), or yum install rocm
(RHEL).
Filesystem Hierarchy Standard#
ROCm 6.0 fully adopts the Filesystem Hierarchy Standard (FHS) reorganization goals. We’ve removed the backward compatibility support for old file locations.
Compiler location change#
The installation path of LLVM has been changed from
/opt/rocm-<rel>/llvm
to/opt/rocm-<rel>/lib/llvm
. For backward compatibility, a symbolic link is provided to the old location and will be removed in a future release.The installation path of the device library bitcode has changed from
/opt/rocm-<rel>/amdgcn
to/opt/rocm-<rel>/lib/llvm/lib/clang/<ver>/lib/amdgcn
. For backward compatibility, a symbolic link is provided and will be removed in a future release.
Documentation#
CMake support has been added for documentation in the ROCm repository.
AMD Instinct™ MI50 end-of-support notice#
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) enters maintenance mode in ROCm 6.0.
As outlined in 5.6.0, ROCm 5.7 was the final release for gfx906 GPUs in a fully supported state.
Henceforth, no new features and performance optimizations will be supported for the gfx906 GPUs.
Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs until Q2 2024 (end of maintenance [EOM] will be aligned with the closest ROCm release).
Bug fixes will be made up to the next ROCm point release.
Bug fixes will not be backported to older ROCm releases for gfx906.
Distribution and operating system updates will continue per the ROCm release cadence for gfx906 GPUs until EOM.
Known issues#
Hang is observed with rocSPARSE tests: Issue 2726.
AddressSanitizer instrumentation is incorrect for device global variables: Issue 2551.
Dynamically loaded HIP runtime library references incorrect version of
hipDeviceGetProperties
API: Issue 2728.Memory access violations when running rocFFT-HMM: Issue 2730.
Library changes#
Library |
Version |
---|---|
AMDMIGraphX |
⇒ 2.8 |
HIP |
|
hipBLAS |
⇒ 2.0.0 |
hipCUB |
⇒ 3.0.0 |
hipFFT |
⇒ 1.0.13 |
hipSOLVER |
⇒ 2.0.0 |
hipSPARSE |
⇒ 3.0.0 |
hipTensor |
⇒ 1.1.0 |
MIOpen |
⇒ 2.19.0 |
rccl |
⇒ 2.15.5 |
rocALUTION |
⇒ 3.0.3 |
rocBLAS |
⇒ 4.0.0 |
rocFFT |
⇒ 1.0.25 |
ROCgdb |
|
rocm-cmake |
⇒ 0.11.0 |
rocPRIM |
⇒ 3.0.0 |
rocprofiler |
|
rocRAND |
⇒ 2.10.17 |
rocSOLVER |
⇒ 3.24.0 |
rocSPARSE |
⇒ 3.0.2 |
rocThrust |
⇒ 3.0.0 |
rocWMMA |
⇒ 1.3.0 |
Tensile |
⇒ 4.39.0 |
AMDMIGraphX 2.8#
MIGraphX 2.8 for ROCm 6.0.0
Additions#
Support for TorchMIGraphX via PyTorch
Boosted overall performance by integrating rocMLIR
INT8 support for ONNX Runtime
Support for ONNX version 1.14.1
Added new operators:
Qlinearadd
,QlinearGlobalAveragePool
,Qlinearconv
,Shrink
,CastLike
, andRandomUniform
Added an error message for when
gpu_targets
is not set during MIGraphX compilationAdded parameter to set tolerances with
migraphx-driver
verifyAdded support for MXR files > 4 GB
Added
MIGRAPHX_TRACE_MLIR
flagBETA added capability for using ROCm Composable Kernels via the
MIGRAPHX_ENABLE_CK=1
environment variable
Optimizations#
Improved performance support for INT8
Improved time precision while benchmarking candidate kernels from CK or MLIR
Removed contiguous from reshape parsing
Updated the
ConstantOfShape
operator to support Dynamic BatchSimplified dynamic shapes-related operators to their static versions, where possible
Improved debugging tools for accuracy issues
Included a print warning about
miopen_fusion
while generatingmxr
General reduction in system memory usage during model compilation
Created additional fusion opportunities during model compilation
Improved debugging for matchers
Improved general debug messages
Fixes#
Fixed scatter operator for nonstandard shapes with some models from ONNX Model Zoo
Provided a compile option to improve the accuracy of some models by disabling Fast-Math
Improved layernorm + pointwise fusion matching to ignore argument order
Fixed accuracy issue with
ROIAlign
operatorFixed computation logic for the
Trilu
operatorFixed support for the DETR model
Changes#
Changed MIGraphX version to 2.8
Extracted the test packages into a separate deb file when building MIGraphX from source
Removals#
Removed building Python 2.7 bindings
AMD SMI#
Integrated the E-SMI library: You can now query CPU-related information directly through AMD SMI. Metrics include power, energy, performance, and other system details.
Added support for gfx942 metrics: You can now query MI300 device metrics to get real-time information. Metrics include power, temperature, energy, and performance.
Added support for compute and memory partitions
HIP 6.0.0#
HIP 6.0.0 for ROCm 6.0.0
Additions#
New fields and structs for external resource interoperability
hipExternalMemoryHandleDesc_st
hipExternalMemoryBufferDesc_st
hipExternalSemaphoreHandleDesc_st
hipExternalSemaphoreSignalParams_st
hipExternalSemaphoreWaitParams_st Enumerations
hipExternalMemoryHandleType_enum
hipExternalSemaphoreHandleType_enum
hipExternalMemoryHandleType_enum
New environment variable
HIP_LAUNCH_BLOCKING
For serialization on kernel execution. The default value is 0 (disable); kernel will execute normally as defined in the queue. When this environment variable is set as 1 (enable), HIP runtime will serialize kernel enqueue; behaves the same as AMD_SERIALIZE_KERNEL.
More members are added in HIP struct
hipDeviceProp_t
, for new feature capabilities including:Texture
int maxTexture1DMipmap;
int maxTexture2DMipmap[2];
int maxTexture2DLinear[3];
int maxTexture2DGather[2];
int maxTexture3DAlt[3];
int maxTextureCubemap;
int maxTexture1DLayered[2];
int maxTexture2DLayered[3];
int maxTextureCubemapLayered[2];
Surface
int maxSurface1D;
int maxSurface2D[2];
int maxSurface3D[3];
int maxSurface1DLayered[2];
int maxSurface2DLayered[3];
int maxSurfaceCubemap;
int maxSurfaceCubemapLayered[2];
Device
hipUUID uuid;
char luid[8];
this is an 8-byte unique identifier. Only valid on Windowsunsigned int luidDeviceNodeMask;
LUID (Locally Unique Identifier) is supported for interoperability between devices. In HIP, more members are added in the struct
hipDeviceProp_t
, as properties to identify each device:char luid[8];
unsigned int luidDeviceNodeMask;
Note
HIP only supports LUID on Windows OS.
Changes#
Some OpenGL Interop HIP APIs are moved from the hip_runtime_api header to a new header file hip_gl_interop.h for the AMD platform, as follows:
hipGLGetDevices
hipGraphicsGLRegisterBuffer
hipGraphicsGLRegisterImage
Changes impacting backward incompatibility#
Data types for members in
HIP_MEMCPY3D
structure are changed fromunsigned int
tosize_t
.The value of the flag
hipIpcMemLazyEnablePeerAccess
is changed to0x01
, which was previously defined as0
Some device property attributes are not currently supported in HIP runtime. In order to maintain consistency, the following related enumeration names are changed in
hipDeviceAttribute_t
hipDeviceAttributeName
is changed tohipDeviceAttributeUnused1
hipDeviceAttributeUuid
is changed tohipDeviceAttributeUnused2
hipDeviceAttributeArch
is changed tohipDeviceAttributeUnused3
hipDeviceAttributeGcnArch
is changed tohipDeviceAttributeUnused4
hipDeviceAttributeGcnArchName
is changed tohipDeviceAttributeUnused5
HIP struct
hipArray
is removed from driver type header to comply with CUDAhipArray_t
replaceshipArray*
, as the pointer to array.This allows
hipMemcpyAtoH
andhipMemcpyHtoA
to have the correct array type which is equivalent to corresponding CUDA driver APIs.
Fixes#
Kernel launch maximum dimension validation is added specifically on gridY and gridZ in the HIP API
hipModule-LaunchKernel
. As a result,whenhipGetDeviceAttribute
is called for the value ofhipDeviceAttributeMaxGrid-Dim
, the behavior on the AMD platform is equivalent to NVIDIA.The HIP stream synchronization behavior is changed in internal stream functions, in which a flag “wait” is added and set when the current stream is null pointer while executing stream synchronization on other explicitly created streams. This change avoids blocking of execution on null/default stream. The change won’t affect usage of applications, and makes them behave the same on the AMD platform as NVIDIA.
Error handling behavior on unsupported GPU is fixed, HIP runtime will log out error message, instead of creating signal abortion error which is invisible to developers but continued kernel execution process. This is for the case when developers compile any application via hipcc, setting the option
--offload-arch
with GPU ID which is different from the one on the system.HIP complex vector type multiplication and division operations. On AMD platform, some duplicated complex operators are removed to avoid compilation failures. In HIP,
hipFloatComplex
andhipDoubleComplex
are defined as complex data types:typedef float2 hipFloatComplex; typedef double2 hipDoubleComplex;
Any application that uses complex multiplication and division operations needs to replace ‘*’ and ‘/’ operators with the following:hipCmulf()
andhipCdivf()
forhipFloatComplex
hipCmul()
andhipCdiv()
forhipDoubleComplex
Note: These complex operations are equivalent to corresponding types/functions on NVIDIA platform.
Removals#
Deprecated Heterogeneous Compute (HCC) symbols and flags are removed from the HIP source code, including:
Build options on obsolete
HCC_OPTIONS
were removed from cmake.Micro definitions are removed:
HIP_INCLUDE_HIP_HCC_DETAIL_DRIVER_TYPES_H
HIP_INCLUDE_HIP_HCC_DETAIL_HOST_DEFINES_H
Compilation flags for the platform definitions
AMD platform
HIP_PLATFORM_HCC
HCC
HIP_ROCclr
NVIDIA platform
HIP_PLATFORM_NVCC
The
hcc_detail
andnvcc_detail
directories in the clr repository are removed.Deprecated gcnArch is removed from hip device struct
hipDeviceProp_t
.Deprecated
enum hipMemoryType memoryType;
is removed from HIP structhipPointerAttribute_t
union.
hipBLAS 2.0.0#
hipBLAS 2.0.0 for ROCm 6.0.0
Additions#
New option to define
HIPBLAS_USE_HIP_BFLOAT16
to switch API to use thehip_bfloat16
typeNew
hipblasGemmExWithFlags
API
Deprecations#
hipblasDatatype_t
; usehipDataType
insteadhipblasComplex
; usehipComplex
insteadhipblasDoubleComplex
; usehipDoubleComplex
insteadUse of
hipblasDatatype_t
forhipblasGemmEx
for compute-type; usehipblasComputeType_t
instead
Removals#
hipblasXtrmm
(calculates B <- alpha * op(A) * B) has been replaced withhipblasXtrmm
(calculates C <- alpha * op(A) * B)
hipCUB 3.0.0#
hipCUB 3.0.0 for ROCm 6.0.0
Changes#
Removed
DOWNLOAD_ROCPRIM
: you can force rocPRIM to download usingDEPENDENCIES_FORCE_DOWNLOAD
hipFFT 1.0.13#
hipFFT 1.0.13 for ROCm 6.0.0
Changes#
hipfft-rider
has been renamed tohipfft-bench
; it is controlled by theBUILD_CLIENTS_BENCH
CMake option (note that a link for the old file name is installed, and the oldBUILD_CLIENTS_RIDER
CMake option is accepted for backwards compatibility, but both will be removed in a future release)Binaries in debug builds no longer have a
-d
suffixThe minimum rocFFT required version has been updated to 1.0.21
Additions#
hipfftXtSetGPUs
,hipfftXtMalloc, hipfftXtMemcpy
,hipfftXtFree
, andhipfftXtExecDescriptor
APIs have been implemented to allow FFT computing on multiple devices in a single process
hipSOLVER 2.0.0#
hipSOLVER 2.0.0 for ROCm 6.0.0
Additions#
Added hipBLAS as an optional dependency to
hipsolver-test
You can use the
BUILD_HIPBLAS_TESTS
CMake option to test the compatibility between hipSOLVER and hipBLAS
Changes#
The
hipsolverOperation_t
type is now an alias ofhipblasOperation_t
The
hipsolverFillMode_t
type is now an alias ofhipblasFillMode_t
The
hipsolverSideMode_t
type is now an alias ofhipblasSideMode_t
Fixes#
Tests for hipSOLVER info updates in
ORGBR/UNGBR
,ORGQR/UNGQR
,ORGTR/UNGTR
,ORMQR/UNMQR
, andORMTR/UNMTR
hipSPARSE 3.0.0#
hipSPARSE 3.0.0 for ROCm 6.0.0
Additions#
Added
hipsparseGetErrorName
andhipsparseGetErrorString
Changes#
Changed the
hipsparseSpSV_solve()
API function to match the cuSPARSE APIChanged generic API functions to use const descriptors
Improved documentation
hipTensor 1.1.0#
hipTensor 1.1.0 for ROCm 6.0.0
Additions#
Architecture support for gfx942
Client tests configuration parameters now support YAML file input format
Changes#
Doxygen now treats warnings as errors
Fixes#
Client tests output redirections now behave accordingly
Removed dependency static library deployment
Security issues for documentation
Compile issues in debug mode
Corrected soft link for ROCm deployment
MIOpen 2.19.0#
MIOpen 2.19.0 for ROCm 6.0.0
Additions#
ROCm 5.5 support for gfx1101 (Navi32)
Changes#
Tuning results for MLIR on ROCm 5.5
Bumped MLIR commit to 5.5.0 release tag
Fixes#
3-D convolution host API bug
[HOTFIX][MI200][FP16]
has been disabled forConvHipImplicitGemmBwdXdlops
when FP16_ALT is required
MIVisionX#
Added Comprehensive CTests to aid developers
Introduced Doxygen support for complete API documentation
Simplified dependencies for rocAL
OpenMP#
MI300:
Added support for gfx942 targets
Fixed declare target variable access in unified_shared_memory mode
Enabled OMPX_APU_MAPS environment variable for MI200 and gfx942
Handled global pointers in forced USM (
OMPX_APU_MAPS
)
Nextgen AMDGPU plugin:
Respect
GPU_MAX_HW_QUEUES
in the AMDGPU Nextgen plugin, which takes precedence over the standardLIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES
environment variableChanged the default for
LIBOMPTARGET_AMDGPU_TEAMS_PER_CU
from 4 to 6Fixed the behavior of the
OMPX_FORCE_SYNC_REGIONS
environment variable, which is used to force synchronous target regions (the default is to use an asynchronous implementation)Added support for and enabled default of code object version 5
Implemented target OMPT callbacks and trace records support in the nextgen plugin
Specialized kernels:
Removes redundant copying of arrays when xteam reductions are active but not offloaded
Tuned the number of teams for BigJumpLoop
Enables specialized kernel generation with nested OpenMP pragma, as long as there is no nested omp-parallel directive
Additions#
-fopenmp-runtimelib={lib,lib-perf,lib-debug}
to select libsWarning if mixed HIP / OpenMP offloading (i.e., if HIP language mode is active, but OpenMP target directives are encountered)
Introduced compile-time limit for the number of GPUs supported in a system: 16 GPUs in a single node is currently the maximum supported
Changes#
Correctly compute number of waves when workgroup size is less than the wave size
Implemented
LIBOMPTARGET_KERNEL_TRACE=3
, which prints DEVID traces and API timingsASAN support for openmp release, debug, and perf libraries
Changed LDS lowering default to hybrid
Fixes#
Fixed RUNPATH for gdb plugin
Fixed hang in OMPT support if flush trace is called when there are no helper threads
rccl 2.15.5#
RCCL 2.15.5 for ROCm 6.0.0
Changes#
Compatibility with NCCL 2.15.5
Renamed the unit test executable to
rccl-UnitTests
Additions#
HW-topology-aware binary tree implementation
Experimental support for MSCCL
New unit tests for hipGraph support
NPKit integration
Fixes#
rocm-smi ID conversion
Support for
HIP_VISIBLE_DEVICES
for unit testsSupport for p2p transfers to non (HIP) visible devices
Removals#
Removed TransferBench from tools as it exists in standalone repo: ROCm/TransferBench
rocALUTION 3.0.3#
rocALUTION 3.0.3 for ROCm 6.0.0
Additions#
Support for 64bit integer vectors
Inclusive and exclusive sum functionality for vector classes
Transpose functionality for
GlobalMatrix
andLocalMatrix
TripleMatrixProduct
functionality forLocalMatrix
Sort()
function forLocalVector
classMultiple stream support to the HIP backend
Optimizations#
GlobalMatrix::Apply()
now uses multiple streams to better hide communication
Changes#
Matrix dimensions and number of non-zeros are now stored using 64-bit integers
Improved the ILUT preconditioner
Removals#
LocalVector::GetIndexValues(ValueType*)
LocalVector::SetIndexValues(const ValueType*)
LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*)
LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, float, LocalMatrix*, LocalMatrix*)
LocalMatrix::RugeStueben()
LocalMatrix::AMGSmoothedAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*, int)
LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix*, LocalMatrix*)
Fixes#
Unit tests no longer ignore BCSR block dimension
Fixed documentation typos
Bug in multi-coloring for non-symmetric matrix patterns
rocBLAS 4.0.0#
rocBLAS 4.0.0 for ROCm 6.0.0
Additions#
Beta API
rocblas_gemm_batched_ex3
androcblas_gemm_strided_batched_ex3
Input/output type f16_r/bf16_r and execution type f32_r support for Level 2 gemv_batched and gemv_strided_batched
Use of
rocblas_status_excluded_from_build
when calling functions that require Tensile (when using rocBLAS built without Tensile)System for asynchronous kernel launches that set a
rocblas_status
failure based on ahipPeekAtLastError
discrepancy
Optimizations#
TRSM performance for small sizes (m < 32 && n < 32)
Deprecations#
Atomic operations will be disabled by default in a future release of rocBLAS (you can enable atomic operations using the
rocblas_set_atomics_mode
function)
Removals#
rocblas_gemm_ext2
API functionIn-place trmm API from Legacy BLAS is replaced by an API that supports both in-place and out-of-place trmm
int8x4 support is removed (int8 support is unchanged)
#define __STDC_WANT_IEC_60559_TYPES_EXT__
is removed fromrocblas-types.h
(if you want ISO/IEC TS 18661-3:2015 functionality, you must define__STDC_WANT_IEC_60559_TYPES_EXT__
before includingfloat.h
,math.h
, androcblas.h
)The default build removes device code for gfx803 architecture from the fat binary
Fixes#
Made offset calculations for 64-bit rocBLAS functions safe
Fixes for very large leading dimension or increment potentially causing overflow:
Level2:
gbmv
,gemv
,hbmv
,sbmv
,spmv
,tbmv
,tpmv
,tbsv
, andtpsv
Lazy loading supports heterogeneous architecture setup and load-appropriate tensile library files, based on device architecture
Guards against no-op kernel launches that result in a potential
hipGetLastError
Changes#
Reduced the default verbosity of
rocblas-test
(you can see all tests by setting theGTEST_LISTENER=PASS_LINE_IN_LOG
environment variable)
rocFFT 1.0.25#
rocFFT 1.0.25 for ROCm 6.0.0
Additions#
Implemented experimental APIs to allow computing FFTs on data distributed across multiple devices in a single process
rocfft_field
is a new type that can be added to a plan description to describe the layout of FFT input or outputrocfft_field_add_brick
can be called to describe the brick decomposition of an FFT field, where each brick can be assigned a different device
These interfaces are still experimental and subject to change. Your feedback is appreciated. You can raise questions and concerns by opening issues in the rocFFT issue tracker.
Note that multi-device FFTs currently have several limitations (we plan to address these in future releases):
Real-complex (forward or inverse) FFTs are not supported
Planar format fields are not supported
Batch (the
number_of_transforms
provided torocfft_plan_create
) must be 1FFT input is gathered to the current device at run time, so all FFT data must fit on that device
Optimizations#
Improved the performance of several 2D/3D real FFTs supported by
2D_SINGLE
kernel. Offline tuning provides more optimization for fx90aRemoved an extra kernel launch from even-length, real-complex FFTs that use callbacks
Changes#
Built kernels in a solution map to the library kernel cache
Real forward transforms (real-to-complex) no longer overwrite input; rocFFT may still overwrite real inverse (complex-to-real) input, as this allows for faster performance
rocfft-rider
anddyna-rocfft-rider
have been renamed torocfft-bench
anddyna-rocfft-bench
; these are controlled by theBUILD_CLIENTS_BENCH
CMake optionLinks for the former file names are installed, and the former
BUILD_CLIENTS_RIDER
CMake option is accepted for compatibility, but both will be removed in a future release
Binaries in debug builds no longer have a
-d
suffix
Fixes#
rocFFT now correctly handles load callbacks that convert data from a smaller data type (e.g., 16-bit integers -> 32-bit float)
ROCgdb 13.2#
ROCgdb 13.2 for ROCm 6.0.0
Additions#
Support for watchpoints on scratch memory addresses.
Added support for gfx1100, gfx1101, and gfx1102.
Added support for gfx942.
Optimizations#
Improved performances when handling the end of a process with a large number of threads.
Known issues#
On certain configurations, ROCgdb can show the following warning message:
warning: Probes-based dynamic linker interface failed. Reverting to original interface.
This does not affect ROCgdb’s functionalities.ROCgdb cannot debug a program on an AMDGPU device past a
s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
instruction. If an exception is reported after this instruction has been executed (including asynchronous exceptions), the wave is killed and the exceptions are only reported by the ROCm runtime.
rocm-cmake 0.11.0#
rocm-cmake 0.11.0 for ROCm 6.0.0
Changes#
Improved validation, documentation, and rocm-docs-core integration for ROCMSphinxDoc
Fixes#
Fixed extra
make
flags passed for Clang-Tidy (ROCMClangTidy).Fixed issues with ROCMTest when using a module in a subdirectory
ROCm Compiler#
On MI300, kernel arguments can be preloaded into SGPRs rather than passed in memory. This feature is enabled with a compiler option, which also controls the number of arguments to pass in SGPRs.
Improved register allocation at -O0: Avoid compiler crashes ( ‘ran out of registers during register allocation’ )
Improved generation of debug information:
Improve compile time
Avoid compiler crashes
rocPRIM 3.0.0#
rocPRIM 3.0.0 for ROCm 6.0.0
Additions#
block_sort::sort()
overload for keys and values with a dynamic size, for all block sort algorithmsAll
block_sort::sort()
overloads with a dynamic size are now supported forblock_sort_algorithm::merge_sort
andblock_sort_algorithm::bitonic_sort
New two-way partition primitive
partition_two_way
, which can write to two separate iterators
Optimizations#
Improved
partition
performance
Fixes#
Fixed
rocprim::MatchAny
for devices with 64-bit warp sizeNote that
rocprim::MatchAny
is deprecated; userocprim::match_any
instead
Roc Profiler 2.0.0#
Roc Profiler 2.0.0 for ROCm 6.0.0
Additions#
Updated supported GPU architectures in README with profiler versions
Automatic ISA dumping for ATT. See README.
CSV mode for ATT. See README.
Added option to control kernel name truncation.
Limit rocprof(v1) script usage to only supported architectures.
Added Tool versioning to be able to run rocprofv2 using rocprof. See README for more information.
Added Plugin Versioning way in rocprofv2. See README for more details.
Added
--version
in rocprof and rocprofv2 to be able to see the current rocprof/v2 version along with ROCm version information.
rocRAND 2.10.17#
rocRAND 2.10.17 for ROCm 6.0.0
Changes#
Generator classes from
rocrand.hpp
are no longer copyable (in previous versions these copies would copy internal references to the generators and would lead to double free or memory leak errors)These types should be moved instead of copied; move constructors and operators are now defined
Optimizations#
Improved MT19937 initialization and generation performance
Removals#
Removed the hipRAND submodule from rocRAND; hipRAND is now only available as a separate package
Removed references to, and workarounds for, the deprecated hcc
Fixes#
mt19937_engine
fromrocrand.hpp
is now move-constructible and move-assignable (the move constructor and move assignment operator was deleted for this class)Various fixes for the C++ wrapper header
rocrand.hpp
The name of
mrg31k3p
it is now correctly spelled (was incorrectly namedmrg31k3a
in previous versions)Added the missing
order
setter method forthreefry4x64
Fixed the default ordering parameter for
lfsr113
Build error when using Clang++ directly resulting from unsupported
amdgpu-target
references
rocSOLVER 3.24.0#
rocSOLVER 3.24.0 for ROCm 6.0.0
Additions#
Cholesky refactorization for sparse matrices:
CSRRF_REFACTCHOL
Added
rocsolver_rfinfo_mode
and the ability to specify the desired refactorization routine (seerocsolver_set_rfinfo_mode
)
Changes#
CSRRF_ANALYSIS
andCSRRF_SOLVE
now support sparse Cholesky factorization
rocSPARSE 3.0.2#
rocSPARSE 3.0.2 for ROCm 6.0.0
Changes#
Function arguments for
rocsparse_spmv
Function arguments for
rocsparse_xbsrmv
routinesWhen using host pointer mode, you must now call
hipStreamSynchronize
followingdoti
,dotci
,spvv
, andcsr2ell
Improved documentation
Improved verbose output during argument checking on API function calls
Removals#
Auto stages from
spmv
,spmm
,spgemm
,spsv
,spsm
, andspitsv
Formerly deprecated
rocsparse_spmm_ex
routine
Fixes#
Bug in
rocsparse-bench
where the SpMV algorithm was not taken into account in CSR formatBSR and GEBSR routines (
bsrmv
,bsrsv
,bsrmm
,bsrgeam
,gebsrmv
,gebsrmm
) didn’t always showblock_dim==0
as an invalid sizePassing
nnz = 0
todoti
ordotci
wasn’t always returning a dot product of 0
Additions#
rocsparse_inverse_permutation
Mixed-precisions for SpVV
Uniform int8 precision for gather and scatter
rocThrust 3.0.0#
rocThrust 3.0.0 for ROCm 6.0.0
Additions#
Updated to match upstream Thrust 2.0.1
NV_IF_TARGET
macro from libcu++ for NVIDIA backend and HIP implementation for HIP backend
Changes#
The CMake build system now accepts
GPU_TARGETS
in addition toAMDGPU_TARGETS
for setting targeted GPU architecturesGPU_TARGETS=all
compiles for all supported architecturesAMDGPU_TARGETS
is only provided for backwards compatibility (GPU_TARGETS
is preferred)
Removed CUB symlink from the root of the repository
Removed support for deprecated macros (
THRUST_DEVICE_BACKEND
andTHRUST_HOST_BACKEND
)
Known issues#
The
THRUST_HAS_CUDART
macro, which is no longer used in Thrust (it’s provided only for legacy support) is replaced withNV_IF_TARGET
andTHRUST_RDC_ENABLED
in the NVIDIA backend. The HIP backend doesn’t have aTHRUST_RDC_ENABLED
macro, so some branches in Thrust code may be unreachable in the HIP backend.
rocWMMA 1.3.0#
rocWMMA 1.3.0 for ROCm 6.0.0
Additions#
Support for gfx942
Support for f8, bf8, and xfloat32 data types
support for
HIP_NO_HALF
,__ HIP_NO_HALF_CONVERSIONS__
, and__ HIP_NO_HALF_OPERATORS__
(e.g., PyTorch environment)
Changes#
rocWMMA with hipRTC now supports
bfloat16_t
data typegfx11 WMMA now uses lane swap instead of broadcast for layout adjustment
Updated samples GEMM parameter validation on host arch
Fixes#
Disabled GoogleTest static library deployment
Extended tests now build in large code model
Tensile 4.39.0#
Tensile 4.39.0 for ROCm 6.0.0
Additions#
Added
aquavanjaram
support: gfx942, fp8/bf8 datatype, xf32 datatype, and stochastic rounding for various datatypesAdded and updated tuning scripts
Added
DirectToLds
support for larger data types with 32-bit global load (old parameterDirectToLds
is replaced withDirectToLdsA
andDirectToLdsB
), and the corresponding test casesAdded the average of frequency, power consumption, and temperature information for the winner kernels to the CSV file
Added asmcap check for MFMA + const src
Added support for wider local read + pack with v_perm (with
VgprForLocalReadPacking=True
)Added a new parameter to increase
miLatencyLeft
Optimizations#
Enabled
InitAccVgprOpt
forMatrixInstruction
casesImplemented local read related parameter calculations with
DirectToVgpr
Enabled dedicated vgpr allocation for local read + pack
Optimized code initialization
Optimized sgpr allocation
Supported DGEMM TLUB + RLVW=2 for odd N (edge shift change)
Enabled
miLatency
optimization for specific data types, and fixed instruction scheduling
Changes#
Removed old code for DTL + (bpe * GlobalReadVectorWidth > 4)
Changed/updated failed CI tests for gfx11xx, InitAccVgprOpt, and DTLds
Removed unused
CustomKernels
andReplacementKernels
Added a reject condition for DTVB + TransposeLDS=False (not supported so far)
Removed unused code for DirectToLds
Updated test cases for DTV + TransposeLDS=False
Moved the
MinKForGSU
parameter fromglobalparameter
toBenchmarkCommonParameter
to support smaller KChanged how to calculate
latencyForLR
for miLatencySet minimum value of
latencyForLRCount
for 1LDSBuffer to avoid getting rejected by overflowedResources=5 (related to miLatency)Refactored allowLRVWBforTLUandMI and renamed it as VectorWidthB
Supported multi-gpu for different architectures in lazy library loading
Enabled dtree library for batch > 1
Added problem scale feature for dtree selection
Modified non-lazy load build to skip experimental logic
Fixes#
Predicate ordering for fp16alt impl round near zero mode to unbreak distance modes
Boundary check for mirror dims and re-enable disabled mirror dims test cases
Merge error affecting i8 with WMMA
Mismatch issue with DTLds + TSGR + TailLoop
Bug with
InitAccVgprOpt
+ GSU>1 and a mismatch issue with PGR=0Override for unloaded solutions when lazy loading
Adding missing headers
Boost link for a clean build on Ubuntu 22
Bug in
forcestoresc1
arch selectionCompiler directive for gfx942
Formatting for
DecisionTree_test.cpp
Library changes in ROCm 6.0.0#
Library |
Version |
---|---|
AMDMIGraphX |
2.7 ⇒ 2.8 |
composable_kernel |
|
hipBLAS |
1.1.0 ⇒ 2.0.0 |
hipCUB |
2.13.1 ⇒ 3.0.0 |
hipFFT |
1.0.12 ⇒ 1.0.13 |
hipRAND |
2.10.16 ⇒ 2.10.17 |
hipSOLVER |
1.8.2 ⇒ 2.0.0 |
hipSPARSE |
2.3.8 ⇒ 3.0.0 |
hipTensor |
|
MIOpen |
|
MIVisionX |
|
rccl |
|
rocALUTION |
2.1.11 ⇒ 3.0.3 |
rocBLAS |
3.1.0 ⇒ 4.0.0 |
rocFFT |
1.0.24 ⇒ 1.0.25 |
rocm-cmake |
0.10.0 ⇒ 0.11.0 |
rocPRIM |
2.13.1 ⇒ 3.0.0 |
rocRAND |
|
rocSOLVER |
3.23.0 ⇒ 3.24.0 |
rocSPARSE |
2.5.4 ⇒ 3.0.2 |
rocThrust |
2.18.0 ⇒ 3.0.0 |
rocWMMA |
1.2.0 ⇒ 1.3.0 |
rpp |
1.2.0 ⇒ 1.4.0 |
Tensile |
4.38.0 ⇒ 4.39.0 |
AMDMIGraphX 2.8#
MIGraphX 2.8 for ROCm 6.0.0
Additions#
Support for MI300 GPUs
Support for TorchMIGraphX via PyTorch
Boosted overall performance by integrating rocMLIR
INT8 support for ONNX Runtime
Support for ONNX version 1.14.1
Added new operators:
Qlinearadd
,QlinearGlobalAveragePool
,Qlinearconv
,Shrink
,CastLike
, andRandomUniform
Added an error message for when
gpu_targets
is not set during MIGraphX compilationAdded parameter to set tolerances with
migraphx-driver
verifyAdded support for MXR files > 4 GB
Added
MIGRAPHX_TRACE_MLIR
flagBETA added capability for using ROCm Composable Kernels via the
MIGRAPHX_ENABLE_CK=1
environment variable
Optimizations#
Improved performance support for INT8
Improved time precision while benchmarking candidate kernels from CK or MLIR
Removed contiguous from reshape parsing
Updated the
ConstantOfShape
operator to support Dynamic BatchSimplified dynamic shapes-related operators to their static versions, where possible
Improved debugging tools for accuracy issues
Included a print warning about
miopen_fusion
while generatingmxr
General reduction in system memory usage during model compilation
Created additional fusion opportunities during model compilation
Improved debugging for matchers
Improved general debug messages
Fixes#
Fixed scatter operator for nonstandard shapes with some models from ONNX Model Zoo
Provided a compile option to improve the accuracy of some models by disabling Fast-Math
Improved layernorm + pointwise fusion matching to ignore argument order
Fixed accuracy issue with
ROIAlign
operatorFixed computation logic for the
Trilu
operatorFixed support for the DETR model
Changes#
Changed MIGraphX version to 2.8
Extracted the test packages into a separate deb file when building MIGraphX from source
Removals#
Removed building Python 2.7 bindings
hipBLAS 2.0.0#
hipBLAS 2.0.0 for ROCm 6.0.0
Added#
added option to define HIPBLAS_USE_HIP_BFLOAT16 to switch API to use hip_bfloat16 type
added hipblasGemmExWithFlags API
Deprecated#
hipblasDatatype_t is deprecated and will be removed in a future release and replaced with hipDataType
hipblasComplex and hipblasDoubleComplex are deprecated and will be removed in a future release and replaced with hipComplex and hipDoubleComplex
use of hipblasDatatype_t for hipblasGemmEx for compute-type is deprecated and will be replaced with hipblasComputeType_t in a future release
Removed#
hipblasXtrmm that calculates B <- alpha * op(A) * B is removed and replaced with hipblasXtrmm that calculates C <- alpha * op(A) * B
hipCUB 3.0.0#
hipCUB 3.0.0 for ROCm 6.0.0
Changed#
Removed
DOWNLOAD_ROCPRIM
, forcing rocPRIM to download can be done withDEPENDENCIES_FORCE_DOWNLOAD
.
hipFFT 1.0.13#
hipFFT 1.0.13 for ROCm 6.0.0
Changed#
hipfft-rider has been renamed to hipfft-bench, controlled by the BUILD_CLIENTS_BENCH CMake option. A link for the old file name is installed, and the old BUILD_CLIENTS_RIDER CMake option is accepted for compatibility but both will be removed in a future release.
Binaries in debug builds no longer have a “-d” suffix.
The minimum rocFFT required version has been updated to 1.0.21.
Added#
Implemented hipfftXtSetGPUs, hipfftXtMalloc, hipfftXtMemcpy, hipfftXtFree, hipfftXtExecDescriptor APIs to allow computing FFTs on multiple devices in a single process.
hipRAND 2.10.17#
hipRAND 2.10.17 for ROCm 6.0.0
Fixed#
Fixed benchmark and unit test builds on Windows.
hipSOLVER 2.0.0#
hipSOLVER 2.0.0 for ROCm 6.0.0
Added#
Added hipBLAS as an optional dependency to hipsolver-test. Use the
BUILD_HIPBLAS_TESTS
CMake option to test compatibility between hipSOLVER and hipBLAS.
Changed#
Types hipsolverOperation_t, hipsolverFillMode_t, and hipsolverSideMode_t are now aliases of hipblasOperation_t, hipblasFillMode_t, and hipblasSideMode_t.
Fixed#
Fixed tests for hipsolver info updates in ORGBR/UNGBR, ORGQR/UNGQR, ORGTR/UNGTR, ORMQR/UNMQR, and ORMTR/UNMTR.
hipSPARSE 3.0.0#
hipSPARSE 3.0.0 for ROCm 6.0.0
Added#
Added hipsparseGetErrorName and hipsparseGetErrorString
Changed#
Changed hipsparseSpSV_solve() API function to match cusparse API
Changed generic API functions to use const descriptors
Documentation improved
hipTensor 1.1.0#
hipTensor 1.1.0 for ROCm 6.0.0
Additions#
Architecture support for gfx940, gfx941, and gfx942
Client tests configuration parameters now support YAML file input format
Changes#
Doxygen now treats warnings as errors
Fixes#
Client tests output redirections now behave accordingly
Removed dependency static library deployment
Security issues for documentation
Compile issues in debug mode
Corrected soft link for ROCm deployment
rocALUTION 3.0.3#
rocALUTION 3.0.3 for ROCm 6.0.0
Added#
Added support for 64bit integer vectors
Added inclusive and exclusive sum functionality for Vector classes
Added Transpose functionality for Global/LocalMatrix
Added TripleMatrixProduct functionality LocalMatrix
Added Sort() function for LocalVector class
Added multiple stream support to the HIP backend
Optimized#
GlobalMatrix::Apply() now uses multiple streams to better hide communication
Changed#
Matrix dimensions and number of non-zeros are now stored using 64bit integers
Improved ILUT preconditioner
Removed#
Removed LocalVector::GetIndexValues(ValueType*)
Removed LocalVector::SetIndexValues(const ValueType*)
Removed LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*)
Removed LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, float, LocalMatrix*, LocalMatrix*)
Removed LocalMatrix::RugeStueben()
Removed LocalMatrix::AMGSmoothedAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*, int)
Removed LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix*, LocalMatrix*)
Fixed#
Unit tests do not ignore BCSR block dimension anymore
Fixed typos in the documentation
Fixed a bug in multicoloring for non-symmetric matrix patterns
rocBLAS 4.0.0#
rocBLAS 4.0.0 for ROCm 6.0.0
Added#
Addition of beta API rocblas_gemm_batched_ex3 and rocblas_gemm_strided_batched_ex3
Added input/output type f16_r/bf16_r and execution type f32_r support for Level 2 gemv_batched and gemv_strided_batched
Added rocblas_status_excluded_from_build to be used when calling functions which require Tensile when using rocBLAS built without Tensile
Added system for async kernel launches setting a failure rocblas_status based on hipPeekAtLastError discrepancy
Optimized#
Trsm performance for small sizes m < 32 && n < 32
Deprecated#
In a future release atomic operations will be disabled by default so results will be repeatable. Atomic operations can always be enabled or disabled using the function rocblas_set_atomics_mode. Enabling atomic operations can improve performance.
Removed#
rocblas_gemm_ext2 API function is removed
in-place trmm API from Legacy BLAS is removed. It is replaced by an API that supports both in-place and out-of-place trmm
int8x4 support is removed. int8 support is unchanged
The #define STDC_WANT_IEC_60559_TYPES_EXT has been removed from rocblas-types.h. Users who want ISO/IEC TS 18661-3:2015 functionality must define STDC_WANT_IEC_60559_TYPES_EXT before including float.h, math.h, and rocblas.h
The default build removes device code for gfx803 architecture from the fat binary
Fixed#
Make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimension or increment potentially causing overflow:
Level2: gbmv, gemv, hbmv, sbmv, spmv, tbmv, tpmv, tbsv, tpsv
Lazy loading to support heterogeneous architecture setup and load appropriate tensile library files based on the device’s architecture
Guard against no-op kernel launches resulting in potential hipGetLastError
Changed#
Default verbosity of rocblas-test reduced. To see all tests set environment variable GTEST_LISTENER=PASS_LINE_IN_LOG
rocFFT 1.0.25#
rocFFT 1.0.25 for ROCm 6.0.0
Added#
Implemented experimental APIs to allow computing FFTs on data distributed across multiple devices in a single process.
rocfft_field
is a new type that can be added to a plan description, to describe layout of FFT input or output.rocfft_field_add_brick
can be called one or more times to describe a brick decomposition of an FFT field, where each brick can be assigned a different device.These interfaces are still experimental and subject to change. We are interested to hear feedback on them. Questions and concerns may be raised by opening issues on the rocFFT issue tracker.
Note that at this time, multi-device FFTs have several limitations:
Real-complex (forward or inverse) FFTs are not currently supported.
Planar format fields are not currently supported.
Batch (i.e.
number_of_transforms
provided torocfft_plan_create
) must be 1.The FFT input is gathered to the current device at execute time, so all of the FFT data must fit on that device.
We expect these limitations to be removed in future releases.
Optimizations#
Improved performance of some small 2D/3D real FFTs supported by 2D_SINGLE kernel. gfx90a gets more optimization by offline tuning.
Removed an extra kernel launch from even-length real-complex FFTs that use callbacks.
Changed#
Built kernels in solution-map to library kernel cache.
Real forward transforms (real-to-complex) no longer overwrite input. rocFFT still may overwrite real inverse (complex-to-real) input, as this allows for faster performance.
rocfft-rider and dyna-rocfft-rider have been renamed to rocfft-bench and dyna-rocfft-bench, controlled by the BUILD_CLIENTS_BENCH CMake option. Links for the old file names are installed, and the old BUILD_CLIENTS_RIDER CMake option is accepted for compatibility but both will be removed in a future release.
Binaries in debug builds no longer have a “-d” suffix.
Fixed#
rocFFT now correctly handles load callbacks that convert data from a smaller data type (e.g. 16-bit integers -> 32-bit float).
rocm-cmake 0.11.0#
rocm-cmake 0.11.0 for ROCm 6.0.0
Changed#
ROCMSphinxDoc: Improved validation, documentation and rocm-docs-core integration.
Fixed#
ROCMClangTidy: Fixed extra make flags passed for clang tidy.
ROCMTest: Fixed issues when using module in a subdirectory.
rocPRIM 3.0.0#
rocPRIM 3.0.0 for ROCm 6.0.0
Added#
block_sort::sort()
overload for keys and values with a dynamic size, for all block sort algorithms. Additionally, allblock_sort::sort()
overloads with a dynamic size are now supported forblock_sort_algorithm::merge_sort
andblock_sort_algorithm::bitonic_sort
.New two-way partition primitive
partition_two_way
which can write to two separate iterators.
Optimizations#
Improved the performance of
partition
.
Fixed#
Fixed
rocprim::MatchAny
for devices with 64-bit warp size. The functionrocprim::MatchAny
is deprecated androcprim::match_any
is preferred instead.
rocSOLVER 3.24.0#
rocSOLVER 3.24.0 for ROCm 6.0.0
Added#
Cholesky refactorization for sparse matrices
CSRRF_REFACTCHOL
Added
rocsolver_rfinfo_mode
and the ability to specify the desired refactorization routine (seerocsolver_set_rfinfo_mode
).
Changed#
CSRRF_ANALYSIS and CSRRF_SOLVE now support sparse Cholesky factorization
rocSPARSE 3.0.2#
rocSPARSE 3.0.2 for ROCm 6.0.0
Added#
Added rocsparse_inverse_permutation
Added mixed precisions for SpVV
Added uniform int8 precision for Gather and Scatter
Optimized#
Optimization to doti routine
Optimization to spin-looping algorithms
Changed#
Changed rocsparse_spmv function arguments
Changed rocsparse_xbsrmv routines function arguments
doti, dotci, spvv, and csr2ell now require calling hipStreamSynchronize after when using host pointer mode
Improved documentation
Improved verbose output during argument checking on API function calls
Deprecated#
Deprecated rocsparse_spmv_ex
Deprecated rocsparse_xbsrmv_ex routines
Removed#
Removed auto stages from spmv, spmm, spgemm, spsv, spsm, and spitsv.
Removed rocsparse_spmm_ex routine
Fixed#
Fixed a bug in rocsparse-bench, where SpMV algorithm was not taken into account in CSR format
Fixed the BSR/GEBSR routines bsrmv, bsrsv, bsrmm, bsrgeam, gebsrmv, gebsrmm so that block_dim==0 is considered an invalid size
Fixed bug where passing nnz = 0 to doti or dotci did not always return a dot product of 0
rocThrust 3.0.0#
rocThrust 3.0.0 for ROCm 6.0.0
Added#
Updated to match upstream Thrust 2.0.1
NV_IF_TARGET macro from libcu++ for NVIDIA backend and HIP implementation for HIP backend.
Changed#
The cmake build system now additionally accepts
GPU_TARGETS
in addition toAMDGPU_TARGETS
for setting the targeted gpu architectures.GPU_TARGETS=all
will compile for all supported architectures.AMDGPU_TARGETS
is only provided for backwards compatibility,GPU_TARGETS
should be preferred.
Removed#
Removed cub symlink from the root of the repository.
Removed support for deprecated macros (THRUST_DEVICE_BACKEND and THRUST_HOST_BACKEND).
Fixed#
Fixed a segmentation fault when binary search / upper bound / lower bound / equal range was invoked with
hip_rocprim::execute_on_stream_base
policy.
Known Issues#
For NVIDIA backend,
NV_IF_TARGET
andTHRUST_RDC_ENABLED
intend to substitute theTHRUST_HAS_CUDART
macro, which is now no longer used in Thrust (provided for legacy support only). However, there is noTHRUST_RDC_ENABLED
macro available for the HIP backend, so some branches in Thrust’s code may be unreachable in the HIP backend.
rocWMMA 1.3.0#
rocWMMA 1.3.0 for ROCm 6.0.0
Added#
Added support for gfx940, gfx941 and gfx942 targets
Added support for f8, bf8 and xfloat32 datatypes
Added support for HIP_NO_HALF, __ HIP_NO_HALF_CONVERSIONS__ and __ HIP_NO_HALF_OPERATORS__ (e.g. pytorch environment)
Changed#
rocWMMA with hipRTC now supports bfloat16_t datatype
gfx11 wmma now uses lane swap instead of broadcast for layout adjustment
Updated samples GEMM parameter validation on host arch
Fixed#
Disabled gtest static library deployment
Extended tests now build in large code model
rpp 1.4.0#
rpp for ROCm 6.0.0
Added#
New Tests
Optimizations#
Readme Updates
Changed#
Backend - Default Backend set to
HIP
Fixed#
Minor bugs and warnings
Tested Configurations#
Linux distribution
Ubuntu -
18.04
/20.04
CentOS -
8
ROCm: rocm-core -
5.0.0.50000-49
Clang - Version
6.0
CMake - Version
3.22.3
Boost - Version
1.72
IEEE 754-based half-precision floating-point library - Version
1.12.0
Rpp 1.3.0#
Rpp 1.2.0#
Known Issues#
CPU
only backend not enabled
Rpp 1.1.0#
Rpp 1.0.0#
Rpp 0.99#
Rpp 0.98#
Rpp 0.97#
Rpp 0.96#
Rpp 0.95#
Rpp 0.93#
Tensile 4.39.0#
Tensile 4.39.0 for ROCm 6.0.0
Added#
Added aquavanjaram support: gfx940/gfx941/gfx942, fp8/bf8 datatype, xf32 datatype, and stochastic rounding for various datatypes
Added/updated tuning scripts
Added DirectToLds support for larger data types with 32bit global load (old parameter DirectToLds is replaced with DirectToLdsA and DirectToLdsB), and the corresponding test cases
Added the average of frequency, power consumption, and temperature information for the winner kernels to the CSV file
Added asmcap check for MFMA + const src
Added support for wider local read + pack with v_perm (with VgprForLocalReadPacking=True)
Added a new parameter to increase miLatencyLeft
Optimizations#
Enabled InitAccVgprOpt for MatrixInstruction cases
Implemented local read related parameter calculations with DirectToVgpr
Adjusted miIssueLatency for gfx940
Enabled dedicated vgpr allocation for local read + pack
Optimized code initialization
Optimized sgpr allocation
Supported DGEMM TLUB + RLVW=2 for odd N (edge shift change)
Enabled miLatency optimization for (gfx940/gfx941 + MFMA) for specific data types, and fixed instruction scheduling
Changed#
Removed old code for DTL + (bpe * GlobalReadVectorWidth > 4)
Changed/updated failed CI tests for gfx11xx, InitAccVgprOpt, and DTLds
Removed unused CustomKernels and ReplacementKernels
Added a reject condition for DTVB + TransposeLDS=False (not supported so far)
Removed unused code for DirectToLds
Updated test cases for DTV + TransposeLDS=False
Moved parameter MinKForGSU from globalparameter to BenchmarkCommonParameter to support smaller K
Changed how to calculate latencyForLR for miLatency
Set minimum value of latencyForLRCount for 1LDSBuffer to avoid getting rejected by overflowedResources=5 (related to miLatency)
Refactored allowLRVWBforTLUandMI and renamed it as VectorWidthB
Supported multi-gpu for different architectures in lazy library loading
Enabled dtree library for batch > 1
Added problem scale feature for dtree selection
Enabled ROCm SMI for gfx940/941.
Modified non-lazy load build to skip experimental logic
Fixed#
Fixed predicate ordering for fp16alt impl round near zero mode to unbreak distance modes
Fixed boundary check for mirror dims and re-enable disabled mirror dims test cases
Fixed merge error affecting i8 with wmma
Fixed mismatch issue with DTLds + TSGR + TailLoop
Fixed a bug with InitAccVgprOpt + GSU>1 and a mismatch issue with PGR=0
Fixed override for unloaded solutions when lazy loading
Fixed build some errors (adding missing headers)
Fixed boost link for a clean build on ubuntu22
Fixed bug in forcestoresc1 arch selection
Fixed compiler directive for gfx941 and gfx942
Fixed formatting for DecisionTree_test.cpp
ROCm 5.7.1#
What’s new in this release#
ROCm 5.7.1 is a point release with several bug fixes in the HIP runtime.
Installing all GPU AddressSanitizer packages with a single command#
ROCm 5.7.1 simplifies the installation steps for the optional AddressSanitizer (ASan) packages. This release provides the meta package rocm-ml-sdk-asan for ease of ASan installation. The following command can be used to install all ASan packages rather than installing each package separately,
sudo apt-get install rocm-ml-sdk-asan
For more detailed information about using the GPU AddressSanitizer, refer to the user guide
ROCm libraries#
rocBLAS#
A new functionality rocblas-gemm-tune and an environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH are added to rocBLAS in the ROCm 5.7.1 release.
rocblas-gemm-tune
is used to find the best-performing GEMM kernel for each GEMM problem set. It
has a command line interface, which mimics the –yaml input used by rocblas-bench. To generate the
expected –yaml input, profile logging can be used, by setting the environment variable
ROCBLAS_LAYER4.
For more information on rocBLAS logging, see Logging in rocBLAS, in the API Reference Guide.
An example input file: Expected output (note selected GEMM idx may differ): Where the far right values
(solution_index) are the indices of the best-performing kernels for those GEMMs in the rocBLAS kernel
library. These indices can be directly used in future GEMM calls. See
rocBLAS/samples/example_user_driven_tuning.cpp
for sample code of directly using kernels via their
indices.
If the output is stored in a file, the results can be used to override default kernel selection with the kernels found by setting the environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH, which points to the stored file.
For more details, refer to the rocBLAS Programmer’s Guide.
HIP 5.7.1 (for ROCm 5.7.1)#
ROCm 5.7.1 is a point release with several bug fixes in the HIP runtime.
Defect fixes#
The hipPointerGetAttributes
API returns the correct HIP memory type as hipMemoryTypeManaged
for managed memory.
Library changes in ROCm 5.7.1#
Library |
Version |
---|---|
AMDMIGraphX |
|
composable_kernel |
|
hipBLAS |
|
hipCUB |
|
hipFFT |
|
hipRAND |
|
hipSOLVER |
1.8.1 ⇒ 1.8.2 |
hipSPARSE |
|
MIOpen |
|
MIVisionX |
|
rocALUTION |
|
rocBLAS |
|
rocFFT |
|
rocm-cmake |
|
rocPRIM |
|
rocRAND |
|
rocSOLVER |
|
rocSPARSE |
|
rocThrust |
|
rocWMMA |
|
rpp |
|
Tensile |
hipSOLVER 1.8.2#
hipSOLVER 1.8.2 for ROCm 5.7.1
Fixed#
Fixed conflicts between the hipsolver-dev and -asan packages by excluding hipsolver_module.f90 from the latter
ROCm 5.7.0#
Release highlights for ROCm 5.7#
New features include:
A new library (hipTensor)
Optimizations for rocRAND and MIVisionX
AddressSanitizer for host and device code (GPU) is now available as a beta
Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major releases in the ROCm 5 series. This release is Linux-only.
Important
The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
AMD Instinct™ MI50 end-of-support notice#
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023.
As outlined in 5.6.0, ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state.
ROCm 6.0 release will show MI50s as “under maintenance” for Linux and Windows
No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs until Q2 2024 (end of maintenance [EOM] will be aligned with the closest ROCm release).
Bug fixes during the maintenance will be made to the next ROCm point release.
Bug fixes will not be backported to older ROCm releases for gfx906.
Distribution and operating system updates will continue per the ROCm release cadence for gfx906 GPUs until EOM.
Feature updates#
Non-hostcall HIP printf#
Current behavior
The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):
Pcie atomics not enabled, hostcall not supported
Workaround
The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
Note
This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
The printf variant is now controlled via a new compiler option -mprintf-kind=
“hostcall” – This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
“buffered” – This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
NOTE: With the new workaround:
The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output.
The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
Beta release of LLVM AddressSanitizer (ASan) with the GPU#
The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The LLVM ASan provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
Refer to the documentation on LLVM ASan with the GPU at LLVM AddressSanitizer User Guide.
Note
The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
Defect fixes#
The following defects are fixed in ROCm v5.7:
Test hangs observed in HMM RCCL
NoGpuTst test of Catch2 fails with Docker
Failures observed with non-HMM HIP directed catch2 tests with XNACK+
Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
HIP 5.7.0#
Optimizations#
Additions#
Added
meta_group_size
/rank
for getting the number of tiles and rank of a tile in the partitionAdded new APIs supporting Windows only, under development on Linux
hipMallocMipmappedArray
for allocating a mipmapped array on the devicehipFreeMipmappedArray
for freeing a mipmapped array on the devicehipGetMipmappedArrayLevel
for getting a mipmap level of a HIP mipmapped arrayhipMipmappedArrayCreate
for creating a mipmapped arrayhipMipmappedArrayDestroy
for destroy a mipmapped arrayhipMipmappedArrayGetLevel
for getting a mipmapped array on a mipmapped level
Changes#
Fixes#
Known issues#
HIP memory type enum values currently don’t support equivalent value to
cudaMemoryTypeUnregistered
, due to HIP functionality backward compatibility.HIP API
hipPointerGetAttributes
could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
Upcoming changes for HIP in ROCm 6.0 release#
Removal of
gcnarch
from hipDeviceProp_t structureAddition of new fields in hipDeviceProp_t structure
maxTexture1D
maxTexture2D
maxTexture1DLayered
maxTexture2DLayered
sharedMemPerMultiprocessor
deviceOverlap
asyncEngineCount
surfaceAlignment
unifiedAddressing
computePreemptionSupported
hostRegisterSupported
uuid
Removal of deprecated code -hip-hcc codes from hip code tree
Correct hipArray usage in HIP APIs such as
hipMemcpyAtoH
andhipMemcpyHtoA
HIPMEMCPY_3D fields correction to avoid truncation of “size_t” to “unsigned int” inside
hipMemcpy3D()
Renaming of ‘memoryType’ in
hipPointerAttribute_t
structure to ‘type’Correct
hipGetLastError
to return the last error instead of last API call’s return codeUpdate
hipExternalSemaphoreHandleDesc
to add “unsigned int reserved[16]”Correct handling of flag values in
hipIpcOpenMemHandle
forhipIpcMemLazyEnablePeerAccess
Remove
hiparray*
and make it opaque withhipArray_t
Library changes in ROCm 5.7.0#
Library |
Version |
---|---|
AMDMIGraphX |
2.5 ⇒ 2.7 |
composable_kernel |
|
hipBLAS |
0.54.0 ⇒ 1.1.0 |
hipCUB |
|
hipFFT |
|
hipRAND |
|
hipSOLVER |
1.8.0 ⇒ 1.8.1 |
hipSPARSE |
2.3.7 ⇒ 2.3.8 |
MIOpen |
|
MIVisionX |
2.4.0 ⇒ 2.5.0 |
rocALUTION |
2.1.9 ⇒ 2.1.11 |
rocBLAS |
3.0.0 ⇒ 3.1.0 |
rocFFT |
1.0.23 ⇒ 1.0.24 |
rocm-cmake |
0.9.0 ⇒ 0.10.0 |
rocPRIM |
2.13.0 ⇒ 2.13.1 |
rocRAND |
|
rocSOLVER |
3.22.0 ⇒ 3.23.0 |
rocSPARSE |
2.5.2 ⇒ 2.5.4 |
rocThrust |
|
rocWMMA |
1.1.0 ⇒ 1.2.0 |
rpp |
|
Tensile |
4.37.0 ⇒ 4.38.0 |
AMDMIGraphX 2.7#
MIGraphX 2.7 for ROCm 5.7.0
Added#
Enabled hipRTC to not require dev packages for migraphx runtime and allow the ROCm install to be in a different directory than it was during build time
Add support for multi-target execution
Added Dynamic Batch support with C++/Python APIs
Add migraphx.create_argument to python API
Added dockerfile example for Ubuntu 22.04
Add TensorFlow supported ops in driver similar to exist onnx operator list
Add a MIGRAPHX_TRACE_MATCHES_FOR env variable to filter the matcher trace
Improved debugging by printing max,min,mean and stddev values for TRACE_EVAL = 2
use fast_math flag instead of ENV flag for GELU
Print message from driver if offload copy is set for compiled program
Optimizations#
Optimized for ONNX Runtime 1.14.0
Improved compile times by only building for the GPU on the system
Improve performance of pointwise/reduction kernels when using NHWC layouts
Load specific version of the migraphx_py library
Annotate functions with the block size so the compiler can do a better job of optimizing
Enable reshape on nonstandard shapes
Use half HIP APIs to compute max and min
Added support for broadcasted scalars to unsqueeze operator
Improved multiplies with dot operator
Handle broadcasts across dot and concat
Add verify namespace for better symbol resolution
Fixed#
Resolved accuracy issues with FP16 resnet50
Update cpp generator to handle inf from float
Fix assertion error during verify and make DCE work with tuples
Fix convert operation for NaNs
Fix shape typo in API test
Fix compile warnings for shadowing variable names
Add missing specialization for the
nullptr
for the hash function
Changed#
Bumped version of half library to 5.6.0
Bumped CI to support rocm 5.6
Make building tests optional
replace np.bool with bool as per numpy request
Removed#
Removed int8x4 rocBlas calls due to deprecation
removed std::reduce usage since not all OS’ support it
composable_kernel 0.2.0#
CK 0.2.0 for ROCm 5.7.0
Fixed#
Fixed a bug in 6-dimensional kernels (#555).
Fixed grouped ConvBwdWeight test case failure (#524).
Optimizations#
Improve proformance of normalization kernel
Added#
Added support on NAVI3x.
Added user tutorial (#563).
Added more instances for irregular GEMM sizes (#560).
Added inter-wave consumer-producer programming model for GEMM kernels (#310).
Added multi-D GEMM client APIs (#534).
Added multi-embeddings support (#542).
Added Navi3x blockwise GEMM and real GEMM support (#541).
Added Navi grouped ConvBwdWeight support (#505).
Changed#
Changed …
hipBLAS 1.1.0#
hipBLAS 1.1.0 for ROCm 5.7.0
Changed#
updated documentation requirements
Dependencies#
dependency rocSOLVER now depends on rocSPARSE
hipSOLVER 1.8.1#
hipSOLVER 1.8.1 for ROCm 5.7.0
Changed#
Changed hipsolver-test sparse input data search paths to be relative to the test executable
hipSPARSE 2.3.8#
hipSPARSE 2.3.8 for ROCm 5.7.0
Improved#
Fix compilation failures when using cusparse 12.1.0 backend
Fix compilation failures when using cusparse 12.0.0 backend
Fix compilation failures when using cusparse 10.1 (non-update versions) as backend
Minor improvements
MIVisionX 2.5.0#
MIVisionX for ROCm 5.7.0
Added#
CTest - OpenVX Tests
Hardware Support
Optimizations#
CMakeList Cleanup
Changed#
rocAL - PyBind Link to prebuilt library
PyBind11
RapidJSON
Setup Updates
RPP Version - 1.2.0
Dockerfiles - Updates & bugfix
Fixed#
rocAL bug fix and updates
Tested Configurations#
Windows
10
/11
Linux distribution
Ubuntu -
20.04
/22.04
CentOS -
7
/8
RHEL -
8
/9
SLES -
15-SP4
ROCm: rocm-core -
5.4.3.50403-121
miopen-hip -
2.19.0.50403-121
miopen-opencl -
2.18.0.50300-63
migraphx -
2.4.0.50403-121
Protobuf - V3.12.4
OpenCV - 4.6.0
RPP - 1.2.0
FFMPEG - n4.4.2
Dependencies for all the above packages
MIVisionX Setup Script -
V2.5.4
Known Issues#
OpenCV 4.X support for some apps missing
Mivisionx Dependency Map#
Hip Backend#
Docker Image: sudo docker build -f docker/ubuntu20/{DOCKER_LEVEL_FILE_NAME}.dockerfile -t {mivisionx-level-NUMBER} .
new component added to the level
existing component from the previous level
Build Level |
MIVisionX Dependencies |
Modules |
Libraries and Executables |
Docker Tag |
---|---|---|---|---|
|
cmake <br> gcc <br> g++ |
amd_openvx <br> utilities |
|
|
|
ROCm HIP <br> +Level 1 |
amd_openvx <br> amd_openvx_extensions <br> utilities |
|
|
|
OpenCV <br> FFMPEG <br> +Level 2 |
amd_openvx <br> amd_openvx_extensions <br> utilities |
|
|
|
MIOpenGEMM <br> MIOpen <br> ProtoBuf <br> +Level 3 |
amd_openvx <br> amd_openvx_extensions <br> apps <br> utilities |
|
|
|
AMD_RPP <br> rocAL deps <br> +Level 4 |
amd_openvx <br> amd_openvx_extensions <br> apps <br> rocAL <br> utilities |
|
Opencl Backend#
Docker Image: sudo docker build -f docker/ubuntu20/{DOCKER_LEVEL_FILE_NAME}.dockerfile -t {mivisionx-level-NUMBER} .
new component added to the level
existing component from the previous level
Build Level |
MIVisionX Dependencies |
Modules |
Libraries and Executables |
Docker Tag |
---|---|---|---|---|
|
cmake <br> gcc <br> g++ |
amd_openvx <br> utilities |
|
|
|
ROCm OpenCL <br> +Level 1 |
amd_openvx <br> amd_openvx_extensions <br> utilities |
|
|
|
OpenCV <br> FFMPEG <br> +Level 2 |
amd_openvx <br> amd_openvx_extensions <br> utilities |
|
|
|
MIOpenGEMM <br> MIOpen <br> ProtoBuf <br> +Level 3 |
amd_openvx <br> amd_openvx_extensions <br> apps <br> utilities |
|
|
|
AMD_RPP <br> rocAL deps <br> +Level 4 |
amd_openvx <br> amd_openvx_extensions <br> apps <br> rocAL <br> utilities |
|
NOTE: OpenVX and the OpenVX logo are trademarks of the Khronos Group Inc.
rocALUTION 2.1.11#
rocALUTION 2.1.11 for ROCm 5.7.0
Added#
Added support for gfx940, gfx941 and gfx942
Improved#
Fixed OpenMP runtime issue with Windows toolchain
rocBLAS 3.1.0#
rocBLAS 3.1.0 for ROCm 5.7.0
Added#
yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details.
rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems.
Fixed#
make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimensions or increments potentially causing overflow:
Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2
Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv
Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam
General: set_vector, get_vector, set_matrix, get_matrix
Related fixes: internal scalar loads with > 32bit offsets
fix in-place functionality for all trtri sizes
Changed#
dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory
enhanced reporting of installation issues caused by runtime libraries (Tensile)
standardized internal rocblas C++ interface across most functions
Deprecated#
Removal of STDC_WANT_IEC_60559_TYPES_EXT define in future release
Dependencies#
optional use of AOCL BLIS 4.0 on Linux for clients
optional build tool only dependency on python psutil
rocFFT 1.0.24#
rocFFT 1.0.24 for ROCm 5.7.0
Optimizations#
Improved performance of complex forward/inverse 1D FFTs (2049 <= length <= 131071) that use Bluestein’s algorithm.
Added#
Implemented a solution map version converter and finish the first conversion from ver.0 to ver.1. Where version 1 removes some incorrect kernels (sbrc/sbcr using half_lds)
Changed#
Moved rocfft_rtc_helper executable to lib/rocFFT directory on Linux.
Moved library kernel cache to lib/rocFFT directory.
rocm-cmake 0.10.0#
rocm-cmake 0.10.0 for ROCm 5.7.0
Added#
Added ROCMTest module
ROCMCreatePackage: Added support for ASAN packages
rocPRIM 2.13.1#
rocPRIM 2.13.1 for ROCm 5.7.0
Changed#
Deprecated configuration
radix_sort_config
for device-level radix sort as it no longer matches the algorithm’s parameters. New configurationradix_sort_config_v2
is preferred instead.Removed erroneous implementation of device-level
inclusive_scan
andexclusive_scan
. The prior default implementation using lookback-scan now is the only available implementation.The benchmark metric indicating the bytes processed for
exclusive_scan_by_key
andinclusive_scan_by_key
has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported asscan
andscan_by_key
instead ofscan_exclusive
andscan_inclusive
.Deprecated configurations
scan_config
andscan_by_key_config
for device-level scans, as they no longer match the algorithm’s parameters. New configurationsscan_config_v2
andscan_by_key_config_v2
are preferred instead.
Fixed#
Fixed build issue caused by missing header in
thread/thread_search.hpp
.
rocSOLVER 3.23.0#
rocSOLVER 3.23.0 for ROCm 5.7.0
Added#
LU factorization without pivoting for block tridiagonal matrices:
GEBLTTRF_NPVT now supports interleaved_batched format
Linear system solver without pivoting for block tridiagonal matrices:
GEBLTTRS_NPVT now supports interleaved_batched format
Fixed#
Fixed stack overflow in sparse tests on Windows
Changed#
Changed rocsolver-test sparse input data search paths to be relative to the test executable
Changed build scripts to default to compressed debug symbols in Debug builds
rocSPARSE 2.5.4#
rocSPARSE 2.5.4 for ROCm 5.7.0
Added#
Added more mixed precisions for SpMV, (matrix: float, vectors: double, calculation: double) and (matrix: rocsparse_float_complex, vectors: rocsparse_double_complex, calculation: rocsparse_double_complex)
Added support for gfx940, gfx941 and gfx942
Improved#
Fixed a bug in csrsm and bsrsm
Known Issues#
In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split.
rocWMMA 1.2.0#
rocWMMA 1.2.0 for ROCm 5.7.0
Changed#
Fixed a bug with synchronization
Updated rocWMMA cmake versioning
rpp 1.2.0#
rpp for ROCm 5.7.0
Added#
New Tests
Optimizations#
Readme Updates
Changed#
Backend - Default Backend set to
HIP
Fixed#
Minor bugs and warnings
Tested Configurations#
Linux distribution
Ubuntu -
18.04
/20.04
CentOS -
8
ROCm: rocm-core -
5.0.0.50000-49
Clang - Version
6.0
CMake - Version
3.22.3
Boost - Version
1.72
IEEE 754-based half-precision floating-point library - Version
1.12.0
Known Issues#
CPU
only backend not enabled
Rpp 1.1.0#
Rpp 1.0.0#
Rpp 0.99#
Rpp 0.98#
Rpp 0.97#
Rpp 0.96#
Rpp 0.95#
Rpp 0.93#
Tensile 4.38.0#
Tensile 4.38.0 for ROCm 5.7.0
Added#
Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation)
Added user-driven solution selection feature
Optimizations#
Enabled LocalSplitU with MFMA for I8 data type
Optimized K mask code in mfmaIter
Enabled TailLoop code in NoLoadLoop to prefetch global/local read
Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations
Optimized DirectToLds test cases to reduce the test duration
Changed#
Removed DGEMM NT custom kernels and related test cases
Changed noTailLoop logic to apply noTailLoop only for NT
Changed the range of AssertFree0ElementMultiple and Free1
Unified aStr, bStr generation code in mfmaIter
Fixed#
Fixed LocalSplitU mismatch issue for SGEMM
Fixed BufferStore=0 and Ldc != Ldd case
Fixed mismatch issue with TailLoop + MatrixInstB > 1
ROCm 5.6.1#
What’s new in this release#
ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime.
HIP 5.6.1 (for ROCm 5.6.1)#
Defect fixes#
hipMemcpy
device-to-device (inter-device) is now asynchronous with respect to the hostEnabled xnack+ check in HIP catch2 tests hang when executing tests
Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs
Using
hipGraphAddMemFreeNode
no longer results in a crash
Library changes in ROCm 5.6.1#
Library |
Version |
---|---|
AMDMIGraphX |
|
hipBLAS |
|
hipCUB |
|
hipFFT |
|
hipRAND |
|
hipSOLVER |
|
hipSPARSE |
2.3.6 ⇒ 2.3.7 |
MIOpen |
|
MIVisionX |
|
rccl |
|
rocALUTION |
|
rocBLAS |
|
rocFFT |
|
rocm-cmake |
|
rocPRIM |
|
rocRAND |
|
rocSOLVER |
|
rocSPARSE |
|
rocThrust |
|
rocWMMA |
|
Tensile |
hipSPARSE 2.3.7#
hipSPARSE 2.3.7 for ROCm 5.6.1
Bugfix#
Reverted an undocumented API change in hipSPARSE 2.3.6 that affected hipsparseSpSV_solve function
ROCm 5.6.0#
Release highlights#
ROCm 5.6 consists of several AI software ecosystem improvements to our fast-growing user base. A few examples include:
New documentation portal at https://rocm.docs.amd.com
Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test suite
OpenAI Triton, CuPy, HIP Graph support, and many other library performance enhancements
Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger, profiler, and docker containers
New pseudorandom generators are available in rocRAND. Added support for half-precision transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in rocSOLVER.
OS and GPU support changes#
SLES15 SP5 support was added this release. SLES15 SP3 support was dropped.
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs) will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA release date.
No new features and performance optimizations will be supported for the gfx906 GPUs beyond ROCm 5.7
Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM will be aligned with the closest ROCm release)
Bug fixes during the maintenance will be made to the next ROCm point release
Bug fixes will not be back ported to older ROCm releases for this SKU
Distro / Operating system updates will continue per the ROCm release cadence for gfx906 GPUs till EOM.
AMDSMI CLI 23.0.0.4#
Additions#
AMDSMI CLI tool enabled for Linux Bare Metal & Guest
Package: amd-smi-lib
Known issues#
not all Error Correction Code (ECC) fields are currently supported
RHEL 8 & SLES 15 have extra install steps
Kernel modules (DKMS)#
Fixes#
Stability fix for multi GPU system reproducible via ROCm_Bandwidth_Test as reported in Issue 2198.
HIP 5.6 (for ROCm 5.6)#
Optimizations#
Consolidation of hipamd, rocclr and OpenCL projects in clr
Optimized lock for graph global capture mode
Additions#
Added hipRTC support for amd_hip_fp16
Added hipStreamGetDevice implementation to get the device associated with the stream
Added HIP_AD_FORMAT_SIGNED_INT16 in hipArray formats
hipArrayGetInfo for getting information about the specified array
hipArrayGetDescriptor for getting 1D or 2D array descriptor
hipArray3DGetDescriptor to get 3D array descriptor
Changes#
hipMallocAsync to return success for zero size allocation to match hipMalloc
Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package
Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide
Removed hipBusBandwidth and hipCommander samples from hip-tests
Fixes#
Fixed regression in hipMemCpyParam3D when offset is applied
Known issues#
Limited testing on xnack+ configuration
Multiple HIP tests failures (gpuvm fault or hangs)
hipSetDevice and hipSetDeviceFlags APIs return hipErrorInvalidDevice instead of hipErrorNoDevice, on a system without GPU
Known memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs. Issue will be fixed in a future ROCm release
Upcoming changes in future release#
Removal of gcnarch from hipDeviceProp_t structure
Addition of new fields in hipDeviceProp_t structure
maxTexture1D
maxTexture2D
maxTexture1DLayered
maxTexture2DLayered
sharedMemPerMultiprocessor
deviceOverlap
asyncEngineCount
surfaceAlignment
unifiedAddressing
computePreemptionSupported
uuid
Removal of deprecated code
hip-hcc codes from hip code tree
Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
HIPMEMCPY_3D fields correction (unsigned int -> size_t)
Renaming of ‘memoryType’ in hipPointerAttribute_t structure to ‘type’
ROCgdb-13 (For ROCm 5.6.0)#
Optimizations#
Improved performances when handling the end of a process with a large number of threads.
Known issues#
On certain configurations, ROCgdb can show the following warning message:
warning: Probes-based dynamic linker interface failed. Reverting to original interface.
This does not affect ROCgdb’s functionalities.
ROCprofiler (for ROCm 5.6.0)#
In ROCm 5.6 the rocprofilerv1
and rocprofilerv2
include and library files of
ROCm 5.5 are split into separate files. The rocmtools
files that were
deprecated in ROCm 5.5 have been removed.
ROCm 5.6 |
rocprofilerv1 |
rocprofilerv2 |
---|---|---|
Tool script |
|
|
API include |
|
|
API library |
|
|
The ROCm Profiler Tool that uses rocprofilerV1
can be invoked using the
following command:
rocprof …
To write a custom tool based on the rocprofilerV1
API do the following:
main.c:
#include <rocprofiler/rocprofiler.h> // Use the rocprofilerV1 API
int main() {
// Use the rocprofilerV1 API
return 0;
}
This can be built in the following manner:
gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64
The resulting a.out
will depend on
/opt/rocm-5.6.0/lib/librocprofiler64.so.1
.
The ROCm Profiler that uses rocprofilerV2
API can be invoked using the
following command:
rocprofv2 …
To write a custom tool based on the rocprofilerV2
API do the following:
main.c:
#include <rocprofiler/v2/rocprofiler.h> // Use the rocprofilerV2 API
int main() {
// Use the rocprofilerV2 API
return 0;
}
This can be built in the following manner:
gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64-v2
The resulting a.out
will depend on
/opt/rocm-5.6.0/lib/librocprofiler64.so.2
.
Optimizations#
Improved Test Suite
Additions#
‘end_time’ need to be disabled in roctx_trace.txt
Fixes#
rocprof in ROcm/5.4.0 gpu selector broken.
rocprof in ROCm/5.4.1 fails to generate kernel info.
rocprof clobbers LD_PRELOAD.
Library changes in ROCm 5.6.0#
Library |
Version |
---|---|
AMDMIGraphX |
|
hipBLAS |
|
hipCUB |
|
hipFFT |
1.0.11 ⇒ 1.0.12 |
hipRAND |
|
hipSOLVER |
1.7.0 ⇒ 1.8.0 |
hipSPARSE |
2.3.5 ⇒ 2.3.6 |
MIOpen |
|
MIVisionX |
2.3.0 ⇒ 2.4.0 |
rccl |
|
rocALUTION |
2.1.8 ⇒ 2.1.9 |
rocBLAS |
2.47.0 ⇒ 3.0.0 |
rocFFT |
1.0.22 ⇒ 1.0.23 |
rocm-cmake |
0.8.1 ⇒ 0.9.0 |
rocPRIM |
|
rocRAND |
|
rocSOLVER |
3.21.0 ⇒ 3.22.0 |
rocSPARSE |
2.5.1 ⇒ 2.5.2 |
rocThrust |
2.17.0 ⇒ 2.18.0 |
rocWMMA |
1.0 ⇒ 1.1.0 |
Tensile |
4.36.0 ⇒ 4.37.0 |
hipFFT 1.0.12#
hipFFT 1.0.12 for ROCm 5.6.0
Added#
Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms.
Changed#
Added –precision argument to benchmark/test clients. –double is still accepted but is deprecated as a method to request a double-precision transform.
hipSOLVER 1.8.0#
hipSOLVER 1.8.0 for ROCm 5.6.0
Added#
Added compatibility API with hipsolverRf prefix
hipSPARSE 2.3.6#
hipSPARSE 2.3.6 for ROCm 5.6.0
Added#
Added SpGEMM algorithms
Changed#
For hipsparseXbsr2csr and hipsparseXcsr2bsr, blockDim == 0 now returns HIPSPARSE_STATUS_INVALID_SIZE
MIVisionX 2.4.0#
MIVisionX for ROCm 5.6.0
Added#
OpenVX FP16 Support
rocAL - CPU, HIP, & OCL backends
AMD RPP - CPU, HIP, & OCL backends
MIVisionX Setup Support for RHEL
Extended OS Support
Docker Support for Ubuntu
22.04
Tests
Optimizations#
CMakeList Cleanup
MIGraphX Extension Updates
rocAL - Documentation
CMakeList Updates & Cleanup
Changed#
rocAL - Changing Python Lib Path
Docker Support - Ubuntu 18 Support Dropped
RPP - Link to Version 1.0.0
rocAL - support updates
Setup Updates
Fixed#
rocAL bug fix and updates
AMD RPP - bug fixes
CMakeLists - Issues
RPATH - Link Issues
Tested Configurations#
Windows
10
/11
Linux distribution
Ubuntu -
20.04
/22.04
CentOS -
7
/8
RHEL -
8
/9
SLES -
15-SP3
ROCm: rocm-core -
5.4.3.50403-121
miopen-hip -
2.19.0.50403-121
miopen-opencl -
2.18.0.50300-63
migraphx -
2.4.0.50403-121
Protobuf - V3.12.4
OpenCV - 4.6.0
RPP - 1.0.0
FFMPEG - n4.4.2
Dependencies for all the above packages
MIVisionX Setup Script -
V2.4.2
Known Issues#
OpenCV 4.X support for some apps missing
Mivisionx Dependency Map#
Docker Image: sudo docker build -f docker/ubuntu20/{DOCKER_LEVEL_FILE_NAME}.dockerfile -t {mivisionx-level-NUMBER} .
new component added to the level
existing component from the previous level
Build Level |
MIVisionX Dependencies |
Modules |
Libraries and Executables |
Docker Tag |
---|---|---|---|---|
|
cmake <br> gcc <br> g++ |
amd_openvx <br> utilities |
|
|
|
ROCm OpenCL <br> +Level 1 |
amd_openvx <br> amd_openvx_extensions <br> utilities |
|
|
|
OpenCV <br> FFMPEG <br> +Level 2 |
amd_openvx <br> amd_openvx_extensions <br> utilities |
|
|
|
MIOpenGEMM <br> MIOpen <br> ProtoBuf <br> +Level 3 |
amd_openvx <br> amd_openvx_extensions <br> apps <br> utilities |
|
|
|
AMD_RPP <br> rocAL deps <br> +Level 4 |
amd_openvx <br> amd_openvx_extensions <br> apps <br> rocAL <br> utilities |
|
NOTE: OpenVX and the OpenVX logo are trademarks of the Khronos Group Inc.
rocALUTION 2.1.9#
rocALUTION 2.1.9 for ROCm 5.6.0
Improved#
Fixed synchronization issues in level 1 routines
rocBLAS 3.0.0#
rocBLAS 3.0.0 for ROCm 5.6.0
Optimizations#
Improved performance of Level 2 rocBLAS GEMV on gfx90a GPU for non-transposed problems having small matrices and larger batch counts. Performance enhanced for problem sizes when m and n <= 32 and batch_count >= 256.
Improved performance of rocBLAS syr2k for single, double, and double-complex precision, and her2k for double-complex precision. Slightly improved performance for general sizes on gfx90a.
Added#
Added bf16 inputs and f32 compute support to Level 1 rocBLAS Extension functions axpy_ex, scal_ex and nrm2_ex.
Deprecated#
trmm inplace is deprecated. It will be replaced by trmm that has both inplace and out-of-place functionality
rocblas_query_int8_layout_flag() is deprecated and will be removed in a future release
rocblas_gemm_flags_pack_int8x4 enum is deprecated and will be removed in a future release
rocblas_set_device_memory_size() is deprecated and will be replaced by a future function rocblas_increase_device_memory_size()
rocblas_is_user_managing_device_memory() is deprecated and will be removed in a future release
Removed#
is_complex helper was deprecated and now removed. Use rocblas_is_complex instead.
The enum truncate_t and the value truncate was deprecated and now removed from. It was replaced by rocblas_truncate_t and rocblas_truncate, respectively.
rocblas_set_int8_type_for_hipblas was deprecated and is now removed.
rocblas_get_int8_type_for_hipblas was deprecated and is now removed.
Dependencies#
build only dependency on python joblib added as used by Tensile build
fix for cmake install on some OS when performed by install.sh -d –cmake_install
Fixed#
make trsm offset calculations 64 bit safe
Changed#
refactor rotg test code
rocFFT 1.0.23#
rocFFT 1.0.23 for ROCm 5.6.0
Added#
Implemented half-precision transforms, which can be requested by passing rocfft_precision_half to rocfft_plan_create.
Implemented a hierarchical solution map which saves how to decompose a problem and the kernels to be used.
Implemented a first version of offline-tuner to support tuning kernels for C2C/Z2Z problems.
Changed#
Replaced std::complex with hipComplex data types for data generator.
FFT plan dimensions are now sorted to be row-major internally where possible, which produces better plans if the dimensions were accidentally specified in a different order (column-major, for example).
Added –precision argument to benchmark/test clients. –double is still accepted but is deprecated as a method to request a double-precision transform.
Fixed#
Fixed over-allocation of LDS in some real-complex kernels, which was resulting in kernel launch failure.
rocm-cmake 0.9.0#
rocm-cmake 0.9.0 for ROCm 5.6.0
Added#
Added the option ROCM_HEADER_WRAPPER_WERROR
Compile-time C macro in the wrapper headers causes errors to be emitted instead of warnings.
Configure-time CMake option sets the default for the C macro.
rocSOLVER 3.22.0#
rocSOLVER 3.22.0 for ROCm 5.6.0
Added#
LU refactorization for sparse matrices
CSRRF_ANALYSIS
CSRRF_SUMLU
CSRRF_SPLITLU
CSRRF_REFACTLU
Linear system solver for sparse matrices
CSRRF_SOLVE
Added type
rocsolver_rfinfo
for use with sparse matrix routines
Optimized#
Improved the performance of BDSQR and GESVD when singular vectors are requested
Fixed#
BDSQR and GESVD should no longer hang when the input contains
NaN
orInf
rocSPARSE 2.5.2#
rocSPARSE 2.5.2 for ROCm 5.6.0
Improved#
Fixed a memory leak in csritsv
Fixed a bug in csrsm and bsrsm
rocThrust 2.18.0#
rocThrust 2.18.0 for ROCm 5.6.0
Fixed#
lower_bound
,upper_bound
, andbinary_search
failed to compile for certain types.
Changed#
Updated
docs
directory structure to match the standard of rocm-docs-core.
rocWMMA 1.1.0#
rocWMMA 1.1.0 for ROCm 5.6.0
Added#
Added cross-lane operation backends (Blend, Permute, Swizzle and Dpp)
Added GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation)
Added performance gemm samples for half, single and double precision
Added rocWMMA cmake versioning
Added vectorized support in coordinate transforms
Included ROCm smi for runtime clock rate detection
Added fragment transforms for transpose and change data layout
Changed#
Default to GPU rocBLAS validation against rocWMMA
Re-enabled int8 gemm tests on gfx9
Upgraded to C++17
Restructured unit test folder for consistency
Consolidated rocWMMA samples common code
Tensile 4.37.0#
Tensile 4.37.0 for ROCm 5.6.0
Added#
Added user driven tuning API
Added decision tree fallback feature
Added SingleBuffer + AtomicAdd option for GlobalSplitU
DirectToVgpr support for fp16 and Int8 with TN orientation
Added new test cases for various functions
Added SingleBuffer algorithm for ZGEMM/CGEMM
Added joblib for parallel map calls
Added support for MFMA + LocalSplitU + DirectToVgprA+B
Added asmcap check for MIArchVgpr
Added support for MFMA + LocalSplitU
Added frequency, power, and temperature data to the output
Optimizations#
Improved the performance of GlobalSplitU with SingleBuffer algorithm
Reduced the running time of the extended and pre_checkin tests
Optimized the Tailloop section of the assembly kernel
Optimized complex GEMM (fixed vgpr allocation, unified CGEMM and ZGEMM code in MulMIoutAlphaToArch)
Improved the performance of the second kernel of MultipleBuffer algorithm
Changed#
Updated custom kernels with 64-bit offsets
Adapted 64-bit offset arguments for assembly kernels
Improved temporary register re-use to reduce max sgpr usage
Removed some restrictions on VectorWidth and DirectToVgpr
Updated the dependency requirements for Tensile
Changed the range of AssertSummationElementMultiple
Modified the error messages for more clarity
Changed DivideAndReminder to vectorStaticRemainder in case quotient is not used
Removed dummy vgpr for vectorStaticRemainder
Removed tmpVgpr parameter from vectorStaticRemainder/Divide/DivideAndReminder
Removed qReg parameter from vectorStaticRemainder
Fixed#
Fixed tmp sgpr allocation to avoid over-writing values (alpha)
64-bit offset parameters for post kernels
Fixed gfx908 CI test failures
Fixed offset calculation to prevent overflow for large offsets
Fixed issues when BufferLoad and BufferStore are equal to zero
Fixed StoreCInUnroll + DirectToVgpr + no useInitAccVgprOpt mismatch
Fixed DirectToVgpr + LocalSplitU + FractionalLoad mismatch
Fixed the memory access error related to StaggerU + large stride
Fixed ZGEMM 4x4 MatrixInst mismatch
Fixed DGEMM 4x4 MatrixInst mismatch
Fixed ASEM + GSU + NoTailLoop opt mismatch
Fixed AssertSummationElementMultiple + GlobalSplitU issues
Fixed ASEM + GSU + TailLoop inner unroll
ROCm 5.5.1#
What’s new in this release#
HIP SDK for Windows#
AMD is pleased to announce the availability of the HIP SDK for Windows as part of ROCm software. The HIP SDK OS and GPU support page lists the versions of Windows and GPUs validated by AMD. HIP SDK features on Windows are described in detail in our What is ROCm? page and differs from the Linux feature set. Visit Quick Start page to get started. Known issues are tracked on GitHub.
HIP API change#
The following HIP API is updated in the ROCm 5.5.1 release:
hipDeviceSetCacheConfig
#
The return value for
hipDeviceSetCacheConfig
is updated fromhipErrorNotSupported
tohipSuccess
Library changes in ROCm 5.5.1#
Library |
Version |
---|---|
AMDMIGraphX |
|
hipBLAS |
|
hipBLASLt |
|
hipCUB |
|
hipFFT |
|
hipRAND |
|
hipSOLVER |
|
hipSPARSE |
|
MIOpen |
|
MIVisionX |
|
rccl |
|
rocALUTION |
|
rocBLAS |
|
rocFFT |
|
rocm-cmake |
|
rocPRIM |
|
rocRAND |
|
rocSOLVER |
|
rocSPARSE |
|
rocThrust |
|
rocWMMA |
|
Tensile |
ROCm 5.5.0#
What’s new in this release#
HIP enhancements#
The ROCm v5.5 release consists of the following HIP enhancements:
Enhanced stack size limit#
In this release, the stack size limit is increased from 16k to 131056 bytes (or 128K - 16). Applications requiring to update the stack size can use hipDeviceSetLimit API.
hipcc
changes#
The following hipcc changes are implemented in this release:
hipcc
will not implicitly link tolibpthread
andlibrt
, as they are no longer a link time dependence for HIP programs. Applications that depend on these libraries must explicitly link to them.-use-staticlib
and-use-sharedlib
options are deprecated.
Future changes#
Separation of
hipcc
binaries (Perl scripts) from HIP tohipcc
project. Users will access separatehipcc
package for installinghipcc
binaries in future ROCm releases.In a future ROCm release, the following samples will be removed from the
hip-tests
project.hipBusbandWidth
at ROCm/hip-testshipCommander
at ROCm/hip-tests
Note that the samples will continue to be available in previous release branches.
Removal of gcnarch from hipDeviceProp_t structure
Addition of new fields in hipDeviceProp_t structure
maxTexture1D
maxTexture2D
maxTexture1DLayered
maxTexture2DLayered
sharedMemPerMultiprocessor
deviceOverlap
asyncEngineCount
surfaceAlignment
unifiedAddressing
computePreemptionSupported
hostRegisterSupported
uuid
Removal of deprecated code
hip-hcc codes from hip code tree
Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
HIPMEMCPY_3D fields correction to avoid truncation of “size_t” to “unsigned int” inside hipMemcpy3D()
Renaming of ‘memoryType’ in hipPointerAttribute_t structure to ‘type’
Correct hipGetLastError to return the last error instead of last API call’s return code
Update hipExternalSemaphoreHandleDesc to add “unsigned int reserved[16]”
Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
Remove hiparray* and make it opaque with hipArray_t
New HIP APIs in this release#
Note
This is a pre-official version (beta) release of the new APIs and may contain unresolved issues.
Memory management HIP APIs#
The new memory management HIP API is as follows:
Sets information on the specified pointer [BETA].
hipError_t hipPointerSetAttribute(const void* value, hipPointer_attribute attribute, hipDeviceptr_t ptr);
Module management HIP APIs#
The new module management HIP APIs are as follows:
Launches kernel \(f\) with launch parameters and shared memory on stream with arguments passed to
kernelParams
, where thread blocks can cooperate and synchronize as they run.hipError_t hipModuleLaunchCooperativeKernel(hipFunction_t f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, hipStream_t stream, void** kernelParams);
Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they run.
hipError_t hipModuleLaunchCooperativeKernelMultiDevice(hipFunctionLaunchParams* launchParamsList, unsigned int numDevices, unsigned int flags);
HIP graph management APIs#
The new HIP graph management APIs are as follows:
Creates a memory allocation node and adds it to a graph [BETA]
hipError_t hipGraphAddMemAllocNode(hipGraphNode_t* pGraphNode, hipGraph_t graph, const hipGraphNode_t* pDependencies, size_t numDependencies, hipMemAllocNodeParams* pNodeParams);
Return parameters for memory allocation node [BETA]
hipError_t hipGraphMemAllocNodeGetParams(hipGraphNode_t node, hipMemAllocNodeParams* pNodeParams);
Creates a memory free node and adds it to a graph [BETA]
hipError_t hipGraphAddMemFreeNode(hipGraphNode_t* pGraphNode, hipGraph_t graph, const hipGraphNode_t* pDependencies, size_t numDependencies, void* dev_ptr);
Returns parameters for memory free node [BETA].
hipError_t hipGraphMemFreeNodeGetParams(hipGraphNode_t node, void* dev_ptr);
Write a DOT file describing graph structure [BETA].
hipError_t hipGraphDebugDotPrint(hipGraph_t graph, const char* path, unsigned int flags);
Copies attributes from source node to destination node [BETA].
hipError_t hipGraphKernelNodeCopyAttributes(hipGraphNode_t hSrc, hipGraphNode_t hDst);
Enables or disables the specified node in the given graphExec [BETA]
hipError_t hipGraphNodeSetEnabled(hipGraphExec_t hGraphExec, hipGraphNode_t hNode, unsigned int isEnabled);
Query whether a node in the given graphExec is enabled [BETA]
hipError_t hipGraphNodeGetEnabled(hipGraphExec_t hGraphExec, hipGraphNode_t hNode, unsigned int* isEnabled);
OpenMP enhancements#
This release consists of the following OpenMP enhancements:
Additional support for OMPT functions
get_device_time
andget_record_type
Added support for min/max fast fp atomics on AMD GPUs
Fixed the use of the abs function in C device regions
Deprecations and warnings#
HIP deprecation#
The hipcc
and hipconfig
Perl scripts are deprecated. In a future release, compiled binaries will be
available as hipcc.bin
and hipconfig.bin
as replacements for the Perl scripts.
Note
There will be a transition period where the Perl scripts and compiled binaries are available before the
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
binary counterpart. No user action is required. Once these are available, users can optionally switch to
hipcc.bin
and hipconfig.bin
. The hipcc
/hipconfig
soft link will be assimilated to point from
hipcc
/hipconfig
to the respective compiled binaries as the default option.
Linux file system hierarchy standard for ROCm#
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.
New file system hierarchy#
The following is the new file system hierarchy:4
/opt/rocm-<ver>
| --bin
| --All externally exposed Binaries
| --libexec
| --<component>
| -- Component specific private non-ISA executables (architecture independent)
| --include
| -- <component>
| --<header files>
| --lib
| --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
(public libraries linked with application)
| --<component> (component specific private library, executable data)
| --<cmake>
| --components
| --<component>.config.cmake
| --share
| --html/<component>/*.html
| --info/<component>/*.[pdf, md, txt]
| --man
| --doc
| --<component>
| --<licenses>
| --<component>
| --<misc files> (arch independent non-executable)
| --samples
Note
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.
Backward compatibility with older file systems#
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
Note
ROCm will continue supporting backward compatibility until the next major release.
Wrapper header files#
Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include
) with a
warning message to include files from the new location (/opt/rocm-xxx/include
) as shown in the
example below:
// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"
The wrapper header files’ backward compatibility deprecation is as follows:
#pragma
message announcing deprecation – ROCm v5.2 release#pragma
message changed to#warning
– Future release#warning
changed to#error
– Future releaseBackward compatibility wrappers removed – Future release
Library files#
Library files are available in the /opt/rocm-xxx/lib
folder. For backward compatibility, the old library
location (/opt/rocm-xxx/<component>/lib
) has a soft link to the library at the new location.
Example:
$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
CMake config files#
All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component>
folder.
For backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake
)
consist of a soft link to the new CMake config.
Example:
$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
ROCm support for Code Object V3 deprecated#
Support for Code Object v3 is deprecated and will be removed in a future release.
Comgr V3.0 changes#
The following APIs and macros have been marked as deprecated. These are expected to be removed in a future ROCm release and coincides with the release of Comgr v3.0.
API changes#
amd_comgr_action_info_set_options()
amd_comgr_action_info_get_options()
Actions and data types#
AMD_COMGR_ACTION_ADD_DEVICE_LIBRARIES
AMD_COMGR_ACTION_COMPILE_SOURCE_TO_FATBIN
For replacements, see the AMD_COMGR_ACTION_INFO_GET
/SET_OPTION_LIST APIs
, and the
AMD_COMGR_ACTION_COMPILE_SOURCE_(WITH_DEVICE_LIBS)_TO_BC
macros.
Deprecated environment variables#
The following environment variables are removed in this ROCm release:
GPU_MAX_COMMAND_QUEUES
GPU_MAX_WORKGROUP_SIZE_2D_X
GPU_MAX_WORKGROUP_SIZE_2D_Y
GPU_MAX_WORKGROUP_SIZE_3D_X
GPU_MAX_WORKGROUP_SIZE_3D_Y
GPU_MAX_WORKGROUP_SIZE_3D_Z
GPU_BLIT_ENGINE_TYPE
GPU_USE_SYNC_OBJECTS
AMD_OCL_SC_LIB
AMD_OCL_ENABLE_MESSAGE_BOX
GPU_FORCE_64BIT_PTR
GPU_FORCE_OCL20_32BIT
GPU_RAW_TIMESTAMP
GPU_SELECT_COMPUTE_RINGS_ID
GPU_USE_SINGLE_SCRATCH
GPU_ENABLE_LARGE_ALLOCATION
HSA_LOCAL_MEMORY_ENABLE
HSA_ENABLE_COARSE_GRAIN_SVM
GPU_IFH_MODE
OCL_SYSMEM_REQUIREMENT
OCL_CODE_CACHE_ENABLE
OCL_CODE_CACHE_RESET
Known issues in this release#
The following are the known issues in this release.
DISTRIBUTED
/TEST_DISTRIBUTED_SPAWN
fails#
When user applications call ncclCommAbort
to destruct communicators and then create new
communicators repeatedly, subsequent communicators may fail to initialize.
This issue is under investigation and will be resolved in a future release.
Library changes in ROCm 5.5.0#
Library |
Version |
---|---|
AMDMIGraphX |
|
hipBLAS |
0.53.0 ⇒ 0.54.0 |
hipBLASLt |
|
hipCUB |
2.13.0 ⇒ 2.13.1 |
hipFFT |
1.0.10 ⇒ 1.0.11 |
hipRAND |
|
hipSOLVER |
1.6.0 ⇒ 1.7.0 |
hipSPARSE |
2.3.3 ⇒ 2.3.5 |
MIOpen |
|
MIVisionX |
|
rccl |
2.13.4 ⇒ 2.15.5 |
rocALUTION |
2.1.3 ⇒ 2.1.8 |
rocBLAS |
2.46.0 ⇒ 2.47.0 |
rocFFT |
1.0.21 ⇒ 1.0.22 |
rocm-cmake |
0.8.0 ⇒ 0.8.1 |
rocPRIM |
2.12.0 ⇒ 2.13.0 |
rocRAND |
2.10.16 ⇒ 2.10.17 |
rocSOLVER |
3.20.0 ⇒ 3.21.0 |
rocSPARSE |
2.4.0 ⇒ 2.5.1 |
rocThrust |
|
rocWMMA |
0.9 ⇒ 1.0 |
Tensile |
4.35.0 ⇒ 4.36.0 |
AMDMIGraphX 2.5#
MIGraphX 2.5 for ROCm 5.5.0
Added#
Y-Model feature to store tuning information with the optimized model
Added Python 3.10 bindings
Accuracy checker tool based on ONNX Runtime
ONNX Operators parse_split, and Trilu
Build support for ROCm MLIR
Added migraphx-driver flag to print optimizations in python (–python)
Added JIT implementation of the Gather and Pad operator which results in better handling of larger tensor sizes.
Optimizations#
Improved performance of Transformer based models
Improved performance of the Pad, Concat, Gather, and Pointwise operators
Improved onnx/pb file loading speed
Added general optimize pass which runs several passes such as simplify_reshapes/algebra and DCE in loop.
Fixed#
Improved parsing Tensorflow Protobuf files
Resolved various accuracy issues with some onnx models
Resolved a gcc-12 issue with mivisionx
Improved support for larger sized models and batches
Use –offload-arch instead of –cuda-gpu-arch for the HIP compiler
Changes inside JIT to use float accumulator for large reduce ops of half type to avoid overflow.
Changes inside JIT to temporarily use cosine to compute sine function.
Changed#
Changed version/location of 3rd party build dependencies to pick up fixes
hipBLAS 0.54.0#
hipBLAS 0.54.0 for ROCm 5.5.0
Added#
added option to opt-in to use __half for hipblasHalf type in the API for c++ users who define HIPBLAS_USE_HIP_HALF
added scripts to plot performance for multiple functions
data driven hipblas-bench and hipblas-test execution via external yaml format data files
client smoke test added for quick validation using command hipblas-test –yaml hipblas_smoke.yaml
Fixed#
fixed datatype conversion functions to support more rocBLAS/cuBLAS datatypes
fixed geqrf to return successfully when nullptrs are passed in with n == 0 || m == 0
fixed getrs to return successfully when given nullptrs with corresponding size = 0
fixed getrs to give info = -1 when transpose is not an expected type
fixed gels to return successfully when given nullptrs with corresponding size = 0
fixed gels to give info = -1 when transpose is not in (‘N’, ‘T’) for real cases or not in (‘N’, ‘C’) for complex cases
Changed#
changed reference code for Windows to OpenBLAS
hipblas client executables all now begin with hipblas- prefix
hipBLASLt 0.1.0#
hipBLASLt 0.1.0 for ROCm 5.5.0
Added#
Enable hipBLASLt APIs
Support gfx90a
Support problem type: fp32, fp16, bf16
Support activation: relu, gelu
Support bias vector
Support Scale D vector
Integreate with tensilelite kernel generator
Add Gtest: hipblaslt-test
Add full function tool: hipblaslt-bench
Add sample app: example_hipblaslt_preference
Optimizations#
Gridbase solution search algorithm for untuned size
Tune 10k sizes for each problem type
hipCUB 2.13.1#
hipCUB 2.13.1 for ROCm 5.5.0
Added#
Benchmarks for
BlockShuffle
,BlockLoad
, andBlockStore
.
Changed#
CUB backend references CUB and Thrust version 1.17.2.
Improved benchmark coverage of
BlockScan
by addingExclusiveScan
, benchmark coverage ofBlockRadixSort
by addingSortBlockedToStriped
, and benchmark coverage ofWarpScan
by addingBroadcast
.
Fixed#
Windows HIP SDK support
Known Issues#
BlockRadixRankMatch
is currently broken under the rocPRIM backend.BlockRadixRankMatch
with a warp size that does not exactly divide the block size is broken under the CUB backend.
hipFFT 1.0.11#
hipFFT 1.0.11 for ROCm 5.5.0
Fixed#
Fixed old version rocm include/lib folders not removed on upgrade.
hipRAND 2.10.16#
hipRAND 2.10.16 for ROCm 5.5.0
Added#
rocRAND backend support for Sobol 64, Scrambled Sobol 32 and 64, and MT19937.
hiprandGenerateLongLong
for generating 64-bits uniformly distributed integers with Sobol 64 and Scrambled Sobol 64.
Changed#
Python 2.7 is no longer officially supported.
hipSOLVER 1.7.0#
hipSOLVER 1.7.0 for ROCm 5.5.0
Added#
Added functions
gesvdj
hipsolverSgesvdj_bufferSize, hipsolverDgesvdj_bufferSize, hipsolverCgesvdj_bufferSize, hipsolverZgesvdj_bufferSize
hipsolverSgesvdj, hipsolverDgesvdj, hipsolverCgesvdj, hipsolverZgesvdj
gesvdjBatched
hipsolverSgesvdjBatched_bufferSize, hipsolverDgesvdjBatched_bufferSize, hipsolverCgesvdjBatched_bufferSize, hipsolverZgesvdjBatched_bufferSize
hipsolverSgesvdjBatched, hipsolverDgesvdjBatched, hipsolverCgesvdjBatched, hipsolverZgesvdjBatched
hipSPARSE 2.3.5#
hipSPARSE 2.3.5 for ROCm 5.5.0
Improved#
Fixed an issue, where the rocm folder was not removed on upgrade of meta packages
Fixed a compilation issue with cusparse backend
Added more detailed messages on unit test failures due to missing input data
Improved documentation
Fixed a bug with deprecation messages when using gcc9 (Thanks @Maetveis)
MIOpen 2.19.0#
MIOpen 2.19.0 for ROCm 5.5.0
Added#
ROCm 5.5 support for gfx1101 (Navi32)
Changed#
Tuning results for MLIR on ROCm 5.5
Bumping MLIR commit to 5.5.0 release tag
Fixed#
Fix 3d convolution Host API bug
[HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required.
rccl 2.15.5#
RCCL 2.15.5 for ROCm 5.5.0
Changed#
Compatibility with NCCL 2.15.5
Unit test executable renamed to rccl-UnitTests
Added#
HW-topology aware binary tree implementation
Experimental support for MSCCL
New unit tests for hipGraph support
NPKit integration
Fixed#
rocm-smi ID conversion
Support for HIP_VISIBLE_DEVICES for unit tests
Support for p2p transfers to non (HIP) visible devices
Removed#
Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench
rocALUTION 2.1.8#
rocALUTION 2.1.8 for ROCm 5.5.0
Added#
Added build support for Navi32
Improved#
Fixed a typo in MPI backend
Fixed a bug with the backend when HIP support is disabled
Fixed a bug in SAAMG hierarchy building on HIP backend
Improved SAAMG hierarchy build performance on HIP backend
Changed#
LocalVector::GetIndexValues(ValueType*) is deprecated, use LocalVector::GetIndexValues(const LocalVector&, LocalVector*) instead
LocalVector::SetIndexValues(const ValueType*) is deprecated, use LocalVector::SetIndexValues(const LocalVector&, const LocalVector&) instead
LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*) is deprecated, use LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix*) instead
LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, float, LocalMatrix*, LocalMatrix*) is deprecated, use LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, LocalMatrix*) instead
LocalMatrix::RugeStueben() is deprecated
LocalMatrix::AMGSmoothedAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*, int) is deprecated, use LocalMatrix::AMGAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix*, int) instead
LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix*, LocalMatrix*) is deprecated, use LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix*) instead
rocBLAS 2.47.0#
rocBLAS 2.47.0 for ROCm 5.5.0
Added#
added functionality rocblas_geam_ex for matrix-matrix minimum operations
added HIP Graph support as beta feature for rocBLAS Level 1, Level 2, and Level 3(pointer mode host) functions
added beta features API. Exposed using compiler define ROCBLAS_BETA_FEATURES_API
added support for vector initialization in the rocBLAS test framework with negative increments
added windows build documentation for forthcoming support using ROCm HIP SDK
added scripts to plot performance for multiple functions
Optimizations#
improved performance of Level 2 rocBLAS GEMV for float and double precision. Performance enhanced by 150-200% for certain problem sizes when (m==n) measured on a gfx90a GPU.
improved performance of Level 2 rocBLAS GER for float, double and complex float precisions. Performance enhanced by 5-7% for certain problem sizes measured on a gfx90a GPU.
improved performance of Level 2 rocBLAS SYMV for float and double precisions. Performance enhanced by 120-150% for certain problem sizes measured on both gfx908 and gfx90a GPUs.
Fixed#
fixed setting of executable mode on client script rocblas_gentest.py to avoid potential permission errors with clients rocblas-test and rocblas-bench
fixed deprecated API compatibility with Visual Studio compiler
fixed test framework memory exception handling for Level 2 functions when the host memory allocation exceeds the available memory
Changed#
install.sh internally runs rmake.py (also used on windows) and rmake.py may be used directly by developers on linux (use –help)
rocblas client executables all now begin with rocblas- prefix
Removed#
install.sh removed options -o –cov as now Tensile will use the default COV format, set by cmake define Tensile_CODE_OBJECT_VERSION=default
rocFFT 1.0.22#
rocFFT 1.0.22 for ROCm 5.5.0
Optimizations#
Improved performance of 1D lengths < 2048 that use Bluestein’s algorithm.
Reduced time for generating code during plan creation.
Optimized 3D R2C/C2R lengths 32, 84, 128.
Optimized batched small 1D R2C/C2R cases.
Added#
Added gfx1101 to default AMDGPU_TARGETS.
Changed#
Moved client programs to C++17.
Moved planar kernels and infrequently used Stockham kernels to be runtime-compiled.
Moved transpose, real-complex, Bluestein, and Stockham kernels to library kernel cache.
Fixed#
Removed zero-length twiddle table allocations, which fixes errors from hipMallocManaged.
Fixed incorrect freeing of HIP stream handles during twiddle computation when multiple devices are present.
rocm-cmake 0.8.1#
rocm-cmake 0.8.1 for ROCm 5.5.0
Fixed#
ROCMInstallTargets: Added compatibility symlinks for included cmake files in
<ROCM>/lib/cmake/<PACKAGE>
.
Changed#
ROCMHeaderWrapper: The wrapper header deprecation message is now a deprecation warning.
rocPRIM 2.13.0#
rocPRIM 2.13.0 for ROCm 5.5.0
Added#
New block level
radix_rank
primitive.New block level
radix_rank_match
primitive.
Changed#
Improved the performance of
block_radix_sort
anddevice_radix_sort
.
Known Issues#
Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows.
Fixed#
Fixed benchmark build on Windows
rocRAND 2.10.17#
rocRAND 2.10.17 for ROCm 5.5.0
Added#
MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
New benchmark for the device API using Google Benchmark,
benchmark_rocrand_device_api
, replacingbenchmark_rocrand_kernel
.benchmark_rocrand_kernel
is deprecated and will be removed in a future version. Likewise,benchmark_curand_host_api
is added to replacebenchmark_curand_generate
andbenchmark_curand_device_api
is added to replacebenchmark_curand_kernel
.experimental HIP-CPU feature
ThreeFry pseudorandom number generator based on Salmon et al., 2011, “Parallel random numbers: as easy as 1, 2, 3”.
Changed#
Python 2.7 is no longer officially supported.
Fixed#
Windows HIP SDK support
rocSOLVER 3.21.0#
rocSOLVER 3.21.0 for ROCm 5.5.0
Added#
SVD for general matrices using Jacobi algorithm:
GESVDJ (with batched and strided_batched versions)
LU factorization without pivoting for block tridiagonal matrices:
GEBLTTRF_NPVT (with batched and strided_batched versions)
Linear system solver without pivoting for block tridiagonal matrices:
GEBLTTRS_NPVT (with batched and strided_batched, versions)
Product of triangular matrices
LAUUM
Added experimental hipGraph support for rocSOLVER functions
Optimized#
Improved the performance of SYEVJ/HEEVJ.
Changed#
STEDC, SYEVD/HEEVD and SYGVD/HEGVD now use fully implemented Divide and Conquer approach.
Fixed#
SYEVJ/HEEVJ should now be invariant under matrix scaling.
SYEVJ/HEEVJ should now properly output the eigenvalues when no sweeps are executed.
Fixed GETF2_NPVT and GETRF_NPVT input data initialization in tests and benchmarks.
Fixed rocblas missing from the dependency list of the rocsolver deb and rpm packages.
rocSPARSE 2.5.1#
rocSPARSE 2.5.1 for ROCm 5.5.0
Added#
Added bsrgemm and spgemm for BSR format
Added bsrgeam
Added build support for Navi32
Added experimental hipGraph support for some rocSPARSE routines
Added csritsv, spitsv csr iterative triangular solve
Added mixed precisions for SpMV
Added batched SpMM for transpose A in COO format with atomic atomic algorithm
Improved#
Optimization to csr2bsr
Optimization to csr2csr_compress
Optimization to csr2coo
Optimization to gebsr2csr
Optimization to csr2gebsr
Fixes to documentation
Fixes a bug in COO SpMV gridsize
Fixes a bug in SpMM gridsize when using very large matrices
Known Issues#
In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split.
rocWMMA 1.0#
rocWMMA 1.0 for ROCm 5.5.0
Added#
Added support for wave32 on gfx11+
Added infrastructure changes to support hipRTC
Added performance tracking system
Changed#
Modified the assignment of hardware information
Modified the data access for unsigned datatypes
Added library config to support multiple architectures
Tensile 4.36.0#
Tensile 4.36.0 for ROCm 5.5.0
Added#
Add functions for user-driven tuning
Add GFX11 support: HostLibraryTests yamls, rearragne FP32©/FP64© instruction order, archCaps for instruction renaming condition, adjust vgpr bank for A/B/C for optimize, separate vscnt and vmcnt, dual mac
Add binary search for Grid-Based algorithm
Add reject condition for (StoreCInUnroll + BufferStore=0) and (DirectToVgpr + ScheduleIterAlg<3 + PrefetchGlobalRead==2)
Add support for (DirectToLds + hgemm + NN/NT/TT) and (DirectToLds + hgemm + GlobalLoadVectorWidth < 4)
Add support for (DirectToLds + hgemm(TLU=True only) or sgemm + NumLoadsCoalesced > 1)
Add GSU SingleBuffer algorithm for HSS/BSS
Add gfx900:xnack-, gfx1032, gfx1034, gfx1035
Enable gfx1031 support
Optimizations#
Use AssertSizeLessThan for BufferStoreOffsetLimitCheck if it is smaller than MT1
Improve InitAccVgprOpt
Changed#
Use global_atomic for GSU instead of flat and global_store for debug code
Replace flat_load/store with global_load/store
Use global_load/store for BufferLoad/Store=0 and enable scheduling
LocalSplitU support for HGEMM+HPA when MFMA disabled
Update Code Object Version
Type cast local memory to COMPUTE_DATA_TYPE in LDS to avoid precision loss
Update asm cap cache arguments
Unify SplitGlobalRead into ThreadSeparateGlobalRead and remove SplitGlobalRead
Change checks, error messages, assembly syntax, and coverage for DirectToLds
Remove unused cmake file
Clean up the LLVM dependency code
Update ThreadSeparateGlobalRead test cases for PrefetchGlobalRead=2
Update sgemm/hgemm test cases for DirectToLds and ThreadSepareteGlobalRead
Fixed#
Add build-id to header of compiled source kernels
Fix solution index collisions
Fix h beta vectorwidth4 correctness issue for WMMA
Fix an error with BufferStore=0
Fix mismatch issue with (StoreCInUnroll + PrefetchGlobalRead=2)
Fix MoveMIoutToArch bug
Fix flat load correctness issue on I8 and flat store correctness issue
Fix mismatch issue with BufferLoad=0 + TailLoop for large array sizes
Fix code generation error with BufferStore=0 and StoreCInUnrollPostLoop
Fix issues with DirectToVgpr + ScheduleIterAlg<3
Fix mismatch issue with DGEMM TT + LocalReadVectorWidth=2
Fix mismatch issue with PrefetchGlobalRead=2
Fix mismatch issue with DirectToVgpr + PrefetchGlobalRead=2 + small tile size
Fix an error with PersistentKernel=0 + PrefetchAcrossPersistent=1 + PrefetchAcrossPersistentMode=1
Fix mismatch issue with DirectToVgpr + DirectToLds + only 1 iteration in unroll loop case
Remove duplicate GSU kernels: for GSU = 1, GSUAlgorithm SingleBuffer and MultipleBuffer kernels are identical
Fix for failing CI tests due to CpuThreads=0
Fix mismatch issue with DirectToLds + PrefetchGlobalRead=2
Remove the reject condition for ThreadSeparateGlobalRead and DirectToLds (HGEMM, SGEMM only)
Modify reject condition for minimum lanes of ThreadSeparateGlobalRead (SGEMM or larger data type only)
ROCm 5.4.3#
Deprecations and warnings#
HIP Perl scripts deprecation#
The hipcc
and hipconfig
Perl scripts are deprecated. In a future release, compiled binaries will be
available as hipcc.bin
and hipconfig.bin
as replacements for the Perl scripts.
Note
There will be a transition period where the Perl scripts and compiled binaries are available before the
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
binary counterpart. No user action is required. Once these are available, users can optionally switch to
hipcc.bin
and hipconfig.bin
. The hipcc
/hipconfig
soft link will be assimilated to point from
hipcc
/hipconfig
to the respective compiled binaries as the default option.
Linux file system hierarchy standard for ROCm#
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.
New file system hierarchy#
The following is the new file system hierarchy:4
/opt/rocm-<ver>
| --bin
| --All externally exposed Binaries
| --libexec
| --<component>
| -- Component specific private non-ISA executables (architecture independent)
| --include
| -- <component>
| --<header files>
| --lib
| --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
(public libraries linked with application)
| --<component> (component specific private library, executable data)
| --<cmake>
| --components
| --<component>.config.cmake
| --share
| --html/<component>/*.html
| --info/<component>/*.[pdf, md, txt]
| --man
| --doc
| --<component>
| --<licenses>
| --<component>
| --<misc files> (arch independent non-executable)
| --samples
Note
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.
Backward compatibility with older file systems#
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
Note
ROCm will continue supporting backward compatibility until the next major release.
Wrapper header files#
Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include
) with a
warning message to include files from the new location (/opt/rocm-xxx/include
) as shown in the
example below:
// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"
The wrapper header files’ backward compatibility deprecation is as follows:
#pragma
message announcing deprecation – ROCm v5.2 release#pragma
message changed to#warning
– Future release#warning
changed to#error
– Future releaseBackward compatibility wrappers removed – Future release
Library files#
Library files are available in the /opt/rocm-xxx/lib
folder. For backward compatibility, the old library
location (/opt/rocm-xxx/<component>/lib
) has a soft link to the library at the new location.
Example:
$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
CMake config files#
All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component>
folder. For
backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake
) consist of
a soft link to the new CMake config.
Example:
$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
Defect fixes#
Compiler improvements#
In ROCm v5.4.3, improvements to the compiler address errors with the following signatures:
“error: unhandled SGPR spill to memory”
“cannot scavenge register without an emergency spill slot!”
“error: ran out of registers during register allocation”
Known issues#
Compiler option error at runtime#
Some users may encounter a “Cannot find Symbol” error at runtime when using -save-temps
. While
most -save-temps
use cases work correctly, this error may appear occasionally.
This issue is under investigation, and the known workaround is not to use -save-temps
when the error
appears.
Library changes in ROCm 5.4.3#
Library |
Version |
---|---|
hipBLAS |
|
hipCUB |
|
hipFFT |
|
hipSOLVER |
|
hipSPARSE |
|
MIVisionX |
|
rccl |
|
rocALUTION |
|
rocBLAS |
|
rocFFT |
1.0.20 ⇒ 1.0.21 |
rocm-cmake |
|
rocPRIM |
|
rocRAND |
|
rocSOLVER |
|
rocSPARSE |
|
rocThrust |
|
rocWMMA |
|
Tensile |
rocFFT 1.0.21#
rocFFT 1.0.21 for ROCm 5.4.3
Fixed#
Removed source directory from rocm_install_targets call to prevent installation of rocfft.h in an unintended location.
ROCm 5.4.2#
Deprecations and warnings#
HIP Perl scripts deprecation#
The hipcc
and hipconfig
Perl scripts are deprecated. In a future release, compiled binaries will be
available as hipcc.bin
and hipconfig.bin
as replacements for the Perl scripts.
Note
There will be a transition period where the Perl scripts and compiled binaries are available before the
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
binary counterpart. No user action is required. Once these are available, users can optionally switch to
hipcc.bin
and hipconfig.bin
. The hipcc
/hipconfig
soft link will be assimilated to point from
hipcc
/hipconfig
to the respective compiled binaries as the default option.
hipcc
options deprecation#
The following hipcc options are being deprecated and will be removed in a future release:
The
--amdgpu-target
option is being deprecated, and user must use the–offload-arch
option to specify the GPU architecture.The
--amdhsa-code-object-version
option is being deprecated. Users can use the Clang/LLVM option-mllvm -mcode-object-version
to debug issues related to code object versions.The
--hipcc-func-supp
/--hipcc-no-func-supp
options are being deprecated, as the function calls are already supported in production on AMD GPUs.
Known issues#
Under certain circumstances typified by high register pressure, users may encounter a compiler abort with one of the following error messages:
error: unhandled SGPR spill to memory
cannot scavenge register without an emergency spill slot!
error: ran out of registers during register allocation
This is a known issue and will be fixed in a future release.
Library changes in ROCm 5.4.2#
ROCm 5.4.1#
What’s new in this release#
HIP enhancements#
The ROCm v5.4.1 release consists of the following new HIP API:
New HIP API - hipLaunchHostFunc#
The following new HIP API is introduced in the ROCm v5.4.1 release.
Note
This is a pre-official version (beta) release of the new APIs.
hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData);
This swaps the stream capture mode of a thread.
@param [in] mode - Pointer to mode value to swap with the current mode
This parameter returns #hipSuccess
, #hipErrorInvalidValue
.
For more information, refer to the HIP API documentation at /bundle/HIP_API_Guide/page/modules.html.
Deprecations and warnings#
HIP Perl scripts deprecation#
The hipcc
and hipconfig
Perl scripts are deprecated. In a future release, compiled binaries will be
available as hipcc.bin
and hipconfig.bin
as replacements for the Perl scripts.
Note
There will be a transition period where the Perl scripts and compiled binaries are available before the
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
binary counterpart. No user action is required. Once these are available, users can optionally switch to
hipcc.bin
and hipconfig.bin
. The hipcc
/hipconfig
soft link will be assimilated to point from
hipcc
/hipconfig
to the respective compiled binaries as the default option.
IFWI fixes#
These defects were identified and documented as known issues in previous ROCm releases and are fixed in this release.
AMD Instinct™ MI200 firmware IFWI maintenance update #3#
This IFWI release fixes the following issue in AMD Instinct™ MI210/MI250 Accelerators.
After prolonged periods of operation, certain MI200 Instinct™ Accelerators may perform in a degraded way resulting in application failures.
In this package, AMD delivers a new firmware version for MI200 GPU accelerators and a firmware installation tool – AMD FW FLASH 1.2.
GPU |
Productionp part number |
SKU |
IFWI name |
---|---|---|---|
MI210 |
113-D673XX |
D67302 |
D6730200V.110 |
MI210 |
113-D673XX |
D67301 |
D6730100V.073 |
MI250 |
113-D652XX |
D65209 |
D6520900.073 |
MI250 |
113-D652XX |
D65210 |
D6521000.073 |
Instructions on how to download and apply MI200 maintenance updates are available at:
AMD Instinct™ MI200 SRIOV virtualization support#
Maintenance update #3, combined with ROCm 5.4.1, now provides SRIOV virtualization support for all AMD Instinct™ MI200 devices.
Library changes in ROCm 5.4.1#
Library |
Version |
---|---|
hipBLAS |
|
hipCUB |
|
hipFFT |
|
hipSOLVER |
|
hipSPARSE |
|
MIVisionX |
|
rccl |
|
rocALUTION |
|
rocBLAS |
|
rocFFT |
1.0.19 ⇒ 1.0.20 |
rocm-cmake |
|
rocPRIM |
|
rocRAND |
|
rocSOLVER |
|
rocSPARSE |
|
rocThrust |
|
rocWMMA |
|
Tensile |
rocFFT 1.0.20#
rocFFT 1.0.20 for ROCm 5.4.1
Fixed#
Fixed incorrect results on strided large 1D FFTs where batch size does not equal the stride.
ROCm 5.4.0#
What’s new in this release#
HIP enhancements#
The ROCm v5.4 release consists of the following HIP enhancements:
Support for wall_clock64#
A new timer function wall_clock64() is supported, which returns wall clock count at a constant frequency on the device.
long long int wall_clock64();
It returns wall clock count at a constant frequency on the device, which can be queried via HIP API with the hipDeviceAttributeWallClockRate attribute of the device in the HIP application code.
Example:
int wallClkRate = 0; //in kilohertz
+HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId));
Where hipDeviceAttributeWallClockRate is a device attribute.
Note
The wall clock frequency is a per-device attribute.
New registry added for GPU_MAX_HW_QUEUES#
The GPU_MAX_HW_QUEUES registry defines the maximum number of independent hardware queues allocated per process per device.
The environment variable controls how many independent hardware queues HIP runtime can create per process, per device. If the application allocates more HIP streams than this number, then the HIP runtime reuses the same hardware queues for the new streams in a round-robin manner.
Note
This maximum number does not apply to hardware queues created for CU-masked HIP streams or cooperative queues for HIP Cooperative Groups (there is only one queue per device).
For more details, refer to the HIP Programming Guide.
New HIP APIs in this release#
The following new HIP APIs are available in the ROCm v5.4 release.
Note
This is a pre-official version (beta) release of the new APIs.
Error handling#
hipError_t hipDrvGetErrorName(hipError_t hipError, const char** errorString);
This returns HIP errors in the text string format.
hipError_t hipDrvGetErrorString(hipError_t hipError, const char** errorString);
This returns text string messages with more details about the error.
For more information, refer to the HIP API Guide.
HIP tests source separation#
With ROCm v5.4, a separate GitHub project is created at
This contains HIP catch2 tests and samples, and new tests will continue to develop.
In future ROCm releases, catch2 tests and samples will be removed from the HIP project.
OpenMP enhancements#
This release consists of the following OpenMP enhancements:
Enable new device RTL in libomptarget as default.
New flag
-fopenmp-target-fast
to imply-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism
.Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is generated.
Initial implementation of optimized cross-team sum reduction for float and double type scalars.
Pool-based optimization in the OpenMP runtime to reduce locking during data transfer.
Deprecations and warnings#
HIP Perl scripts deprecation#
The hipcc
and hipconfig
Perl scripts are deprecated. In a future release, compiled binaries will be
available as hipcc.bin
and hipconfig.bin
as replacements for the Perl scripts.
Note
There will be a transition period where the Perl scripts and compiled binaries are available before the
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
binary counterpart. No user action is required. Once these are available, users can optionally switch to
hipcc.bin
and hipconfig.bin
. The hipcc
/hipconfig
soft link will be assimilated to point from
hipcc
/hipconfig
to the respective compiled binaries as the default option.
Linux file system hierarchy standard for ROCm#
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.
New file system hierarchy#
The following is the new file system hierarchy:
/opt/rocm-<ver>
| --bin
| --All externally exposed Binaries
| --libexec
| --<component>
| -- Component specific private non-ISA executables (architecture independent)
| --include
| -- <component>
| --<header files>
| --lib
| --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
(public libraries linked with application)
| --<component> (component specific private library, executable data)
| --<cmake>
| --components
| --<component>.config.cmake
| --share
| --html/<component>/*.html
| --info/<component>/*.[pdf, md, txt]
| --man
| --doc
| --<component>
| --<licenses>
| --<component>
| --<misc files> (arch independent non-executable)
| --samples
Note
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.
Backward compatibility with older file systems#
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
Note
ROCm will continue supporting backward compatibility until the next major release.
Wrapper header files#
Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include
) with a
warning message to include files from the new location (/opt/rocm-xxx/include
) as shown in the
example below:
// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"
The wrapper header files’ backward compatibility deprecation is as follows:
#pragma
message announcing deprecation – ROCm v5.2 release#pragma
message changed to#warning
– Future release#warning
changed to#error
– Future releaseBackward compatibility wrappers removed – Future release
Library files#
Library files are available in the /opt/rocm-xxx/lib
folder. For backward compatibility, the old library
location (/opt/rocm-xxx/<component>/lib
) has a soft link to the library at the new location.
Example:
$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
CMake config files#
All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component>
folder. For
backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake
) consist of
a soft link to the new CMake config.
Example:
$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
Defect fixes#
The following defects are fixed in this release.
These defects were identified and documented as known issues in previous ROCm releases and are fixed in this release.
Memory allocated using hipHostMalloc() with flags didn’t exhibit fine-grain behavior#
Issue#
The test was incorrectly using the hipDeviceAttributePageableMemoryAccess
device attribute to
determine coherent support.
Fix#
hipHostMalloc()
allocates memory with fine-grained access by default when the environment variable
HIP_HOST_COHERENT=1
is used.
For more information, refer to HIP Runtime API Reference.
SoftHang with hipStreamWithCUMask
test on AMD Instinct™#
Issue#
On GFX10 GPUs, kernel execution hangs when it is launched on streams created using
hipStreamWithCUMask
.
Fix#
On GFX10 GPUs, each workgroup processor encompasses two compute units, and the compute units
must be enabled as a pair. The hipStreamWithCUMask
API unit test cases are updated to set compute
unit mask (cuMask) in pairs for GFX10 GPUs.
ROCm tools GPU IDs#
The HIP language device IDs are not the same as the GPU IDs reported by the tools. GPU IDs are globally unique and guaranteed to be consistent across APIs and processes.
GPU IDs reported by ROCTracer and ROCProfiler or ROCm Tools are HSA Driver Node ID of that GPU, as it is a unique ID for that device in that particular node.
Library changes in ROCm 5.4.0#
Library |
Version |
---|---|
hipBLAS |
0.52.0 ⇒ 0.53.0 |
hipCUB |
2.12.0 ⇒ 2.13.0 |
hipFFT |
1.0.9 ⇒ 1.0.10 |
hipSOLVER |
1.5.0 ⇒ 1.6.0 |
hipSPARSE |
2.3.1 ⇒ 2.3.3 |
MIVisionX |
|
rccl |
2.12.10 ⇒ 2.13.4 |
rocALUTION |
2.1.0 ⇒ 2.1.3 |
rocBLAS |
2.45.0 ⇒ 2.46.0 |
rocFFT |
1.0.18 ⇒ 1.0.19 |
rocm-cmake |
|
rocPRIM |
2.11.0 ⇒ 2.12.0 |
rocRAND |
2.10.15 ⇒ 2.10.16 |
rocSOLVER |
3.19.0 ⇒ 3.20.0 |
rocSPARSE |
2.2.0 ⇒ 2.4.0 |
rocThrust |
2.16.0 ⇒ 2.17.0 |
rocWMMA |
0.8 ⇒ 0.9 |
Tensile |
4.34.0 ⇒ 4.35.0 |
hipBLAS 0.53.0#
hipBLAS 0.53.0 for ROCm 5.4.0
Added#
Allow for selection of int8 datatype
Added support for hipblasXgels and hipblasXgelsStridedBatched operations (with s,d,c,z precisions), only supported with rocBLAS backend
Added support for hipblasXgelsBatched operations (with s,d,c,z precisions)
hipCUB 2.13.0#
hipCUB 2.13.0 for ROCm 5.4.0
Added#
CMake functionality to improve build parallelism of the test suite that splits compilation units by function or by parameters.
New overload for
BlockAdjacentDifference::SubtractLeftPartialTile
that takes a predecessor item.
Changed#
Improved build parallelism of the test suite by splitting up large compilation units for
DeviceRadixSort
,DeviceSegmentedRadixSort
andDeviceSegmentedSort
.CUB backend references CUB and thrust version 1.17.1.
hipFFT 1.0.10#
hipFFT 1.0.10 for ROCm 5.4.0
Added#
Added hipfftExtPlanScaleFactor API to efficiently multiply each output element of a FFT by a given scaling factor. Result scaling must be supported in the backend FFT library.
Changed#
When hipFFT is built against the rocFFT backend, rocFFT 1.0.19 or higher is now required.
hipSOLVER 1.6.0#
hipSOLVER 1.6.0 for ROCm 5.4.0
Added#
Added compatibility-only functions
gesvdaStridedBatched
hipsolverDnSgesvdaStridedBatched_bufferSize, hipsolverDnDgesvdaStridedBatched_bufferSize, hipsolverDnCgesvdaStridedBatched_bufferSize, hipsolverDnZgesvdaStridedBatched_bufferSize
hipsolverDnSgesvdaStridedBatched, hipsolverDnDgesvdaStridedBatched, hipsolverDnCgesvdaStridedBatched, hipsolverDnZgesvdaStridedBatched
hipSPARSE 2.3.3#
hipSPARSE 2.3.3 for ROCm 5.4.0
Added#
Added hipsparseCsr2cscEx2_bufferSize and hipsparseCsr2cscEx2 routines
Changed#
HIPSPARSE_ORDER_COLUMN has been renamed to HIPSPARSE_ORDER_COL to match cusparse
rccl 2.13.4#
RCCL 2.13.4 for ROCm 5.4.0
Changed#
Compatibility with NCCL 2.13.4
Improvements to RCCL when running with hipGraphs
RCCL_ENABLE_HIPGRAPH environment variable is no longer necessary to enable hipGraph support
Minor latency improvements
Fixed#
Resolved potential memory access error due to asynchronous memset
rocALUTION 2.1.3#
rocALUTION 2.1.3 for ROCm 5.4.0
Added#
Added build support for Navi31 and Navi33
Added support for non-squared global matrices
Improved#
Fixed a memory leak in MatrixMult on HIP backend
Global structures can now be used with a single process
Changed#
Switched GTest death test style to ‘threadsafe’
GlobalVector::GetGhostSize() is deprecated and will be removed
ParallelManager::GetGlobalSize(), ParallelManager::GetLocalSize(), ParallelManager::SetGlobalSize() and ParallelManager::SetLocalSize() are deprecated and will be removed
Vector::GetGhostSize() is deprecated and will be removed
Multigrid::SetOperatorFormat(unsigned int) is deprecated and will be removed, use Multigrid::SetOperatorFormat(unsigned int, int) instead
RugeStuebenAMG::SetCouplingStrength(ValueType) is deprecated and will be removed, use SetStrengthThreshold(float) instead
rocBLAS 2.46.0#
rocBLAS 2.46.0 for ROCm 5.4.0
Added#
client smoke test dataset added for quick validation using command rocblas-test –yaml rocblas_smoke.yaml
Added stream order device memory allocation as a non-default beta option.
Optimized#
Improved trsm performance for small sizes by using a substitution method technique
Improved syr2k and her2k performance significantly by using a block-recursive algorithm
Changed#
Level 2, Level 1, and Extension functions: argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values. With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer. This improves consistency with legacy BLAS behaviour.
Add variable to turn on/off ieee16/ieee32 tests for mixed precision gemm
Allow hipBLAS to select int8 datatype
Disallow B == C && ldb != ldc in rocblas_xtrmm_outofplace
Fixed#
FORTRAN interfaces generalized for FORTRAN compilers other than gfortran
fix for trsm_strided_batched rocblas-bench performance gathering
Fix for rocm-smi path in commandrunner.py script to match ROCm 5.2 and above
rocFFT 1.0.19#
rocFFT 1.0.19 for ROCm 5.4.0
Optimizations#
Optimized some strided large 1D plans.
Added#
Added rocfft_plan_description_set_scale_factor API to efficiently multiply each output element of a FFT by a given scaling factor.
Created a rocfft_kernel_cache.db file next to the installed library. SBCC kernels are moved to this file when built with the library, and are runtime-compiled for new GPU architectures.
Added gfx1100 and gfx1102 to default AMDGPU_TARGETS.
Changed#
Moved runtime compilation cache to in-memory by default. A default on-disk cache can encounter contention problems on multi-node clusters with a shared filesystem. rocFFT can still be told to use an on-disk cache by setting the ROCFFT_RTC_CACHE_PATH environment variable.
rocPRIM 2.12.0#
rocPRIM 2.12.0 for ROCm 5.4.0
Changed#
device_partition
,device_unique
, anddevice_reduce_by_key
now support problem sizes larger than 2^32 items.
Removed#
block_sort::sort()
overload for keys and values with a dynamic size. This overload was documented but the implementation is missing. To avoid further confusion the documentation is removed until a decision is made on implementing the function.
Fixed#
Fixed the compilation failure in
device_merge
if the two key iterators don’t match.
rocRAND 2.10.16#
rocRAND 2.10.16 for ROCm 5.4.0
Added#
MRG31K3P pseudorandom number generator based on L’Ecuyer and Touzin, 2000, “Fast combined multiple recursive generators with multipliers of the form a = ±2q ±2r”.
LFSR113 pseudorandom number generator based on L’Ecuyer, 1999, “Tables of maximally equidistributed combined LFSR generators”.
SCRAMBLED_SOBOL32 and SCRAMBLED_SOBOL64 quasirandom number generators. The Scrambled Sobol sequences are generated by scrambling the output of a Sobol sequence.
Changed#
The
mrg_<distribution>_distribution
structures, which provided numbers based on MRG32K3A, are now replaced bymrg_engine_<distribution>_distribution
, where<distribution>
islog_normal
,normal
,poisson
, oruniform
. These structures provide numbers for MRG31K3P (with template typerocrand_state_mrg31k3p
) and MRG32K3A (with template typerocrand_state_mrg32k3a
).
Fixed#
Sobol64 now returns 64 bits random numbers, instead of 32 bits random numbers. As a result, the performance of this generator has regressed.
Fixed a bug that prevented compiling code in C++ mode (with a host compiler) when it included the rocRAND headers on Windows.
rocSOLVER 3.20.0#
rocSOLVER 3.20.0 for ROCm 5.4.0
Added#
Partial SVD for bidiagonal matrices:
BDSVDX
Partial SVD for general matrices:
GESVDX (with batched and strided_batched versions)
Changed#
Changed
ROCSOLVER_EMBED_FMT
default toON
for users building directly with CMake. This matches the existing default when building with install.sh or rmake.py.
rocSPARSE 2.4.0#
rocSPARSE 2.4.0 for ROCm 5.4.0
Added#
Added rocsparse_spmv_ex routine
Added rocsparse_bsrmv_ex_analysis and rocsparse_bsrmv_ex routines
Added csritilu0 routine
Added build support for Navi31 and Navi 33
Improved#
Optimization to segmented algorithm for COO SpMV by performing analysis
Improve performance when generating random matrices.
Fixed bug in ellmv
Optimized bsr2csr routine
Fixed integer overflow bugs
rocThrust 2.17.0#
rocThrust 2.17.0 for ROCm 5.4.0
Added#
Updated to match upstream Thrust 1.17.0
rocWMMA 0.9#
rocWMMA 0.9 for ROCm 5.4.0
Added#
Added gemm driver APIs for flow control builtins
Added benchmark logging systems
Restructured tests to follow naming convention. Added macros for test generation
Changed#
Changed CMake to accomodate the modified test infrastructure
Fine tuned the multi-block kernels with and without lds
Adjusted Maximum Vector Width to dWordx4 Width
Updated Efficiencies to display as whole number percentages
Updated throughput from GFlops/s to TFlops/s
Reset the ad-hoc tests to use smaller sizes
Modified the output validation to use CPU-based implementation against rocWMMA
Modified the extended vector test to return error codes for memory allocation failures
Tensile 4.35.0#
Tensile 4.35.0 for ROCm 5.4.0
Added#
Async DMA support for Transpose Data Layout (ThreadSeparateGlobalReadA/B)
Option to output library logic in dictionary format
No solution found error message for benchmarking client
Exact K check for StoreCInUnrollExact
Support for CGEMM + MIArchVgpr
client-path parameter for using prebuilt client
CleanUpBuildFiles global parameter
Debug flag for printing library logic index of winning solution
NumWarmups global parameter for benchmarking
Windows support for benchmarking client
DirectToVgpr support for CGEMM
TensileLibLogicToYaml for creating tuning configs from library logic solutions
Optimizations#
Put beta code and store separately if StoreCInUnroll = x4 store
Improved performance for StoreCInUnroll + b128 store
Changed#
Re-enable HardwareMonitor for gfx90a
Decision trees use MLFeatures instead of Properties
Fixed#
Reject DirectToVgpr + MatrixInstBM/BN > 1
Fix benchmark timings when using warmups and/or validation
Fix mismatch issue with DirectToVgprB + VectorWidth > 1
Fix mismatch issue with DirectToLds + NumLoadsCoalesced > 1 + TailLoop
Fix incorrect reject condition for DirectToVgpr
Fix reject condition for DirectToVgpr + MIWaveTile < VectorWidth
Fix incorrect instruction generation with StoreCInUnroll
ROCm 5.3.3#
Defect fixes#
Issue with rocTHRUST and rocPRIM libraries#
There was a known issue with rocTHRUST and rocPRIM libraries supporting iterator and types in ROCm v5.3.x releases.
thrust::merge
no longer correctly supports different iterator types forkeys_input1
andkeys_input2
.rocprim::device_merge
no longer correctly supports using different types forkeys_input1
andkeys_input2
.
This issue is resolved with the following fixes to compilation failures:
rocPRIM: in device_merge if the two key iterators do not match.
rocTHRUST: in thrust::merge if the two key iterators do not match.
Library changes in ROCm 5.3.3#
ROCm 5.3.2#
Defect fixes#
The following known issues in ROCm v5.3.2 are fixed in this release.
Peer-to-peer DMA mapping errors with SLES and RHEL#
Peer-to-Peer Direct Memory Access (DMA) mapping errors on Dell systems (R7525 and R750XA) with SLES 15 SP3/SP4 and RHEL 9.0 are fixed in this release.
Previously, running rocminfo
resulted in Peer-to-Peer DMA mapping errors.
RCCL tuning table#
The RCCL tuning table is updated for supported platforms.
SGEMM (F32 GEMM) routines in rocBLAS#
Functional correctness failures in SGEMM (F32 GEMM) routines in rocBLAS for certain problem sizes and ranges are fixed in this release.
Known issues#
This section consists of known issues in this release.
AMD Instinct™ MI200 SRIOV virtualization issue#
There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of SRIOV-based workloads but does not impact Discrete Device Assignment (DDA) or bare metal.
Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV workloads.
AMD Instinct™ MI200 firmware updates#
Customers cannot update the Integrated Firmware Image (IFWI) for AMD Instinct™ MI200 accelerators.
An updated firmware maintenance bundle consisting of an installation tool and images specific to AMD Instinct™ MI200 accelerators is under planning and will be available soon.
Known issue with rocThrust and rocPRIM libraries#
There is a known known issue with rocThrust and rocPRIM libraries supporting iterator and types in ROCm v5.3.x releases.
thrust::merge
no longer correctly supports different iterator types forkeys_input1
andkeys_input2
.rocprim::device_merge
no longer correctly supports using different types forkeys_input1
andkeys_input2
.
This issue is currently under investigation and will be resolved in a future release.
Library changes in ROCm 5.3.2#
ROCm 5.3.0#
Deprecations and warnings#
HIP Perl scripts deprecation#
The hipcc
and hipconfig
Perl scripts are deprecated. In a future release, compiled binaries will be
available as hipcc.bin
and hipconfig.bin
as replacements for the Perl scripts.
Note
There will be a transition period where the Perl scripts and compiled binaries are available before the
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
binary counterpart. No user action is required. Once these are available, users can optionally switch to
hipcc.bin
and hipconfig.bin
. The hipcc
/hipconfig
soft link will be assimilated to point from
hipcc
/hipconfig
to the respective compiled binaries as the default option.
Linux file system hierarchy standard for ROCm#
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.
New file system hierarchy#
The following is the new file system hierarchy:
/opt/rocm-<ver>
| --bin
| --All externally exposed Binaries
| --libexec
| --<component>
| -- Component specific private non-ISA executables (architecture independent)
| --include
| -- <component>
| --<header files>
| --lib
| --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
(public libraries linked with application)
| --<component> (component specific private library, executable data)
| --<cmake>
| --components
| --<component>.config.cmake
| --share
| --html/<component>/*.html
| --info/<component>/*.[pdf, md, txt]
| --man
| --doc
| --<component>
| --<licenses>
| --<component>
| --<misc files> (arch independent non-executable)
| --samples
Note
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.
Backward compatibility with older file systems#
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
Note
ROCm will continue supporting backward compatibility until the next major release.
Wrapper header files#
Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include
) with a
warning message to include files from the new location (/opt/rocm-xxx/include
) as shown in the
example below:
// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"
The wrapper header files’ backward compatibility deprecation is as follows:
#pragma
message announcing deprecation – ROCm v5.2 release#pragma
message changed to#warning
– Future release#warning
changed to#error
– Future releaseBackward compatibility wrappers removed – Future release
Library files#
Library files are available in the /opt/rocm-xxx/lib
folder. For backward compatibility, the old library
location (/opt/rocm-xxx/<component>/lib
) has a soft link to the library at the new location.
Example:
$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
CMake config files#
All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component>
folder. For
backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake
) consist of
a soft link to the new CMake config.
Example:
$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
Defect fixes#
The following defects are fixed in this release.
These defects were identified and documented as known issues in previous ROCm releases and are fixed in the ROCm v5.3 release.
Kernel produces incorrect results with ROCm 5.2#
User code did not initialize certain data constructs, leading to a correctness issue. A strict reading of the C++ standard suggests that failing to initialize these data constructs is undefined behavior. However, a special case was added for a specific compiler builtin to handle the uninitialized data in a defined manner.
The compiler fix consists of the following patches:
A new
noundef
attribute is added. This attribute denotes when a function call argument or return value may never contain uninitialized bits. For more information, see https://reviews.llvm.org/D81678The application of this attribute was refined such that it was not added to a specific compiler built-in where the compiler knows that inactive lanes do not impact program execution. For more information, see ROCm/llvm-project.
Known issues#
This section consists of known issues in this release.
Issue with OpenMP-extras package upgrade#
The openmp-extras
package has been split into runtime (openmp-extras-runtime
) and dev
(openmp-extras-devel
) packages. This change has broken the upgrade support for the
openmp-extras
package in RHEL/SLES.
An available workaround in RHEL is to use the following command for upgrades:
sudo yum upgrade rocm-language-runtime --allowerasing
An available workaround in SLES is to use the following command for upgrades:
zypper update --force-resolution <meta-package>
AMD Instinct™ MI200 SRIOV virtualization issue#
There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of SRIOV-based workloads, but does not impact Discrete Device Assignment (DDA) or Bare Metal.
Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV workloads.
System crash when IMMOU is enabled#
If input-output memory management unit (IOMMU) is enabled in SBIOS and ROCm is installed, the system may report the following failure or errors when running workloads such as bandwidth test, clinfo, and HelloWord.cl and cause a system crash.
IO PAGE FAULT
IRQ remapping does not support X2APIC mode
NMI error
Workaround: To avoid the system crash, add amd_iommu=on iommu=pt
as the kernel bootparam, as
indicated in the warning message.
Library changes in ROCm 5.3.0#
Library |
Version |
---|---|
hipBLAS |
0.51.0 ⇒ 0.52.0 |
hipCUB |
2.11.1 ⇒ 2.12.0 |
hipFFT |
1.0.8 ⇒ 1.0.9 |
hipSOLVER |
1.4.0 ⇒ 1.5.0 |
hipSPARSE |
2.2.0 ⇒ 2.3.1 |
MIVisionX |
|
rccl |
|
rocALUTION |
2.0.3 ⇒ 2.1.0 |
rocBLAS |
2.44.0 ⇒ 2.45.0 |
rocFFT |
1.0.17 ⇒ 1.0.18 |
rocm-cmake |
|
rocPRIM |
2.10.14 ⇒ 2.11.0 |
rocRAND |
2.10.14 ⇒ 2.10.15 |
rocSOLVER |
3.18.0 ⇒ 3.19.0 |
rocSPARSE |
|
rocThrust |
2.15.0 ⇒ 2.16.0 |
rocWMMA |
0.7 ⇒ 0.8 |
Tensile |
4.33.0 ⇒ 4.34.0 |
hipBLAS 0.52.0#
hipBLAS 0.52.0 for ROCm 5.3.0
Added#
Added –cudapath option to install.sh to allow user to specify which cuda build they would like to use.
Added –installcuda option to install.sh to install cuda via a package manager. Can be used with new –installcudaversion option to specify which version of cuda to install.
Fixed#
Fixed #includes to support a compiler version.
Fixed client dependency support in install.sh
hipCUB 2.12.0#
hipCUB 2.12.0 for ROCm 5.3.0
Added#
UniqueByKey device algorithm
SubtractLeft, SubtractLeftPartialTile, SubtractRight, SubtractRightPartialTile overloads in BlockAdjacentDifference.
The old overloads (FlagHeads, FlagTails, FlagHeadsAndTails) are deprecated.
DeviceAdjacentDifference algorithm.
Extended benchmark suite of
DeviceHistogram
,DeviceScan
,DevicePartition
,DeviceReduce
,DeviceSegmentedReduce
,DeviceSegmentedRadixSort
,DeviceRadixSort
,DeviceSpmv
,DeviceMergeSort
,DeviceSegmentedSort
Changed#
Obsolated type traits defined in util_type.hpp. Use the standard library equivalents instead.
CUB backend references CUB and thrust version 1.16.0.
DeviceRadixSort’s num_items parameter’s type is now templated instead of being an int.
If an integral type with a size at most 4 bytes is passed (i.e. an int), the former logic applies.
Otherwise the algorithm uses a larger indexing type that makes it possible to sort input data over 2**32 elements.
Improved build parallelism of the test suite by splitting up large compilation units
hipFFT 1.0.9#
hipFFT 1.0.9 for ROCm 5.3.0
Changed#
Clean up build warnings.
GNUInstall Dir enhancements.
Requires gtest 1.11.
hipSOLVER 1.5.0#
hipSOLVER 1.5.0 for ROCm 5.3.0
Added#
Added functions
syevj
hipsolverSsyevj_bufferSize, hipsolverDsyevj_bufferSize, hipsolverCheevj_bufferSize, hipsolverZheevj_bufferSize
hipsolverSsyevj, hipsolverDsyevj, hipsolverCheevj, hipsolverZheevj
syevjBatched
hipsolverSsyevjBatched_bufferSize, hipsolverDsyevjBatched_bufferSize, hipsolverCheevjBatched_bufferSize, hipsolverZheevjBatched_bufferSize
hipsolverSsyevjBatched, hipsolverDsyevjBatched, hipsolverCheevjBatched, hipsolverZheevjBatched
sygvj
hipsolverSsygvj_bufferSize, hipsolverDsygvj_bufferSize, hipsolverChegvj_bufferSize, hipsolverZhegvj_bufferSize
hipsolverSsygvj, hipsolverDsygvj, hipsolverChegvj, hipsolverZhegvj
Added compatibility-only functions
syevdx/heevdx
hipsolverDnSsyevdx_bufferSize, hipsolverDnDsyevdx_bufferSize, hipsolverDnCheevdx_bufferSize, hipsolverDnZheevdx_bufferSize
hipsolverDnSsyevdx, hipsolverDnDsyevdx, hipsolverDnCheevdx, hipsolverDnZheevdx
sygvdx/hegvdx
hipsolverDnSsygvdx_bufferSize, hipsolverDnDsygvdx_bufferSize, hipsolverDnChegvdx_bufferSize, hipsolverDnZhegvdx_bufferSize
hipsolverDnSsygvdx, hipsolverDnDsygvdx, hipsolverDnChegvdx, hipsolverDnZhegvdx
Added –mem_query option to hipsolver-bench, which will print the amount of device memory workspace required by the function.
Changed#
The rocSOLVER backend will now set
info
to zero if rocSOLVER does not referenceinfo
. (Applies to orgbr/ungbr, orgqr/ungqr, orgtr/ungtr, ormqr/unmqr, ormtr/unmtr, gebrd, geqrf, getrs, potrs, and sytrd/hetrd).gesvdj will no longer require extra workspace to transpose
V
whenjobz
isHIPSOLVER_EIG_MODE_VECTOR
andecon
is 1.
Fixed#
Fixed Fortran return value declarations within hipsolver_module.f90
Fixed gesvdj_bufferSize returning
HIPSOLVER_STATUS_INVALID_VALUE
whenjobz
isHIPSOLVER_EIG_MODE_NOVECTOR
and 1 <=ldv
<n
Fixed gesvdj returning
HIPSOLVER_STATUS_INVALID_VALUE
whenjobz
isHIPSOLVER_EIG_MODE_VECTOR
,econ
is 1, andm
<n
hipSPARSE 2.3.1#
hipSPARSE 2.3.1 for ROCm 5.3.0
Added#
Add SpMM and SpMM batched for CSC format
rocALUTION 2.1.0#
rocALUTION 2.1.0 for ROCm 5.3.0
Added#
Benchmarking tool
Ext+I Interpolation with sparsify strategies added for RS-AMG
Improved#
ParallelManager
rocBLAS 2.45.0#
rocBLAS 2.45.0 for ROCm 5.3.0
Added#
install.sh option –upgrade_tensile_venv_pip to upgrade Pip in Tensile Virtual Environment. The corresponding CMake option is TENSILE_VENV_UPGRADE_PIP.
install.sh option –relocatable or -r adds rpath and removes ldconf entry on rocBLAS build.
install.sh option –lazy-library-loading to enable on-demand loading of tensile library files at runtime to speedup rocBLAS initialization.
Support for RHEL9 and CS9.
Added Numerical checking routine for symmetric, Hermitian, and triangular matrices, so that they could be checked for any numerical abnormalities such as NaN, Zero, infinity and denormal value.
Optimizations#
trmm_outofplace performance improvements for all sizes and data types using block-recursive algorithm.
herkx performance improvements for all sizes and data types using block-recursive algorithm.
syrk/herk performance improvements by utilising optimised syrkx/herkx code.
symm/hemm performance improvements for all sizes and datatypes using block-recursive algorithm.
Changed#
Unifying library logic file names: affects HBH (->HHS_BH), BBH (->BBS_BH), 4xi8BH (->4xi8II_BH). All HPA types are using the new naming convention now.
Level 3 function argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values. With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer. This improves consistency with legacy BLAS behaviour.
Level 1, 2, and 3 function argument checking for enums is now more rigorously matching legacy BLAS so returns rocblas_status_invalid_value if arguments do not match the accepted subset.
Add quick-return for internal trmm and gemm template functions.
Moved function block sizes to a shared header file.
Level 1, 2, and 3 functions use rocblas_stride datatype for offset.
Modified the matrix and vector memory allocation in our test infrastructure for all Level 1, 2, 3 and BLAS_EX functions.
Added specific initialization for symmetric, Hermitian, and triangular matrix types in our test infrastructure.
Added NaN tests to the test infrastructure for the rest of Level 3, BLAS_EX functions.
Fixed#
Improved logic to #include <filesystem> vs <experimental/filesystem>.
install.sh -s option to build rocblas as a static library.
dot function now sets the device results asynchronously for N <= 0
Deprecated#
is_complex helper is now deprecated. Use rocblas_is_complex instead.
The enum truncate_t and the value truncate is now deprecated and will removed from the ROCm release 6.0. It is replaced by rocblas_truncate_t and rocblas_truncate, respectively. The new enum rocblas_truncate_t and the value rocblas_truncate could be used from this ROCm release for an easy transition.
Removed#
install.sh options –hip-clang , –no-hip-clang, –merge-files, –no-merge-files are removed.
rocFFT 1.0.18#
rocFFT 1.0.18 for ROCm 5.3.0
Changed#
Runtime compilation cache now looks for environment variables XDG_CACHE_HOME (on Linux) and LOCALAPPDATA (on Windows) before falling back to HOME.
Optimizations#
Optimized 2D R2C/C2R to use 2-kernel plans where possible.
Improved performance of the Bluestein algorithm.
Optimized sbcc-168 and 100 by using half-lds.
Fixed#
Fixed occasional failures to parallelize runtime compilation of kernels. Failures would be retried serially and ultimately succeed, but this would take extra time.
Fixed failures of some R2C 3D transforms that use the unsupported TILE_UNALGNED SBRC kernels. An example is 98^3 R2C out-of-place.
Fixed bugs in SBRC_ERC type.
rocm-cmake 0.8.0#
rocm-cmake 0.8.0 for ROCm 5.3.0
Fixed#
Fixed error in prerm scripts created by
rocm_create_package
that could break uninstall for packages using thePTH
option.
Changed#
ROCM_USE_DEV_COMPONENT
set to on by default for all platforms. This means that Windows will now generate runtime and devel packages by defaultROCMInstallTargets now defaults
CMAKE_INSTALL_LIBDIR
tolib
if not otherwise specified.Changed default Debian compression type to xz and enabled multi-threaded package compression.
rocm_create_package
will no longer warn upon failure to determine version of program rpmbuild.
rocPRIM 2.11.0#
rocPRIM 2.11.0 for ROCm 5.3.0
Added#
New functions
subtract_left
andsubtract_right
inblock_adjacent_difference
to apply functions on pairs of adjacent items distributed between threads in a block.New device level
adjacent_difference
primitives.Added experimental tooling for automatic kernel configuration tuning for various architectures
Benchmarks collect and output more detailed system information
CMake functionality to improve build parallelism of the test suite that splits compilation units by function or by parameters.
Reverse iterator.
rocRAND 2.10.15#
rocRAND 2.10.15 for ROCm 5.3.0
Changed#
Increased number of warmup iterations for rocrand_benchmark_generate from 5 to 15 to eliminate corner cases that would generate artificially high benchmark scores.
rocSOLVER 3.19.0#
rocSOLVER 3.19.0 for ROCm 5.3.0
Added#
Partial eigensolver routines for symmetric/hermitian matrices:
SYEVX (with batched and strided_batched versions)
HEEVX (with batched and strided_batched versions)
Generalized symmetric- and hermitian-definite partial eigensolvers:
SYGVX (with batched and strided_batched versions)
HEGVX (with batched and strided_batched versions)
Eigensolver routines for symmetric/hermitian matrices using Jacobi algorithm:
SYEVJ (with batched and strided_batched versions)
HEEVJ (with batched and strided_batched versions)
Generalized symmetric- and hermitian-definite eigensolvers using Jacobi algorithm:
SYGVJ (with batched and strided_batched versions)
HEGVJ (with batched and strided_batched versions)
Added –profile_kernels option to rocsolver-bench, which will include kernel calls in the profile log (if profile logging is enabled with –profile).
Changed#
Changed rocsolver-bench result labels
cpu_time
andgpu_time
tocpu_time_us
andgpu_time_us
, respectively.
Removed#
Removed dependency on cblas from the rocsolver test and benchmark clients.
Fixed#
Fixed incorrect SYGS2/HEGS2, SYGST/HEGST, SYGV/HEGV, and SYGVD/HEGVD results for batch counts larger than 32.
Fixed STEIN memory access fault when nev is 0.
Fixed incorrect STEBZ results for close eigenvalues when range = index.
Fixed git unsafe repository error when building with
./install.sh -cd
as a non-root user.
rocThrust 2.16.0#
rocThrust 2.16.0 for ROCm 5.3.0
Changed#
rocThrust functionality dependent on device malloc works is functional as ROCm 5.2 reneabled device malloc. Device launched
thrust::sort
andthrust::sort_by_key
are available for use.
rocWMMA 0.8#
rocWMMA 0.8 for ROCm 5.3.0
Tensile 4.34.0#
Tensile 4.34.0 for ROCm 5.3.0
Added#
Lazy loading of solution libraries and code object files
Support for dictionary style logic files
Support for decision tree based logic files using dictionary format
DecisionTreeLibrary for solution selection
DirectToLDS support for HGEMM
DirectToVgpr support for SGEMM
Grid based distance metric for solution selection
Support for gfx11xx
Support for DirectToVgprA/B + TLU=False
ForkParameters Groups as a way of specifying solution parameters
Support for a new Tensile yaml config format
TensileClientConfig for generating Tensile client config files
Options for TensileCreateLibrary to build client and create client config file
Optimizations#
Solution generation is now cached and is not repeated if solution parameters are unchanged
Changed#
Default MACInstruction to FMA
Fixed#
Accept StaggerUStride=0 as valid
Reject invalid data types for UnrollLoopEfficiencyEnable
Fix invalid code generation issues related to DirectToVgpr
Return hipErrorNotFound if no modules are loaded
Fix performance drop for NN ZGEMM with 96x64 macro tile
Fix memory violation for general batched kernels when alpha/beta/K = 0
ROCm 5.2.3#
Changes in this release#
Ubuntu 18.04 end-of-life announcement#
Support for Ubuntu 18.04 ends in this release. Future releases of ROCm will not provide prebuilt packages for Ubuntu 18.04.
HIP runtime#
Fixes#
A bug was discovered in the HIP graph capture implementation in the ROCm v5.2.0 release. If the same kernel is called twice (with different argument values) in a graph capture, the implementation only kept the argument values for the second kernel call.
A bug was introduced in the hiprtc implementation in the ROCm v5.2.0 release. This bug caused the
hiprtcGetLoweredName
call to fail for named expressions with whitespace in it.
Example:
The named expression my_sqrt<complex<double>>
passed but my_sqrt<complex<double >>
failed.
RCCL#
Additions#
Compatibility with NCCL 2.12.10
Packages for test and benchmark executables on all supported OSes using CPack
Added custom signal handler - opt-in with RCCL_ENABLE_SIGNALHANDLER=1
Additional details provided if Binary File Descriptor library (BFD) is pre-installed.
Added experimental support for using multiple ranks per device
Requires using a new interface to create communicator (ncclCommInitRankMulti), refer to the interface documentation for details.
To avoid potential deadlocks, user might have to set an environment variables increasing the number of hardware queues. For example,
export GPU_MAX_HW_QUEUES=16
Added support for reusing ports in NET/IB channels
Opt-in with NCCL_IB_SOCK_CLIENT_PORT_REUSE=1 and NCCL_IB_SOCK_SERVER_PORT_REUSE=1
When “Call to bind failed: Address already in use” error happens in large-scale AlltoAll (for example, >=64 MI200 nodes), users are suggested to opt-in either one or both of the options to resolve the massive port usage issue
Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned >1
Removals#
Removed experimental clique-based kernels
Development tools#
No notable changes in this release for development tools, including the compiler, profiler, and debugger deployment and management tools
No notable changes in this release for deployment and management tools.
For release information for older ROCm releases, refer to ROCm/ROCm
Library changes in ROCm 5.2.3#
Library |
Version |
---|---|
hipBLAS |
|
hipCUB |
|
hipFFT |
|
hipSOLVER |
|
hipSPARSE |
|
MIVisionX |
|
rccl |
2.11.4 ⇒ 2.12.10 |
rocALUTION |
|
rocBLAS |
|
rocFFT |
|
rocPRIM |
|
rocRAND |
|
rocSOLVER |
|
rocSPARSE |
|
rocThrust |
|
rocWMMA |
|
Tensile |
rccl 2.12.10#
RCCL 2.12.10 for ROCm 5.2.3
Added#
Compatibility with NCCL 2.12.10
Packages for test and benchmark executables on all supported OSes using CPack.
Adding custom signal handler - opt-in with RCCL_ENABLE_SIGNALHANDLER=1
Additional details provided if Binary File Descriptor library (BFD) is pre-installed
Adding support for reusing ports in NET/IB channels
Opt-in with NCCL_IB_SOCK_CLIENT_PORT_REUSE=1 and NCCL_IB_SOCK_SERVER_PORT_REUSE=1
When “Call to bind failed : Address already in use” error happens in large-scale AlltoAll (e.g., >=64 MI200 nodes), users are suggested to opt-in either one or both of the options to resolve the massive port usage issue
Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned >1
Removed#
Removed experimental clique-based kernels
ROCm 5.2.1#
Library changes in ROCm 5.2.1#
Library |
Version |
---|---|
hipBLAS |
|
hipCUB |
|
hipFFT |
|
hipSOLVER |
|
hipSPARSE |
|
MIVisionX |
2.2.0 ⇒ 2.3.0 |
rccl |
|
rocALUTION |
|
rocBLAS |
|
rocFFT |
|
rocPRIM |
|
rocRAND |
|
rocSOLVER |
|
rocSPARSE |
|
rocThrust |
|
rocWMMA |
|
Tensile |
MIVisionX 2.3.0#
MIVisionX for ROCm 5.2.1
Added#
Docker Support for ROCm
5.2.X
Optimizations#
Changed#
Fixed#
Tested Configurations#
Windows
10
/11
Linux distribution
Ubuntu -
18.04
/20.04
CentOS -
7
/8
SLES -
15-SP2
ROCm: rocm-core -
5.2.0.50200-65
miopen-hip -
2.16.0.50101-48
miopen-opencl -
2.16.0.50101-48
migraphx -
2.1.0.50101-48
Protobuf - V3.12.4
OpenCV - 4.5.5
RPP - 0.93
FFMPEG - n4.4.2
Dependencies for all the above packages
MIVisionX Setup Script -
V2.3.4
Known Issues#
OpenCV 4.X support for some apps missing
Mivisionx Dependency Map#
Docker Image: sudo docker build -f docker/ubuntu20/{DOCKER_LEVEL_FILE_NAME}.dockerfile -t {mivisionx-level-NUMBER} .
new component added to the level
existing component from the previous level
Build Level |
MIVisionX Dependencies |
Modules |
Libraries and Executables |
Docker Tag |
---|---|---|---|---|
|
cmake <br> gcc <br> g++ |
amd_openvx <br> utilities |
|
|
|
ROCm OpenCL <br> +Level 1 |
amd_openvx <br> amd_openvx_extensions <br> utilities |
|
|
|
OpenCV <br> FFMPEG <br> +Level 2 |
amd_openvx <br> amd_openvx_extensions <br> utilities |
|
|
|
MIOpenGEMM <br> MIOpen <br> ProtoBuf <br> +Level 3 |
amd_openvx <br> amd_openvx_extensions <br> apps <br> utilities |
|
|
|
AMD_RPP <br> rocAL deps <br> +Level 4 |
amd_openvx <br> amd_openvx_extensions <br> apps <br> rocAL <br> utilities |
|
ROCm 5.2.0#
What’s new in this release#
HIP enhancements#
The ROCm v5.2 release consists of the following HIP enhancements:
HIP installation guide updates#
The HIP Installation Guide is updated to include building HIP tests from source on the AMD and NVIDIA platforms.
For more details, refer to the HIP Installation Guide v5.2.
Support for device-side malloc on HIP-Clang#
HIP-Clang now supports device-side malloc. This implementation does not require the use of
hipDeviceSetLimit(hipLimitMallocHeapSize,value)
nor respect any setting. The heap is fully dynamic
and can grow until the available free memory on the device is consumed.
The test codes at the following link show how to implement applications using malloc and free functions in device kernels:
New HIP APIs in this release#
The following new HIP APIs are available in the ROCm v5.2 release. Note that this is a pre-official version (beta) release of the new APIs:
Device management HIP APIs#
The new device management HIP APIs are as follows:
Gets a UUID for the device. This API returns a UUID for the device.
hipError_t hipDeviceGetUuid(hipUUID* uuid, hipDevice_t device);
Note that this new API corresponds to the following CUDA API:
CUresult cuDeviceGetUuid(CUuuid* uuid, CUdevice dev);
Gets default memory pool of the specified device
hipError_t hipDeviceGetDefaultMemPool(hipMemPool_t* mem_pool, int device);
Sets the current memory pool of a device
hipError_t hipDeviceSetMemPool(int device, hipMemPool_t mem_pool);
Gets the current memory pool for the specified device
hipError_t hipDeviceGetMemPool(hipMemPool_t* mem_pool, int device);
New HIP runtime APIs in memory management#
The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory management are:
Allocates memory with stream ordered semantics
hipError_t hipMallocAsync(void** dev_ptr, size_t size, hipStream_t stream);
Frees memory with stream ordered semantics
hipError_t hipFreeAsync(void* dev_ptr, hipStream_t stream);
Releases freed memory back to the OS
hipError_t hipMemPoolTrimTo(hipMemPool_t mem_pool, size_t min_bytes_to_hold);
Sets attributes of a memory pool
hipError_t hipMemPoolSetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void* value);
Gets attributes of a memory pool
hipError_t hipMemPoolGetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void* value);
Controls visibility of the specified pool between devices
hipError_t hipMemPoolSetAccess(hipMemPool_t mem_pool, const hipMemAccessDesc* desc_list, size_t count);
Returns the accessibility of a pool from a device
hipError_t hipMemPoolGetAccess(hipMemAccessFlags* flags, hipMemPool_t mem_pool, hipMemLocation* location);
Creates a memory pool
hipError_t hipMemPoolCreate(hipMemPool_t* mem_pool, const hipMemPoolProps* pool_props);
Destroys the specified memory pool
hipError_t hipMemPoolDestroy(hipMemPool_t mem_pool);
Allocates memory from a specified pool with stream ordered semantics
hipError_t hipMallocFromPoolAsync(void** dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream);
Exports a memory pool to the requested handle type
hipError_t hipMemPoolExportToShareableHandle( void* shared_handle, hipMemPool_t mem_pool, hipMemAllocationHandleType handle_type, unsigned int flags);
Imports a memory pool from a shared handle
hipError_t hipMemPoolImportFromShareableHandle( hipMemPool_t* mem_pool, void* shared_handle, hipMemAllocationHandleType handle_type, unsigned int flags);
Exports data to share a memory pool allocation between processes
hipError_t hipMemPoolExportPointer(hipMemPoolPtrExportData* export_data, void* dev_ptr); Import a memory pool allocation from another process.t hipError_t hipMemPoolImportPointer( void** dev_ptr, hipMemPool_t mem_pool, hipMemPoolPtrExportData* export_data);
HIP graph management APIs#
The new HIP Graph Management APIs are as follows:
Enqueues a host function call in a stream
hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData);
Swaps the stream capture mode of a thread
hipError_t hipThreadExchangeStreamCaptureMode(hipStreamCaptureMode* mode);
Sets a node attribute
hipError_t hipGraphKernelNodeSetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, const hipKernelNodeAttrValue* value);
Gets a node attribute
hipError_t hipGraphKernelNodeGetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, hipKernelNodeAttrValue* value);
Support for virtual memory management APIs#
The new APIs for virtual memory management are as follows:
Frees an address range reservation made via hipMemAddressReserve
hipError_t hipMemAddressFree(void* devPtr, size_t size);
Reserves an address range
hipError_t hipMemAddressReserve(void** ptr, size_t size, size_t alignment, void* addr, unsigned long long flags);
Creates a memory allocation described by the properties and size
hipError_t hipMemCreate(hipMemGenericAllocationHandle_t* handle, size_t size, const hipMemAllocationProp* prop, unsigned long long flags);
Exports an allocation to a requested shareable handle type
hipError_t hipMemExportToShareableHandle(void* shareableHandle, hipMemGenericAllocationHandle_t handle, hipMemAllocationHandleType handleType, unsigned long long flags);
Gets the access flags set for the given location and ptr
hipError_t hipMemGetAccess(unsigned long long* flags, const hipMemLocation* location, void* ptr);
Calculates either the minimal or recommended granularity
hipError_t hipMemGetAllocationGranularity(size_t* granularity, const hipMemAllocationProp* prop, hipMemAllocationGranularity_flags option);
Retrieves the property structure of the given handle
hipError_t hipMemGetAllocationPropertiesFromHandle(hipMemAllocationProp* prop, hipMemGenericAllocationHandle_t handle);
Imports an allocation from a requested shareable handle type
hipError_t hipMemImportFromShareableHandle(hipMemGenericAllocationHandle_t* handle, void* osHandle, hipMemAllocationHandleType shHandleType);
Maps an allocation handle to a reserved virtual address range
hipError_t hipMemMap(void* ptr, size_t size, size_t offset, hipMemGenericAllocationHandle_t handle, unsigned long long flags);
Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays
hipError_t hipMemMapArrayAsync(hipArrayMapInfo* mapInfoList, unsigned int count, hipStream_t stream);
Release a memory handle representing a memory allocation, that was previously allocated through hipMemCreate
hipError_t hipMemRelease(hipMemGenericAllocationHandle_t handle);
Returns the allocation handle of the backing memory allocation given the address
hipError_t hipMemRetainAllocationHandle(hipMemGenericAllocationHandle_t* handle, void* addr);
Sets the access flags for each location specified in desc for the given virtual address range
hipError_t hipMemSetAccess(void* ptr, size_t size, const hipMemAccessDesc* desc, size_t count);
Unmaps memory allocation of a given address range
hipError_t hipMemUnmap(void* ptr, size_t size);
For more information, refer to the HIP API documentation at hip:doxygen/html/modules.
Planned HIP changes in future releases#
Changes to hipDeviceProp_t
, HIPMEMCPY_3D
, and hipArray
structures (and related HIP APIs) are
planned in the next major release. These changes may impact backward compatibility.
Refer to the release notes in subsequent releases for more information.
ROCm math and communication libraries#
In this release, ROCm math and communication libraries consist of the following enhancements and fixes:
New rocWMMA for matrix multiplication and accumulation operations acceleration
This release introduces a new ROCm C++ library for accelerating mixed-precision matrix multiplication and accumulation (MFMA) operations leveraging specialized GPU matrix cores. rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a header library of GPU device code, meaning matrix core acceleration may be compiled directly into your kernel device code. This can benefit from compiler optimization in the generation of kernel assembly and does not incur additional overhead costs of linking to external runtime libraries or having to launch separate kernels.
rocWMMA is released as a header library and includes test and sample projects to validate and illustrate example usages of the C++ API. GEMM matrix multiplication is used as primary validation given the heavy precedent for the library. However, the usage portfolio is growing significantly and demonstrates different ways rocWMMA may be consumed.
For more information, refer to Communication libraries.
OpenMP enhancements in this release#
OMPT target support#
The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the OpenMP specification document. These are APIs that allow first-party tools to examine the profile and traces for kernels that execute on a device. A tool may register callbacks for data transfer and kernel dispatch entry points. A tool may use APIs to start and stop tracing for device-related activities, such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled, trace records for device activities are collected during program execution and returned to the tool using the APIs described in the specification.
Following is an example demonstrating how a tool would use the OMPT target APIs supported. The README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to follow, and you can run the provided example as indicated below:
cd /opt/rocm/llvm/examples/tools/ompt/veccopy-ompt-target-tracing
make run
The file veccopy-ompt-target-tracing.c
simulates how a tool would initiate device activity tracing. The
file callbacks.h
shows the callbacks that may be registered and implemented by the tool.
Deprecations and warnings#
Linux file system hierarchy standard for ROCm#
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.
New file system hierarchy#
The following is the new file system hierarchy:
/opt/rocm-<ver>
| --bin
| --All externally exposed Binaries
| --libexec
| --<component>
| -- Component specific private non-ISA executables (architecture independent)
| --include
| -- <component>
| --<header files>
| --lib
| --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
(public libraries linked with application)
| --<component> (component specific private library, executable data)
| --<cmake>
| --components
| --<component>.config.cmake
| --share
| --html/<component>/*.html
| --info/<component>/*.[pdf, md, txt]
| --man
| --doc
| --<component>
| --<licenses>
| --<component>
| --<misc files> (arch independent non-executable)
| --samples
Note
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.
Backward compatibility with older file systems#
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
Note
ROCm will continue supporting backward compatibility until the next major release.
Wrapper header files#
Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include
) with a
warning message to include files from the new location (/opt/rocm-xxx/include
) as shown in the
example below:
// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"
The wrapper header files’ backward compatibility deprecation is as follows:
#pragma
message announcing deprecation – ROCm v5.2 release#pragma
message changed to#warning
– Future release#warning
changed to#error
– Future releaseBackward compatibility wrappers removed – Future release
Library files#
Library files are available in the /opt/rocm-xxx/lib
folder. For backward compatibility, the old library
location (/opt/rocm-xxx/<component>/lib
) has a soft link to the library at the new location.
Example:
$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
CMake config files#
All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component>
folder. For
backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake
) consist of
a soft link to the new CMake config.
Example:
$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
Planned deprecation of hip-rocclr and hip-base packages#
In the ROCm v5.2 release, hip-rocclr and hip-base packages (Debian and RPM) are planned for deprecation and will be removed in a future release. hip-runtime-amd and hip-dev(el) will replace these packages respectively. Users of hip-rocclr must install two packages, hip-runtime-amd and hip-dev, to get the same set of packages installed by hip-rocclr previously.
Currently, both package names hip-rocclr (or) hip-runtime-amd and hip-base (or) hip-dev(el) are supported.
Deprecation of integrated HIP directed tests#
The integrated HIP directed tests, which are currently built by default, are deprecated in this release. The default building and execution support through CMake will be removed in future release.
Defect fixes#
Defect |
Fix |
---|---|
ROCmInfo does not list gpus |
code fix |
Hang observed while restoring cooperative group samples |
code fix |
ROCM-SMI over SRIOV: Unsupported commands do not return proper error message |
code fix |
Known issues#
This section consists of known issues in this release.
Compiler error on gfx1030 when compiling at -O0#
Issue#
A compiler error occurs when using -O0 flag to compile code for gfx1030 that calls atomicAddNoRet, which is defined in amd_hip_atomic.h. The compiler generates an illegal instruction for gfx1030.
Workaround#
The workaround is not to use the -O0 flag for this case. For higher optimization levels, the compiler does not generate an invalid instruction.
System freeze observed during CUDA memtest checkpoint#
Issue#
Checkpoint/Restore in Userspace (CRIU) requires 20 MB of VRAM approximately to checkpoint and restore. The CRIU process may freeze if the maximum amount of available VRAM is allocated to checkpoint applications.
Workaround#
To use CRIU to checkpoint and restore your application, limit the amount of VRAM the application uses to ensure at least 20 MB is available.
HPC test fails with the “HSA_STATUS_ERROR_MEMORY_FAULT” error#
Issue#
The compiler may incorrectly compile a program that uses the __shfl_sync(mask, value, srcLane)
function when the “value” parameter to the function is undefined along some path to the function. For
most functions, uninitialized inputs cause undefined behavior, but the definition for __shfl_sync
should
allow for undefined values.
Workaround#
The workaround is to initialize the parameters to __shfl_sync
.
Note
When the -Wall
compilation flag is used, the compiler generates a warning indicating the variable is
initialized along some path.
Example:
double res = 0.0; // Initialize the input to __shfl_sync.
if (lane == 0) {
res = <some expression>
}
res = __shfl_sync(mask, res, 0);
Kernel produces incorrect result#
Issue#
In recent changes to Clang, insertion of the noundef attribute to all the function arguments has been enabled by default.
In the HIP kernel, variable var in shfl_sync may not be initialized, so LLVM IR treats it as undef.
So, the function argument that is potentially undef (because it is not initialized) has always been assumed to be noundef by LLVM IR (since Clang has inserted the noundef attribute). This leads to ambiguous kernel execution.
Workaround#
Skip adding
noundef
attribute to functions tagged with convergent attribute. Refer to https://reviews.llvm.org/D124158 for more information.Introduce shuffle attribute and add it to
__shfl
like APIs at hip headers. Clang can skip adding thenoundef
attribute, if it finds that argument is tagged with shuffle attribute. Refer to https://reviews.llvm.org/D125378 for more information.Introduce clang builtin for
__shfl
to identify it and skip addingnoundef
attribute.Introduce
__builtin_freeze
to use on the relevant arguments in library wrappers. The library/header need to insert freezes on the relevant inputs.
Issue with applications triggering oversubscription#
There is a known issue with applications that trigger oversubscription. A hardware hang occurs when ROCgdb is used on AMD Instinct™ MI50 and MI100 systems.
This issue is under investigation and will be fixed in a future release.
Library changes in ROCm 5.2.0#
Library |
Version |
---|---|
hipBLAS |
0.50.0 ⇒ 0.51.0 |
hipCUB |
2.11.0 ⇒ 2.11.1 |
hipFFT |
1.0.7 ⇒ 1.0.8 |
hipSOLVER |
1.3.0 ⇒ 1.4.0 |
hipSPARSE |
2.1.0 ⇒ 2.2.0 |
MIVisionX |
|
rccl |
|
rocALUTION |
2.0.2 ⇒ 2.0.3 |
rocBLAS |
2.43.0 ⇒ 2.44.0 |
rocFFT |
1.0.16 ⇒ 1.0.17 |
rocPRIM |
2.10.13 ⇒ 2.10.14 |
rocRAND |
2.10.13 ⇒ 2.10.14 |
rocSOLVER |
3.17.0 ⇒ 3.18.0 |
rocSPARSE |
2.1.0 ⇒ 2.2.0 |
rocThrust |
2.14.0 ⇒ 2.15.0 |
rocWMMA |
|
Tensile |
4.32.0 ⇒ 4.33.0 |
hipBLAS 0.51.0#
hipBLAS 0.51.0 for ROCm 5.2.0
Added#
Packages for test and benchmark executables on all supported OSes using CPack.
Added File/Folder Reorg Changes with backward compatibility support enabled using ROCM-CMAKE wrapper functions
Added user-specified initialization option to hipblas-bench
Fixed#
Fixed version gathering in performance measuring script
hipCUB 2.11.1#
hipCUB 2.11.1 for ROCm 5.2.0
Added#
Packages for tests and benchmark executable on all supported OSes using CPack.
hipFFT 1.0.8#
hipFFT 1.0.8 for ROCm 5.2.0
Added#
Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions.
Packages for test and benchmark executables on all supported OSes using CPack.
hipSOLVER 1.4.0#
hipSOLVER 1.4.0 for ROCm 5.2.0
Added#
Package generation for test and benchmark executables on all supported OSes using CPack.
File/Folder Reorg
Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions.
Fixed#
Fixed the ReadTheDocs documentation generation.
hipSPARSE 2.2.0#
hipSPARSE 2.2.0 for ROCm 5.2.0
Added#
Packages for test and benchmark executables on all supported OSes using CPack.
rocALUTION 2.0.3#
rocALUTION 2.0.3 for ROCm 5.2.0
Added#
Packages for test and benchmark executables on all supported OSes using CPack.
rocBLAS 2.44.0#
rocBLAS 2.44.0 for ROCm 5.2.0
Added#
Packages for test and benchmark executables on all supported OSes using CPack.
Added Denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output vectors of rocBLAS level 1 and 2 functions.
Added Denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output general matrices of rocBLAS level 2 and 3 functions.
Added NaN initialization tests to the yaml files of Level 2 rocBLAS batched and strided-batched functions for testing purposes.
Added memory allocation check to avoid disk swapping during rocblas-test runs by skipping tests.
Optimizations#
Improved performance of non-batched and batched her2 for all sizes and data types.
Improved performance of non-batched and batched amin for all data types using shuffle reductions.
Improved performance of non-batched and batched amax for all data types using shuffle reductions.
Improved performance of trsv for all sizes and data types.
Changed#
Modifying gemm_ex for HBH (High-precision F16). The alpha/beta data type remains as F32 without narrowing to F16 and expanding back to F32 in the kernel. This change prevents rounding errors due to alpha/beta conversion in situations where alpha/beta are not exactly represented as an F16.
Modified non-batched and batched asum, nrm2 functions to use shuffle instruction based reductions.
For gemm, gemm_ex, gemm_ex2 internal API use rocblas_stride datatype for offset.
For symm, hemm, syrk, herk, dgmm, geam internal API use rocblas_stride datatype for offset.
AMD copyright year for all rocBLAS files.
For gemv (transpose-case), typecasted the ‘lda’(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions.
Fixed#
For function her2 avoid overflow in offset calculation.
For trsm when alpha == 0 and on host, allow A to be nullptr.
Fixed memory access issue in trsv.
Fixed git pre-commit script to update only AMD copyright year.
Fixed dgmm, geam test functions to set correct stride values.
For functions ssyr2k and dsyr2k allow trans == rocblas_operation_conjugate_transpose.
Fixed compilation error for clients-only build.
Removed#
Remove Navi12 (gfx1011) from fat binary.
rocFFT 1.0.17#
rocFFT 1.0.17 for ROCm 5.2.0
Added#
Packages for test and benchmark executables on all supported OSes using CPack.
Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions.
Changed#
Improved reuse of twiddle memory between plans.
Set a default load/store callback when only one callback type is set via the API for improved performance.
Optimizations#
Introduced a new access pattern of lds (non-linear) and applied it on sbcc kernels len 64 to get performance improvement.
Fixed#
Fixed plan creation failure in cases where SBCC kernels would need to write to non-unit-stride buffers.
rocPRIM 2.10.14#
rocPRIM 2.10.14 for ROCm 5.2.0
Added#
Packages for tests and benchmark executable on all supported OSes using CPack.
Added File/Folder Reorg Changes and Enabled Backward compatibility support using wrapper headers.
rocRAND 2.10.14#
rocRAND 2.10.14 for ROCm 5.2.0
Added#
Backward compatibility for deprecated
#include <rocrand.h>
using wrapper header files.Packages for test and benchmark executables on all supported OSes using CPack.
rocSOLVER 3.18.0#
rocSOLVER 3.18.0 for ROCm 5.2.0
Added#
Partial eigenvalue decomposition routines:
STEBZ
STEIN
Package generation for test and benchmark executables on all supported OSes using CPack.
Added tests for multi-level logging
Added tests for rocsolver-bench client
File/Folder Reorg
Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions.
Fixed#
Fixed compatibility with libfmt 8.1
rocSPARSE 2.2.0#
rocSPARSE 2.2.0 for ROCm 5.2.0
Added#
batched SpMM for CSR, COO and Blocked ELL formats.
Packages for test and benchmark executables on all supported OSes using CPack.
Clients file importers and exporters.
Improved#
Clients code size reduction.
Clients error handling.
Clients benchmarking for performance tracking.
Changed#
Test adjustments due to roundoff errors.
Fixing API calls compatiblity with rocPRIM.
Known Issues#
none
rocThrust 2.15.0#
rocThrust 2.15.0 for ROCm 5.2.0
Added#
Packages for tests and benchmark executable on all supported OSes using CPack.
rocWMMA 0.7#
rocWMMA 0.7 for ROCm 5.2.0
Added#
Added unit tests for DLRM kernels
Added GEMM sample
Added DLRM sample
Added SGEMV sample
Added unit tests for cooperative wmma load and stores
Added unit tests for IOBarrier.h
Added wmma load/ store tests for different matrix types (A, B and Accumulator)
Added more block sizes 1, 2, 4, 8 to test MmaSyncMultiTest
Added block sizes 4, 8 to test MmaSynMultiLdsTest
Added support for wmma load / store layouts with block dimension greater than 64
Added IOShape structure to define the attributes of mapping and layouts for all wmma matrix types
Added CI testing for rocWMMA
Changed#
Renamed wmma to rocwmma in cmake, header files and documentation
Renamed library files
Modified Layout.h to use different matrix offset calculations (base offset, incremental offset and cumulative offset)
Opaque load/store continue to use incrementatl offsets as they fill the entire block
Cooperative load/store use cumulative offsets as they fill only small portions for the entire block
Increased Max split counts to 64 for cooperative load/store
Moved all the wmma definitions, API headers to rocwmma namespace
Modified wmma fill unit tests to validate all matrix types (A, B, Accumulator)
Tensile 4.33.0#
Tensile 4.33.0 for ROCm 5.2.0
Added#
TensileUpdateLibrary for updating old library logic files
Support for TensileRetuneLibrary to use sizes from separate file
ZGEMM DirectToVgpr/DirectToLds/StoreCInUnroll/MIArchVgpr support
Tests for denorm correctness
Option to write different architectures to different TensileLibrary files
Optimizations#
Optimize MessagePackLoadLibraryFile by switching to fread
DGEMM tail loop optimization for PrefetchAcrossPersistentMode=1/DirectToVgpr
Changed#
Alpha/beta datatype remains as F32 for HPA HGEMM
Force assembly kernels to not flush denorms
Use hipDeviceAttributePhysicalMultiProcessorCount as multiProcessorCount
Fixed#
Fix segmentation fault when run i8 datatype with TENSILE_DB=0x80
ROCm 5.1.3#
Library changes in ROCm 5.1.3#
Library |
Version |
---|---|
hipBLAS |
|
hipCUB |
|
hipFFT |
|
hipRAND |
|
hipSOLVER |
|
hipSPARSE |
|
MIVisionX |
2.1.0 ⇒ 2.2.0 |
rccl |
|
rocALUTION |
|
rocBLAS |
|
rocFFT |
|
rocPRIM |
|
rocRAND |
|
rocSOLVER |
|
rocSPARSE |
|
rocThrust |
|
Tensile |
MIVisionX 2.2.0#
MIVisionX for ROCm 5.1.3
Added#
Optimizations#
Changed#
DockerFiles - Updates to install ROCm 5.1.1 Plus
Fixed#
Tested Configurations#
Windows
10
/11
Linux distribution
Ubuntu -
18.04
/20.04
CentOS -
7
/8
SLES -
15-SP2
ROCm: rocm-core -
5.1.1.50101-48
miopen-hip -
2.16.0.50101-48
miopen-opencl -
2.16.0.50101-48
migraphx -
2.1.0.50101-48
Protobuf - V3.12.0
OpenCV - 4.5.5
RPP - 0.93
FFMPEG - n4.0.4
Dependencies for all the above packages
MIVisionX Setup Script -
V2.3.0
Known Issues#
Mivisionx Dependency Map#
Docker Image: sudo docker build -f docker/ubuntu20/{DOCKER_LEVEL_FILE_NAME}.dockerfile -t {mivisionx-level-NUMBER} .
new component added to the level
existing component from the previous level
Build Level |
MIVisionX Dependencies |
Modules |
Libraries and Executables |
Docker Tag |
---|---|---|---|---|
|
cmake <br> gcc <br> g++ |
amd_openvx <br> utilities |
|
|
|
ROCm OpenCL <br> +Level 1 |
amd_openvx <br> amd_openvx_extensions <br> utilities |
|
|
|
OpenCV <br> FFMPEG <br> +Level 2 |
amd_openvx <br> amd_openvx_extensions <br> utilities |
|
|
|
MIOpenGEMM <br> MIOpen <br> ProtoBuf <br> +Level 3 |
amd_openvx <br> amd_openvx_extensions <br> apps <br> utilities |
|
|
|
AMD_RPP <br> rocAL deps <br> +Level 4 |
amd_openvx <br> amd_openvx_extensions <br> apps <br> rocAL <br> utilities |
|
ROCm 5.1.1#
Library changes in ROCm 5.1.1#
ROCm 5.1.0#
What’s new in this release#
HIP enhancements#
The ROCm v5.1 release consists of the following HIP enhancements.
HIP installation guide updates#
The HIP installation guide now includes information on installing and building HIP from source on AMD and NVIDIA platforms.
Refer to the HIP Installation Guide v5.1 for more details.
Support for HIP graph#
ROCm v5.1 extends support for HIP Graph.
Planned changes for HIP in future releases#
Separation of hiprtc (libhiprtc) library from hip runtime (amdhip64)#
On ROCm/Linux, to maintain backward compatibility, the hipruntime library (amdhip64) will continue to include hiprtc symbols in future releases. The backward compatible support may be discontinued by removing hiprtc symbols from the hipruntime library (amdhip64) in the next major release.
hipDeviceProp_t structure enhancements#
Changes to the hipDeviceProp_t structure in the next major release may result in backward incompatibility. More details on these changes will be provided in subsequent releases.
ROCDebugger enhancements#
Multi-language source-level debugger#
The compiler now generates a source-level variable and function argument debug information.
The accuracy is guaranteed if the compiler options -g -O0
are used and apply only to HIP.
This enhancement enables ROCDebugger users to interact with the HIP source-level variables and function arguments.
Note
The newly-suggested compiler -g option must be used instead of the previously-suggested -ggdb
option. Although the effect of these two options is currently equivalent, this is not guaranteed for the
future, as changes might be made by the upstream LLVM community.
Machine interface lanes support#
ROCDebugger Machine Interface (MI) extends support to lanes, which includes the following enhancements:
Added a new -lane-info command, listing the current thread’s lanes.
The -thread-select command now supports a lane switch to switch to a specific lane of a thread:
-thread-select -l LANE THREAD
The =thread-selected notification gained a lane-id attribute. This enables the frontend to know which lane of the thread was selected.
The *stopped asynchronous record gained lane-id and hit-lanes attributes. The former indicates which lane is selected, and the latter indicates which lanes explain the stop.
MI commands now accept a global –lane option, similar to the global –thread and –frame options.
MI varobjs are now lane-aware.
For more information, refer to the ROC Debugger User Guide at ROCgdb.
Enhanced - clone-inferior command#
The clone-inferior command now ensures that the TTY, CMD, ARGS, and AMDGPU PRECISE-MEMORY settings are copied from the original inferior to the new one. All modifications to the environment variables done using the ‘set environment’ or ‘unset environment’ commands are also copied to the new inferior.
MIOpen support for RDNA GPUs#
This release includes support for AMD Radeon™ Pro W6800, in addition to other bug fixes and performance improvements as listed below:
MIOpen now supports RDNA GPUs!! (via MIOpen PRs 973, 780, 764, 740, 739, 677, 660, 653, 493, 498)
Fixed a correctness issue with ImplicitGemm algorithm
Updated the performance data for new kernel versions
Improved MIOpen build time by splitting large kernel header files
Fixed an issue in reduction kernels for padded tensors
Various other bug fixes and performance improvements
For more information, see Documentation.
Checkpoint restore support with CRIU#
The new Checkpoint Restore in Userspace (CRIU) functionality is implemented to support AMD GPU and ROCm applications.
CRIU is a userspace tool to Checkpoint and Restore an application.
CRIU lacked the support for checkpoint restore applications that used device files such as a GPU. With this ROCm release, CRIU is enhanced with a new plugin to support AMD GPUs, which includes:
Single and Multi GPU systems (Gfx9)
Checkpoint / Restore on a different system
Checkpoint / Restore inside a docker container
PyTorch
TensorFlow
Using CRIU Image Streamer
For more information, refer to checkpoint-restore/criu
Note
The CRIU plugin (amdgpu_plugin) is merged upstream with the CRIU repository. The KFD kernel patches are also available upstream with the amd-staging-drm-next branch (public) and the ROCm 5.1 release branch.
Note
This is a Beta release of the Checkpoint and Restore functionality, and some features are not available in this release.
For more information, refer to the following websites:
Defect fixes#
The following defects are fixed in this release.
Driver fails to load after installation#
The issue with the driver failing to load after ROCm installation is now fixed.
The driver installs successfully, and the server reboots with working rocminfo and clinfo.
ROCDebugger defect fixes#
Breakpoints in GPU kernel code before kernel is loaded#
Previously, setting a breakpoint in device code by line number before the device code was loaded into the program resulted in ROCgdb incorrectly moving the breakpoint to the first following line that contains host code.
Now, the breakpoint is left pending. When the GPU kernel gets loaded, the breakpoint resolves to a location in the kernel.
Registers invalidated after write#
Previously, the stale just-written value was presented as a current value.
ROCgdb now invalidates the cached values of registers whose content might differ after being written. For example, registers with read-only bits.
ROCgdb also invalidates all volatile registers when a volatile register is written. For example, writing VCC invalidates the content of STATUS as STATUS.VCCZ may change.
Scheduler-locking and GPU wavefronts#
When scheduler-locking is in effect, new wavefronts created by a resumed thread, CPU, or GPU wavefront, are held in the halt state. For example, the “set scheduler-locking” command.
ROCDebugger fails before completion of kernel execution#
It was possible (although erroneous) for a debugger to load GPU code in memory, send it to the device, start executing a kernel on the device, and dispose of the original code before the kernel had finished execution. If a breakpoint was hit after this point, the debugger failed with an internal error while trying to access the debug information.
This issue is now fixed by ensuring that the debugger keeps a local copy of the original code and debug information.
Known issues#
Random memory access fault errors observed while running math libraries unit tests#
Issue: Random memory access fault issues are observed while running Math libraries unit tests. This issue is encountered in ROCm v5.0, ROCm v5.0.1, and ROCm v5.0.2.
Note, the faults only occur in the SRIOV environment.
Workaround: Use SDMA to update the page table. The Guest set up steps are as follows:
sudo modprobe amdgpu vm_update_mode=0
To verify, use
Guest:
cat /sys/module/amdgpu/parameters/vm_update_mode 0
Where expectation is 0.
CU masking causes application to freeze#
Using CU Masking results in an application freeze or runs exceptionally slowly. This issue is noticed only in the GFX10 suite of products. Note, this issue is observed only in GFX10 suite of products.
This issue is under active investigation at this time.
Failed checkpoint in Docker containers#
A defect with Ubuntu images kernel-5.13-30-generic and kernel-5.13-35-generic with Overlay FS results in incorrect reporting of the mount ID.
This issue with Ubuntu causes CRIU checkpointing to fail in Docker containers.
As a workaround, use an older version of the kernel. For example, Ubuntu 5.11.0-46-generic.
Issue with restoring workloads using cooperative groups feature#
Workloads that use the cooperative groups function to ensure all waves can be resident at the same time may fail to restore correctly. This issue is under investigation and will be fixed in a future release.
Radeon Pro V620 and W6800 workstation GPUs#
No support for ROCDebugger on SRIOV#
ROCDebugger is not supported in the SRIOV environment on any GPU.
This is a known issue and will be fixed in a future release.
Random error messages in ROCm SMI for SR-IOV#
Random error messages are generated by unsupported functions or commands.
This is a known issue and will be fixed in a future release.
Library changes in ROCm 5.1.0#
Library |
Version |
---|---|
hipBLAS |
0.49.0 ⇒ 0.50.0 |
hipCUB |
2.10.13 ⇒ 2.11.0 |
hipFFT |
1.0.4 ⇒ 1.0.7 |
hipRAND |
|
hipSOLVER |
1.2.0 ⇒ 1.3.0 |
hipSPARSE |
2.0.0 ⇒ 2.1.0 |
MIVisionX |
|
rccl |
2.10.3 ⇒ 2.11.4 |
rocALUTION |
2.0.1 ⇒ 2.0.2 |
rocBLAS |
2.42.0 ⇒ 2.43.0 |
rocFFT |
1.0.13 ⇒ 1.0.16 |
rocPRIM |
2.10.12 ⇒ 2.10.13 |
rocRAND |
2.10.12 ⇒ 2.10.13 |
rocSOLVER |
3.16.0 ⇒ 3.17.0 |
rocSPARSE |
2.0.0 ⇒ 2.1.0 |
rocThrust |
2.13.0 ⇒ 2.14.0 |
Tensile |
4.31.0 ⇒ 4.32.0 |
hipBLAS 0.50.0#
hipBLAS 0.50.0 for ROCm 5.1.0
Added#
Added library version and device information to hipblas-test output
Added –rocsolver-path command line option to choose path to pre-built rocSOLVER, as absolute or relative path
Added –cmake_install command line option to update cmake to minimum version if required
Added cmake-arg parameter to pass in cmake arguments while building
Added infrastructure to support readthedocs hipBLAS documentation.
Fixed#
Added hipblasVersionMinor define. hipblaseVersionMinor remains defined for backwards compatibility.
Doxygen warnings in hipblas.h header file.
Changed#
rocblas-path command line option can be specified as either absolute or relative path
Help message improvements in install.sh and rmake.py
Updated googletest dependency from 1.10.0 to 1.11.0
hipCUB 2.11.0#
hipCUB 2.11.0 for ROCm 5.1.0
Added#
Device segmented sort
Warp merge sort, WarpMask and thread sort from cub 1.15.0 supported in hipCUB
Device three way partition
Changed#
Device_scan and device_segmented_scan: inclusive_scan now uses the input-type as accumulator-type, exclusive_scan uses initial-value-type.
This particularly changes behaviour of small-size input types with large-size output types (e.g. short input, int output).
And low-res input with high-res output (e.g. float input, double output)
Block merge sort no longer supports non power of two blocksizes
hipFFT 1.0.7#
hipFFT 1.0.7 for ROCm 5.1.0
Changed#
Use fft_params struct for accuracy and benchmark clients.
hipRAND 2.10.13#
hipRAND 2.10.13 for ROCm 5.1.0
Changed#
Header file installation location changed to match other libraries.
Using the
hiprand.h
header file should now use#include <hiprand/hiprand.h>
, rather than#include <hiprand.h>
Symlinks are included for backwards compatibility
hipSOLVER 1.3.0#
hipSOLVER 1.3.0 for ROCm 5.1.0
Added#
Added functions
gels
hipsolverSSgels_bufferSize, hipsolverDDgels_bufferSize, hipsolverCCgels_bufferSize, hipsolverZZgels_bufferSize
hipsolverSSgels, hipsolverDDgels, hipsolverCCgels, hipsolverZZgels
Added library version and device information to hipsolver-test output.
Added compatibility API with hipsolverDn prefix.
Added compatibility-only functions
gesvdj
hipsolverDnSgesvdj_bufferSize, hipsolverDnDgesvdj_bufferSize, hipsolverDnCgesvdj_bufferSize, hipsolverDnZgesvdj_bufferSize
hipsolverDnSgesvdj, hipsolverDnDgesvdj, hipsolverDnCgesvdj, hipsolverDnZgesvdj
gesvdjBatched
hipsolverDnSgesvdjBatched_bufferSize, hipsolverDnDgesvdjBatched_bufferSize, hipsolverDnCgesvdjBatched_bufferSize, hipsolverDnZgesvdjBatched_bufferSize
hipsolverDnSgesvdjBatched, hipsolverDnDgesvdjBatched, hipsolverDnCgesvdjBatched, hipsolverDnZgesvdjBatched
syevj
hipsolverDnSsyevj_bufferSize, hipsolverDnDsyevj_bufferSize, hipsolverDnCheevj_bufferSize, hipsolverDnZheevj_bufferSize
hipsolverDnSsyevj, hipsolverDnDsyevj, hipsolverDnCheevj, hipsolverDnZheevj
syevjBatched
hipsolverDnSsyevjBatched_bufferSize, hipsolverDnDsyevjBatched_bufferSize, hipsolverDnCheevjBatched_bufferSize, hipsolverDnZheevjBatched_bufferSize
hipsolverDnSsyevjBatched, hipsolverDnDsyevjBatched, hipsolverDnCheevjBatched, hipsolverDnZheevjBatched
sygvj
hipsolverDnSsygvj_bufferSize, hipsolverDnDsygvj_bufferSize, hipsolverDnChegvj_bufferSize, hipsolverDnZhegvj_bufferSize
hipsolverDnSsygvj, hipsolverDnDsygvj, hipsolverDnChegvj, hipsolverDnZhegvj
Changed#
The rocSOLVER backend now allows hipsolverXXgels and hipsolverXXgesv to be called in-place when B == X.
The rocSOLVER backend now allows rwork to be passed as a null pointer to hipsolverXgesvd.
Fixed#
bufferSize functions will now return HIPSOLVER_STATUS_NOT_INITIALIZED instead of HIPSOLVER_STATUS_INVALID_VALUE when both handle and lwork are null.
Fixed rare memory allocation failure in syevd/heevd and sygvd/hegvd caused by improper workspace array allocation outside of rocSOLVER.
hipSPARSE 2.1.0#
hipSPARSE 2.1.0 for ROCm 5.1.0
Added#
Added gtsv_interleaved_batch and gpsv_interleaved_batch routines
Add SpGEMM_reuse
Changed#
Changed BUILD_CUDA with USE_CUDA in install script and cmake files
Update googletest to 11.1
Improved#
Fixed a bug in SpMM Alg versioning
Known Issues#
none
rccl 2.11.4#
RCCL 2.11.4 for ROCm 5.1.0
Added#
Compatibility with NCCL 2.11.4
Known Issues#
Managed memory is not currently supported for clique-based kernels
rocALUTION 2.0.2#
rocALUTION 2.0.2 for ROCm 5.1.0
Added#
Added out-of-place matrix transpose functionality
Added LocalVector<bool>
rocBLAS 2.43.0#
rocBLAS 2.43.0 for ROCm 5.1.0
Added#
Option to install script for number of jobs to use for rocBLAS and Tensile compilation (-j, –jobs)
Option to install script to build clients without using any Fortran (–clients_no_fortran)
rocblas_client_initialize function, to perform rocBLAS initialize for clients(benchmark/test) and report the execution time.
Added tests for output of reduction functions when given bad input
Added user specified initialization (rand_int/trig_float/hpl) for initializing matrices and vectors in rocblas-bench
Optimizations#
Improved performance of trsm with side == left and n == 1
Improved perforamnce of trsm with side == left and m <= 32 along with side == right and n <= 32
Changed#
For syrkx and trmm internal API use rocblas_stride datatype for offset
For non-batched and batched gemm_ex functions if the C matrix pointer equals the D matrix pointer (aliased) their respective type and leading dimension arguments must now match
Test client dependencies updated to GTest 1.11
non-global false positives reported by cppcheck from file based suppression to inline suppression. File based suppression will only be used for global false positives.
Help menu messages in install.sh
For ger function, typecast the ‘lda’(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions.
Modified default initialization from rand_int to hpl for initializing matrices and vectors in rocblas-bench
Fixed#
For function trmv (non-transposed cases) avoid overflow in offset calculation
Fixed cppcheck errors/warnings
Fixed doxygen warnings
rocFFT 1.0.16#
rocFFT 1.0.16 for ROCm 5.1.0
Changed#
Supported unaligned tile dimension for SBRC_2D kernels.
Improved (more RAII) test and benchmark infrastructure.
Enabled runtime compilation of length-2304 FFT kernel during plan creation.
Optimizations#
Optimized more large 1D cases by using L1D_CC plan.
Optimized 3D 200^3 C2R case.
Optimized 1D 2^30 double precision on MI200.
Fixed#
Fixed correctness of some R2C transforms with unusual strides.
Removed#
The hipFFT API (header) has been removed from after a long deprecation period. Please use the hipFFT package/repository to obtain the hipFFT API.
rocPRIM 2.10.13#
rocPRIM 2.10.13 for ROCm 5.1.0
Fixed#
Fixed radix sort int64_t bug introduced in [2.10.11]
Added#
Future value
Added device partition_three_way to partition input to three output iterators based on two predicates
Changed#
The reduce/scan algorithm precision issues in the tests has been resolved for half types.
Known Issues#
device_segmented_radix_sort unit test failing for HIP on Windows
rocRAND 2.10.13#
rocRAND 2.10.13 for ROCm 5.1.0
Added#
Generating a random sequence different sizes now produces the same sequence without gaps indepent of how many values are generated per call.
Only in the case of XORWOW, MRG32K3A, PHILOX4X32_10, SOBOL32 and SOBOL64
This only holds true if the size in each call is a divisor of the distributions
output_width
due to performanceSimilarly the output pointer has to be aligned to
output_width * sizeof(output_type)
Changed#
hipRAND split into a separate package
Header file installation location changed to match other libraries.
Using the
rocrand.h
header file should now use#include <rocrand/rocrand.h>
, rather than#include <rocrand/rocrand.h>
rocRAND still includes hipRAND using a submodule
The rocRAND package also sets the provides field with hipRAND, so projects which require hipRAND can begin to specify it.
Fixed#
Fix offset behaviour for XORWOW, MRG32K3A and PHILOX4X32_10 generator, setting offset now correctly generates the same sequence starting from the offset.
Only uniform int and float will work as these can be generated with a single call to the generator
Known Issues#
kernel_xorwow unit test is failing for certain GPU architectures.
rocSOLVER 3.17.0#
rocSOLVER 3.17.0 for ROCm 5.1.0
Optimized#
Optimized non-pivoting and batch cases of the LU factorization
Fixed#
Fixed missing synchronization in SYTRF with
rocblas_fill_lower
that could potentially result in incorrect pivot values.Fixed multi-level logging output to file with the
ROCSOLVER_LOG_PATH
,ROCSOLVER_LOG_TRACE_PATH
,ROCSOLVER_LOG_BENCH_PATH
andROCSOLVER_LOG_PROFILE_PATH
environment variables.Fixed performance regression in the batched LU factorization of tiny matrices
rocSPARSE 2.1.0#
rocSPARSE 2.1.0 for ROCm 5.1.0
Added#
gtsv_interleaved_batch
gpsv_interleaved_batch
SpGEMM_reuse
Allow copying of mat info struct
Improved#
Optimization for SDDMM
Allow unsorted matrices in csrgemm multipass algorithm
Known Issues#
none
rocThrust 2.14.0#
rocThrust 2.14.0 for ROCm 5.1.0
Added#
Updated to match upstream Thrust 1.15.0
Known Issues#
async_copy, partition, and stable_sort_by_key unit tests are failing on HIP on Windows.
Tensile 4.32.0#
Tensile 4.32.0 for ROCm 5.1.0
Added#
Better control of parallelism to control memory usage
Support for multiprocessing on Windows for TensileCreateLibrary
New JSD metric and metric selection functionality
Initial changes to support two-tier solution selection
Optimized#
Optimized runtime of TensileCreateLibraries by reducing max RAM usage
StoreCInUnroll additional optimizations plus adaptive K support
DGEMM NN optimizations with PrefetchGlobalRead(PGR)=2 support
Changed#
Update Googletest to 1.11.0
Removed#
Remove no longer supported benchmarking steps
ROCm 5.0.2#
Defect fixes#
The following defects are fixed in the ROCm v5.0.2 release.
Issue with hostcall facility in HIP runtime#
In ROCm v5.0, when using the “assert()” call in a HIP kernel, the compiler may sometimes fail to emit kernel metadata related to the hostcall facility, which results in incomplete initialization of the hostcall facility in the HIP runtime. This can cause the HIP kernel to crash when it attempts to execute the “assert()” call.
The root cause was an incorrect check in the compiler to determine whether the hostcall facility is required by the kernel. This is fixed in the ROCm v5.0.2 release.
The resolution includes a compiler change, which emits the required metadata by default, unless the compiler can prove that the hostcall facility is not required by the kernel. This ensures that the “assert()” call never fails.
Note
This fix may lead to breakage in some OpenMP offload use cases, which use print inside a target region and result in an abort in device code. The issue will be fixed in a future release.
The compatibility matrix in the Deep-learning guide is updated for ROCm v5.0.2.
Library changes in ROCm 5.0.2#
Library |
Version |
---|---|
hipBLAS |
|
hipCUB |
|
hipFFT |
|
hipSOLVER |
|
hipSPARSE |
|
MIVisionX |
2.0.1 ⇒ 2.1.0 |
rccl |
|
rocALUTION |
|
rocBLAS |
|
rocFFT |
|
rocPRIM |
|
rocRAND |
|
rocSOLVER |
|
rocSPARSE |
|
rocThrust |
|
Tensile |
MIVisionX 2.1.0#
MIVisionX for ROCm 5.0.2
Added#
New Tests - AMD_MEDIA
Optimizations#
Readme Updates
HIP Buffer Transfer - Eliminate cupy usage
Changed#
Backend - Default Backend set to
HIP
Fixed#
Minor bugs and warnings
AMD_MEDIA - Bug Fixes
Tested Configurations#
Windows 10
Linux distribution
Ubuntu -
18.04
/20.04
CentOS -
7
/8
SLES -
15-SP2
ROCm: rocm-dev -
4.5.2.40502-164
rocm-cmake - rocm-4.2.0
MIOpenGEMM - 1.1.5
MIOpen - 2.14.0
Protobuf - V3.12.0
OpenCV - 4.5.5
RPP - 0.92
FFMPEG - n4.0.4
Dependencies for all the above packages
MIVisionX Setup Script -
V2.0.0
Known Issues#
TBD
Mivisionx Dependency Map#
Docker Image: sudo docker build -f docker/ubuntu20/{DOCKER_LEVEL_FILE_NAME}.dockerfile -t {mivisionx-level-NUMBER} .
new component added to the level
existing component from the previous level
Build Level |
MIVisionX Dependencies |
Modules |
Libraries and Executables |
Docker Tag |
---|---|---|---|---|
|
cmake <br> gcc <br> g++ |
amd_openvx <br> utilities |
|
|
|
ROCm OpenCL <br> +Level 1 |
amd_openvx <br> amd_openvx_extensions <br> utilities |
|
|
|
OpenCV <br> FFMPEG <br> +Level 2 |
amd_openvx <br> amd_openvx_extensions <br> utilities |
|
|
|
MIOpenGEMM <br> MIOpen <br> ProtoBuf <br> +Level 3 |
amd_openvx <br> amd_openvx_extensions <br> apps <br> utilities |
|
|
|
AMD_RPP <br> rocAL deps <br> +Level 4 |
amd_openvx <br> amd_openvx_extensions <br> apps <br> rocAL <br> utilities |
|
ROCm 5.0.1#
Deprecations and warnings#
Refactor of HIPCC/HIPCONFIG#
In prior ROCm releases, by default, the hipcc/hipconfig Perl scripts were used to identify and set target compiler options, target platform, compiler, and runtime appropriately.
In ROCm v5.0.1, hipcc.bin and hipconfig.bin have been added as the compiled binary implementations
of the hipcc and hipconfig. These new binaries are currently a work-in-progress, considered, and
marked as experimental. ROCm plans to fully transition to hipcc.bin and hipconfig.bin in the a future
ROCm release. The existing hipcc and hipconfig Perl scripts are renamed to hipcc.pl
and hipconfig.pl
respectively. New top-level hipcc and hipconfig Perl scripts are created, which can switch between the
Perl script or the compiled binary based on the environment variable HIPCC_USE_PERL_SCRIPT
.
In ROCm 5.0.1, by default, this environment variable is set to use hipcc and hipconfig through the Perl scripts.
Subsequent Perl scripts will no longer be available in ROCm in a future release.
Library changes in ROCm 5.0.1#
ROCm 5.0.0#
What’s new in this release#
HIP enhancements#
The ROCm v5.0 release consists of the following HIP enhancements.
HIP installation guide updates#
The HIP Installation Guide is updated to include building HIP from source on the NVIDIA platform.
Refer to the HIP Installation Guide v5.0 for more details.
Managed memory allocation#
Managed memory, including the __managed__
keyword, is now supported in the HIP combined host/device compilation. Through unified memory allocation, managed memory allows data to be shared and accessible to both the CPU and GPU using a single pointer. The allocation is managed by the AMD GPU driver using the Linux Heterogeneous Memory Management (HMM) mechanism. The user can call managed memory API hipMallocManaged to allocate a large chunk of HMM memory, execute kernels on a device, and fetch data between the host and device as needed.
Note
In a HIP application, it is recommended to do a capability check before calling the managed memory APIs. For example,
int managed_memory = 0;
HIPCHECK(hipDeviceGetAttribute(&managed_memory,
hipDeviceAttributeManagedMemory,p_gpuDevice));
if (!managed_memory ) {
printf ("info: managed memory access not supported on the device %d\n Skipped\n", p_gpuDevice);
}
else {
HIPCHECK(hipSetDevice(p_gpuDevice));
HIPCHECK(hipMallocManaged(&Hmm, N * sizeof(T)));
. . .
}
Note
The managed memory capability check may not be necessary; however, if HMM is not supported, managed malloc will fall back to using system memory. Other managed memory API calls will, then, have
Refer to the HIP API documentation for more details on managed memory APIs.
For the application, see
New environment variable#
The following new environment variable is added in this release:
Environment Variable |
Value |
Description |
---|---|---|
HSA_COOP_CU_COUNT |
0 or 1 (default is 0) |
Some processors support more CUs than can reliably be used in a cooperative dispatch. Setting the environment variable HSA_COOP_CU_COUNT to 1 will cause ROCr to return the correct CU count for cooperative groups through the HSA_AMD_AGENT_INFO_COOPERATIVE_COMPUTE_UNIT_COUNT attribute of hsa_agent_get_info(). Setting HSA_COOP_CU_COUNT to other values, or leaving it unset, will cause ROCr to return the same CU count for the attributes HSA_AMD_AGENT_INFO_COOPERATIVE_COMPUTE_UNIT_COUNT and HSA_AMD_AGENT_INFO_COMPUTE_UNIT_COUNT. Future ROCm releases will make HSA_COOP_CU_COUNT=1 the default. |
Breaking changes#
Runtime breaking change#
Re-ordering of the enumerated type in hip_runtime_api.h to better match NV. See below for the difference in enumerated types.
ROCm software will be affected if any of the defined enums listed below are used in the code. Applications built with ROCm v5.0 enumerated types will work with a ROCm 4.5.2 driver. However, an undefined behavior error will occur with a ROCm v4.5.2 application that uses these enumerated types with a ROCm 5.0 runtime.
typedef enum hipDeviceAttribute_t {
- hipDeviceAttributeMaxThreadsPerBlock, ///< Maximum number of threads per block.
- hipDeviceAttributeMaxBlockDimX, ///< Maximum x-dimension of a block.
- hipDeviceAttributeMaxBlockDimY, ///< Maximum y-dimension of a block.
- hipDeviceAttributeMaxBlockDimZ, ///< Maximum z-dimension of a block.
- hipDeviceAttributeMaxGridDimX, ///< Maximum x-dimension of a grid.
- hipDeviceAttributeMaxGridDimY, ///< Maximum y-dimension of a grid.
- hipDeviceAttributeMaxGridDimZ, ///< Maximum z-dimension of a grid.
- hipDeviceAttributeMaxSharedMemoryPerBlock, ///< Maximum shared memory available per block in
- ///< bytes.
- hipDeviceAttributeTotalConstantMemory, ///< Constant memory size in bytes.
- hipDeviceAttributeWarpSize, ///< Warp size in threads.
- hipDeviceAttributeMaxRegistersPerBlock, ///< Maximum number of 32-bit registers available to a
- ///< thread block. This number is shared by all thread
- ///< blocks simultaneously resident on a
- ///< multiprocessor.
- hipDeviceAttributeClockRate, ///< Peak clock frequency in kilohertz.
- hipDeviceAttributeMemoryClockRate, ///< Peak memory clock frequency in kilohertz.
- hipDeviceAttributeMemoryBusWidth, ///< Global memory bus width in bits.
- hipDeviceAttributeMultiprocessorCount, ///< Number of multiprocessors on the device.
- hipDeviceAttributeComputeMode, ///< Compute mode that device is currently in.
- hipDeviceAttributeL2CacheSize, ///< Size of L2 cache in bytes. 0 if the device doesn't have L2
- ///< cache.
- hipDeviceAttributeMaxThreadsPerMultiProcessor, ///< Maximum resident threads per
- ///< multiprocessor.
- hipDeviceAttributeComputeCapabilityMajor, ///< Major compute capability version number.
- hipDeviceAttributeComputeCapabilityMinor, ///< Minor compute capability version number.
- hipDeviceAttributeConcurrentKernels, ///< Device can possibly execute multiple kernels
- ///< concurrently.
- hipDeviceAttributePciBusId, ///< PCI Bus ID.
- hipDeviceAttributePciDeviceId, ///< PCI Device ID.
- hipDeviceAttributeMaxSharedMemoryPerMultiprocessor, ///< Maximum Shared Memory Per
- ///< Multiprocessor.
- hipDeviceAttributeIsMultiGpuBoard, ///< Multiple GPU devices.
- hipDeviceAttributeIntegrated, ///< iGPU
- hipDeviceAttributeCooperativeLaunch, ///< Support cooperative launch
- hipDeviceAttributeCooperativeMultiDeviceLaunch, ///< Support cooperative launch on multiple devices
- hipDeviceAttributeMaxTexture1DWidth, ///< Maximum number of elements in 1D images
- hipDeviceAttributeMaxTexture2DWidth, ///< Maximum dimension width of 2D images in image elements
- hipDeviceAttributeMaxTexture2DHeight, ///< Maximum dimension height of 2D images in image elements
- hipDeviceAttributeMaxTexture3DWidth, ///< Maximum dimension width of 3D images in image elements
- hipDeviceAttributeMaxTexture3DHeight, ///< Maximum dimensions height of 3D images in image elements
- hipDeviceAttributeMaxTexture3DDepth, ///< Maximum dimensions depth of 3D images in image elements
+ hipDeviceAttributeCudaCompatibleBegin = 0,
- hipDeviceAttributeHdpMemFlushCntl, ///< Address of the HDP_MEM_COHERENCY_FLUSH_CNTL register
- hipDeviceAttributeHdpRegFlushCntl, ///< Address of the HDP_REG_COHERENCY_FLUSH_CNTL register
+ hipDeviceAttributeEccEnabled = hipDeviceAttributeCudaCompatibleBegin, ///< Whether ECC support is enabled.
+ hipDeviceAttributeAccessPolicyMaxWindowSize, ///< Cuda only. The maximum size of the window policy in bytes.
+ hipDeviceAttributeAsyncEngineCount, ///< Cuda only. Asynchronous engines number.
+ hipDeviceAttributeCanMapHostMemory, ///< Whether host memory can be mapped into device address space
+ hipDeviceAttributeCanUseHostPointerForRegisteredMem,///< Cuda only. Device can access host registered memory
+ ///< at the same virtual address as the CPU
+ hipDeviceAttributeClockRate, ///< Peak clock frequency in kilohertz.
+ hipDeviceAttributeComputeMode, ///< Compute mode that device is currently in.
+ hipDeviceAttributeComputePreemptionSupported, ///< Cuda only. Device supports Compute Preemption.
+ hipDeviceAttributeConcurrentKernels, ///< Device can possibly execute multiple kernels concurrently.
+ hipDeviceAttributeConcurrentManagedAccess, ///< Device can coherently access managed memory concurrently with the CPU
+ hipDeviceAttributeCooperativeLaunch, ///< Support cooperative launch
+ hipDeviceAttributeCooperativeMultiDeviceLaunch, ///< Support cooperative launch on multiple devices
+ hipDeviceAttributeDeviceOverlap, ///< Cuda only. Device can concurrently copy memory and execute a kernel.
+ ///< Deprecated. Use instead asyncEngineCount.
+ hipDeviceAttributeDirectManagedMemAccessFromHost, ///< Host can directly access managed memory on
+ ///< the device without migration
+ hipDeviceAttributeGlobalL1CacheSupported, ///< Cuda only. Device supports caching globals in L1
+ hipDeviceAttributeHostNativeAtomicSupported, ///< Cuda only. Link between the device and the host supports native atomic operations
+ hipDeviceAttributeIntegrated, ///< Device is integrated GPU
+ hipDeviceAttributeIsMultiGpuBoard, ///< Multiple GPU devices.
+ hipDeviceAttributeKernelExecTimeout, ///< Run time limit for kernels executed on the device
+ hipDeviceAttributeL2CacheSize, ///< Size of L2 cache in bytes. 0 if the device doesn't have L2 cache.
+ hipDeviceAttributeLocalL1CacheSupported, ///< caching locals in L1 is supported
+ hipDeviceAttributeLuid, ///< Cuda only. 8-byte locally unique identifier in 8 bytes. Undefined on TCC and non-Windows platforms
+ hipDeviceAttributeLuidDeviceNodeMask, ///< Cuda only. Luid device node mask. Undefined on TCC and non-Windows platforms
+ hipDeviceAttributeComputeCapabilityMajor, ///< Major compute capability version number.
+ hipDeviceAttributeManagedMemory, ///< Device supports allocating managed memory on this system
+ hipDeviceAttributeMaxBlocksPerMultiProcessor, ///< Cuda only. Max block size per multiprocessor
+ hipDeviceAttributeMaxBlockDimX, ///< Max block size in width.
+ hipDeviceAttributeMaxBlockDimY, ///< Max block size in height.
+ hipDeviceAttributeMaxBlockDimZ, ///< Max block size in depth.
+ hipDeviceAttributeMaxGridDimX, ///< Max grid size in width.
+ hipDeviceAttributeMaxGridDimY, ///< Max grid size in height.
+ hipDeviceAttributeMaxGridDimZ, ///< Max grid size in depth.
+ hipDeviceAttributeMaxSurface1D, ///< Maximum size of 1D surface.
+ hipDeviceAttributeMaxSurface1DLayered, ///< Cuda only. Maximum dimensions of 1D layered surface.
+ hipDeviceAttributeMaxSurface2D, ///< Maximum dimension (width, height) of 2D surface.
+ hipDeviceAttributeMaxSurface2DLayered, ///< Cuda only. Maximum dimensions of 2D layered surface.
+ hipDeviceAttributeMaxSurface3D, ///< Maximum dimension (width, height, depth) of 3D surface.
+ hipDeviceAttributeMaxSurfaceCubemap, ///< Cuda only. Maximum dimensions of Cubemap surface.
+ hipDeviceAttributeMaxSurfaceCubemapLayered, ///< Cuda only. Maximum dimension of Cubemap layered surface.
+ hipDeviceAttributeMaxTexture1DWidth, ///< Maximum size of 1D texture.
+ hipDeviceAttributeMaxTexture1DLayered, ///< Cuda only. Maximum dimensions of 1D layered texture.
+ hipDeviceAttributeMaxTexture1DLinear, ///< Maximum number of elements allocatable in a 1D linear texture.
+ ///< Use cudaDeviceGetTexture1DLinearMaxWidth() instead on Cuda.
+ hipDeviceAttributeMaxTexture1DMipmap, ///< Cuda only. Maximum size of 1D mipmapped texture.
+ hipDeviceAttributeMaxTexture2DWidth, ///< Maximum dimension width of 2D texture.
+ hipDeviceAttributeMaxTexture2DHeight, ///< Maximum dimension hight of 2D texture.
+ hipDeviceAttributeMaxTexture2DGather, ///< Cuda only. Maximum dimensions of 2D texture if gather operations performed.
+ hipDeviceAttributeMaxTexture2DLayered, ///< Cuda only. Maximum dimensions of 2D layered texture.
+ hipDeviceAttributeMaxTexture2DLinear, ///< Cuda only. Maximum dimensions (width, height, pitch) of 2D textures bound to pitched memory.
+ hipDeviceAttributeMaxTexture2DMipmap, ///< Cuda only. Maximum dimensions of 2D mipmapped texture.
+ hipDeviceAttributeMaxTexture3DWidth, ///< Maximum dimension width of 3D texture.
+ hipDeviceAttributeMaxTexture3DHeight, ///< Maximum dimension height of 3D texture.
+ hipDeviceAttributeMaxTexture3DDepth, ///< Maximum dimension depth of 3D texture.
+ hipDeviceAttributeMaxTexture3DAlt, ///< Cuda only. Maximum dimensions of alternate 3D texture.
+ hipDeviceAttributeMaxTextureCubemap, ///< Cuda only. Maximum dimensions of Cubemap texture
+ hipDeviceAttributeMaxTextureCubemapLayered, ///< Cuda only. Maximum dimensions of Cubemap layered texture.
+ hipDeviceAttributeMaxThreadsDim, ///< Maximum dimension of a block
+ hipDeviceAttributeMaxThreadsPerBlock, ///< Maximum number of threads per block.
+ hipDeviceAttributeMaxThreadsPerMultiProcessor, ///< Maximum resident threads per multiprocessor.
+ hipDeviceAttributeMaxPitch, ///< Maximum pitch in bytes allowed by memory copies
+ hipDeviceAttributeMemoryBusWidth, ///< Global memory bus width in bits.
+ hipDeviceAttributeMemoryClockRate, ///< Peak memory clock frequency in kilohertz.
+ hipDeviceAttributeComputeCapabilityMinor, ///< Minor compute capability version number.
+ hipDeviceAttributeMultiGpuBoardGroupID, ///< Cuda only. Unique ID of device group on the same multi-GPU board
+ hipDeviceAttributeMultiprocessorCount, ///< Number of multiprocessors on the device.
+ hipDeviceAttributeName, ///< Device name.
+ hipDeviceAttributePageableMemoryAccess, ///< Device supports coherently accessing pageable memory
+ ///< without calling hipHostRegister on it
+ hipDeviceAttributePageableMemoryAccessUsesHostPageTables, ///< Device accesses pageable memory via the host's page tables
+ hipDeviceAttributePciBusId, ///< PCI Bus ID.
+ hipDeviceAttributePciDeviceId, ///< PCI Device ID.
+ hipDeviceAttributePciDomainID, ///< PCI Domain ID.
+ hipDeviceAttributePersistingL2CacheMaxSize, ///< Cuda11 only. Maximum l2 persisting lines capacity in bytes
+ hipDeviceAttributeMaxRegistersPerBlock, ///< 32-bit registers available to a thread block. This number is shared
+ ///< by all thread blocks simultaneously resident on a multiprocessor.
+ hipDeviceAttributeMaxRegistersPerMultiprocessor, ///< 32-bit registers available per block.
+ hipDeviceAttributeReservedSharedMemPerBlock, ///< Cuda11 only. Shared memory reserved by CUDA driver per block.
+ hipDeviceAttributeMaxSharedMemoryPerBlock, ///< Maximum shared memory available per block in bytes.
+ hipDeviceAttributeSharedMemPerBlockOptin, ///< Cuda only. Maximum shared memory per block usable by special opt in.
+ hipDeviceAttributeSharedMemPerMultiprocessor, ///< Cuda only. Shared memory available per multiprocessor.
+ hipDeviceAttributeSingleToDoublePrecisionPerfRatio, ///< Cuda only. Performance ratio of single precision to double precision.
+ hipDeviceAttributeStreamPrioritiesSupported, ///< Cuda only. Whether to support stream priorities.
+ hipDeviceAttributeSurfaceAlignment, ///< Cuda only. Alignment requirement for surfaces
+ hipDeviceAttributeTccDriver, ///< Cuda only. Whether device is a Tesla device using TCC driver
+ hipDeviceAttributeTextureAlignment, ///< Alignment requirement for textures
+ hipDeviceAttributeTexturePitchAlignment, ///< Pitch alignment requirement for 2D texture references bound to pitched memory;
+ hipDeviceAttributeTotalConstantMemory, ///< Constant memory size in bytes.
+ hipDeviceAttributeTotalGlobalMem, ///< Global memory available on devicice.
+ hipDeviceAttributeUnifiedAddressing, ///< Cuda only. An unified address space shared with the host.
+ hipDeviceAttributeUuid, ///< Cuda only. Unique ID in 16 byte.
+ hipDeviceAttributeWarpSize, ///< Warp size in threads.
- hipDeviceAttributeMaxPitch, ///< Maximum pitch in bytes allowed by memory copies
- hipDeviceAttributeTextureAlignment, ///<Alignment requirement for textures
- hipDeviceAttributeTexturePitchAlignment, ///<Pitch alignment requirement for 2D texture references bound to pitched memory;
- hipDeviceAttributeKernelExecTimeout, ///<Run time limit for kernels executed on the device
- hipDeviceAttributeCanMapHostMemory, ///<Device can map host memory into device address space
- hipDeviceAttributeEccEnabled, ///<Device has ECC support enabled
+ hipDeviceAttributeCudaCompatibleEnd = 9999,
+ hipDeviceAttributeAmdSpecificBegin = 10000,
- hipDeviceAttributeCooperativeMultiDeviceUnmatchedFunc, ///< Supports cooperative launch on multiple
- ///devices with unmatched functions
- hipDeviceAttributeCooperativeMultiDeviceUnmatchedGridDim, ///< Supports cooperative launch on multiple
- ///devices with unmatched grid dimensions
- hipDeviceAttributeCooperativeMultiDeviceUnmatchedBlockDim, ///< Supports cooperative launch on multiple
- ///devices with unmatched block dimensions
- hipDeviceAttributeCooperativeMultiDeviceUnmatchedSharedMem, ///< Supports cooperative launch on multiple
- ///devices with unmatched shared memories
- hipDeviceAttributeAsicRevision, ///< Revision of the GPU in this device
- hipDeviceAttributeManagedMemory, ///< Device supports allocating managed memory on this system
- hipDeviceAttributeDirectManagedMemAccessFromHost, ///< Host can directly access managed memory on
- /// the device without migration
- hipDeviceAttributeConcurrentManagedAccess, ///< Device can coherently access managed memory
- /// concurrently with the CPU
- hipDeviceAttributePageableMemoryAccess, ///< Device supports coherently accessing pageable memory
- /// without calling hipHostRegister on it
- hipDeviceAttributePageableMemoryAccessUsesHostPageTables, ///< Device accesses pageable memory via
- /// the host's page tables
- hipDeviceAttributeCanUseStreamWaitValue ///< '1' if Device supports hipStreamWaitValue32() and
- ///< hipStreamWaitValue64() , '0' otherwise.
+ hipDeviceAttributeClockInstructionRate = hipDeviceAttributeAmdSpecificBegin, ///< Frequency in khz of the timer used by the device-side "clock*"
+ hipDeviceAttributeArch, ///< Device architecture
+ hipDeviceAttributeMaxSharedMemoryPerMultiprocessor, ///< Maximum Shared Memory PerMultiprocessor.
+ hipDeviceAttributeGcnArch, ///< Device gcn architecture
+ hipDeviceAttributeGcnArchName, ///< Device gcnArch name in 256 bytes
+ hipDeviceAttributeHdpMemFlushCntl, ///< Address of the HDP_MEM_COHERENCY_FLUSH_CNTL register
+ hipDeviceAttributeHdpRegFlushCntl, ///< Address of the HDP_REG_COHERENCY_FLUSH_CNTL register
+ hipDeviceAttributeCooperativeMultiDeviceUnmatchedFunc, ///< Supports cooperative launch on multiple
+ ///< devices with unmatched functions
+ hipDeviceAttributeCooperativeMultiDeviceUnmatchedGridDim, ///< Supports cooperative launch on multiple
+ ///< devices with unmatched grid dimensions
+ hipDeviceAttributeCooperativeMultiDeviceUnmatchedBlockDim, ///< Supports cooperative launch on multiple
+ ///< devices with unmatched block dimensions
+ hipDeviceAttributeCooperativeMultiDeviceUnmatchedSharedMem, ///< Supports cooperative launch on multiple
+ ///< devices with unmatched shared memories
+ hipDeviceAttributeIsLargeBar, ///< Whether it is LargeBar
+ hipDeviceAttributeAsicRevision, ///< Revision of the GPU in this device
+ hipDeviceAttributeCanUseStreamWaitValue, ///< '1' if Device supports hipStreamWaitValue32() and
+ ///< hipStreamWaitValue64() , '0' otherwise.
+ hipDeviceAttributeAmdSpecificEnd = 19999,
+ hipDeviceAttributeVendorSpecificBegin = 20000,
+ // Extended attributes for vendors
} hipDeviceAttribute_t;
enum hipComputeMode {
Known issues#
Incorrect dGPU behavior when using AMDVBFlash tool#
The AMDVBFlash tool, used for flashing the VBIOS image to dGPU, does not communicate with the ROM Controller specifically when the driver is present. This is because the driver, as part of its runtime power management feature, puts the dGPU to a sleep state.
As a workaround, users can run amdgpu.runpm=0, which temporarily disables the runtime power management feature from the driver and dynamically changes some power control-related sysfs files.
Issue with START timestamp in ROCProfiler#
Users may encounter an issue with the enabled timestamp functionality for monitoring one or multiple counters. ROCProfiler outputs the following four timestamps for each kernel:
Dispatch
Start
End
Complete
Issue#
This defect is related to the Start timestamp functionality, which incorrectly shows an earlier time than the Dispatch timestamp.
To reproduce the issue,
Enable timing using the –timestamp on flag.
Use the -i option with the input filename that contains the name of the counter(s) to monitor.
Run the program.
Check the output result file.
Current behavior#
BeginNS is lower than DispatchNS, which is incorrect.
Expected behavior#
The correct order is:
Dispatch < Start < End < Complete
Users cannot use ROCProfiler to measure the time spent on each kernel because of the incorrect timestamp with counter collection enabled.
Recommended workaround#
Users are recommended to collect kernel execution timestamps without monitoring counters, as follows:
Enable timing using the –timestamp on flag, and run the application.
Rerun the application using the -i option with the input filename that contains the name of the counter(s) to monitor, and save this to a different output file using the -o flag.
Check the output result file from step 1.
The order of timestamps correctly displays as: DispatchNS < BeginNS < EndNS < CompleteNS
Users can find the values of the collected counters in the output file generated in step 2.
Radeon Pro V620 and W6800 workstation GPUs#
No support for SMI and ROCDebugger on SRIOV#
System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV environment on any GPU. For more information, refer to the Systems Management Interface documentation.
Deprecations and warnings#
ROCm libraries changes – deprecations and deprecation removal#
The
hipFFT.h
header is now provided only by the hipFFT package. Up to ROCm 5.0, users would gethipFFT.h
in the rocFFT package too.The GlobalPairwiseAMG class is now entirely removed, users should use the PairwiseAMG class instead.
The rocsparse_spmm signature in 5.0 was changed to match that of rocsparse_spmm_ex. In 5.0, rocsparse_spmm_ex is still present, but deprecated. Signature diff for rocsparse_spmm rocsparse_spmm in 5.0
rocsparse_status rocsparse_spmm(rocsparse_handle handle, rocsparse_operation trans_A, rocsparse_operation trans_B, const void* alpha, const rocsparse_spmat_descr mat_A, const rocsparse_dnmat_descr mat_B, const void* beta, const rocsparse_dnmat_descr mat_C, rocsparse_datatype compute_type, rocsparse_spmm_alg alg, rocsparse_spmm_stage stage, size_t* buffer_size, void* temp_buffer);
rocSPARSE_spmm in 4.0
rocsparse_status rocsparse_spmm(rocsparse_handle handle, rocsparse_operation trans_A, rocsparse_operation trans_B, const void* alpha, const rocsparse_spmat_descr mat_A, const rocsparse_dnmat_descr mat_B, const void* beta, const rocsparse_dnmat_descr mat_C, rocsparse_datatype compute_type, rocsparse_spmm_alg alg, size_t* buffer_size, void* temp_buffer);
HIP API deprecations and warnings#
Warning - arithmetic operators of HIP complex and vector types#
In this release, arithmetic operators of HIP complex and vector types are deprecated.
As alternatives to arithmetic operators of HIP complex types, users can use arithmetic operators of
std::complex
types.As alternatives to arithmetic operators of HIP vector types, users can use the operators of the native clang vector type associated with the data member of HIP vector types.
During the deprecation, two macros _HIP_ENABLE_COMPLEX_OPERATORS
and
_HIP_ENABLE_VECTOR_OPERATORS
are provided to allow users to conditionally enable arithmetic
operators of HIP complex or vector types.
Note, the two macros are mutually exclusive and, by default, set to Off.
The arithmetic operators of HIP complex and vector types will be removed in a future release.
Refer to the HIP API Guide for more information.
Warning - compiler-generated code object version 4 deprecation#
Support for loading compiler-generated code object version 4 will be deprecated in a future release with no release announcement and replaced with code object 5 as the default version.
The current default is code object version 4.
Warning - MIOpenTensile deprecation#
MIOpenTensile will be deprecated in a future release.
Library changes in ROCm 5.0.0#
Library |
Version |
---|---|
hipBLAS |
|
hipCUB |
|
hipFFT |
|
hipSOLVER |
|
hipSPARSE |
|
MIVisionX |
|
rccl |
|
rocALUTION |
|
rocBLAS |
|
rocFFT |
|
rocPRIM |
|
rocRAND |
|
rocSOLVER |
|
rocSPARSE |
|
rocThrust |
|
Tensile |
hipBLAS 0.49.0#
hipBLAS 0.49.0 for ROCm 5.0.0
Added#
Added rocSOLVER functions to hipblas-bench
Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex
Added compilation warning for future trmm changes
Added documentation to hipblas.h
Added option to forgo pivoting for getrf and getri when ipiv is nullptr
Added code coverage option
Fixed#
Fixed use of incorrect ‘HIP_PATH’ when building from source.
Fixed windows packaging
Allowing negative increments in hipblas-bench
Removed boost dependency
hipCUB 2.10.13#
hipCUB 2.10.13 for ROCm 5.0.0
Fixed#
Added missing includes to hipcub.hpp
Added#
Bfloat16 support to test cases (device_reduce & device_radix_sort)
Device merge sort
Block merge sort
API update to CUB 1.14.0
Changed#
The SetupNVCC.cmake automatic target selector select all of the capabalities of all available card for NVIDIA backend.
hipFFT 1.0.4#
hipFFT 1.0.4 for ROCm 5.0.0
Fixed#
Add calls to rocFFT setup/cleanup.
Cmake fixes for clients and backend support.
Added#
Added support for Windows 10 as a build target.
hipSOLVER 1.2.0#
hipSOLVER 1.2.0 for ROCm 5.0.0
Added#
Added functions
sytrf
hipsolverSsytrf_bufferSize, hipsolverDsytrf_bufferSize, hipsolverCsytrf_bufferSize, hipsolverZsytrf_bufferSize
hipsolverSsytrf, hipsolverDsytrf, hipsolverCsytrf, hipsolverZsytrf
Fixed#
Fixed use of incorrect
HIP_PATH
when building from source (#40). Thanks @jakub329homola!
hipSPARSE 2.0.0#
hipSPARSE 2.0.0 for ROCm 5.0.0
Added#
Added (conjugate) transpose support for csrmv, hybmv and spmv routines
MIVisionX 2.0.1#
MIVisionX for ROCm 5.0.0
Added#
Support for cmake 3.22.X
Support for OpenCV 4.X.X
Support for mv_compile with the HIP GPU backend
Support for tensor_compare node (less/greater/less_than/greater_than/equal onnx operators)
Optimizations#
Code Cleanup
Readme Updates
Changed#
License Updates
Fixed#
Minor bugs and warnings
Inference server application - OpenCL Backend
vxCreateThreshold Fix - Apps & Sample
Tested Configurations#
Windows 10
Linux distribution
Ubuntu -
18.04
/20.04
CentOS -
7
/8
SLES -
15-SP2
ROCm: rocm-dev -
4.5.2.40502-164
rocm-cmake - rocm-4.2.0
MIOpenGEMM - 1.1.5
MIOpen - 2.14.0
Protobuf - V3.12.0
OpenCV - 3.4.0
RPP - 0.92
FFMPEG - n4.0.4
Dependencies for all the above packages
MIVisionX Setup Script -
V2.0.0
Known Issues#
Package install requires OpenCV
v3.4.X
to executeAMD OpenCV extensions
Mivisionx Dependency Map#
Docker Image: docker pull kiritigowda/ubuntu-18.04:{TAGNAME}
new component added to the level
existing component from the previous level
Build Level |
MIVisionX Dependencies |
Modules |
Libraries and Executables |
Docker Tag |
---|---|---|---|---|
|
cmake <br> gcc <br> g++ |
amd_openvx <br> utilities |
|
|
|
ROCm OpenCL <br> +Level 1 |
amd_openvx <br> amd_openvx_extensions <br> utilities |
|
|
|
OpenCV <br> FFMPEG <br> +Level 2 |
amd_openvx <br> amd_openvx_extensions <br> utilities |
|
|
|
MIOpenGEMM <br> MIOpen <br> ProtoBuf <br> +Level 3 |
amd_openvx <br> amd_openvx_extensions <br> apps <br> utilities |
|
|
|
AMD_RPP <br> rocAL deps <br> +Level 4 |
amd_openvx <br> amd_openvx_extensions <br> apps <br> rocAL <br> utilities |
|
rccl 2.10.3#
RCCL 2.10.3 for ROCm 5.0.0
Added#
Compatibility with NCCL 2.10.3
Known Issues#
Managed memory is not currently supported for clique-based kernels
rocALUTION 2.0.1#
rocALUTION 2.0.1 for ROCm 5.0.0
Changed#
Removed deprecated GlobalPairwiseAMG class, please use PairwiseAMG instead.
Changed to C++ 14 Standard
Improved#
Added sanitizer option
Improved documentation
rocBLAS 2.42.0#
rocBLAS 2.42.0 for ROCm 5.0.0
Added#
Added rocblas_get_version_string_size convenience function
Added rocblas_xtrmm_outofplace, an out-of-place version of rocblas_xtrmm
Added hpl and trig initialization for gemm_ex to rocblas-bench
Added source code gemm. It can be used as an alternative to Tensile for debugging and development
Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex
Optimizations#
Improved performance of non-batched and batched single-precision GER for size m > 1024. Performance enhanced by 5-10% measured on a MI100 (gfx908) GPU.
Improved performance of non-batched and batched HER for all sizes and data types. Performance enhanced by 2-17% measured on a MI100 (gfx908) GPU.
Changed#
Instantiate templated rocBLAS functions to reduce size of librocblas.so
Removed static library dependency on msgpack
Removed boost dependencies for clients
Fixed#
Option to install script to build only rocBLAS clients with a pre-built rocBLAS library
Correctly set output of nrm2_batched_ex and nrm2_strided_batched_ex when given bad input
Fix for dgmm with side == rocblas_side_left and a negative incx
Fixed out-of-bounds read for small trsm
Fixed numerical checking for tbmv_strided_batched
rocFFT 1.0.13#
rocFFT 1.0.13 for ROCm 5.0.0
Optimizations#
Improved many plans by removing unnecessary transpose steps.
Optimized scheme selection for 3D problems.
Imposed less restrictions on 3D_BLOCK_RC selection. More problems can use 3D_BLOCK_RC and have some performance gain.
Enabled 3D_RC. Some 3D problems with SBCC-supported z-dim can use less kernels and get benefit.
Force –length 336 336 56 (dp) use faster 3D_RC to avoid it from being skipped by conservative threshold test.
Optimized some even-length R2C/C2R cases by doing more operations in-place and combining pre/post processing into Stockham kernels.
Added radix-17.
Added#
Added new kernel generator for select fused-2D transforms.
Fixed#
Improved large 1D transform decompositions.
rocPRIM 2.10.12#
rocPRIM 2.10.12 for ROCm 5.0.0
Fixed#
Enable bfloat16 tests and reduce threshold for bfloat16
Fix device scan limit_size feature
Non-optimized builds no longer trigger local memory limit errors
Added#
Added scan size limit feature
Added reduce size limit feature
Added transform size limit feature
Add block_load_striped and block_store_striped
Add gather_to_blocked to gather values from other threads into a blocked arrangement
The block sizes for device merge sorts initial block sort and its merge steps are now separate in its kernel config
the block sort step supports multiple items per thread
Changed#
size_limit for scan, reduce and transform can now be set in the config struct instead of a parameter
Device_scan and device_segmented_scan:
inclusive_scan
now uses the input-type as accumulator-type,exclusive_scan
uses initial-value-type.This particularly changes behaviour of small-size input types with large-size output types (e.g.
short
input,int
output).And low-res input with high-res output (e.g.
float
input,double
output)
Revert old Fiji workaround, because they solved the issue at compiler side
Update README cmake minimum version number
Block sort support multiple items per thread
currently only powers of two block sizes, and items per threads are supported and only for full blocks
Bumped the minimum required version of CMake to 3.16
Known Issues#
Unit tests may soft hang on MI200 when running in hipMallocManaged mode.
device_segmented_radix_sort, device_scan unit tests failing for HIP on Windows
ReduceEmptyInput cause random faulire with bfloat16
rocRAND 2.10.12#
rocRAND 2.10.12 for ROCm 5.0.0
Changed#
No updates or changes for ROCm 5.0.0.
rocSOLVER 3.16.0#
rocSOLVER 3.16.0 for ROCm 5.0.0
Added#
Symmetric matrix factorizations:
LASYF
SYTF2, SYTRF (with batched and strided_batched versions)
Added
rocsolver_get_version_string_size
to help with version string queriesAdded
rocblas_layer_mode_ex
and the ability to print kernel calls in the trace and profile logsExpanded batched and strided_batched sample programs.
Optimized#
Improved general performance of LU factorization
Increased parallelism of specialized kernels when compiling from source, reducing build times on multi-core systems.
Changed#
The rocsolver-test client now prints the rocSOLVER version used to run the tests, rather than the version used to build them
The rocsolver-bench client now prints the rocSOLVER version used in the benchmark
Fixed#
Added missing stdint.h include to rocsolver.h
rocSPARSE 2.0.0#
rocSPARSE 2.0.0 for ROCm 5.0.0
Added#
csrmv, coomv, ellmv, hybmv for (conjugate) transposed matrices
csrmv for symmetric matrices
Changed#
spmm_ex is now deprecated and will be removed in the next major release
Improved#
Optimization for gtsv
rocThrust 2.13.0#
rocThrust 2.13.0 for ROCm 5.0.0
Added#
Updated to match upstream Thrust 1.13.0
Updated to match upstream Thrust 1.14.0
Added async scan
Changed#
Scan algorithms:
inclusive_scan
now uses the input-type as accumulator-type,exclusive_scan
uses initial-value-type.This particularly changes behaviour of small-size input types with large-size output types (e.g.
short
input,int
output).And low-res input with high-res output (e.g.
float
input,double
output)
Tensile 4.31.0#
Tensile 4.31.0 for ROCm 5.0.0
Added#
DirectToLds support (x2/x4)
DirectToVgpr support for DGEMM
Parameter to control number of files kernels are merged into to better parallelize kernel compilation
FP16 alternate implementation for HPA HGEMM on aldebaran
Optimized#
Add DGEMM NN custom kernel for HPL on aldebaran
Changed#
Update tensile_client executable to std=c++14
Removed#
Remove unused old Tensile client code
Fixed#
Fix hipErrorInvalidHandle during benchmarks
Fix addrVgpr for atomic GSU
Fix for Python 3.8: add case for Constant nodeType
Fix architecture mapping for gfx1011 and gfx1012
Fix PrintSolutionRejectionReason verbiage in KernelWriter.py
Fix vgpr alignment problem when enabling flat buffer load