AMD ROCm™ documentation#

Applies to Linux and Windows

2024-04-18

621 min read time

Welcome to the ROCm docs home page! If you’re new to ROCm, you can review the following resources to learn more about our products and what we support:

You can install ROCm on our Radeon™, Radeon™ PRO, and Instinct™ GPUs. If you’re using Radeon GPUs, we recommend reading the Radeon-specific ROCm documentation.

For hands-on applications, refer to our ROCm blogs site.

Our documentation is organized into the following categories:

What is ROCm?#

ROCm is an open-source stack, composed primarily of open-source software, designed for graphics processing unit (GPU) computation. ROCm consists of a collection of drivers, development tools, and APIs that enable GPU programming from low-level kernel to end-user applications.

AMD's ROCm software stack and neighboring technologies.

ROCm is powered by Heterogeneous-computing Interface for Portability (HIP); it supports programming models, such as OpenMP and OpenCL, and includes all necessary open source software compilers, debuggers, and libraries. It’s fully integrated into machine learning (ML) frameworks, such as PyTorch and TensorFlow.

Tip

If you’re using Radeon GPUs, refer to the Radeon-specific ROCm documentation.

ROCm components#

ROCm consists of the following components. For information on the license associated with each component, see ROCm licensing.

Libraries#

Machine Learning & Computer Vision#

Component

Description

Composable Kernel

Provides a programming model for writing performance critical kernels for machine learning workloads across multiple architectures

MIGraphX

Graph inference engine that accelerates machine learning model inference

MIOpen

An open source deep-learning library

MIVisionX

Set of comprehensive computer vision and machine learning libraries, utilities, and applications

rocAL

An augmentation library designed to decode and process images and videos

rocDecode

High-performance SDK for access to video decoding features on AMD GPUs

ROCm Performance Primitives (RPP)

Comprehensive high-performance computer vision library for AMD processors with HIP/OpenCL/CPU back-ends

Communication#

Component

Description

RCCL

Standalone library that provides multi-GPU and multi-node collective communication primitives

Math#

Component

Description

half

C++ header-only library that provides an IEEE 754 conformant, 16-bit half-precision floating-point type, along with corresponding arithmetic operators, type conversions, and common mathematical functions

hipBLAS

BLAS-marshaling library that supports rocBLAS and cuBLAS backends

hipBLASLt

Provides general matrix-matrix operations with a flexible API and extends functionalities beyond traditional BLAS library

hipFFT

Fast Fourier transforms (FFT)-marshalling library that supports rocFFT or cuFFT backends

hipfort

Fortran interface library for accessing GPU Kernels

hipRAND

Ports CUDA applications that use the cuRAND library into the HIP layer

hipSOLVER

An LAPACK-marshalling library that supports rocSOLVER and cuSOLVER backends

hipSPARSE

SPARSE-marshalling library that supports rocSPARSE and cuSPARSE backends

hipSPARSELt

SPARSE-marshalling library with multiple supported backends

rocALUTION

Sparse linear algebra library for exploring fine-grained parallelism on ROCm runtime and toolchains

rocBLAS

BLAS implementation (in the HIP programming language) on the ROCm runtime and toolchains

rocFFT

Software library for computing fast Fourier transforms (FFTs) written in HIP

rocRAND

Provides functions that generate pseudorandom and quasirandom numbers

rocSOLVER

An implementation of LAPACK routines on ROCm software, implemented in the HIP programming language and optimized for AMD’s latest discrete GPUs

rocSPARSE

Exposes a common interface that provides BLAS for sparse computation implemented on ROCm runtime and toolchains (in the HIP programming language)

rocWMMA

C++ library for accelerating mixed-precision matrix multiply-accumulate (MMA) operations

Tensile

Creates benchmark-driven backend libraries for GEMMs, GEMM-like problems, and general N-dimensional tensor contractions

Primitives#

Component

Description

hipCUB

Thin header-only wrapper library on top of rocPRIM or CUB that allows project porting using the CUB library to the HIP layer

hipTensor

AMD’s C++ library for accelerating tensor primitives based on the composable kernel library

rocPRIM

Header-only library for HIP parallel primitives

rocThrust

Parallel algorithm library

Tools#

Component

Description

AMD SMI

C library for Linux that provides a user space interface for applications to monitor and control AMD devices

HIPIFY

Translates CUDA source code into portable HIP C++

Radeon Compute Profiler (RCP)

Performance analysis tool that gathers data from the API runtime and GPU for OpenCL and ROCm/HSA applications

RocBandwidthTest

Captures the performance characteristics of buffer copying and kernel read/write operations

ROCmCC

Clang/LLVM-based compiler

ROCm CMake

Collection of CMake modules for common build and development tasks

ROCm Data Center Tool

Simplifies administration and addresses key infrastructure challenges in AMD GPUs in cluster and data-center environments

ROCm Debug Agent (ROCdebug-agent)

Prints the state of all AMD GPU wavefronts that caused a queue error by sending a SIGQUIT signal to the process while the program is running

ROCm Debugger (ROCgdb)

Source-level debugger for Linux, based on the GNU Debugger (GDB)

ROCdbgapi

ROCm debugger API library

rocminfo

Reports system information

ROCm SMI

C library for Linux that provides a user space interface for applications to monitor and control GPU applications

ROCm Validation Suite

Detects and troubleshoots common problems affecting AMD GPUs running in a high-performance computing environment

ROCProfiler

Profiling tool for HIP applications

ROCTracer

Intercepts runtime API calls and traces asynchronous activity

TransferBench

Utility to benchmark simultaneous transfers between user-specified devices (CPUs/GPUs)

Compilers#

Component

Description

AOMP

Scripted build of LLVM and supporting software

FLANG

An out-of-tree Fortran compiler targeting LLVM

hipCC

Compiler driver utility that calls Clang or NVCC and passes the appropriate include and library options for the target compiler and HIP infrastructure

LLVM (amdclang)

Toolkit for the construction of highly optimized compilers, optimizers, and runtime environments

Runtimes#

Component

Description

AMD Common Language Runtime (CLR)

Contains source code for AMD’s common language runtimes: HIP and OpenCL

HIP

AMD’s GPU programming language extension and the GPU runtime

ROCR-Runtime

User-mode API interfaces and libraries necessary for host applications to launch compute kernels on available HSA ROCm kernel agents

ROCm 6.1 release highlights#

The ROCm™ 6.1 release consists of new features and fixes to improve the stability and performance of AMD Instinct™ MI300 GPU applications. Notably, we’ve added:

  • Full support for Ubuntu 22.04.4.

  • rocDecode, a new ROCm component that provides high-performance video decode support for AMD GPUs. With rocDecode, you can decode compressed video streams while keeping the resulting YUV frames in video memory. With decoded frames in video memory, you can run video post-processing using ROCm HIP, avoiding unnecessary data copies via the PCIe bus.

    To learn more, refer to the rocDecode documentation.

OS and GPU support changes#

ROCm 6.1 adds the following operating system support:

  • MI300A: Ubuntu 22.04.4 and RHEL 9.3

  • MI300X: Ubuntu 22.04.4

Future releases will add additional operating systems to match the general offering. For older generations of supported AMD Instinct products, we’ve added Ubuntu 22.04.4 support.

Tip

To view the complete list of supported GPUs and operating systems, refer to the system requirements page for Linux and Windows.

Installation packages#

This release includes a new set of packages for every module (all libraries and binaries default to DT_RPATH). Package names have the suffix rpath; for example, the rpath variant of rocminfo is rocminfo-rpath.

Warning

The new rpath packages will conflict with the default packages; they are meant to be used only in environments where legacy DT_RPATH is the preferred form of linking (instead of DT_RUNPATH). We do not recommend installing both sets of packages.

ROCm components#

The following sections highlight select component-specific changes. For additional details, refer to the Changelog.

AMD System Management Interface (SMI) Tool#

  • New monitor command for GPU metrics. Use the monitor command to customize, capture, collect, and observe GPU metrics on target devices.

  • Integration with E-SMI. The EPYC™ System Management Interface In-band Library is a Linux C-library that provides in-band user space software APIs to monitor and control your CPU’s power, energy, performance, and other system management functionality. This integration enables access to CPU metrics and telemetry through the AMD SMI API and CLI tools.

Composable Kernel (CK)#

  • New architecture support. CK now supports to the following architectures to enable efficient image denoising on the following AMD GPUs: gfx1030, gfx1100, gfx1031, gfx1101, gfx1032, gfx1102, gfx1034, gfx1103, gfx1035, gfx1036

  • FP8 rounding logic is replaced with stochastic rounding. Stochastic rounding mimics a more realistic data behavior and improves model convergence.

HIP#

  • New environment variable to enable kernel run serialization. The default HIP_LAUNCH_BLOCKING value is 0 (disable); which causes kernels to run as defined in the queue. When set to 1 (enable), the HIP runtime serializes the kernel queue, which behaves the same as AMD_SERIALIZE_KERNEL.

hipBLASLt#

  • New GemmTuning extension parameter GemmTuning allows you to set a split-k value for each solution, which is more feasible for performance tuning.

hipFFT#

  • New multi-GPU support for single-process transforms Multiple GPUs can be used to perform a transform in a single process. Note that this initial implementation is a functional preview.

HIPIFY#

  • Skipped code blocks: Code blocks that are skipped by the preprocessor are no longer hipified under the --default-preprocessor option. To hipify everything, despite conditional preprocessor directives (#if, #ifdef, #ifndef, #elif, or #else), don’t use the --default-preprocessor or --amap options.

hipSPARSELt#

  • Structured sparsity matrix support extensions Structured sparsity matrices help speed up deep-learning workloads. We now support B as the sparse matrix and A as the dense matrix in Sparse Matrix-Matrix Multiplication (SPMM). Prior to this release, we only supported sparse (matrix A) x dense (matrix B) matrix multiplication. Structured sparsity matrices help speed up deep learning workloads.

hipTensor#

  • 4D tensor permutation and contraction support. You can now perform tensor permutation on 4D tensors and 4D contractions for F16, BF16, and Complex F32/F64 datatypes.

MIGraphX#

  • Improved performance for transformer-based models. We added support for FlashAttention, which benefits models like BERT, GPT, and Stable Diffusion.

  • New Torch-MIGraphX driver. This driver calls MIGraphX directly from PyTorch. It provides an mgx_module object that you can invoke like any other Torch module, but which utilizes the MIGraphX inference engine internally. Torch-MIGraphX supports FP32, FP16, and INT8 datatypes.

    • FP8 support. We now offer functional support for inference in the FP8E4M3FNUZ datatype. You can load an ONNX model in FP8E4M3FNUZ using C++ or Python APIs, or migraphx-driver. You can quantize a floating point model to FP8 format by using the --fp8 flag with migraphx-driver. To accelerate inference, MIGraphX uses hardware acceleration on MI300 for FP8 by leveraging FP8 support in various backend kernel libraries.

MIOpen#

  • Improved performance for inference and convolutions. Inference support now provided for Find 2.0 fusion plans. Additionally, we’ve enhanced the Number of samples, Height, Width, and Channels (NHWC) convolution kernels for heuristics. NHWC stores data in a format where the height and width dimensions come first, followed by channels.

OpenMP#

  • Implicit Zero-copy is triggered automatically in XNACK-enabled MI300A systems. Implicit Zero-copy behavior in non unified_shared_memory programs is triggered automatically in XNACK-enabled MI300A systems (for example, when using the HSA_XNACK=1 environment variable). OpenMP supports the ‘requires unified_shared_memory’ directive to support programs that don’t want to copy data explicitly between the CPU and GPU. However, this requires that you add these directives to every translation unit of the program.

  • New MI300 FP atomics. Application performance can now improve by leveraging fast floating-point atomics on MI300 (gfx942).

RCCL#

  • NCCL 2.18.6 compatibility. RCCL is now compatible with NCCL 2.18.6, which includes increasing the maximum IB network interfaces to 32 and fixing network device ordering when creating communicators with only one GPU per node.

  • Doubled simultaneous communication channels. We improved MI300X performance by increasing the maximum number of simultaneous communication channels from 32 to 64.

rocALUTION#

  • New multiple node and GPU support. Unsmoothed and smoothed aggregations and Ruge-Stueben AMG now work with multiple nodes and GPUs. For more information, refer to the API documentation.

rocDecode#

  • New ROCm component. rocDecode ROCm’s newest component, providing high-performance video decode support for AMD GPUs. To learn more, refer to the documentation.

ROCm Compiler#

  • Combined projects. ROCm Device-Libs, ROCm Compiler Support, and hipCC are now located in the llvm-project/amd subdirectory of AMD’s fork of the LLVM project. Previously, these projects were maintained in separate repositories. Note that the projects themselves will continue to be packaged separately.

  • Split the ‘rocm-llvm’ package. This package has been split into a required and an optional package:

    • rocm-llvm(required): A package containing the essential binaries needed for compilation.

    • rocm-llvm-dev(optional): A package containing binaries for compiler and application developers.

ROCm Data Center Tool (RDC)#

  • C++ upgrades. RDC was upgraded from C++11 to C++17 to enable a more modern C++ standard when writing RDC plugins.

ROCm Performance Primitives (RPP)#

  • New backend support. Audio processing support added for the HOST backend and 3D Voxel kernels support for the HOST and HIP backends.

ROCm Validation Suite#

  • New datatype support. Added BF16 and FP8 datatypes based on General Matrix Multiply(GEMM) operations in the GPU Stress Test (GST) module. This provides additional performance benchmarking and stress testing based on the newly supported datatypes.

rocSOLVER#

  • New EigenSolver routine. Based on the Jacobi algorithm, a new EigenSolver routine was added to the library. This routine computes the eigenvalues and eigenvectors of a matrix with improved performance.

ROCTracer#

  • New versioning and callback enhancements. Improved to match versioning changes in HIP Runtime and supports runtime API callbacks and activity record logging. The APIs of different runtimes at different levels are considered different API domains with assigned domain IDs.

Upcoming changes#

  • ROCm SMI will be deprecated in a future release. We advise migrating to AMD SMI now to prevent future workflow disruptions.

  • hipCC supports, by default, the following compiler invocation flags:

    • -mllvm -amdgpu-early-inline-all=true

    • -mllvm -amdgpu-function-calls=false

    In a future ROCm release, hipCC will no longer support these flags. It will, instead, use the Clang defaults:

    • -mllvm -amdgpu-early-inline-all=false

    • -mllvm -amdgpu-function-calls=true

    To evaluate the impact of this change, include --hipcc-func-supp in your hipCC invocation.

    For information on these flags, and the differences between hipCC and Clang, refer to ROCm Compiler Interfaces.

  • Future ROCm releases will not provide clang-ocl. For more information, refer to the clang-ocl README.

  • The following operating systems will be supported in a future ROCm release. They are currently only available in beta.

    • RHEL 9.4

    • RHEL 8.10

    • SLES 15 SP6

  • As of ROCm 6.2, we’ve planned for end-of-support for:

    • Ubuntu 20.04.5

    • SLES 15 SP4

    • RHEL/CentOS 7.9

Changelog#

This page contains the changelog for AMD ROCm Software.


ROCm 6.1.0#

The ROCm™ 6.1 release consists of new features and fixes to improve the stability and performance of AMD Instinct™ MI300 GPU applications. Notably, we’ve added:

  • Full support for Ubuntu 22.04.4.

  • rocDecode, a new ROCm component that provides high-performance video decode support for AMD GPUs. With rocDecode, you can decode compressed video streams while keeping the resulting YUV frames in video memory. With decoded frames in video memory, you can run video post-processing using ROCm HIP, avoiding unnecessary data copies via the PCIe bus.

    To learn more, refer to the rocDecode documentation.

OS and GPU support changes#

ROCm 6.1 adds the following operating system support:

  • MI300A: Ubuntu 22.04.4 and RHEL 9.3

  • MI300X: Ubuntu 22.04.4

Future releases will add additional operating systems to match our general offering. For older generations of supported AMD Instinct products, we’ve added Ubuntu 22.04.4 support.

Tip

To view the complete list of supported GPUs and operating systems, refer to the system requirements page for Linux and Windows.

Installation packages#

This release includes a new set of packages for every module (all libraries and binaries default to DT_RPATH). Package names have the suffix rpath; for example, the rpath variant of rocminfo is rocminfo-rpath.

Warning

The new rpath packages will conflict with the default packages; they are meant to be used only in environments where legacy DT_RPATH is the preferred form of linking (instead of DT_RUNPATH). We do not recommend trying to install both sets of packages.

AMD SMI#

AMD SMI for ROCm 6.1.0

Additions#
  • Added Monitor command. This provides users the ability to customize GPU metrics to capture, collect, and observe. Output is provided in a table view. This aligns closer to ROCm SMI rocm-smi (no argument), and allows you to customize per the data that are helpful for your use-case.

  • Integrated ESMI Tool. You can get CPU metrics and telemetry through our API and CLI tools. You can get this information using the amd-smi static and amd-smi metric commands. This is only available for limited target processors. As of ROCm 6.0.2, this is listed as:

    • AMD Zen3 based CPU Family 19h Models 0h-Fh and 30h-3Fh

    • AMD Zen4 based CPU Family 19h Models 10h-1Fh and A0-AFh

  • Added support for new metrics: VCN, JPEG engines, and PCIe errors. Using the AMD SMIrccl tool, you can retrieve VCN, JPEG engines, and PCIe errors by calling amd-smi metric -P or amd-smi metric --usage. Depending on device support, VCN_ACTIVITY will update for MI3x ASICs (with 4 separate VCN engine activities) for older ASICs MM_ACTIVITY with UVD/VCN engine activity (average of all engines). JPEG_ACTIVITY is a new field for MI3x ASICs, where device can support up to 32 JPEG engine activities. See our documentation for more in-depth understanding of these new fields.

  • Added AMDSMI Tool version. AMD SMI will report three versions: AMDSMI Tool, AMDSMI Library version, and ROCm version.

    The AMDSMI Tool version is the CLI/tool version number with commit ID appended after the + sign. The AMDSMI Library version is the library package version number. The ROCm version is the system’s installed ROCm version; if ROCm is not installed, it reports N/A.

  • Added XGMI table. Displays XGMI information for AMD GPU devices in a table format. This is only available on supported ASICs (e.g., MI300). Here, users can view read/write data XGMI or PCIe accumulated data transfer size (in KiloBytes).

  • Added units of measure to JSON output.. We added unit of measure to JSON/CSV amd-smi metric, amd-smi static, and amd-smi monitor commands.

Changes#
  • Topology is now left-aligned with BDF for each device listed individual table’s row/columns. We provided each device’s BDF for every table’s row/columns, then left-aligned data. We want AMD SMI Tool output to be easy to understand and digest. Having to scroll up to find this information made it difficult to follow, especially for devices that have many devices associated with one ASIC.

Fixes#
  • Fix for RDNA3/RDNA2/MI100 ‘amdsmi_get_gpu_pci_bandwidth()’ in ‘frequencies_read’ tests. For devices that do not report (e.g., RDNA3/RDNA2/MI100), we have added checks to confirm that these devices return AMDSMI_STATUS_NOT_SUPPORTED. Otherwise, tests now display a return string.

  • Fix for devices that have an older PyYAML installed. For platforms that are identified as having an older PyYAML version or pip, we now manually update both pip and PyYAML as needed. This fix impacts the following CLI commands:

    • amd-smi list

    • amd-smi static

    • amd-smi firmware

    • amd-smi metric

    • amd-smi topology

  • Fix for crash when user is not a member of video/render groups. AMD SMI now uses the same mutex handler for devices as ROCm SMI. This helps avoid crashes when DRM/device data are inaccessible to the logged-in user.

Known issues#
  • There is an AttributeError while running amd-smi process --csv

  • GPU reset results in an “Unable to reset non-amd GPU” error

  • bad pages results with “ValueError: NULL pointer access”

  • Some RDNA3 cards may enumerate to Slot type = UNKNOWN

HIP#

HIP 6.1 for ROCm 6.1

Additions#
  • New environment variable, HIP_LAUNCH_BLOCKING, which is used for serialization on kernel execution.

  • The default value is 0 (disable): kernel runs normally, as defined in the queue

  • When set as 1 (enable): HIP runtime serializes the kernel enqueue and behaves the same as AMD_SERIALIZE_KERNEL

  • Added HIPRTC support for hip headers driver_types, math_functions, library_types, math_functions, hip_math_constants, channel_descriptor, device_functions, hip_complex, surface_types, texture_types

Changes#
  • HIPRTC now assumes WGP mode for gfx10+. You can enable CU mode by passing -mcumode to the compile options from hiprtcCompileProgram.

Fixes#
  • HIP complex vector type multiplication and division operations. On an AMD platform, some duplicated complex operators are removed to avoid compilation failures. In HIP, hipFloatComplex and hipDoubleComplex are defined as complex datatypes:

    • typedef float2 hipFloatComplex

    • typedef double2 hipDoubleComplex

    Any application that uses complex multiplication and division operations must replace * and / operators with the following:

    • hipCmulf() and hipCdivf() for hipFloatComplex

    • hipCmul() and hipCdiv() for hipDoubleComplex

    Note that these complex operations are equivalent to corresponding types/functions on an NVIDIA platform.

ROCm Compiler#

ROCm Compiler for ROCm 6.1.0

Additions#
  • Compiler now generates .uniform_work_group_size and records it in the metadata. It indicates if the kernel requires that each dimension of global size is a multiple of the corresponding dimension of work-group size. A value of 1 is true, and 0 is false. This metadata is only provided when the value is 1.

  • Added the rocm-llvm-docs package.

  • Added ROCm Device-Libs, ROCm Compiler Support, and hipCC within the llvm-project/amd subdirectory to AMD’s fork of the LLVM project.

  • Added support for C++ Parallel Algorithm Offload via HIP (HIPSTDPAR), which allows parallel algorithms to run on the GPU.

Changes#
  • rocm-clang-ocl is now an optional package and will require manual installation.

Deprecations#
  • hipCC adds -mllvm, -amdgpu-early-inline-all=true, and -mllvm -amdgpu-function-calls=false by default to compiler invocations. These flags will be removed from hipCC in a future ROCm release.

Fixes#

AddressSanitizer (ASan):

  • Added sanitized_padded_global LLVM ir attribute to identify sanitizer instrumented globals.

  • For ASan instrumented global, emit two symbols: one with actual size and the other with instrumented size.

    On GitHub

Known issues#
  • Due to an issue within the amd-llvm compiler shipping with ROCm 6.1, HIPSTDPAR’s interposition mode, which is enabled by --hipstdpar-interpose-alloc is currently broken.

The temporary workaround is to use the upstream LLVM 18 (or newer) compiler. This issue will be addressed in a future ROCm release .”

ROCm Data Center (RDC)#

RDC for ROCm 6.1.0

Changes#
  • Added --address flag to rdcd

  • Upgraded from C++11 to C++17

  • Upgraded gRPC

ROCDebugger (ROCgdb)#

ROCgdb for ROCm 6.1.0

Fixes#

Previously, ROCDebugger encountered hangs and crashes when stepping over the s_endpgm instruction at the end of a HIP kernel entry function, which caused the stepped wave to exit. This issue is fixed in the ROCm 6.1 release. You can now step over the last instruction of any HIP kernel without debugger hangs or crashes.

ROCm SMI#

ROCm SMI for ROCm 6.1.0

Additions#
  • Added support to set max/min clock level for sclk (‘RSMI_CLK_TYPE_SYS’) or mclk (‘RSMI_CLK_TYPE_MEM’). You can now set a maximum or minimum sclk or mclk value through the rsmi_dev_clk_extremum_set() API provided ASIC support. Alternatively, you can use our Python CLI tool (rocm-smi --setextremum max sclk 1500).

  • Added rsmi_dev_target_graphics_version_get(). You can now query through ROCm SMI API (rsmi_dev_target_graphics_version_get()) to retreive the target graphics version for a GPU device. Currently, this output is not supplied through our ROCm SMI CLI.

Changes#
  • Removed non-unified API headers: Individual GPU metric APIs are no longer supported. The individual metric APIs (rsmi_dev_metrics_*) were removed in order to keep updates easier for new GPU metric support. By providing a simple API (rsmi_dev_gpu_metrics_info_get()) with its reported device metrics, it is worth noting there is a risk for ABI break-age using rsmi_dev_gpu_metrics_info_get(). It is vital to understand that ABI breaks are necessary (in some cases) in order to support newer ASICs and metrics for our customers. We will continue to support rsmi_dev_gpu_metrics_info_get() with these considerations and limitations in mind.

  • Deprecated ‘rsmi_dev_power_ave_get()’; use the newer API, ‘rsmi_dev_power_get()’. As outlined in the change for 6.0.0 (Added a generic power API: rsmi_dev_power_get), is now deprecated. You must update your ROCm SMI API calls accordingly.

Fixes#
  • Fixed --showpids reporting [PID] [PROCESS NAME] 1 UNKNOWN UNKNOWN UNKNOWN. Output was failing because cu_occupancy debugfs method is not provided on some graphics cards by design. get_compute_process_info_by_pid was updated to reflect this and returns with the output needed by the CLI.

  • Fixed rocm-smi --showpower output, which was inconsistent on some RDNA3 devices. We updated this to use rsmi_dev_power_get() within the CLI to provide a consistent device power output. This was caused by using the now-deprecated rsmi_dev_average_power_get() API.

  • Fixed rocm-smi --setcomputepartition and rocm-smi --resetcomputepartition to notate if device is EBUSY

  • Fixed rocm-smi --setmemorypartition and rocm-smi --resetmemorypartition read only SYSFS to return RSMI_STATUS_NOT_SUPPORTED The rsmi_dev_memory_partition_set API is updated to handle the read-only SYSFS check. Corresponding tests and CLI (rocm-smi --setmemorypartition and rocm-smi --resetmemorypartition) calls were updated accordingly.

  • Fixed rocm-smi --showclkvolt and rocm-smi --showvc, which were displaying 0 for overdrive and that the voltage curve is not supported.

ROCProfiler#

ROCProfiler for ROCm 6.1.0

Fixes#
  • Fixed ROCprofiler to match versioning changes in HIP Runtime

  • Fixed plugins race condition

  • Updated metrics to MI300

ROCm Validation Suite#
Known issue#
  • In a future release, the ROCm Validation Suite P2P Benchmark and Qualification Tool (PBQT) tests will be optimized to meet the target bandwidth requirements for MI300X.

    On GitHub

MI200 SR-IOV#
Known issue#
  • Multimedia applications may encounter compilation errors in the MI200 Single Root Input/Output Virtualization (SR-IOV) environment. This is because MI200 SR-IOV does not currently support multimedia applications.

    On GitHub

AMD MI300A RAS#
Fixed defect#
GFX correctable and uncorrectable error inject failures#
  • Previously, the AMD CPU Reliability, Availability, and Serviceability (RAS) installation encountered correctable and uncorrectable failures while injecting an error.

    This issue is resolved in the ROCm 6.1 release, and users will no longer encounter the GFX correctable error (CE) and uncorrectable error (UE) failures.

Library changes in ROCm 6.1.0#

Library

Version

AMDMIGraphX

2.8 ⇒ 2.9

composable_kernel

0.2.0

hipBLAS

2.0.0 ⇒ 2.1.0

hipBLASLt

0.7.0

hipCUB

3.0.0 ⇒ 3.1.0

hipFFT

1.0.13 ⇒ 1.0.14

hipRAND

2.10.17

hipSOLVER

2.0.0 ⇒ 2.1.0

hipSPARSE

3.0.0 ⇒ 3.0.1

hipSPARSELt

0.2.0

hipTensor

1.1.0 ⇒ 1.2.0

MIOpen

2.19.0 ⇒ 3.1.0

MIVisionX

2.5.0

rccl

2.18.6

rocALUTION

3.0.3 ⇒ 3.1.1

rocBLAS

4.0.0 ⇒ 4.1.0

rocDecode

0.5.0

rocFFT

1.0.25 ⇒ 1.0.26

rocm-cmake

0.11.0 ⇒ 0.12.0

rocPRIM

3.0.0 ⇒ 3.1.0

rocRAND

3.0.0 ⇒ 3.0.1

rocSOLVER

3.24.0 ⇒ 3.25.0

rocSPARSE

3.0.2 ⇒ 3.1.2

rocThrust

3.0.0 ⇒ 3.0.1

rocWMMA

1.3.0 ⇒ 1.4.0

rpp

1.4.0 ⇒ 1.5.0

Tensile

4.39.0 ⇒ 4.40.0

AMDMIGraphX 2.9#

MIGraphX 2.9 for ROCm 6.1.0

Additions#
  • Added FP8 support

  • Created a dockerfile with MIGraphX+ONNX Runtime EP+Torch

  • Added support for the Hardmax, DynamicQuantizeLinear, Qlinearconcat, Unique, QLinearAveragePool, QLinearSigmoid, QLinearLeakyRelu, QLinearMul, IsInf operators

  • Created web site examples for Whisper, Llama-2, and Stable Diffusion 2.1

  • Created examples of using the ONNX Runtime MIGraphX Execution Provider with the InceptionV3 and Resnet50 models

  • Updated operators to support ONNX Opset 19

  • Enable fuse_pointwise and fuse_reduce in the driver

  • Add support for dot-(mul)-softmax-dot offloads to MLIR

  • Added Blas auto-tuning for GEMMs

  • Added dynamic shape support for the multinomial operator

  • Added fp16 to accuracy checker

  • Added initial code for running on Windows OS

Optimizations#
  • Improved the output of migraphx-driver command

  • Documentation now shows all environment variables

  • Updates needed for general stride support

  • Enabled Asymmetric Quantization

  • Added ScatterND unsupported reduction modes

  • Rewrote softmax for better performance

  • General improvement to how quantization is performed to support INT8

  • Used problem_cache for gemm tuning

  • Improved performance by always using rocMLIR for quantized convolution

  • Improved group convolutions by using rocMLIR

  • Improved accuracy of fp16 models

  • ScatterElements unsupported reduction

  • Added concat fusions

  • Improved INT8 support to include UINT8

  • Allow reshape ops between dq and quant_op

  • Improve dpp reductions on navi

  • Have the accuracy checker print the whole final buffer

  • Added support for handling dynamic Slice and ConstantOfShape ONNX operators

  • Add support for the dilations attribute to Pooling ops

  • Add layout attribute support for LSTM operator

  • Improved performance by removing contiguous for reshapes

  • Handle all slice input variations

  • Add scales attribute parse in upsample for older opset versions

  • Added support for uneven Split operations

  • Improved unit testing to run in python virtual environments

Fixes#
  • Fixed outstanding issues in autogenerated documentation

  • Update model zoo paths for examples

  • Fixed promote_literals_test by using additional if condition

  • Fixed export API symbols from dynamic library

  • Fixed bug in pad operator from dimension reduction

  • Fixed using the LD to embed files and enable by default when building shared libraries on linux

  • fixed get_version()

  • Fixed Round operator inaccuracy

  • Fixed wrong size check when axes not present for slice

  • Set the .SO version correctly

Changes#
  • Cleanup LSTM and RNN activation functions

  • Placed gemm_pointwise at a higher priority than layernorm_pointwise

  • Updated README to mention the need to include GPU_TARGETS when building MIGraphX

Removals#
  • Removed unused device kernels from Gather and Pad operators

  • Removed int8x4 format

hipBLAS 2.1.0#

hipBLAS 2.1.0 for ROCm 6.1.0

Additions#
  • New build option to automatically use hipconfig –platform to determine HIP platform

  • Level 1 functions have additional ILP64 API for both C and Fortran (_64 name suffix) with int64_t function arguments

  • New functions hipblasGetMathMode and hipblasSetMathMode

Deprecations#
  • USE_CUDA build option; use HIP_PLATFORM=amd or HIP_PLATFORM=nvidia to override hipconfig

Changes#
  • Some Level 2 function argument names have changed from m to n to match legacy BLAS; there was no change in implementation.

  • Updated client code to use YAML-based testing

  • Renamed .doxygen and .sphinx folders to doxygen and sphinx, respectively

  • Added CMake support for documentation

hipBLASLt 0.7.0#

hipBLASLt 0.7.0 for ROCm 6.1.0

Additions#
  • Added hipblasltExtSoftmax extension API

  • Added hipblasltExtLayerNorm extension API

  • Added hipblasltExtAMax extension API

  • Added GemmTuning extension parameter to set split-k by user

  • Support for mix precision datatype: fp16/fp8 in with fp16 out

hipCUB 3.1.0#

hipCUB 3.1.0 for ROCm 6.1.0

Changed#
  • CUB backend references CUB and Thrust version 2.1.0.

  • Updated HIPCUB_HOST_WARP_THREADS macro definition to match host_warp_size changes from rocPRIM 3.0.

  • Implemented __int128_t and __uint128_t support for radix_sort.

Fixed#
  • Fixed build issues with rmake.py on Windows when using VS 2017 15.8 or later due to a breaking fix with extended aligned storage.

Added#
  • Added interface DeviceMemcpy::Batched for batched memcpy from rocPRIM and CUB.

hipFFT 1.0.14#

hipFFT 1.0.14 for ROCm 6.1.0

Changes#
  • When building hipFFT from source, rocFFT code no longer needs to be initialized as a git submodule.

Fixes#
  • Fixed error when creating length-1 plans.

hipSOLVER 2.1.0#

hipSOLVER 2.1.0 for ROCm 6.1.0

Added#
  • Added compatibility API with hipsolverSp prefix

  • Added compatibility-only functions

    • csrlsvchol

      • hipsolverSpScsrlsvcholHost, hipsolverSpDcsrlsvcholHost

      • hipsolverSpScsrlsvchol, hipsolverSpDcsrlsvchol

  • Added rocSPARSE and SuiteSparse as optional dependencies to hipSOLVER (rocSOLVER backend only). Use the BUILD_WITH_SPARSE CMake option to enable functionality for the hipsolverSp API (on by default).

  • Added hipSPARSE as an optional dependency to hipsolver-test. Use the BUILD_WITH_SPARSE CMake option to enable tests of the hipsolverSp API (on by default).

Changed#
  • Relax array length requirements for GESVDA.

Fixed#
  • Fixed incorrect singular vectors returned from GESVDA.

hipSPARSE 3.0.1#

hipSPARSE 3.0.1 for ROCm 6.1.0

Fixes#
  • Fixes to the build chain

hipSPARSELt 0.2.0#

hipSPARSELt 0.2.0 for ROCm 6.1.0

Added#
  • Support Matrix B is a Structured Sparsity Matrix.

hipTensor 1.2.0#

hipTensor 1.2.0 for ROCm 6.1.0

Additions#
  • API support for permutation of rank 4 tensors: f16 and f32

  • New datatype support in contractions of rank 4: f16, bf16, complex f32, complex f64

  • Added scale and bilinear contraction samples and tests for new supported data types

  • Added permutation samples and tests for f16, f32 types

Fixes#
  • Fixed bug in contraction calculation with data type f32

MIOpen 3.1.0#

MIOpen 3.1.0 for ROCm 6.1.0

Added#
  • CK-based 2d/3d convolution solvers to support nchw/ncdhw layout

  • Fused solver for Fwd Convolution with Residual, Bias and activation

  • AI Based Parameter Prediction Model for conv_hip_igemm_group_fwd_xdlops Solver

  • Forward, backward data and backward weight convolution solver with fp8/bfp8

  • check for packed tensors for convolution solvers

  • Integrate CK’s layer norm

  • Combine gtests into single binary

Fixed#
  • fix for backward passes bwd/wrw for CK group conv 3d

  • Fixed out-of-bounds memory access : ConvOclDirectFwdGen

  • fixed build failure due to hipRTC

Changed#
  • Standardize workspace abstraction

  • Use split CK libraries

Removed#
  • clamping to MAX from CastTensor used in Bwd and WrW convolution

rccl 2.18.6#

RCCL 2.18.6 for ROCm 6.1.0

Changed#
  • Compatibility with NCCL 2.18.6

rocALUTION 3.1.1#

rocALUTION 3.1.1 for ROCm 6.1.0

Additions#
  • TripleMatrixProduct functionality for GlobalMatrix

  • Multi-Node/GPU support for UA-AMG, SA-AMG and RS-AMG

  • Iterative ILU0 preconditioner ItILU0

  • Iterative triangular solve, selectable via SolverDecr class

Deprecations#
  • LocalMatrix::AMGConnect

  • LocalMatrix::AMGAggregate

  • LocalMatrix::AMGPMISAggregate

  • LocalMatrix::AMGSmoothedAggregation

  • LocalMatrix::AMGAggregation

  • PairwiseAMG

Known Issues#
  • PairwiseAMG does currently not support matrix sizes that exceed int32 range

  • PairwiseAMG might fail building the hierarchy on certain input matrices

rocBLAS 4.1.0#

rocBLAS 4.1.0 for ROCm 6.1.0

Additions#
  • Level 1 and Level 1 Extension functions have additional ILP64 API for both C and FORTRAN (_64 name suffix) with int64_t function arguments.

  • Cache flush timing for gemm_ex.

Changes#
  • Some Level 2 function argument names have changed ‘m’ to ‘n’ to match legacy BLAS, there was no change in implementation.

  • Standardized the use of non-blocking streams for copying results from device to host.

Fixes#
  • Fixed host-pointer mode reductions for non-blocking streams.

rocDecode 0.5.0#

rocDecode 0.5.0 for ROCm 6.1.0

Changes#
  • Changed setup updates

  • Added AMDGPU package support

  • Optimized package dependencies

  • Updated README

Fixes#
  • Minor bug fix and updates

Tested Configurations#
  • Linux distribution

    • Ubuntu - 20.04 / 22.04

  • ROCm:

    • rocm-core - 6.1.0.60100-28

    • amdgpu-core - 1:6.1.60100-1731559

  • FFMPEG - 4.2.7 / 4.4.2-0

  • rocDecode Setup Script - V1.4

rocFFT 1.0.26#

rocFFT 1.0.26 for ROCm 6.1.0

Changes#
  • Multi-device FFTs now allow batch greater than 1

  • Multi-device, real-complex FFTs are now supported

  • rocFFT now statically links libstdc++ when only std::experimental::filesystem is available (to guard against ABI incompatibilities with newer libstdc++ libraries that include std::filesystem)

rocm-cmake 0.12.0#

rocm-cmake 0.12.0 for ROCm 6.1.0

Changed#
  • ROCMSphinxDoc: Allow separate source and config directories.

  • ROCMCreatePackage: Allow additional PROVIDES on header-only packages.

  • ROCMInstallTargets: Don’t install executable targets by default for ASAN builds.

  • ROCMTest: Add RPATH for installed tests.

  • Finalize rename to ROCmCMakeBuildTools

Fixed#
  • ROCMClangTidy: Fixed invalid list index.

  • Test failures when ROCM_CMAKE_GENERATOR is empty.

rocPRIM 3.1.0#

rocPRIM 3.1.0 for ROCm 6.1.0

Additions#
  • New primitive: block_run_length_decode

  • New primitive: batch_memcpy

Changes#
  • Renamed:

    • scan_config_v2 to scan_config

    • scan_by_key_config_v2 to scan_by_key_config

    • radix_sort_config_v2 to radix_sort_config

    • reduce_by_key_config_v2 to reduce_by_key_config

    • radix_sort_config_v2 to radix_sort_config

  • Removed support for custom config types for device algorithms

  • host_warp_size() was moved into rocprim/device/config_types.hpp; it now uses either device_id or a stream parameter to query the proper device and a device_id out parameter

    • The return type is hipError_t

  • Added support for __int128_t in device_radix_sort and block_radix_sort

  • Improved the performance of match_any, and block_histogram which uses it

Deprecations#
  • Removed reduce_by_key_config, MatchAny, scan_config, scan_by_key_config, and radix_sort_config

Fixes#
  • Build issues with rmake.py on Windows when using VS 2017 15.8 or later (due to a breaking fix with extended aligned storage)

rocRAND 3.0.1#

rocRAND 3.0.1 for ROCm 6.1.0

Fixes#
  • Implemented workaround for regressions in XORWOW and LFSR on MI200

rocSOLVER 3.25.0#

rocSOLVER 3.25.0 for ROCm 6.1.0

Added#
  • Eigensolver routines for symmetric/hermitian matrices using Divide & Conquer and Jacobi algorithm:

    • SYEVDJ (with batched and strided_batched versions)

    • HEEVDJ (with batched and strided_batched versions)

  • Generalized symmetric/hermitian-definite eigensolvers using Divide & Conquer and Jacobi algorithm:

    • SYGVDJ (with batched and strided_batched versions)

    • HEGVDJ (with batched and strided_batched versions)

Changed#
  • Relaxed array length requirements for GESVDX with rocblas_srange_index.

Removed#
  • Removed gfx803 and gfx900 from default build targets.

Fixed#
  • Corrected singular vector normalization in BDSVDX and GESVDX

  • Fixed potential memory access fault in STEIN, SYEVX/HEEVX, SYGVX/HEGVX, BDSVDX and GESVDX

rocSPARSE 3.1.2#

rocSPARSE 3.1.2 for ROCm 6.1.0

Additions#
  • New LRB algorithm to SpMV, supporting CSR format

  • rocBLAS as now an optional dependency for SDDMM algorithms

  • Additional verbose output for csrgemm and bsrgemm

Optimizations#
  • Triangular solve with multiple rhs (SpSM, csrsm, …) now calls SpSV, csrsv, etcetera when nrhs equals 1

  • Improved user manual section Installation and Building for Linux and Windows

  • Improved SpMV in CSR format on MI300

rocThrust 3.0.1#

rocThrust 3.0.1 for ROCm 6.1.0

Fixes#
  • Ported a fix from thrust 2.2 that ensures thrust::optional is trivially copyable.

rocWMMA 1.4.0#

rocWMMA 1.4.0 for ROCm 6.1.0

Additions#
  • Added bf16 support for hipRTC sample

Changes#
  • Changed Clang C++ version to C++17

  • Updated rocwmma_coop API

  • Linked rocWMMA to hiprtc

Fixes#
  • Fixed compile/runtime arch checks

  • Built all test in large code model

  • Removed inefficient branching in layout loop unrolling

rpp 1.5.0#

rpp for ROCm 6.1.0

Changes#
  • Prerequisites

Tested Configurations#
  • Linux distribution

    • Ubuntu - 20.04 / 22.04

    • CentOS - 7

    • RHEL - 8/9

  • ROCm: rocm-core - 5.5.0.50500-63

  • Clang - Version 5.0.1 and above

  • CMake - Version 3.22.3

  • IEEE 754-based half-precision floating-point library - Version 1.12.0

Tensile 4.40.0#

Tensile 4.40.0 for ROCm 6.1.0

Additions#
  • new DisableKernelPieces values to invalidate local read, local write, and global read

  • stream-K kernel generation, including two-tile stream-k algorithm by setting StreamK=3

  • feature to allow testing stream-k grid multipliers

  • debug output to check occupancy for Stream-K

  • reject condition for FractionalLoad + DepthU!=power of 2

  • new TENSILE_DB debugging value to dump the common kernel parameters

  • predicate for APU libs

  • new parameter (ClusterLocalRead) to turn on/off wider local read opt for TileMajorLDS

  • new parameter (ExtraLatencyForLR) to add extra interval between local read and wait

  • new logic to check LDS size with auto LdsPad(=1) and change LdsPad to 0 if LDS overflows

  • initialization type and general batched options to the rocblas-bench input creator script

Optimizations#
  • enabled MFMA + LocalSplitU=4 for MT16x16

  • enabled (DirectToVgpr + MI4x4) and supported skinny MacroTile

  • optimized postGSU kernel: separate postGSU kernels for different GSU values, loop unroll for GSU loop, wider global load depending on array size, and parallel reduction depending on array size

  • auto LdsPad calculation for TileMajorLds + MI16x16

  • auto LdsPad calculation for UnrollMajorLds + MI16x16 + VectorWidth

Changes#
  • cleared hipErrorNotFound error since it is an expected part of the search

  • modified hipcc search path for Linux

  • changed PCI ID from 32bit to 64bit for ROCm SMI HW monitor

  • changed LdsBlockSizePerPad to LdsBlockSizePerPadA, B to specify LBSPP separately

  • changed the default value of LdsPadA, B, LdsBlockSizePerPadA, B from 0 to -1

  • updated test cases according to parameter changes for LdsPad, LBSPP and ClusterLocalRead

  • Replaced std::regex with fnmatch()/PathMatchSpec as a workaround to std::regex stack overflow known bug

Fixes#
  • hipcc compile append flag parallel-jobs=4

  • race condition in Stream-K that appeared with large grids and small sizes

  • mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and TailLoop

  • mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and SplitLds

  • incorrect reject condition check for DirectToLds + LdsBlockSizePerPad=-1 case

  • small fix for LdsPad optimization (LdsElement calculation)


ROCm 6.0.2#

The ROCm 6.0.2 point release consists of minor bug fixes to improve the stability of MI300 GPU applications. This release introduces several new driver features for system qualification on our partner server offerings.

hipFFT 1.0.13#

hipFFT 1.0.13 for ROCm 6.0.2

Changes#
  • Removed the Git submodule for shared files between rocFFT and hipFFT; instead, just copy the files over (this should help simplify downstream builds and packaging)

Library changes in ROCm 6.0.2#

Library

Version

AMDMIGraphX

2.8

composable_kernel

0.2.0

hipBLAS

2.0.0

hipCUB

3.0.0

hipFFT

1.0.13

hipRAND

2.10.17

hipSOLVER

2.0.0

hipSPARSE

3.0.0

hipTensor

1.1.0

MIOpen

2.19.0

MIVisionX

2.5.0

rccl

2.15.5

rocALUTION

3.0.3

rocBLAS

4.0.0

rocFFT

1.0.25

rocm-cmake

0.11.0

rocPRIM

3.0.0

rocRAND

2.10.17 ⇒ 3.0.0

rocSOLVER

3.24.0

rocSPARSE

3.0.2

rocThrust

3.0.0

rocWMMA

1.3.0

rpp

1.4.0

Tensile

4.39.0

rocRAND 3.0.0#

rocRAND 3.0.0 for ROCm 6.0.2

Changed#
  • Generator classes from rocrand.hpp are no longer copyable, in previous versions these copies would copy internal references to the generators and would lead to double free or memory leak errors. These types should be moved instead of copied, and move constructors and operators are now defined for them.

Optimized#
  • Improved MT19937 initialization and generation performance.

Removed#
  • Removed hipRAND submodule from rocRAND. hipRAND is now only available as a separate package.

  • Removed references to and workarounds for deprecated hcc

Fixed#
  • mt19937_engine from rocrand.hpp is now move-constructible and move-assignable. Previously the move constructor and move assignment operator was deleted for this class.

  • Various fixes for the C++ wrapper header rocrand.hpp

    • fixed the name of mrg31k3p it is now correctly spelled (was incorrectly namedmrg31k3a in previous versions).

    • added missing order setter method for threefry4x64

    • fixed the default ordering parameter for lfsr113

  • Build error when using clang++ directly due to unsupported references to amdgpu-target


ROCm 6.0.0#

ROCm 6.0 is a major release with new performance optimizations, expanded frameworks and library support, and improved developer experience. This includes initial enablement of the AMD Instinct™ MI300 series. Future releases will further enable and optimize this new platform. Key features include:

  • Improved performance in areas like lower precision math and attention layers.

  • New hipSPARSELt library to accelerate AI workloads via AMD’s sparse matrix core technique.

  • Latest upstream support for popular AI frameworks like PyTorch, TensorFlow, and JAX.

  • New support for libraries, such as DeepSpeed, ONNX-RT, and CuPy.

  • Prepackaged HPC and AI containers on AMD Infinity Hub, with improved documentation and tutorials on the AMD ROCm Docs site.

  • Consolidated developer resources and training on the new AMD ROCm Developer Hub.

The following section provide a release overview for ROCm 6.0. For additional details, you can refer to the Changelog.

OS and GPU support changes#

AMD Instinct™ MI300A and MI300X Accelerator support has been enabled for limited operating systems.

  • Ubuntu 22.04.3 (MI300A and MI300X)

  • RHEL 8.9 (MI300A)

  • SLES 15 SP5 (MI300A)

We’ve added support for the following operating systems:

  • RHEL 9.3

  • RHEL 8.9

Note that, of ROCm 6.2, we’ve planned for end-of-support (EoS) for the following operating systems:

  • Ubuntu 20.04.5

  • SLES 15 SP4

  • RHEL/CentOS 7.9

New ROCm meta package#

We’ve added a new ROCm meta package for easy installation of all ROCm core packages, tools, and libraries. For example, the following command will install the full ROCm package: apt-get install rocm (Ubuntu), or yum install rocm (RHEL).

Filesystem Hierarchy Standard#

ROCm 6.0 fully adopts the Filesystem Hierarchy Standard (FHS) reorganization goals. We’ve removed the backward compatibility support for old file locations.

Compiler location change#
  • The installation path of LLVM has been changed from /opt/rocm-<rel>/llvm to /opt/rocm-<rel>/lib/llvm. For backward compatibility, a symbolic link is provided to the old location and will be removed in a future release.

  • The installation path of the device library bitcode has changed from /opt/rocm-<rel>/amdgcn to /opt/rocm-<rel>/lib/llvm/lib/clang/<ver>/lib/amdgcn. For backward compatibility, a symbolic link is provided and will be removed in a future release.

Documentation#

CMake support has been added for documentation in the ROCm repository.

AMD Instinct™ MI50 end-of-support notice#

AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) enters maintenance mode in ROCm 6.0.

As outlined in 5.6.0, ROCm 5.7 was the final release for gfx906 GPUs in a fully supported state.

  • Henceforth, no new features and performance optimizations will be supported for the gfx906 GPUs.

  • Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs until Q2 2024 (end of maintenance [EOM] will be aligned with the closest ROCm release).

  • Bug fixes will be made up to the next ROCm point release.

  • Bug fixes will not be backported to older ROCm releases for gfx906.

  • Distribution and operating system updates will continue per the ROCm release cadence for gfx906 GPUs until EOM.

Known issues#
  • Hang is observed with rocSPARSE tests: Issue 2726.

  • AddressSanitizer instrumentation is incorrect for device global variables: Issue 2551.

  • Dynamically loaded HIP runtime library references incorrect version of hipDeviceGetProperties API: Issue 2728.

  • Memory access violations when running rocFFT-HMM: Issue 2730.

Library changes#

Library

Version

AMDMIGraphX

2.8

HIP

6.0.0

hipBLAS

2.0.0

hipCUB

3.0.0

hipFFT

1.0.13

hipSOLVER

2.0.0

hipSPARSE

3.0.0

hipTensor

1.1.0

MIOpen

2.19.0

rccl

2.15.5

rocALUTION

3.0.3

rocBLAS

4.0.0

rocFFT

1.0.25

ROCgdb

13.2

rocm-cmake

0.11.0

rocPRIM

3.0.0

rocprofiler

2.0.0

rocRAND

2.10.17

rocSOLVER

3.24.0

rocSPARSE

3.0.2

rocThrust

3.0.0

rocWMMA

1.3.0

Tensile

4.39.0

AMDMIGraphX 2.8#

MIGraphX 2.8 for ROCm 6.0.0

Additions#
  • Support for TorchMIGraphX via PyTorch

  • Boosted overall performance by integrating rocMLIR

  • INT8 support for ONNX Runtime

  • Support for ONNX version 1.14.1

  • Added new operators: Qlinearadd, QlinearGlobalAveragePool, Qlinearconv, Shrink, CastLike, and RandomUniform

  • Added an error message for when gpu_targets is not set during MIGraphX compilation

  • Added parameter to set tolerances with migraphx-driver verify

  • Added support for MXR files > 4 GB

  • Added MIGRAPHX_TRACE_MLIR flag

  • BETA added capability for using ROCm Composable Kernels via the MIGRAPHX_ENABLE_CK=1 environment variable

Optimizations#
  • Improved performance support for INT8

  • Improved time precision while benchmarking candidate kernels from CK or MLIR

  • Removed contiguous from reshape parsing

  • Updated the ConstantOfShape operator to support Dynamic Batch

  • Simplified dynamic shapes-related operators to their static versions, where possible

  • Improved debugging tools for accuracy issues

  • Included a print warning about miopen_fusion while generating mxr

  • General reduction in system memory usage during model compilation

  • Created additional fusion opportunities during model compilation

  • Improved debugging for matchers

  • Improved general debug messages

Fixes#
  • Fixed scatter operator for nonstandard shapes with some models from ONNX Model Zoo

  • Provided a compile option to improve the accuracy of some models by disabling Fast-Math

  • Improved layernorm + pointwise fusion matching to ignore argument order

  • Fixed accuracy issue with ROIAlign operator

  • Fixed computation logic for the Trilu operator

  • Fixed support for the DETR model

Changes#
  • Changed MIGraphX version to 2.8

  • Extracted the test packages into a separate deb file when building MIGraphX from source

Removals#
  • Removed building Python 2.7 bindings

AMD SMI#
  • Integrated the E-SMI library: You can now query CPU-related information directly through AMD SMI. Metrics include power, energy, performance, and other system details.

  • Added support for gfx942 metrics: You can now query MI300 device metrics to get real-time information. Metrics include power, temperature, energy, and performance.

  • Added support for compute and memory partitions

HIP 6.0.0#

HIP 6.0.0 for ROCm 6.0.0

Additions#
  • New fields and structs for external resource interoperability

    • hipExternalMemoryHandleDesc_st

    • hipExternalMemoryBufferDesc_st

    • hipExternalSemaphoreHandleDesc_st

    • hipExternalSemaphoreSignalParams_st

    • hipExternalSemaphoreWaitParams_st Enumerations

    • hipExternalMemoryHandleType_enum

    • hipExternalSemaphoreHandleType_enum

    • hipExternalMemoryHandleType_enum

  • New environment variable HIP_LAUNCH_BLOCKING

    • For serialization on kernel execution. The default value is 0 (disable); kernel will execute normally as defined in the queue. When this environment variable is set as 1 (enable), HIP runtime will serialize kernel enqueue; behaves the same as AMD_SERIALIZE_KERNEL.

  • More members are added in HIP struct hipDeviceProp_t, for new feature capabilities including:

    • Texture

      • int maxTexture1DMipmap;

      • int maxTexture2DMipmap[2];

      • int maxTexture2DLinear[3];

      • int maxTexture2DGather[2];

      • int maxTexture3DAlt[3];

      • int maxTextureCubemap;

      • int maxTexture1DLayered[2];

      • int maxTexture2DLayered[3];

      • int maxTextureCubemapLayered[2];

    • Surface

      • int maxSurface1D;

      • int maxSurface2D[2];

      • int maxSurface3D[3];

      • int maxSurface1DLayered[2];

      • int maxSurface2DLayered[3];

      • int maxSurfaceCubemap;

      • int maxSurfaceCubemapLayered[2];

    • Device

      • hipUUID uuid;

      • char luid[8]; this is an 8-byte unique identifier. Only valid on Windows

      • unsigned int luidDeviceNodeMask;

  • LUID (Locally Unique Identifier) is supported for interoperability between devices. In HIP, more members are added in the struct hipDeviceProp_t, as properties to identify each device:

    • char luid[8];

    • unsigned int luidDeviceNodeMask;

Note

HIP only supports LUID on Windows OS.

Changes#
  • Some OpenGL Interop HIP APIs are moved from the hip_runtime_api header to a new header file hip_gl_interop.h for the AMD platform, as follows:

    • hipGLGetDevices

    • hipGraphicsGLRegisterBuffer

    • hipGraphicsGLRegisterImage

Changes impacting backward incompatibility#
  • Data types for members in HIP_MEMCPY3D structure are changed from unsigned int to size_t.

  • The value of the flag hipIpcMemLazyEnablePeerAccess is changed to 0x01, which was previously defined as 0

  • Some device property attributes are not currently supported in HIP runtime. In order to maintain consistency, the following related enumeration names are changed in hipDeviceAttribute_t

    • hipDeviceAttributeName is changed to hipDeviceAttributeUnused1

    • hipDeviceAttributeUuid is changed to hipDeviceAttributeUnused2

    • hipDeviceAttributeArch is changed to hipDeviceAttributeUnused3

    • hipDeviceAttributeGcnArch is changed to hipDeviceAttributeUnused4

    • hipDeviceAttributeGcnArchName is changed to hipDeviceAttributeUnused5

  • HIP struct hipArray is removed from driver type header to comply with CUDA

  • hipArray_t replaces hipArray*, as the pointer to array.

    • This allows hipMemcpyAtoH and hipMemcpyHtoA to have the correct array type which is equivalent to corresponding CUDA driver APIs.

Fixes#
  • Kernel launch maximum dimension validation is added specifically on gridY and gridZ in the HIP API hipModule-LaunchKernel. As a result,when hipGetDeviceAttribute is called for the value of hipDeviceAttributeMaxGrid-Dim, the behavior on the AMD platform is equivalent to NVIDIA.

  • The HIP stream synchronization behavior is changed in internal stream functions, in which a flag “wait” is added and set when the current stream is null pointer while executing stream synchronization on other explicitly created streams. This change avoids blocking of execution on null/default stream. The change won’t affect usage of applications, and makes them behave the same on the AMD platform as NVIDIA.

  • Error handling behavior on unsupported GPU is fixed, HIP runtime will log out error message, instead of creating signal abortion error which is invisible to developers but continued kernel execution process. This is for the case when developers compile any application via hipcc, setting the option --offload-arch with GPU ID which is different from the one on the system.

  • HIP complex vector type multiplication and division operations. On AMD platform, some duplicated complex operators are removed to avoid compilation failures. In HIP, hipFloatComplex and hipDoubleComplex are defined as complex data types: typedef float2 hipFloatComplex; typedef double2 hipDoubleComplex; Any application that uses complex multiplication and division operations needs to replace ‘*’ and ‘/’ operators with the following:

    • hipCmulf() and hipCdivf() for hipFloatComplex

    • hipCmul() and hipCdiv() for hipDoubleComplex Note: These complex operations are equivalent to corresponding types/functions on NVIDIA platform.

Removals#
  • Deprecated Heterogeneous Compute (HCC) symbols and flags are removed from the HIP source code, including:

    • Build options on obsolete HCC_OPTIONS were removed from cmake.

    • Micro definitions are removed:

      • HIP_INCLUDE_HIP_HCC_DETAIL_DRIVER_TYPES_H

      • HIP_INCLUDE_HIP_HCC_DETAIL_HOST_DEFINES_H

    • Compilation flags for the platform definitions

      • AMD platform

        • HIP_PLATFORM_HCC

        • HCC

        • HIP_ROCclr

      • NVIDIA platform

        • HIP_PLATFORM_NVCC

  • The hcc_detail and nvcc_detail directories in the clr repository are removed.

  • Deprecated gcnArch is removed from hip device struct hipDeviceProp_t.

  • Deprecated enum hipMemoryType memoryType; is removed from HIP struct hipPointerAttribute_t union.

hipBLAS 2.0.0#

hipBLAS 2.0.0 for ROCm 6.0.0

Additions#
  • New option to define HIPBLAS_USE_HIP_BFLOAT16 to switch API to use the hip_bfloat16 type

  • New hipblasGemmExWithFlags API

Deprecations#
  • hipblasDatatype_t; use hipDataType instead

  • hipblasComplex; use hipComplex instead

  • hipblasDoubleComplex; use hipDoubleComplex instead

  • Use of hipblasDatatype_t for hipblasGemmEx for compute-type; use hipblasComputeType_t instead

Removals#
  • hipblasXtrmm (calculates B <- alpha * op(A) * B) has been replaced with hipblasXtrmm (calculates C <- alpha * op(A) * B)

hipCUB 3.0.0#

hipCUB 3.0.0 for ROCm 6.0.0

Changes#
  • Removed DOWNLOAD_ROCPRIM: you can force rocPRIM to download using DEPENDENCIES_FORCE_DOWNLOAD

hipFFT 1.0.13#

hipFFT 1.0.13 for ROCm 6.0.0

Changes#
  • hipfft-rider has been renamed to hipfft-bench; it is controlled by the BUILD_CLIENTS_BENCH CMake option (note that a link for the old file name is installed, and the old BUILD_CLIENTS_RIDER CMake option is accepted for backwards compatibility, but both will be removed in a future release)

  • Binaries in debug builds no longer have a -d suffix

  • The minimum rocFFT required version has been updated to 1.0.21

Additions#
  • hipfftXtSetGPUs, hipfftXtMalloc, hipfftXtMemcpy, hipfftXtFree, and hipfftXtExecDescriptor APIs have been implemented to allow FFT computing on multiple devices in a single process

hipSOLVER 2.0.0#

hipSOLVER 2.0.0 for ROCm 6.0.0

Additions#
  • Added hipBLAS as an optional dependency to hipsolver-test

    • You can use the BUILD_HIPBLAS_TESTS CMake option to test the compatibility between hipSOLVER and hipBLAS

Changes#
  • The hipsolverOperation_t type is now an alias of hipblasOperation_t

  • The hipsolverFillMode_t type is now an alias of hipblasFillMode_t

  • The hipsolverSideMode_t type is now an alias of hipblasSideMode_t

Fixes#
  • Tests for hipSOLVER info updates in ORGBR/UNGBR, ORGQR/UNGQR, ORGTR/UNGTR, ORMQR/UNMQR, and ORMTR/UNMTR

hipSPARSE 3.0.0#

hipSPARSE 3.0.0 for ROCm 6.0.0

Additions#
  • Added hipsparseGetErrorName and hipsparseGetErrorString

Changes#
  • Changed the hipsparseSpSV_solve() API function to match the cuSPARSE API

  • Changed generic API functions to use const descriptors

  • Improved documentation

hipTensor 1.1.0#

hipTensor 1.1.0 for ROCm 6.0.0

Additions#
  • Architecture support for gfx942

  • Client tests configuration parameters now support YAML file input format

Changes#
  • Doxygen now treats warnings as errors

Fixes#
  • Client tests output redirections now behave accordingly

  • Removed dependency static library deployment

  • Security issues for documentation

  • Compile issues in debug mode

  • Corrected soft link for ROCm deployment

MIOpen 2.19.0#

MIOpen 2.19.0 for ROCm 6.0.0

Additions#
  • ROCm 5.5 support for gfx1101 (Navi32)

Changes#
  • Tuning results for MLIR on ROCm 5.5

  • Bumped MLIR commit to 5.5.0 release tag

Fixes#
  • 3-D convolution host API bug

  • [HOTFIX][MI200][FP16] has been disabled for ConvHipImplicitGemmBwdXdlops when FP16_ALT is required

MIVisionX#
  • Added Comprehensive CTests to aid developers

  • Introduced Doxygen support for complete API documentation

  • Simplified dependencies for rocAL

OpenMP#
  • MI300:

    • Added support for gfx942 targets

    • Fixed declare target variable access in unified_shared_memory mode

    • Enabled OMPX_APU_MAPS environment variable for MI200 and gfx942

    • Handled global pointers in forced USM (OMPX_APU_MAPS)

  • Nextgen AMDGPU plugin:

    • Respect GPU_MAX_HW_QUEUES in the AMDGPU Nextgen plugin, which takes precedence over the standard LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES environment variable

    • Changed the default for LIBOMPTARGET_AMDGPU_TEAMS_PER_CU from 4 to 6

    • Fixed the behavior of the OMPX_FORCE_SYNC_REGIONS environment variable, which is used to force synchronous target regions (the default is to use an asynchronous implementation)

    • Added support for and enabled default of code object version 5

    • Implemented target OMPT callbacks and trace records support in the nextgen plugin

  • Specialized kernels:

    • Removes redundant copying of arrays when xteam reductions are active but not offloaded

    • Tuned the number of teams for BigJumpLoop

    • Enables specialized kernel generation with nested OpenMP pragma, as long as there is no nested omp-parallel directive

Additions#
  • -fopenmp-runtimelib={lib,lib-perf,lib-debug} to select libs

  • Warning if mixed HIP / OpenMP offloading (i.e., if HIP language mode is active, but OpenMP target directives are encountered)

  • Introduced compile-time limit for the number of GPUs supported in a system: 16 GPUs in a single node is currently the maximum supported

Changes#
  • Correctly compute number of waves when workgroup size is less than the wave size

  • Implemented LIBOMPTARGET_KERNEL_TRACE=3, which prints DEVID traces and API timings

  • ASAN support for openmp release, debug, and perf libraries

  • Changed LDS lowering default to hybrid

Fixes#
  • Fixed RUNPATH for gdb plugin

  • Fixed hang in OMPT support if flush trace is called when there are no helper threads

rccl 2.15.5#

RCCL 2.15.5 for ROCm 6.0.0

Changes#
  • Compatibility with NCCL 2.15.5

  • Renamed the unit test executable to rccl-UnitTests

Additions#
  • HW-topology-aware binary tree implementation

  • Experimental support for MSCCL

  • New unit tests for hipGraph support

  • NPKit integration

Fixes#
  • rocm-smi ID conversion

  • Support for HIP_VISIBLE_DEVICES for unit tests

  • Support for p2p transfers to non (HIP) visible devices

Removals#
rocALUTION 3.0.3#

rocALUTION 3.0.3 for ROCm 6.0.0

Additions#
  • Support for 64bit integer vectors

  • Inclusive and exclusive sum functionality for vector classes

  • Transpose functionality for GlobalMatrix and LocalMatrix

  • TripleMatrixProduct functionality for LocalMatrix

  • Sort() function for LocalVector class

  • Multiple stream support to the HIP backend

Optimizations#
  • GlobalMatrix::Apply() now uses multiple streams to better hide communication

Changes#
  • Matrix dimensions and number of non-zeros are now stored using 64-bit integers

  • Improved the ILUT preconditioner

Removals#
  • LocalVector::GetIndexValues(ValueType*)

  • LocalVector::SetIndexValues(const ValueType*)

  • LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*)

  • LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, float, LocalMatrix*, LocalMatrix*)

  • LocalMatrix::RugeStueben()

  • LocalMatrix::AMGSmoothedAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*, int)

  • LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix*, LocalMatrix*)

Fixes#
  • Unit tests no longer ignore BCSR block dimension

  • Fixed documentation typos

  • Bug in multi-coloring for non-symmetric matrix patterns

rocBLAS 4.0.0#

rocBLAS 4.0.0 for ROCm 6.0.0

Additions#
  • Beta API rocblas_gemm_batched_ex3 and rocblas_gemm_strided_batched_ex3

  • Input/output type f16_r/bf16_r and execution type f32_r support for Level 2 gemv_batched and gemv_strided_batched

  • Use of rocblas_status_excluded_from_build when calling functions that require Tensile (when using rocBLAS built without Tensile)

  • System for asynchronous kernel launches that set a rocblas_status failure based on a hipPeekAtLastError discrepancy

Optimizations#
  • TRSM performance for small sizes (m < 32 && n < 32)

Deprecations#
  • Atomic operations will be disabled by default in a future release of rocBLAS (you can enable atomic operations using the rocblas_set_atomics_mode function)

Removals#
  • rocblas_gemm_ext2 API function

  • In-place trmm API from Legacy BLAS is replaced by an API that supports both in-place and out-of-place trmm

  • int8x4 support is removed (int8 support is unchanged)

  • #define __STDC_WANT_IEC_60559_TYPES_EXT__ is removed from rocblas-types.h (if you want ISO/IEC TS 18661-3:2015 functionality, you must define __STDC_WANT_IEC_60559_TYPES_EXT__ before including float.h, math.h, and rocblas.h)

  • The default build removes device code for gfx803 architecture from the fat binary

Fixes#
  • Made offset calculations for 64-bit rocBLAS functions safe

    • Fixes for very large leading dimension or increment potentially causing overflow:

      • Level2: gbmv, gemv, hbmv, sbmv, spmv, tbmv, tpmv, tbsv, and tpsv

  • Lazy loading supports heterogeneous architecture setup and load-appropriate tensile library files, based on device architecture

  • Guards against no-op kernel launches that result in a potential hipGetLastError

Changes#
  • Reduced the default verbosity of rocblas-test (you can see all tests by setting the GTEST_LISTENER=PASS_LINE_IN_LOG environment variable)

rocFFT 1.0.25#

rocFFT 1.0.25 for ROCm 6.0.0

Additions#
  • Implemented experimental APIs to allow computing FFTs on data distributed across multiple devices in a single process

    • rocfft_field is a new type that can be added to a plan description to describe the layout of FFT input or output

    • rocfft_field_add_brick can be called to describe the brick decomposition of an FFT field, where each brick can be assigned a different device

    These interfaces are still experimental and subject to change. Your feedback is appreciated. You can raise questions and concerns by opening issues in the rocFFT issue tracker.

    Note that multi-device FFTs currently have several limitations (we plan to address these in future releases):

    • Real-complex (forward or inverse) FFTs are not supported

    • Planar format fields are not supported

    • Batch (the number_of_transforms provided to rocfft_plan_create) must be 1

    • FFT input is gathered to the current device at run time, so all FFT data must fit on that device

Optimizations#
  • Improved the performance of several 2D/3D real FFTs supported by 2D_SINGLE kernel. Offline tuning provides more optimization for fx90a

  • Removed an extra kernel launch from even-length, real-complex FFTs that use callbacks

Changes#
  • Built kernels in a solution map to the library kernel cache

  • Real forward transforms (real-to-complex) no longer overwrite input; rocFFT may still overwrite real inverse (complex-to-real) input, as this allows for faster performance

  • rocfft-rider and dyna-rocfft-rider have been renamed to rocfft-bench and dyna-rocfft-bench; these are controlled by the BUILD_CLIENTS_BENCH CMake option

    • Links for the former file names are installed, and the former BUILD_CLIENTS_RIDER CMake option is accepted for compatibility, but both will be removed in a future release

  • Binaries in debug builds no longer have a -d suffix

Fixes#
  • rocFFT now correctly handles load callbacks that convert data from a smaller data type (e.g., 16-bit integers -> 32-bit float)

ROCgdb 13.2#

ROCgdb 13.2 for ROCm 6.0.0

Additions#
  • Support for watchpoints on scratch memory addresses.

  • Added support for gfx1100, gfx1101, and gfx1102.

  • Added support for gfx942.

Optimizations#
  • Improved performances when handling the end of a process with a large number of threads.

Known issues#
  • On certain configurations, ROCgdb can show the following warning message: warning: Probes-based dynamic linker interface failed. Reverting to original interface. This does not affect ROCgdb’s functionalities.

  • ROCgdb cannot debug a program on an AMDGPU device past a s_sendmsg sendmsg(MSG_DEALLOC_VGPRS) instruction. If an exception is reported after this instruction has been executed (including asynchronous exceptions), the wave is killed and the exceptions are only reported by the ROCm runtime.

rocm-cmake 0.11.0#

rocm-cmake 0.11.0 for ROCm 6.0.0

Changes#
  • Improved validation, documentation, and rocm-docs-core integration for ROCMSphinxDoc

Fixes#
  • Fixed extra make flags passed for Clang-Tidy (ROCMClangTidy).

  • Fixed issues with ROCMTest when using a module in a subdirectory

ROCm Compiler#
  • On MI300, kernel arguments can be preloaded into SGPRs rather than passed in memory. This feature is enabled with a compiler option, which also controls the number of arguments to pass in SGPRs.

  • Improved register allocation at -O0: Avoid compiler crashes ( ‘ran out of registers during register allocation’ )

  • Improved generation of debug information:

    • Improve compile time

    • Avoid compiler crashes

rocPRIM 3.0.0#

rocPRIM 3.0.0 for ROCm 6.0.0

Additions#
  • block_sort::sort() overload for keys and values with a dynamic size, for all block sort algorithms

  • All block_sort::sort() overloads with a dynamic size are now supported for block_sort_algorithm::merge_sort and block_sort_algorithm::bitonic_sort

  • New two-way partition primitive partition_two_way, which can write to two separate iterators

Optimizations#
  • Improved partition performance

Fixes#
  • Fixed rocprim::MatchAny for devices with 64-bit warp size

    • Note that rocprim::MatchAny is deprecated; use rocprim::match_any instead

Roc Profiler 2.0.0#

Roc Profiler 2.0.0 for ROCm 6.0.0

Additions#
  • Updated supported GPU architectures in README with profiler versions

  • Automatic ISA dumping for ATT. See README.

  • CSV mode for ATT. See README.

  • Added option to control kernel name truncation.

  • Limit rocprof(v1) script usage to only supported architectures.

  • Added Tool versioning to be able to run rocprofv2 using rocprof. See README for more information.

  • Added Plugin Versioning way in rocprofv2. See README for more details.

  • Added --version in rocprof and rocprofv2 to be able to see the current rocprof/v2 version along with ROCm version information.

rocRAND 2.10.17#

rocRAND 2.10.17 for ROCm 6.0.0

Changes#
  • Generator classes from rocrand.hpp are no longer copyable (in previous versions these copies would copy internal references to the generators and would lead to double free or memory leak errors)

    • These types should be moved instead of copied; move constructors and operators are now defined

Optimizations#
  • Improved MT19937 initialization and generation performance

Removals#
  • Removed the hipRAND submodule from rocRAND; hipRAND is now only available as a separate package

  • Removed references to, and workarounds for, the deprecated hcc

Fixes#
  • mt19937_engine from rocrand.hpp is now move-constructible and move-assignable (the move constructor and move assignment operator was deleted for this class)

  • Various fixes for the C++ wrapper header rocrand.hpp

    • The name of mrg31k3p it is now correctly spelled (was incorrectly named mrg31k3a in previous versions)

    • Added the missing order setter method for threefry4x64

    • Fixed the default ordering parameter for lfsr113

  • Build error when using Clang++ directly resulting from unsupported amdgpu-target references

rocSOLVER 3.24.0#

rocSOLVER 3.24.0 for ROCm 6.0.0

Additions#
  • Cholesky refactorization for sparse matrices: CSRRF_REFACTCHOL

  • Added rocsolver_rfinfo_mode and the ability to specify the desired refactorization routine (see rocsolver_set_rfinfo_mode)

Changes#
  • CSRRF_ANALYSIS and CSRRF_SOLVE now support sparse Cholesky factorization

rocSPARSE 3.0.2#

rocSPARSE 3.0.2 for ROCm 6.0.0

Changes#
  • Function arguments for rocsparse_spmv

  • Function arguments for rocsparse_xbsrmv routines

  • When using host pointer mode, you must now call hipStreamSynchronize following doti, dotci, spvv, and csr2ell

  • Improved documentation

  • Improved verbose output during argument checking on API function calls

Removals#
  • Auto stages from spmv, spmm, spgemm, spsv, spsm, and spitsv

  • Formerly deprecated rocsparse_spmm_ex routine

Fixes#
  • Bug in rocsparse-bench where the SpMV algorithm was not taken into account in CSR format

  • BSR and GEBSR routines (bsrmv, bsrsv, bsrmm, bsrgeam, gebsrmv, gebsrmm) didn’t always show block_dim==0 as an invalid size

  • Passing nnz = 0 to doti or dotci wasn’t always returning a dot product of 0

Additions#
  • rocsparse_inverse_permutation

  • Mixed-precisions for SpVV

  • Uniform int8 precision for gather and scatter

rocThrust 3.0.0#

rocThrust 3.0.0 for ROCm 6.0.0

Additions#
  • Updated to match upstream Thrust 2.0.1

  • NV_IF_TARGET macro from libcu++ for NVIDIA backend and HIP implementation for HIP backend

Changes#
  • The CMake build system now accepts GPU_TARGETS in addition to AMDGPU_TARGETS for setting targeted GPU architectures

    • GPU_TARGETS=all compiles for all supported architectures

    • AMDGPU_TARGETS is only provided for backwards compatibility (GPU_TARGETS is preferred)

  • Removed CUB symlink from the root of the repository

  • Removed support for deprecated macros (THRUST_DEVICE_BACKEND and THRUST_HOST_BACKEND)

Known issues#
  • The THRUST_HAS_CUDART macro, which is no longer used in Thrust (it’s provided only for legacy support) is replaced with NV_IF_TARGET and THRUST_RDC_ENABLED in the NVIDIA backend. The HIP backend doesn’t have a THRUST_RDC_ENABLED macro, so some branches in Thrust code may be unreachable in the HIP backend.

rocWMMA 1.3.0#

rocWMMA 1.3.0 for ROCm 6.0.0

Additions#
  • Support for gfx942

  • Support for f8, bf8, and xfloat32 data types

  • support for HIP_NO_HALF, __ HIP_NO_HALF_CONVERSIONS__, and __ HIP_NO_HALF_OPERATORS__ (e.g., PyTorch environment)

Changes#
  • rocWMMA with hipRTC now supports bfloat16_t data type

  • gfx11 WMMA now uses lane swap instead of broadcast for layout adjustment

  • Updated samples GEMM parameter validation on host arch

Fixes#
  • Disabled GoogleTest static library deployment

  • Extended tests now build in large code model

Tensile 4.39.0#

Tensile 4.39.0 for ROCm 6.0.0

Additions#
  • Added aquavanjaram support: gfx942, fp8/bf8 datatype, xf32 datatype, and stochastic rounding for various datatypes

  • Added and updated tuning scripts

  • Added DirectToLds support for larger data types with 32-bit global load (old parameter DirectToLds is replaced with DirectToLdsA and DirectToLdsB), and the corresponding test cases

  • Added the average of frequency, power consumption, and temperature information for the winner kernels to the CSV file

  • Added asmcap check for MFMA + const src

  • Added support for wider local read + pack with v_perm (with VgprForLocalReadPacking=True)

  • Added a new parameter to increase miLatencyLeft

Optimizations#
  • Enabled InitAccVgprOpt for MatrixInstruction cases

  • Implemented local read related parameter calculations with DirectToVgpr

  • Enabled dedicated vgpr allocation for local read + pack

  • Optimized code initialization

  • Optimized sgpr allocation

  • Supported DGEMM TLUB + RLVW=2 for odd N (edge shift change)

  • Enabled miLatency optimization for specific data types, and fixed instruction scheduling

Changes#
  • Removed old code for DTL + (bpe * GlobalReadVectorWidth > 4)

  • Changed/updated failed CI tests for gfx11xx, InitAccVgprOpt, and DTLds

  • Removed unused CustomKernels and ReplacementKernels

  • Added a reject condition for DTVB + TransposeLDS=False (not supported so far)

  • Removed unused code for DirectToLds

  • Updated test cases for DTV + TransposeLDS=False

  • Moved the MinKForGSU parameter from globalparameter to BenchmarkCommonParameter to support smaller K

  • Changed how to calculate latencyForLR for miLatency

  • Set minimum value of latencyForLRCount for 1LDSBuffer to avoid getting rejected by overflowedResources=5 (related to miLatency)

  • Refactored allowLRVWBforTLUandMI and renamed it as VectorWidthB

  • Supported multi-gpu for different architectures in lazy library loading

  • Enabled dtree library for batch > 1

  • Added problem scale feature for dtree selection

  • Modified non-lazy load build to skip experimental logic

Fixes#
  • Predicate ordering for fp16alt impl round near zero mode to unbreak distance modes

  • Boundary check for mirror dims and re-enable disabled mirror dims test cases

  • Merge error affecting i8 with WMMA

  • Mismatch issue with DTLds + TSGR + TailLoop

  • Bug with InitAccVgprOpt + GSU>1 and a mismatch issue with PGR=0

  • Override for unloaded solutions when lazy loading

  • Adding missing headers

  • Boost link for a clean build on Ubuntu 22

  • Bug in forcestoresc1 arch selection

  • Compiler directive for gfx942

  • Formatting for DecisionTree_test.cpp

Library changes in ROCm 6.0.0#

Library

Version

AMDMIGraphX

2.7 ⇒ 2.8

composable_kernel

0.2.0

hipBLAS

1.1.0 ⇒ 2.0.0

hipCUB

2.13.1 ⇒ 3.0.0

hipFFT

1.0.12 ⇒ 1.0.13

hipRAND

2.10.16 ⇒ 2.10.17

hipSOLVER

1.8.2 ⇒ 2.0.0

hipSPARSE

2.3.8 ⇒ 3.0.0

hipTensor

1.1.0

MIOpen

2.19.0

MIVisionX

2.5.0

rccl

2.15.5

rocALUTION

2.1.11 ⇒ 3.0.3

rocBLAS

3.1.0 ⇒ 4.0.0

rocFFT

1.0.24 ⇒ 1.0.25

rocm-cmake

0.10.0 ⇒ 0.11.0

rocPRIM

2.13.1 ⇒ 3.0.0

rocRAND

2.10.17

rocSOLVER

3.23.0 ⇒ 3.24.0

rocSPARSE

2.5.4 ⇒ 3.0.2

rocThrust

2.18.0 ⇒ 3.0.0

rocWMMA

1.2.0 ⇒ 1.3.0

rpp

1.2.0 ⇒ 1.4.0

Tensile

4.38.0 ⇒ 4.39.0

AMDMIGraphX 2.8#

MIGraphX 2.8 for ROCm 6.0.0

Additions#
  • Support for MI300 GPUs

  • Support for TorchMIGraphX via PyTorch

  • Boosted overall performance by integrating rocMLIR

  • INT8 support for ONNX Runtime

  • Support for ONNX version 1.14.1

  • Added new operators: Qlinearadd, QlinearGlobalAveragePool, Qlinearconv, Shrink, CastLike, and RandomUniform

  • Added an error message for when gpu_targets is not set during MIGraphX compilation

  • Added parameter to set tolerances with migraphx-driver verify

  • Added support for MXR files > 4 GB

  • Added MIGRAPHX_TRACE_MLIR flag

  • BETA added capability for using ROCm Composable Kernels via the MIGRAPHX_ENABLE_CK=1 environment variable

Optimizations#
  • Improved performance support for INT8

  • Improved time precision while benchmarking candidate kernels from CK or MLIR

  • Removed contiguous from reshape parsing

  • Updated the ConstantOfShape operator to support Dynamic Batch

  • Simplified dynamic shapes-related operators to their static versions, where possible

  • Improved debugging tools for accuracy issues

  • Included a print warning about miopen_fusion while generating mxr

  • General reduction in system memory usage during model compilation

  • Created additional fusion opportunities during model compilation

  • Improved debugging for matchers

  • Improved general debug messages

Fixes#
  • Fixed scatter operator for nonstandard shapes with some models from ONNX Model Zoo

  • Provided a compile option to improve the accuracy of some models by disabling Fast-Math

  • Improved layernorm + pointwise fusion matching to ignore argument order

  • Fixed accuracy issue with ROIAlign operator

  • Fixed computation logic for the Trilu operator

  • Fixed support for the DETR model

Changes#
  • Changed MIGraphX version to 2.8

  • Extracted the test packages into a separate deb file when building MIGraphX from source

Removals#
  • Removed building Python 2.7 bindings

hipBLAS 2.0.0#

hipBLAS 2.0.0 for ROCm 6.0.0

Added#
  • added option to define HIPBLAS_USE_HIP_BFLOAT16 to switch API to use hip_bfloat16 type

  • added hipblasGemmExWithFlags API

Deprecated#
  • hipblasDatatype_t is deprecated and will be removed in a future release and replaced with hipDataType

  • hipblasComplex and hipblasDoubleComplex are deprecated and will be removed in a future release and replaced with hipComplex and hipDoubleComplex

  • use of hipblasDatatype_t for hipblasGemmEx for compute-type is deprecated and will be replaced with hipblasComputeType_t in a future release

Removed#
  • hipblasXtrmm that calculates B <- alpha * op(A) * B is removed and replaced with hipblasXtrmm that calculates C <- alpha * op(A) * B

hipCUB 3.0.0#

hipCUB 3.0.0 for ROCm 6.0.0

Changed#
  • Removed DOWNLOAD_ROCPRIM, forcing rocPRIM to download can be done with DEPENDENCIES_FORCE_DOWNLOAD.

hipFFT 1.0.13#

hipFFT 1.0.13 for ROCm 6.0.0

Changed#
  • hipfft-rider has been renamed to hipfft-bench, controlled by the BUILD_CLIENTS_BENCH CMake option. A link for the old file name is installed, and the old BUILD_CLIENTS_RIDER CMake option is accepted for compatibility but both will be removed in a future release.

  • Binaries in debug builds no longer have a “-d” suffix.

  • The minimum rocFFT required version has been updated to 1.0.21.

Added#
  • Implemented hipfftXtSetGPUs, hipfftXtMalloc, hipfftXtMemcpy, hipfftXtFree, hipfftXtExecDescriptor APIs to allow computing FFTs on multiple devices in a single process.

hipRAND 2.10.17#

hipRAND 2.10.17 for ROCm 6.0.0

Fixed#
  • Fixed benchmark and unit test builds on Windows.

hipSOLVER 2.0.0#

hipSOLVER 2.0.0 for ROCm 6.0.0

Added#
  • Added hipBLAS as an optional dependency to hipsolver-test. Use the BUILD_HIPBLAS_TESTS CMake option to test compatibility between hipSOLVER and hipBLAS.

Changed#
  • Types hipsolverOperation_t, hipsolverFillMode_t, and hipsolverSideMode_t are now aliases of hipblasOperation_t, hipblasFillMode_t, and hipblasSideMode_t.

Fixed#
  • Fixed tests for hipsolver info updates in ORGBR/UNGBR, ORGQR/UNGQR, ORGTR/UNGTR, ORMQR/UNMQR, and ORMTR/UNMTR.

hipSPARSE 3.0.0#

hipSPARSE 3.0.0 for ROCm 6.0.0

Added#
  • Added hipsparseGetErrorName and hipsparseGetErrorString

Changed#
  • Changed hipsparseSpSV_solve() API function to match cusparse API

  • Changed generic API functions to use const descriptors

  • Documentation improved

hipTensor 1.1.0#

hipTensor 1.1.0 for ROCm 6.0.0

Additions#
  • Architecture support for gfx940, gfx941, and gfx942

  • Client tests configuration parameters now support YAML file input format

Changes#
  • Doxygen now treats warnings as errors

Fixes#
  • Client tests output redirections now behave accordingly

  • Removed dependency static library deployment

  • Security issues for documentation

  • Compile issues in debug mode

  • Corrected soft link for ROCm deployment

rocALUTION 3.0.3#

rocALUTION 3.0.3 for ROCm 6.0.0

Added#
  • Added support for 64bit integer vectors

  • Added inclusive and exclusive sum functionality for Vector classes

  • Added Transpose functionality for Global/LocalMatrix

  • Added TripleMatrixProduct functionality LocalMatrix

  • Added Sort() function for LocalVector class

  • Added multiple stream support to the HIP backend

Optimized#
  • GlobalMatrix::Apply() now uses multiple streams to better hide communication

Changed#
  • Matrix dimensions and number of non-zeros are now stored using 64bit integers

  • Improved ILUT preconditioner

Removed#
  • Removed LocalVector::GetIndexValues(ValueType*)

  • Removed LocalVector::SetIndexValues(const ValueType*)

  • Removed LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*)

  • Removed LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, float, LocalMatrix*, LocalMatrix*)

  • Removed LocalMatrix::RugeStueben()

  • Removed LocalMatrix::AMGSmoothedAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*, int)

  • Removed LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix*, LocalMatrix*)

Fixed#
  • Unit tests do not ignore BCSR block dimension anymore

  • Fixed typos in the documentation

  • Fixed a bug in multicoloring for non-symmetric matrix patterns

rocBLAS 4.0.0#

rocBLAS 4.0.0 for ROCm 6.0.0

Added#
  • Addition of beta API rocblas_gemm_batched_ex3 and rocblas_gemm_strided_batched_ex3

  • Added input/output type f16_r/bf16_r and execution type f32_r support for Level 2 gemv_batched and gemv_strided_batched

  • Added rocblas_status_excluded_from_build to be used when calling functions which require Tensile when using rocBLAS built without Tensile

  • Added system for async kernel launches setting a failure rocblas_status based on hipPeekAtLastError discrepancy

Optimized#
  • Trsm performance for small sizes m < 32 && n < 32

Deprecated#
  • In a future release atomic operations will be disabled by default so results will be repeatable. Atomic operations can always be enabled or disabled using the function rocblas_set_atomics_mode. Enabling atomic operations can improve performance.

Removed#
  • rocblas_gemm_ext2 API function is removed

  • in-place trmm API from Legacy BLAS is removed. It is replaced by an API that supports both in-place and out-of-place trmm

  • int8x4 support is removed. int8 support is unchanged

  • The #define STDC_WANT_IEC_60559_TYPES_EXT has been removed from rocblas-types.h. Users who want ISO/IEC TS 18661-3:2015 functionality must define STDC_WANT_IEC_60559_TYPES_EXT before including float.h, math.h, and rocblas.h

  • The default build removes device code for gfx803 architecture from the fat binary

Fixed#
  • Make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimension or increment potentially causing overflow:

    • Level2: gbmv, gemv, hbmv, sbmv, spmv, tbmv, tpmv, tbsv, tpsv

  • Lazy loading to support heterogeneous architecture setup and load appropriate tensile library files based on the device’s architecture

  • Guard against no-op kernel launches resulting in potential hipGetLastError

Changed#
  • Default verbosity of rocblas-test reduced. To see all tests set environment variable GTEST_LISTENER=PASS_LINE_IN_LOG

rocFFT 1.0.25#

rocFFT 1.0.25 for ROCm 6.0.0

Added#
  • Implemented experimental APIs to allow computing FFTs on data distributed across multiple devices in a single process.

    rocfft_field is a new type that can be added to a plan description, to describe layout of FFT input or output. rocfft_field_add_brick can be called one or more times to describe a brick decomposition of an FFT field, where each brick can be assigned a different device.

    These interfaces are still experimental and subject to change. We are interested to hear feedback on them. Questions and concerns may be raised by opening issues on the rocFFT issue tracker.

    Note that at this time, multi-device FFTs have several limitations:

    • Real-complex (forward or inverse) FFTs are not currently supported.

    • Planar format fields are not currently supported.

    • Batch (i.e. number_of_transforms provided to rocfft_plan_create) must be 1.

    • The FFT input is gathered to the current device at execute time, so all of the FFT data must fit on that device.

    We expect these limitations to be removed in future releases.

Optimizations#
  • Improved performance of some small 2D/3D real FFTs supported by 2D_SINGLE kernel. gfx90a gets more optimization by offline tuning.

  • Removed an extra kernel launch from even-length real-complex FFTs that use callbacks.

Changed#
  • Built kernels in solution-map to library kernel cache.

  • Real forward transforms (real-to-complex) no longer overwrite input. rocFFT still may overwrite real inverse (complex-to-real) input, as this allows for faster performance.

  • rocfft-rider and dyna-rocfft-rider have been renamed to rocfft-bench and dyna-rocfft-bench, controlled by the BUILD_CLIENTS_BENCH CMake option. Links for the old file names are installed, and the old BUILD_CLIENTS_RIDER CMake option is accepted for compatibility but both will be removed in a future release.

  • Binaries in debug builds no longer have a “-d” suffix.

Fixed#
  • rocFFT now correctly handles load callbacks that convert data from a smaller data type (e.g. 16-bit integers -> 32-bit float).

rocm-cmake 0.11.0#

rocm-cmake 0.11.0 for ROCm 6.0.0

Changed#
  • ROCMSphinxDoc: Improved validation, documentation and rocm-docs-core integration.

Fixed#
  • ROCMClangTidy: Fixed extra make flags passed for clang tidy.

  • ROCMTest: Fixed issues when using module in a subdirectory.

rocPRIM 3.0.0#

rocPRIM 3.0.0 for ROCm 6.0.0

Added#
  • block_sort::sort() overload for keys and values with a dynamic size, for all block sort algorithms. Additionally, all block_sort::sort() overloads with a dynamic size are now supported for block_sort_algorithm::merge_sort and block_sort_algorithm::bitonic_sort.

  • New two-way partition primitive partition_two_way which can write to two separate iterators.

Optimizations#
  • Improved the performance of partition.

Fixed#
  • Fixed rocprim::MatchAny for devices with 64-bit warp size. The function rocprim::MatchAny is deprecated and rocprim::match_any is preferred instead.

rocSOLVER 3.24.0#

rocSOLVER 3.24.0 for ROCm 6.0.0

Added#
  • Cholesky refactorization for sparse matrices

    • CSRRF_REFACTCHOL

  • Added rocsolver_rfinfo_mode and the ability to specify the desired refactorization routine (see rocsolver_set_rfinfo_mode).

Changed#
  • CSRRF_ANALYSIS and CSRRF_SOLVE now support sparse Cholesky factorization

rocSPARSE 3.0.2#

rocSPARSE 3.0.2 for ROCm 6.0.0

Added#
  • Added rocsparse_inverse_permutation

  • Added mixed precisions for SpVV

  • Added uniform int8 precision for Gather and Scatter

Optimized#
  • Optimization to doti routine

  • Optimization to spin-looping algorithms

Changed#
  • Changed rocsparse_spmv function arguments

  • Changed rocsparse_xbsrmv routines function arguments

  • doti, dotci, spvv, and csr2ell now require calling hipStreamSynchronize after when using host pointer mode

  • Improved documentation

  • Improved verbose output during argument checking on API function calls

Deprecated#
  • Deprecated rocsparse_spmv_ex

  • Deprecated rocsparse_xbsrmv_ex routines

Removed#
  • Removed auto stages from spmv, spmm, spgemm, spsv, spsm, and spitsv.

  • Removed rocsparse_spmm_ex routine

Fixed#
  • Fixed a bug in rocsparse-bench, where SpMV algorithm was not taken into account in CSR format

  • Fixed the BSR/GEBSR routines bsrmv, bsrsv, bsrmm, bsrgeam, gebsrmv, gebsrmm so that block_dim==0 is considered an invalid size

  • Fixed bug where passing nnz = 0 to doti or dotci did not always return a dot product of 0

rocThrust 3.0.0#

rocThrust 3.0.0 for ROCm 6.0.0

Added#
  • Updated to match upstream Thrust 2.0.1

  • NV_IF_TARGET macro from libcu++ for NVIDIA backend and HIP implementation for HIP backend.

Changed#
  • The cmake build system now additionally accepts GPU_TARGETS in addition to AMDGPU_TARGETS for setting the targeted gpu architectures. GPU_TARGETS=all will compile for all supported architectures. AMDGPU_TARGETS is only provided for backwards compatibility, GPU_TARGETS should be preferred.

Removed#
  • Removed cub symlink from the root of the repository.

  • Removed support for deprecated macros (THRUST_DEVICE_BACKEND and THRUST_HOST_BACKEND).

Fixed#
  • Fixed a segmentation fault when binary search / upper bound / lower bound / equal range was invoked with hip_rocprim::execute_on_stream_base policy.

Known Issues#
  • For NVIDIA backend, NV_IF_TARGET and THRUST_RDC_ENABLED intend to substitute the THRUST_HAS_CUDART macro, which is now no longer used in Thrust (provided for legacy support only). However, there is no THRUST_RDC_ENABLED macro available for the HIP backend, so some branches in Thrust’s code may be unreachable in the HIP backend.

rocWMMA 1.3.0#

rocWMMA 1.3.0 for ROCm 6.0.0

Added#
  • Added support for gfx940, gfx941 and gfx942 targets

  • Added support for f8, bf8 and xfloat32 datatypes

  • Added support for HIP_NO_HALF, __ HIP_NO_HALF_CONVERSIONS__ and __ HIP_NO_HALF_OPERATORS__ (e.g. pytorch environment)

Changed#
  • rocWMMA with hipRTC now supports bfloat16_t datatype

  • gfx11 wmma now uses lane swap instead of broadcast for layout adjustment

  • Updated samples GEMM parameter validation on host arch

Fixed#
  • Disabled gtest static library deployment

  • Extended tests now build in large code model

rpp 1.4.0#

rpp for ROCm 6.0.0

Added#
  • New Tests

Optimizations#
  • Readme Updates

Changed#
  • Backend - Default Backend set to HIP

Fixed#
  • Minor bugs and warnings

Tested Configurations#
  • Linux distribution

    • Ubuntu - 18.04 / 20.04

    • CentOS - 8

  • ROCm: rocm-core - 5.0.0.50000-49

  • Clang - Version 6.0

  • CMake - Version 3.22.3

  • Boost - Version 1.72

  • IEEE 754-based half-precision floating-point library - Version 1.12.0

Rpp 1.3.0#
Rpp 1.2.0#
Known Issues#
  • CPU only backend not enabled

Rpp 1.1.0#
Rpp 1.0.0#
Rpp 0.99#
Rpp 0.98#
Rpp 0.97#
Rpp 0.96#
Rpp 0.95#
Rpp 0.93#
Tensile 4.39.0#

Tensile 4.39.0 for ROCm 6.0.0

Added#
  • Added aquavanjaram support: gfx940/gfx941/gfx942, fp8/bf8 datatype, xf32 datatype, and stochastic rounding for various datatypes

  • Added/updated tuning scripts

  • Added DirectToLds support for larger data types with 32bit global load (old parameter DirectToLds is replaced with DirectToLdsA and DirectToLdsB), and the corresponding test cases

  • Added the average of frequency, power consumption, and temperature information for the winner kernels to the CSV file

  • Added asmcap check for MFMA + const src

  • Added support for wider local read + pack with v_perm (with VgprForLocalReadPacking=True)

  • Added a new parameter to increase miLatencyLeft

Optimizations#
  • Enabled InitAccVgprOpt for MatrixInstruction cases

  • Implemented local read related parameter calculations with DirectToVgpr

  • Adjusted miIssueLatency for gfx940

  • Enabled dedicated vgpr allocation for local read + pack

  • Optimized code initialization

  • Optimized sgpr allocation

  • Supported DGEMM TLUB + RLVW=2 for odd N (edge shift change)

  • Enabled miLatency optimization for (gfx940/gfx941 + MFMA) for specific data types, and fixed instruction scheduling

Changed#
  • Removed old code for DTL + (bpe * GlobalReadVectorWidth > 4)

  • Changed/updated failed CI tests for gfx11xx, InitAccVgprOpt, and DTLds

  • Removed unused CustomKernels and ReplacementKernels

  • Added a reject condition for DTVB + TransposeLDS=False (not supported so far)

  • Removed unused code for DirectToLds

  • Updated test cases for DTV + TransposeLDS=False

  • Moved parameter MinKForGSU from globalparameter to BenchmarkCommonParameter to support smaller K

  • Changed how to calculate latencyForLR for miLatency

  • Set minimum value of latencyForLRCount for 1LDSBuffer to avoid getting rejected by overflowedResources=5 (related to miLatency)

  • Refactored allowLRVWBforTLUandMI and renamed it as VectorWidthB

  • Supported multi-gpu for different architectures in lazy library loading

  • Enabled dtree library for batch > 1

  • Added problem scale feature for dtree selection

  • Enabled ROCm SMI for gfx940/941.

  • Modified non-lazy load build to skip experimental logic

Fixed#
  • Fixed predicate ordering for fp16alt impl round near zero mode to unbreak distance modes

  • Fixed boundary check for mirror dims and re-enable disabled mirror dims test cases

  • Fixed merge error affecting i8 with wmma

  • Fixed mismatch issue with DTLds + TSGR + TailLoop

  • Fixed a bug with InitAccVgprOpt + GSU>1 and a mismatch issue with PGR=0

  • Fixed override for unloaded solutions when lazy loading

  • Fixed build some errors (adding missing headers)

  • Fixed boost link for a clean build on ubuntu22

  • Fixed bug in forcestoresc1 arch selection

  • Fixed compiler directive for gfx941 and gfx942

  • Fixed formatting for DecisionTree_test.cpp


ROCm 5.7.1#

What’s new in this release#

ROCm 5.7.1 is a point release with several bug fixes in the HIP runtime.

Installing all GPU AddressSanitizer packages with a single command#

ROCm 5.7.1 simplifies the installation steps for the optional AddressSanitizer (ASan) packages. This release provides the meta package rocm-ml-sdk-asan for ease of ASan installation. The following command can be used to install all ASan packages rather than installing each package separately,

    sudo apt-get install rocm-ml-sdk-asan

For more detailed information about using the GPU AddressSanitizer, refer to the user guide

ROCm libraries#
rocBLAS#

A new functionality rocblas-gemm-tune and an environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH are added to rocBLAS in the ROCm 5.7.1 release.

rocblas-gemm-tune is used to find the best-performing GEMM kernel for each GEMM problem set. It has a command line interface, which mimics the –yaml input used by rocblas-bench. To generate the expected –yaml input, profile logging can be used, by setting the environment variable ROCBLAS_LAYER4.

For more information on rocBLAS logging, see Logging in rocBLAS, in the API Reference Guide.

An example input file: Expected output (note selected GEMM idx may differ): Where the far right values (solution_index) are the indices of the best-performing kernels for those GEMMs in the rocBLAS kernel library. These indices can be directly used in future GEMM calls. See rocBLAS/samples/example_user_driven_tuning.cpp for sample code of directly using kernels via their indices.

If the output is stored in a file, the results can be used to override default kernel selection with the kernels found by setting the environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH, which points to the stored file.

For more details, refer to the rocBLAS Programmer’s Guide.

HIP 5.7.1 (for ROCm 5.7.1)#

ROCm 5.7.1 is a point release with several bug fixes in the HIP runtime.

Defect fixes#

The hipPointerGetAttributes API returns the correct HIP memory type as hipMemoryTypeManaged for managed memory.

Library changes in ROCm 5.7.1#

Library

Version

AMDMIGraphX

2.7

composable_kernel

0.2.0

hipBLAS

1.1.0

hipCUB

2.13.1

hipFFT

1.0.12

hipRAND

2.10.16

hipSOLVER

1.8.1 ⇒ 1.8.2

hipSPARSE

2.3.8

MIOpen

2.19.0

MIVisionX

2.5.0

rocALUTION

2.1.11

rocBLAS

3.1.0

rocFFT

1.0.24

rocm-cmake

0.10.0

rocPRIM

2.13.1

rocRAND

2.10.17

rocSOLVER

3.23.0

rocSPARSE

2.5.4

rocThrust

2.18.0

rocWMMA

1.2.0

rpp

1.2.0

Tensile

4.38.0

hipSOLVER 1.8.2#

hipSOLVER 1.8.2 for ROCm 5.7.1

Fixed#
  • Fixed conflicts between the hipsolver-dev and -asan packages by excluding hipsolver_module.f90 from the latter


ROCm 5.7.0#

Release highlights for ROCm 5.7#

New features include:

  • A new library (hipTensor)

  • Optimizations for rocRAND and MIVisionX

  • AddressSanitizer for host and device code (GPU) is now available as a beta

Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major releases in the ROCm 5 series. This release is Linux-only.

Important

The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.

AMD Instinct™ MI50 end-of-support notice#

AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023.

As outlined in 5.6.0, ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state.

  • ROCm 6.0 release will show MI50s as “under maintenance” for Linux and Windows

  • No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).

  • Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs until Q2 2024 (end of maintenance [EOM] will be aligned with the closest ROCm release).

  • Bug fixes during the maintenance will be made to the next ROCm point release.

  • Bug fixes will not be backported to older ROCm releases for gfx906.

  • Distribution and operating system updates will continue per the ROCm release cadence for gfx906 GPUs until EOM.

Feature updates#
Non-hostcall HIP printf#

Current behavior

The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):

    Pcie atomics not enabled, hostcall not supported

Workaround

The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.

Note

This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.

The printf variant is now controlled via a new compiler option -mprintf-kind=. This is supported only for HIP programs and takes the following values,

  • “hostcall” – This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.

  • “buffered” – This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.

NOTE: With the new workaround:

  • The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output.

  • The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.

Beta release of LLVM AddressSanitizer (ASan) with the GPU#

The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The LLVM ASan provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.

Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.

Refer to the documentation on LLVM ASan with the GPU at LLVM AddressSanitizer User Guide.

Note

The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.

Defect fixes#

The following defects are fixed in ROCm v5.7:

  • Test hangs observed in HMM RCCL

  • NoGpuTst test of Catch2 fails with Docker

  • Failures observed with non-HMM HIP directed catch2 tests with XNACK+

  • Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+

HIP 5.7.0#
Optimizations#
Additions#
  • Added meta_group_size/rank for getting the number of tiles and rank of a tile in the partition

  • Added new APIs supporting Windows only, under development on Linux

    • hipMallocMipmappedArray for allocating a mipmapped array on the device

    • hipFreeMipmappedArray for freeing a mipmapped array on the device

    • hipGetMipmappedArrayLevel for getting a mipmap level of a HIP mipmapped array

    • hipMipmappedArrayCreate for creating a mipmapped array

    • hipMipmappedArrayDestroy for destroy a mipmapped array

    • hipMipmappedArrayGetLevel for getting a mipmapped array on a mipmapped level

Changes#
Fixes#
Known issues#
  • HIP memory type enum values currently don’t support equivalent value to cudaMemoryTypeUnregistered, due to HIP functionality backward compatibility.

  • HIP API hipPointerGetAttributes could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.

Upcoming changes for HIP in ROCm 6.0 release#
  • Removal of gcnarch from hipDeviceProp_t structure

  • Addition of new fields in hipDeviceProp_t structure

    • maxTexture1D

    • maxTexture2D

    • maxTexture1DLayered

    • maxTexture2DLayered

    • sharedMemPerMultiprocessor

    • deviceOverlap

    • asyncEngineCount

    • surfaceAlignment

    • unifiedAddressing

    • computePreemptionSupported

    • hostRegisterSupported

    • uuid

  • Removal of deprecated code -hip-hcc codes from hip code tree

  • Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA

  • HIPMEMCPY_3D fields correction to avoid truncation of “size_t” to “unsigned int” inside hipMemcpy3D()

  • Renaming of ‘memoryType’ in hipPointerAttribute_t structure to ‘type’

  • Correct hipGetLastError to return the last error instead of last API call’s return code

  • Update hipExternalSemaphoreHandleDesc to add “unsigned int reserved[16]”

  • Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess

  • Remove hiparray* and make it opaque with hipArray_t

Library changes in ROCm 5.7.0#

Library

Version

AMDMIGraphX

2.5 ⇒ 2.7

composable_kernel

0.2.0

hipBLAS

0.54.0 ⇒ 1.1.0

hipCUB

2.13.1

hipFFT

1.0.12

hipRAND

2.10.16

hipSOLVER

1.8.0 ⇒ 1.8.1

hipSPARSE

2.3.7 ⇒ 2.3.8

MIOpen

2.19.0

MIVisionX

2.4.0 ⇒ 2.5.0

rocALUTION

2.1.9 ⇒ 2.1.11

rocBLAS

3.0.0 ⇒ 3.1.0

rocFFT

1.0.23 ⇒ 1.0.24

rocm-cmake

0.9.0 ⇒ 0.10.0

rocPRIM

2.13.0 ⇒ 2.13.1

rocRAND

2.10.17

rocSOLVER

3.22.0 ⇒ 3.23.0

rocSPARSE

2.5.2 ⇒ 2.5.4

rocThrust

2.18.0

rocWMMA

1.1.0 ⇒ 1.2.0

rpp

1.2.0

Tensile

4.37.0 ⇒ 4.38.0

AMDMIGraphX 2.7#

MIGraphX 2.7 for ROCm 5.7.0

Added#
  • Enabled hipRTC to not require dev packages for migraphx runtime and allow the ROCm install to be in a different directory than it was during build time

  • Add support for multi-target execution

  • Added Dynamic Batch support with C++/Python APIs

  • Add migraphx.create_argument to python API

  • Added dockerfile example for Ubuntu 22.04

  • Add TensorFlow supported ops in driver similar to exist onnx operator list

  • Add a MIGRAPHX_TRACE_MATCHES_FOR env variable to filter the matcher trace

  • Improved debugging by printing max,min,mean and stddev values for TRACE_EVAL = 2

  • use fast_math flag instead of ENV flag for GELU

  • Print message from driver if offload copy is set for compiled program

Optimizations#
  • Optimized for ONNX Runtime 1.14.0

  • Improved compile times by only building for the GPU on the system

  • Improve performance of pointwise/reduction kernels when using NHWC layouts

  • Load specific version of the migraphx_py library

  • Annotate functions with the block size so the compiler can do a better job of optimizing

  • Enable reshape on nonstandard shapes

  • Use half HIP APIs to compute max and min

  • Added support for broadcasted scalars to unsqueeze operator

  • Improved multiplies with dot operator

  • Handle broadcasts across dot and concat

  • Add verify namespace for better symbol resolution

Fixed#
  • Resolved accuracy issues with FP16 resnet50

  • Update cpp generator to handle inf from float

  • Fix assertion error during verify and make DCE work with tuples

  • Fix convert operation for NaNs

  • Fix shape typo in API test

  • Fix compile warnings for shadowing variable names

  • Add missing specialization for the nullptr for the hash function

Changed#
  • Bumped version of half library to 5.6.0

  • Bumped CI to support rocm 5.6

  • Make building tests optional

  • replace np.bool with bool as per numpy request

Removed#
  • Removed int8x4 rocBlas calls due to deprecation

  • removed std::reduce usage since not all OS’ support it

composable_kernel 0.2.0#

CK 0.2.0 for ROCm 5.7.0

Fixed#
  • Fixed a bug in 6-dimensional kernels (#555).

  • Fixed grouped ConvBwdWeight test case failure (#524).

Optimizations#
  • Improve proformance of normalization kernel

Added#
  • Added support on NAVI3x.

  • Added user tutorial (#563).

  • Added more instances for irregular GEMM sizes (#560).

  • Added inter-wave consumer-producer programming model for GEMM kernels (#310).

  • Added multi-D GEMM client APIs (#534).

  • Added multi-embeddings support (#542).

  • Added Navi3x blockwise GEMM and real GEMM support (#541).

  • Added Navi grouped ConvBwdWeight support (#505).

Changed#
  • Changed …

hipBLAS 1.1.0#

hipBLAS 1.1.0 for ROCm 5.7.0

Changed#
  • updated documentation requirements

Dependencies#
  • dependency rocSOLVER now depends on rocSPARSE

hipSOLVER 1.8.1#

hipSOLVER 1.8.1 for ROCm 5.7.0

Changed#
  • Changed hipsolver-test sparse input data search paths to be relative to the test executable

hipSPARSE 2.3.8#

hipSPARSE 2.3.8 for ROCm 5.7.0

Improved#
  • Fix compilation failures when using cusparse 12.1.0 backend

  • Fix compilation failures when using cusparse 12.0.0 backend

  • Fix compilation failures when using cusparse 10.1 (non-update versions) as backend

  • Minor improvements

MIVisionX 2.5.0#

MIVisionX for ROCm 5.7.0

Added#
  • CTest - OpenVX Tests

  • Hardware Support

Optimizations#
  • CMakeList Cleanup

Changed#
  • rocAL - PyBind Link to prebuilt library

    • PyBind11

    • RapidJSON

  • Setup Updates

  • RPP Version - 1.2.0

  • Dockerfiles - Updates & bugfix

Fixed#
  • rocAL bug fix and updates

Tested Configurations#
  • Windows 10 / 11

  • Linux distribution

    • Ubuntu - 20.04 / 22.04

    • CentOS - 7 / 8

    • RHEL - 8 / 9

    • SLES - 15-SP4

  • ROCm: rocm-core - 5.4.3.50403-121

  • miopen-hip - 2.19.0.50403-121

  • miopen-opencl - 2.18.0.50300-63

  • migraphx - 2.4.0.50403-121

  • Protobuf - V3.12.4

  • OpenCV - 4.6.0

  • RPP - 1.2.0

  • FFMPEG - n4.4.2

  • Dependencies for all the above packages

  • MIVisionX Setup Script - V2.5.4

Known Issues#
  • OpenCV 4.X support for some apps missing

Mivisionx Dependency Map#
Hip Backend#

Docker Image: sudo docker build -f docker/ubuntu20/{DOCKER_LEVEL_FILE_NAME}.dockerfile -t {mivisionx-level-NUMBER} .

  • #c5f015 new component added to the level

  • #1589F0 existing component from the previous level

Build Level

MIVisionX Dependencies

Modules

Libraries and Executables

Docker Tag

Level_1

cmake <br> gcc <br> g++

amd_openvx <br> utilities

#c5f015 libopenvx.so - OpenVX&trade; Lib - CPU <br> #c5f015 libvxu.so - OpenVX&trade; immediate node Lib - CPU <br> #c5f015 runvx - OpenVX&trade; Graph Executor - CPU with Display OFF

Docker Image Version (tag latest semver)

Level_2

ROCm HIP <br> +Level 1

amd_openvx <br> amd_openvx_extensions <br> utilities

#c5f015 libopenvx.so - OpenVX&trade; Lib - CPU/GPU <br> #c5f015 libvxu.so - OpenVX&trade; immediate node Lib - CPU/GPU <br> #c5f015 runvx - OpenVX&trade; Graph Executor - Display OFF

Docker Image Version (tag latest semver)

Level_3

OpenCV <br> FFMPEG <br> +Level 2

amd_openvx <br> amd_openvx_extensions <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #c5f015 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #c5f015 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #c5f015 mv_compile - Neural Net Model Compile <br> #c5f015 runvx - OpenVX&trade; Graph Executor - Display ON

Docker Image Version (tag latest semver)

Level_4

MIOpenGEMM <br> MIOpen <br> ProtoBuf <br> +Level 3

amd_openvx <br> amd_openvx_extensions <br> apps <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #1589F0 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #1589F0 mv_compile - Neural Net Model Compile <br> #1589F0 runvx - OpenVX&trade; Graph Executor - Display ON <br> #c5f015 libvx_nn.so - OpenVX&trade; Neural Net Extension

Docker Image Version (tag latest semver)

Level_5

AMD_RPP <br> rocAL deps <br> +Level 4

amd_openvx <br> amd_openvx_extensions <br> apps <br> rocAL <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #1589F0 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #1589F0 mv_compile - Neural Net Model Compile <br> #1589F0 runvx - OpenVX&trade; Graph Executor - Display ON <br> #1589F0 libvx_nn.so - OpenVX&trade; Neural Net Extension <br> #c5f015 libvx_rpp.so - OpenVX&trade; RPP Extension <br> #c5f015 librocal.so - Radeon Augmentation Library <br> #c5f015 rocal_pybind.so - rocAL Pybind Lib

Docker Image Version (tag latest semver)

Opencl Backend#

Docker Image: sudo docker build -f docker/ubuntu20/{DOCKER_LEVEL_FILE_NAME}.dockerfile -t {mivisionx-level-NUMBER} .

  • #c5f015 new component added to the level

  • #1589F0 existing component from the previous level

Build Level

MIVisionX Dependencies

Modules

Libraries and Executables

Docker Tag

Level_1

cmake <br> gcc <br> g++

amd_openvx <br> utilities

#c5f015 libopenvx.so - OpenVX&trade; Lib - CPU <br> #c5f015 libvxu.so - OpenVX&trade; immediate node Lib - CPU <br> #c5f015 runvx - OpenVX&trade; Graph Executor - CPU with Display OFF

Docker Image Version (tag latest semver)

Level_2

ROCm OpenCL <br> +Level 1

amd_openvx <br> amd_openvx_extensions <br> utilities

#c5f015 libopenvx.so - OpenVX&trade; Lib - CPU/GPU <br> #c5f015 libvxu.so - OpenVX&trade; immediate node Lib - CPU/GPU <br> #c5f015 libvx_loomsl.so - Loom 360 Stitch Lib <br> #c5f015 loom_shell - 360 Stitch App <br> #c5f015 runcl - OpenCL&trade; program debug App <br> #c5f015 runvx - OpenVX&trade; Graph Executor - Display OFF

Docker Image Version (tag latest semver)

Level_3

OpenCV <br> FFMPEG <br> +Level 2

amd_openvx <br> amd_openvx_extensions <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #c5f015 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #c5f015 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #c5f015 mv_compile - Neural Net Model Compile <br> #c5f015 runvx - OpenVX&trade; Graph Executor - Display ON

Docker Image Version (tag latest semver)

Level_4

MIOpenGEMM <br> MIOpen <br> ProtoBuf <br> +Level 3

amd_openvx <br> amd_openvx_extensions <br> apps <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #1589F0 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #1589F0 mv_compile - Neural Net Model Compile <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #1589F0 runvx - OpenVX&trade; Graph Executor - Display ON <br> #c5f015 libvx_nn.so - OpenVX&trade; Neural Net Extension <br> #c5f015 inference_server_app - Cloud Inference App

Docker Image Version (tag latest semver)

Level_5

AMD_RPP <br> rocAL deps <br> +Level 4

amd_openvx <br> amd_openvx_extensions <br> apps <br> rocAL <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #1589F0 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #1589F0 mv_compile - Neural Net Model Compile <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #1589F0 runvx - OpenVX&trade; Graph Executor - Display ON <br> #1589F0 libvx_nn.so - OpenVX&trade; Neural Net Extension <br> #1589F0 inference_server_app - Cloud Inference App <br> #c5f015 libvx_rpp.so - OpenVX&trade; RPP Extension <br> #c5f015 librocal.so - Radeon Augmentation Library <br> #c5f015 rocal_pybind.so - rocAL Pybind Lib

Docker Image Version (tag latest semver)

NOTE: OpenVX and the OpenVX logo are trademarks of the Khronos Group Inc.

rocALUTION 2.1.11#

rocALUTION 2.1.11 for ROCm 5.7.0

Added#
  • Added support for gfx940, gfx941 and gfx942

Improved#
  • Fixed OpenMP runtime issue with Windows toolchain

rocBLAS 3.1.0#

rocBLAS 3.1.0 for ROCm 5.7.0

Added#
  • yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details.

  • rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems.

Fixed#
  • make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimensions or increments potentially causing overflow:

    • Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2

    • Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv

    • Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam

    • General: set_vector, get_vector, set_matrix, get_matrix

    • Related fixes: internal scalar loads with > 32bit offsets

    • fix in-place functionality for all trtri sizes

Changed#
  • dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory

  • enhanced reporting of installation issues caused by runtime libraries (Tensile)

  • standardized internal rocblas C++ interface across most functions

Deprecated#
  • Removal of STDC_WANT_IEC_60559_TYPES_EXT define in future release

Dependencies#
  • optional use of AOCL BLIS 4.0 on Linux for clients

  • optional build tool only dependency on python psutil

rocFFT 1.0.24#

rocFFT 1.0.24 for ROCm 5.7.0

Optimizations#
  • Improved performance of complex forward/inverse 1D FFTs (2049 <= length <= 131071) that use Bluestein’s algorithm.

Added#
  • Implemented a solution map version converter and finish the first conversion from ver.0 to ver.1. Where version 1 removes some incorrect kernels (sbrc/sbcr using half_lds)

Changed#
  • Moved rocfft_rtc_helper executable to lib/rocFFT directory on Linux.

  • Moved library kernel cache to lib/rocFFT directory.

rocm-cmake 0.10.0#

rocm-cmake 0.10.0 for ROCm 5.7.0

Added#
  • Added ROCMTest module

  • ROCMCreatePackage: Added support for ASAN packages

rocPRIM 2.13.1#

rocPRIM 2.13.1 for ROCm 5.7.0

Changed#
  • Deprecated configuration radix_sort_config for device-level radix sort as it no longer matches the algorithm’s parameters. New configuration radix_sort_config_v2 is preferred instead.

  • Removed erroneous implementation of device-level inclusive_scan and exclusive_scan. The prior default implementation using lookback-scan now is the only available implementation.

  • The benchmark metric indicating the bytes processed for exclusive_scan_by_key and inclusive_scan_by_key has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as scan and scan_by_key instead of scan_exclusive and scan_inclusive.

  • Deprecated configurations scan_config and scan_by_key_config for device-level scans, as they no longer match the algorithm’s parameters. New configurations scan_config_v2 and scan_by_key_config_v2 are preferred instead.

Fixed#
  • Fixed build issue caused by missing header in thread/thread_search.hpp.

rocSOLVER 3.23.0#

rocSOLVER 3.23.0 for ROCm 5.7.0

Added#
  • LU factorization without pivoting for block tridiagonal matrices:

    • GEBLTTRF_NPVT now supports interleaved_batched format

  • Linear system solver without pivoting for block tridiagonal matrices:

    • GEBLTTRS_NPVT now supports interleaved_batched format

Fixed#
  • Fixed stack overflow in sparse tests on Windows

Changed#
  • Changed rocsolver-test sparse input data search paths to be relative to the test executable

  • Changed build scripts to default to compressed debug symbols in Debug builds

rocSPARSE 2.5.4#

rocSPARSE 2.5.4 for ROCm 5.7.0

Added#
  • Added more mixed precisions for SpMV, (matrix: float, vectors: double, calculation: double) and (matrix: rocsparse_float_complex, vectors: rocsparse_double_complex, calculation: rocsparse_double_complex)

  • Added support for gfx940, gfx941 and gfx942

Improved#
  • Fixed a bug in csrsm and bsrsm

Known Issues#

In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split.

rocWMMA 1.2.0#

rocWMMA 1.2.0 for ROCm 5.7.0

Changed#
  • Fixed a bug with synchronization

  • Updated rocWMMA cmake versioning

rpp 1.2.0#

rpp for ROCm 5.7.0

Added#
  • New Tests

Optimizations#
  • Readme Updates

Changed#
  • Backend - Default Backend set to HIP

Fixed#
  • Minor bugs and warnings

Tested Configurations#
  • Linux distribution

    • Ubuntu - 18.04 / 20.04

    • CentOS - 8

  • ROCm: rocm-core - 5.0.0.50000-49

  • Clang - Version 6.0

  • CMake - Version 3.22.3

  • Boost - Version 1.72

  • IEEE 754-based half-precision floating-point library - Version 1.12.0

Known Issues#
  • CPU only backend not enabled

Rpp 1.1.0#
Rpp 1.0.0#
Rpp 0.99#
Rpp 0.98#
Rpp 0.97#
Rpp 0.96#
Rpp 0.95#
Rpp 0.93#
Tensile 4.38.0#

Tensile 4.38.0 for ROCm 5.7.0

Added#
  • Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation)

  • Added user-driven solution selection feature

Optimizations#
  • Enabled LocalSplitU with MFMA for I8 data type

  • Optimized K mask code in mfmaIter

  • Enabled TailLoop code in NoLoadLoop to prefetch global/local read

  • Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations

  • Optimized DirectToLds test cases to reduce the test duration

Changed#
  • Removed DGEMM NT custom kernels and related test cases

  • Changed noTailLoop logic to apply noTailLoop only for NT

  • Changed the range of AssertFree0ElementMultiple and Free1

  • Unified aStr, bStr generation code in mfmaIter

Fixed#
  • Fixed LocalSplitU mismatch issue for SGEMM

  • Fixed BufferStore=0 and Ldc != Ldd case

  • Fixed mismatch issue with TailLoop + MatrixInstB > 1


ROCm 5.6.1#

What’s new in this release#

ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime.

HIP 5.6.1 (for ROCm 5.6.1)#
Defect fixes#
  • hipMemcpy device-to-device (inter-device) is now asynchronous with respect to the host

  • Enabled xnack+ check in HIP catch2 tests hang when executing tests

  • Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs

  • Using hipGraphAddMemFreeNode no longer results in a crash

Library changes in ROCm 5.6.1#

Library

Version

AMDMIGraphX

2.5

hipBLAS

0.53.0

hipCUB

2.13.1

hipFFT

1.0.12

hipRAND

2.10.16

hipSOLVER

1.8.0

hipSPARSE

2.3.6 ⇒ 2.3.7

MIOpen

2.19.0

MIVisionX

2.4.0

rccl

2.15.5

rocALUTION

2.1.9

rocBLAS

3.0.0

rocFFT

1.0.23

rocm-cmake

0.9.0

rocPRIM

2.13.0

rocRAND

2.10.17

rocSOLVER

3.22.0

rocSPARSE

2.5.2

rocThrust

2.18.0

rocWMMA

1.1.0

Tensile

4.37.0

hipSPARSE 2.3.7#

hipSPARSE 2.3.7 for ROCm 5.6.1

Bugfix#
  • Reverted an undocumented API change in hipSPARSE 2.3.6 that affected hipsparseSpSV_solve function


ROCm 5.6.0#

Release highlights#

ROCm 5.6 consists of several AI software ecosystem improvements to our fast-growing user base.  A few examples include:

  • New documentation portal at https://rocm.docs.amd.com

  • Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test suite

  • OpenAI Triton, CuPy, HIP Graph support, and many other library performance enhancements

  • Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger, profiler, and docker containers

  • New pseudorandom generators are available in rocRAND. Added support for half-precision transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in rocSOLVER.

OS and GPU support changes#
  • SLES15 SP5 support was added this release. SLES15 SP3 support was dropped.

  • AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs) will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA release date.

    • No new features and performance optimizations will be supported for the gfx906 GPUs beyond ROCm 5.7

    • Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM will be aligned with the closest ROCm release)

    • Bug fixes during the maintenance will be made to the next ROCm point release

    • Bug fixes will not be back ported to older ROCm releases for this SKU

    • Distro / Operating system updates will continue per the ROCm release cadence for gfx906 GPUs till EOM.

AMDSMI CLI 23.0.0.4#
Additions#
  • AMDSMI CLI tool enabled for Linux Bare Metal & Guest

  • Package: amd-smi-lib

Known issues#
  • not all Error Correction Code (ECC) fields are currently supported

  • RHEL 8 & SLES 15 have extra install steps

Kernel modules (DKMS)#
Fixes#
  • Stability fix for multi GPU system reproducible via ROCm_Bandwidth_Test as reported in Issue 2198.

HIP 5.6 (for ROCm 5.6)#
Optimizations#
  • Consolidation of hipamd, rocclr and OpenCL projects in clr

  • Optimized lock for graph global capture mode

Additions#
  • Added hipRTC support for amd_hip_fp16

  • Added hipStreamGetDevice implementation to get the device associated with the stream

  • Added HIP_AD_FORMAT_SIGNED_INT16 in hipArray formats

  • hipArrayGetInfo for getting information about the specified array

  • hipArrayGetDescriptor for getting 1D or 2D array descriptor

  • hipArray3DGetDescriptor to get 3D array descriptor

Changes#
  • hipMallocAsync to return success for zero size allocation to match hipMalloc

  • Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package

  • Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide

  • Removed hipBusBandwidth and hipCommander samples from hip-tests

Fixes#
  • Fixed regression in hipMemCpyParam3D when offset is applied

Known issues#
  • Limited testing on xnack+ configuration

    • Multiple HIP tests failures (gpuvm fault or hangs)

  • hipSetDevice and hipSetDeviceFlags APIs return hipErrorInvalidDevice instead of hipErrorNoDevice, on a system without GPU

  • Known memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs. Issue will be fixed in a future ROCm release

Upcoming changes in future release#
  • Removal of gcnarch from hipDeviceProp_t structure

  • Addition of new fields in hipDeviceProp_t structure

    • maxTexture1D

    • maxTexture2D

    • maxTexture1DLayered

    • maxTexture2DLayered

    • sharedMemPerMultiprocessor

    • deviceOverlap

    • asyncEngineCount

    • surfaceAlignment

    • unifiedAddressing

    • computePreemptionSupported

    • uuid

  • Removal of deprecated code

    • hip-hcc codes from hip code tree

  • Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA

  • HIPMEMCPY_3D fields correction (unsigned int -> size_t)

  • Renaming of ‘memoryType’ in hipPointerAttribute_t structure to ‘type’

ROCgdb-13 (For ROCm 5.6.0)#
Optimizations#
  • Improved performances when handling the end of a process with a large number of threads.

Known issues#
  • On certain configurations, ROCgdb can show the following warning message:

    warning: Probes-based dynamic linker interface failed. Reverting to original interface.

    This does not affect ROCgdb’s functionalities.

ROCprofiler (for ROCm 5.6.0)#

In ROCm 5.6 the rocprofilerv1 and rocprofilerv2 include and library files of ROCm 5.5 are split into separate files. The rocmtools files that were deprecated in ROCm 5.5 have been removed.

ROCm 5.6

rocprofilerv1

rocprofilerv2

Tool script

bin/rocprof

bin/rocprofv2

API include

include/rocprofiler/rocprofiler.h

include/rocprofiler/v2/rocprofiler.h

API library

lib/librocprofiler.so.1

lib/librocprofiler.so.2

The ROCm Profiler Tool that uses rocprofilerV1 can be invoked using the following command:

rocprof 

To write a custom tool based on the rocprofilerV1 API do the following:

main.c:
#include <rocprofiler/rocprofiler.h> // Use the rocprofilerV1 API
int main() {
  // Use the rocprofilerV1 API
  return 0;
}

This can be built in the following manner:

gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64

The resulting a.out will depend on /opt/rocm-5.6.0/lib/librocprofiler64.so.1.

The ROCm Profiler that uses rocprofilerV2 API can be invoked using the following command:

rocprofv2 

To write a custom tool based on the rocprofilerV2 API do the following:

main.c:
#include <rocprofiler/v2/rocprofiler.h> // Use the rocprofilerV2 API
int main() {
  // Use the rocprofilerV2 API
  return 0;
}

This can be built in the following manner:

gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64-v2

The resulting a.out will depend on /opt/rocm-5.6.0/lib/librocprofiler64.so.2.

Optimizations#
  • Improved Test Suite

Additions#
  • ‘end_time’ need to be disabled in roctx_trace.txt

Fixes#
  • rocprof in ROcm/5.4.0 gpu selector broken.

  • rocprof in ROCm/5.4.1 fails to generate kernel info.

  • rocprof clobbers LD_PRELOAD.

Library changes in ROCm 5.6.0#

Library

Version

AMDMIGraphX

2.5

hipBLAS

0.53.0

hipCUB

2.13.1

hipFFT

1.0.11 ⇒ 1.0.12

hipRAND

2.10.16

hipSOLVER

1.7.0 ⇒ 1.8.0

hipSPARSE

2.3.5 ⇒ 2.3.6

MIOpen

2.19.0

MIVisionX

2.3.0 ⇒ 2.4.0

rccl

2.15.5

rocALUTION

2.1.8 ⇒ 2.1.9

rocBLAS

2.47.0 ⇒ 3.0.0

rocFFT

1.0.22 ⇒ 1.0.23

rocm-cmake

0.8.1 ⇒ 0.9.0

rocPRIM

2.13.0

rocRAND

2.10.17

rocSOLVER

3.21.0 ⇒ 3.22.0

rocSPARSE

2.5.1 ⇒ 2.5.2

rocThrust

2.17.0 ⇒ 2.18.0

rocWMMA

1.0 ⇒ 1.1.0

Tensile

4.36.0 ⇒ 4.37.0

hipFFT 1.0.12#

hipFFT 1.0.12 for ROCm 5.6.0

Added#
  • Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms.

Changed#
  • Added –precision argument to benchmark/test clients. –double is still accepted but is deprecated as a method to request a double-precision transform.

hipSOLVER 1.8.0#

hipSOLVER 1.8.0 for ROCm 5.6.0

Added#
  • Added compatibility API with hipsolverRf prefix

hipSPARSE 2.3.6#

hipSPARSE 2.3.6 for ROCm 5.6.0

Added#
  • Added SpGEMM algorithms

Changed#
  • For hipsparseXbsr2csr and hipsparseXcsr2bsr, blockDim == 0 now returns HIPSPARSE_STATUS_INVALID_SIZE

MIVisionX 2.4.0#

MIVisionX for ROCm 5.6.0

Added#
  • OpenVX FP16 Support

  • rocAL - CPU, HIP, & OCL backends

  • AMD RPP - CPU, HIP, & OCL backends

  • MIVisionX Setup Support for RHEL

  • Extended OS Support

  • Docker Support for Ubuntu 22.04

  • Tests

Optimizations#
  • CMakeList Cleanup

  • MIGraphX Extension Updates

  • rocAL - Documentation

  • CMakeList Updates & Cleanup

Changed#
  • rocAL - Changing Python Lib Path

  • Docker Support - Ubuntu 18 Support Dropped

  • RPP - Link to Version 1.0.0

  • rocAL - support updates

  • Setup Updates

Fixed#
  • rocAL bug fix and updates

  • AMD RPP - bug fixes

  • CMakeLists - Issues

  • RPATH - Link Issues

Tested Configurations#
  • Windows 10 / 11

  • Linux distribution

    • Ubuntu - 20.04 / 22.04

    • CentOS - 7 / 8

    • RHEL - 8 / 9

    • SLES - 15-SP3

  • ROCm: rocm-core - 5.4.3.50403-121

  • miopen-hip - 2.19.0.50403-121

  • miopen-opencl - 2.18.0.50300-63

  • migraphx - 2.4.0.50403-121

  • Protobuf - V3.12.4

  • OpenCV - 4.6.0

  • RPP - 1.0.0

  • FFMPEG - n4.4.2

  • Dependencies for all the above packages

  • MIVisionX Setup Script - V2.4.2

Known Issues#
  • OpenCV 4.X support for some apps missing

Mivisionx Dependency Map#

Docker Image: sudo docker build -f docker/ubuntu20/{DOCKER_LEVEL_FILE_NAME}.dockerfile -t {mivisionx-level-NUMBER} .

  • #c5f015 new component added to the level

  • #1589F0 existing component from the previous level

Build Level

MIVisionX Dependencies

Modules

Libraries and Executables

Docker Tag

Level_1

cmake <br> gcc <br> g++

amd_openvx <br> utilities

#c5f015 libopenvx.so - OpenVX&trade; Lib - CPU <br> #c5f015 libvxu.so - OpenVX&trade; immediate node Lib - CPU <br> #c5f015 runvx - OpenVX&trade; Graph Executor - CPU with Display OFF

Docker Image Version (tag latest semver)

Level_2

ROCm OpenCL <br> +Level 1

amd_openvx <br> amd_openvx_extensions <br> utilities

#c5f015 libopenvx.so - OpenVX&trade; Lib - CPU/GPU <br> #c5f015 libvxu.so - OpenVX&trade; immediate node Lib - CPU/GPU <br> #c5f015 libvx_loomsl.so - Loom 360 Stitch Lib <br> #c5f015 loom_shell - 360 Stitch App <br> #c5f015 runcl - OpenCL&trade; program debug App <br> #c5f015 runvx - OpenVX&trade; Graph Executor - Display OFF

Docker Image Version (tag latest semver)

Level_3

OpenCV <br> FFMPEG <br> +Level 2

amd_openvx <br> amd_openvx_extensions <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #c5f015 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #c5f015 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #c5f015 mv_compile - Neural Net Model Compile <br> #c5f015 runvx - OpenVX&trade; Graph Executor - Display ON

Docker Image Version (tag latest semver)

Level_4

MIOpenGEMM <br> MIOpen <br> ProtoBuf <br> +Level 3

amd_openvx <br> amd_openvx_extensions <br> apps <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #1589F0 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #1589F0 mv_compile - Neural Net Model Compile <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #1589F0 runvx - OpenVX&trade; Graph Executor - Display ON <br> #c5f015 libvx_nn.so - OpenVX&trade; Neural Net Extension <br> #c5f015 inference_server_app - Cloud Inference App

Docker Image Version (tag latest semver)

Level_5

AMD_RPP <br> rocAL deps <br> +Level 4

amd_openvx <br> amd_openvx_extensions <br> apps <br> rocAL <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #1589F0 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #1589F0 mv_compile - Neural Net Model Compile <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #1589F0 runvx - OpenVX&trade; Graph Executor - Display ON <br> #1589F0 libvx_nn.so - OpenVX&trade; Neural Net Extension <br> #1589F0 inference_server_app - Cloud Inference App <br> #c5f015 libvx_rpp.so - OpenVX&trade; RPP Extension <br> #c5f015 librocal.so - Radeon Augmentation Library <br> #c5f015 rocal_pybind.so - rocAL Pybind Lib

Docker Image Version (tag latest semver)

NOTE: OpenVX and the OpenVX logo are trademarks of the Khronos Group Inc.

rocALUTION 2.1.9#

rocALUTION 2.1.9 for ROCm 5.6.0

Improved#
  • Fixed synchronization issues in level 1 routines

rocBLAS 3.0.0#

rocBLAS 3.0.0 for ROCm 5.6.0

Optimizations#
  • Improved performance of Level 2 rocBLAS GEMV on gfx90a GPU for non-transposed problems having small matrices and larger batch counts. Performance enhanced for problem sizes when m and n <= 32 and batch_count >= 256.

  • Improved performance of rocBLAS syr2k for single, double, and double-complex precision, and her2k for double-complex precision. Slightly improved performance for general sizes on gfx90a.

Added#
  • Added bf16 inputs and f32 compute support to Level 1 rocBLAS Extension functions axpy_ex, scal_ex and nrm2_ex.

Deprecated#
  • trmm inplace is deprecated. It will be replaced by trmm that has both inplace and out-of-place functionality

  • rocblas_query_int8_layout_flag() is deprecated and will be removed in a future release

  • rocblas_gemm_flags_pack_int8x4 enum is deprecated and will be removed in a future release

  • rocblas_set_device_memory_size() is deprecated and will be replaced by a future function rocblas_increase_device_memory_size()

  • rocblas_is_user_managing_device_memory() is deprecated and will be removed in a future release

Removed#
  • is_complex helper was deprecated and now removed. Use rocblas_is_complex instead.

  • The enum truncate_t and the value truncate was deprecated and now removed from. It was replaced by rocblas_truncate_t and rocblas_truncate, respectively.

  • rocblas_set_int8_type_for_hipblas was deprecated and is now removed.

  • rocblas_get_int8_type_for_hipblas was deprecated and is now removed.

Dependencies#
  • build only dependency on python joblib added as used by Tensile build

  • fix for cmake install on some OS when performed by install.sh -d –cmake_install

Fixed#
  • make trsm offset calculations 64 bit safe

Changed#
  • refactor rotg test code

rocFFT 1.0.23#

rocFFT 1.0.23 for ROCm 5.6.0

Added#
  • Implemented half-precision transforms, which can be requested by passing rocfft_precision_half to rocfft_plan_create.

  • Implemented a hierarchical solution map which saves how to decompose a problem and the kernels to be used.

  • Implemented a first version of offline-tuner to support tuning kernels for C2C/Z2Z problems.

Changed#
  • Replaced std::complex with hipComplex data types for data generator.

  • FFT plan dimensions are now sorted to be row-major internally where possible, which produces better plans if the dimensions were accidentally specified in a different order (column-major, for example).

  • Added –precision argument to benchmark/test clients. –double is still accepted but is deprecated as a method to request a double-precision transform.

Fixed#
  • Fixed over-allocation of LDS in some real-complex kernels, which was resulting in kernel launch failure.

rocm-cmake 0.9.0#

rocm-cmake 0.9.0 for ROCm 5.6.0

Added#
  • Added the option ROCM_HEADER_WRAPPER_WERROR

    • Compile-time C macro in the wrapper headers causes errors to be emitted instead of warnings.

    • Configure-time CMake option sets the default for the C macro.

rocSOLVER 3.22.0#

rocSOLVER 3.22.0 for ROCm 5.6.0

Added#
  • LU refactorization for sparse matrices

    • CSRRF_ANALYSIS

    • CSRRF_SUMLU

    • CSRRF_SPLITLU

    • CSRRF_REFACTLU

  • Linear system solver for sparse matrices

    • CSRRF_SOLVE

  • Added type rocsolver_rfinfo for use with sparse matrix routines

Optimized#
  • Improved the performance of BDSQR and GESVD when singular vectors are requested

Fixed#
  • BDSQR and GESVD should no longer hang when the input contains NaN or Inf

rocSPARSE 2.5.2#

rocSPARSE 2.5.2 for ROCm 5.6.0

Improved#
  • Fixed a memory leak in csritsv

  • Fixed a bug in csrsm and bsrsm

rocThrust 2.18.0#

rocThrust 2.18.0 for ROCm 5.6.0

Fixed#
  • lower_bound, upper_bound, and binary_search failed to compile for certain types.

Changed#
  • Updated docs directory structure to match the standard of rocm-docs-core.

rocWMMA 1.1.0#

rocWMMA 1.1.0 for ROCm 5.6.0

Added#
  • Added cross-lane operation backends (Blend, Permute, Swizzle and Dpp)

  • Added GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation)

  • Added performance gemm samples for half, single and double precision

  • Added rocWMMA cmake versioning

  • Added vectorized support in coordinate transforms

  • Included ROCm smi for runtime clock rate detection

  • Added fragment transforms for transpose and change data layout

Changed#
  • Default to GPU rocBLAS validation against rocWMMA

  • Re-enabled int8 gemm tests on gfx9

  • Upgraded to C++17

  • Restructured unit test folder for consistency

  • Consolidated rocWMMA samples common code

Tensile 4.37.0#

Tensile 4.37.0 for ROCm 5.6.0

Added#
  • Added user driven tuning API

  • Added decision tree fallback feature

  • Added SingleBuffer + AtomicAdd option for GlobalSplitU

  • DirectToVgpr support for fp16 and Int8 with TN orientation

  • Added new test cases for various functions

  • Added SingleBuffer algorithm for ZGEMM/CGEMM

  • Added joblib for parallel map calls

  • Added support for MFMA + LocalSplitU + DirectToVgprA+B

  • Added asmcap check for MIArchVgpr

  • Added support for MFMA + LocalSplitU

  • Added frequency, power, and temperature data to the output

Optimizations#
  • Improved the performance of GlobalSplitU with SingleBuffer algorithm

  • Reduced the running time of the extended and pre_checkin tests

  • Optimized the Tailloop section of the assembly kernel

  • Optimized complex GEMM (fixed vgpr allocation, unified CGEMM and ZGEMM code in MulMIoutAlphaToArch)

  • Improved the performance of the second kernel of MultipleBuffer algorithm

Changed#
  • Updated custom kernels with 64-bit offsets

  • Adapted 64-bit offset arguments for assembly kernels

  • Improved temporary register re-use to reduce max sgpr usage

  • Removed some restrictions on VectorWidth and DirectToVgpr

  • Updated the dependency requirements for Tensile

  • Changed the range of AssertSummationElementMultiple

  • Modified the error messages for more clarity

  • Changed DivideAndReminder to vectorStaticRemainder in case quotient is not used

  • Removed dummy vgpr for vectorStaticRemainder

  • Removed tmpVgpr parameter from vectorStaticRemainder/Divide/DivideAndReminder

  • Removed qReg parameter from vectorStaticRemainder

Fixed#
  • Fixed tmp sgpr allocation to avoid over-writing values (alpha)

  • 64-bit offset parameters for post kernels

  • Fixed gfx908 CI test failures

  • Fixed offset calculation to prevent overflow for large offsets

  • Fixed issues when BufferLoad and BufferStore are equal to zero

  • Fixed StoreCInUnroll + DirectToVgpr + no useInitAccVgprOpt mismatch

  • Fixed DirectToVgpr + LocalSplitU + FractionalLoad mismatch

  • Fixed the memory access error related to StaggerU + large stride

  • Fixed ZGEMM 4x4 MatrixInst mismatch

  • Fixed DGEMM 4x4 MatrixInst mismatch

  • Fixed ASEM + GSU + NoTailLoop opt mismatch

  • Fixed AssertSummationElementMultiple + GlobalSplitU issues

  • Fixed ASEM + GSU + TailLoop inner unroll


ROCm 5.5.1#

What’s new in this release#
HIP SDK for Windows#

AMD is pleased to announce the availability of the HIP SDK for Windows as part of ROCm software. The HIP SDK OS and GPU support page lists the versions of Windows and GPUs validated by AMD. HIP SDK features on Windows are described in detail in our What is ROCm? page and differs from the Linux feature set. Visit Quick Start page to get started. Known issues are tracked on GitHub.

HIP API change#

The following HIP API is updated in the ROCm 5.5.1 release:

hipDeviceSetCacheConfig#
  • The return value for hipDeviceSetCacheConfig is updated from hipErrorNotSupported to hipSuccess

Library changes in ROCm 5.5.1#

Library

Version

AMDMIGraphX

2.5

hipBLAS

0.54.0

hipBLASLt

0.1.0

hipCUB

2.13.1

hipFFT

1.0.11

hipRAND

2.10.16

hipSOLVER

1.7.0

hipSPARSE

2.3.5

MIOpen

2.19.0

MIVisionX

2.3.0

rccl

2.15.5

rocALUTION

2.1.8

rocBLAS

2.47.0

rocFFT

1.0.22

rocm-cmake

0.8.1

rocPRIM

2.13.0

rocRAND

2.10.17

rocSOLVER

3.21.0

rocSPARSE

2.5.1

rocThrust

2.17.0

rocWMMA

1.0

Tensile

4.36.0


ROCm 5.5.0#

What’s new in this release#
HIP enhancements#

The ROCm v5.5 release consists of the following HIP enhancements:

Enhanced stack size limit#

In this release, the stack size limit is increased from 16k to 131056 bytes (or 128K - 16). Applications requiring to update the stack size can use hipDeviceSetLimit API.

hipcc changes#

The following hipcc changes are implemented in this release:

  • hipcc will not implicitly link to libpthread and librt, as they are no longer a link time dependence for HIP programs.  Applications that depend on these libraries must explicitly link to them.

  • -use-staticlib and -use-sharedlib options are deprecated.

Future changes#
  • Separation of hipcc binaries (Perl scripts) from HIP to hipcc project. Users will access separate hipcc package for installing hipcc binaries in future ROCm releases.

  • In a future ROCm release, the following samples will be removed from the hip-tests project.

    Note that the samples will continue to be available in previous release branches.

  • Removal of gcnarch from hipDeviceProp_t structure

  • Addition of new fields in hipDeviceProp_t structure

    • maxTexture1D

    • maxTexture2D

    • maxTexture1DLayered

    • maxTexture2DLayered

    • sharedMemPerMultiprocessor

    • deviceOverlap

    • asyncEngineCount

    • surfaceAlignment

    • unifiedAddressing

    • computePreemptionSupported

    • hostRegisterSupported

    • uuid

  • Removal of deprecated code

    • hip-hcc codes from hip code tree

  • Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA

  • HIPMEMCPY_3D fields correction to avoid truncation of “size_t” to “unsigned int” inside hipMemcpy3D()

  • Renaming of ‘memoryType’ in hipPointerAttribute_t structure to ‘type’

  • Correct hipGetLastError to return the last error instead of last API call’s return code

  • Update hipExternalSemaphoreHandleDesc to add “unsigned int reserved[16]”

  • Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess

  • Remove hiparray* and make it opaque with hipArray_t

New HIP APIs in this release#

Note

This is a pre-official version (beta) release of the new APIs and may contain unresolved issues.

Memory management HIP APIs#

The new memory management HIP API is as follows:

  • Sets information on the specified pointer [BETA].

    hipError_t hipPointerSetAttribute(const void* value, hipPointer_attribute attribute, hipDeviceptr_t ptr);
    
Module management HIP APIs#

The new module management HIP APIs are as follows:

  • Launches kernel \(f\) with launch parameters and shared memory on stream with arguments passed to kernelParams, where thread blocks can cooperate and synchronize as they run.

    hipError_t hipModuleLaunchCooperativeKernel(hipFunction_t f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, hipStream_t stream, void** kernelParams);
    
  • Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they run.

    hipError_t hipModuleLaunchCooperativeKernelMultiDevice(hipFunctionLaunchParams* launchParamsList, unsigned int numDevices, unsigned int flags);
    
HIP graph management APIs#

The new HIP graph management APIs are as follows:

  • Creates a memory allocation node and adds it to a graph [BETA]

    hipError_t hipGraphAddMemAllocNode(hipGraphNode_t* pGraphNode, hipGraph_t graph, const hipGraphNode_t* pDependencies, size_t numDependencies, hipMemAllocNodeParams* pNodeParams);
    
  • Return parameters for memory allocation node [BETA]

    hipError_t hipGraphMemAllocNodeGetParams(hipGraphNode_t node, hipMemAllocNodeParams* pNodeParams);
    
  • Creates a memory free node and adds it to a graph [BETA]

    hipError_t hipGraphAddMemFreeNode(hipGraphNode_t* pGraphNode, hipGraph_t graph, const hipGraphNode_t* pDependencies, size_t numDependencies, void* dev_ptr);
    
  • Returns parameters for memory free node [BETA].

    hipError_t hipGraphMemFreeNodeGetParams(hipGraphNode_t node, void* dev_ptr);
    
  • Write a DOT file describing graph structure [BETA].

    hipError_t hipGraphDebugDotPrint(hipGraph_t graph, const char* path, unsigned int flags);
    
  • Copies attributes from source node to destination node [BETA].

    hipError_t hipGraphKernelNodeCopyAttributes(hipGraphNode_t hSrc, hipGraphNode_t hDst);
    
  • Enables or disables the specified node in the given graphExec [BETA]

    hipError_t hipGraphNodeSetEnabled(hipGraphExec_t hGraphExec, hipGraphNode_t hNode, unsigned int isEnabled);
    
  • Query whether a node in the given graphExec is enabled [BETA]

    hipError_t hipGraphNodeGetEnabled(hipGraphExec_t hGraphExec, hipGraphNode_t hNode, unsigned int* isEnabled);
    
OpenMP enhancements#

This release consists of the following OpenMP enhancements:

  • Additional support for OMPT functions get_device_time and get_record_type

  • Added support for min/max fast fp atomics on AMD GPUs

  • Fixed the use of the abs function in C device regions

Deprecations and warnings#
HIP deprecation#

The hipcc and hipconfig Perl scripts are deprecated. In a future release, compiled binaries will be available as hipcc.bin and hipconfig.bin as replacements for the Perl scripts.

Note

There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to hipcc.bin and hipconfig.bin. The hipcc/hipconfig soft link will be assimilated to point from hipcc/hipconfig to the respective compiled binaries as the default option.

Linux file system hierarchy standard for ROCm#

ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.

New file system hierarchy#

The following is the new file system hierarchy:4

/opt/rocm-<ver>
    | --bin
      | --All externally exposed Binaries
    | --libexec
        | --<component>
            | -- Component specific private non-ISA executables (architecture independent)
    | --include
        | -- <component>
            | --<header files>
    | --lib
        | --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
            (public libraries linked with application)
        | --<component> (component specific private library, executable data)
        | --<cmake>
            | --components
                | --<component>.config.cmake
    | --share
        | --html/<component>/*.html
        | --info/<component>/*.[pdf, md, txt]
        | --man
        | --doc
            | --<component>
                | --<licenses>
        | --<component>
            | --<misc files> (arch independent non-executable)
            | --samples

Note

ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.

For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.

Backward compatibility with older file systems#

ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.

Note

ROCm will continue supporting backward compatibility until the next major release.

Wrapper header files#

Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include) with a warning message to include files from the new location (/opt/rocm-xxx/include) as shown in the example below:

// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"

The wrapper header files’ backward compatibility deprecation is as follows:

  • #pragma message announcing deprecation – ROCm v5.2 release

  • #pragma message changed to #warning – Future release

  • #warning changed to #error – Future release

  • Backward compatibility wrappers removed – Future release

Library files#

Library files are available in the /opt/rocm-xxx/lib folder. For backward compatibility, the old library location (/opt/rocm-xxx/<component>/lib) has a soft link to the library at the new location.

Example:

$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root   24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
CMake config files#

All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component> folder. For backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake) consist of a soft link to the new CMake config.

Example:

$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
ROCm support for Code Object V3 deprecated#

Support for Code Object v3 is deprecated and will be removed in a future release.

Comgr V3.0 changes#

The following APIs and macros have been marked as deprecated. These are expected to be removed in a future ROCm release and coincides with the release of Comgr v3.0.

API changes#
  • amd_comgr_action_info_set_options()

  • amd_comgr_action_info_get_options()

Actions and data types#
  • AMD_COMGR_ACTION_ADD_DEVICE_LIBRARIES

  • AMD_COMGR_ACTION_COMPILE_SOURCE_TO_FATBIN

For replacements, see the AMD_COMGR_ACTION_INFO_GET/SET_OPTION_LIST APIs, and the AMD_COMGR_ACTION_COMPILE_SOURCE_(WITH_DEVICE_LIBS)_TO_BC macros.

Deprecated environment variables#

The following environment variables are removed in this ROCm release:

  • GPU_MAX_COMMAND_QUEUES

  • GPU_MAX_WORKGROUP_SIZE_2D_X

  • GPU_MAX_WORKGROUP_SIZE_2D_Y

  • GPU_MAX_WORKGROUP_SIZE_3D_X

  • GPU_MAX_WORKGROUP_SIZE_3D_Y

  • GPU_MAX_WORKGROUP_SIZE_3D_Z

  • GPU_BLIT_ENGINE_TYPE

  • GPU_USE_SYNC_OBJECTS

  • AMD_OCL_SC_LIB

  • AMD_OCL_ENABLE_MESSAGE_BOX

  • GPU_FORCE_64BIT_PTR

  • GPU_FORCE_OCL20_32BIT

  • GPU_RAW_TIMESTAMP

  • GPU_SELECT_COMPUTE_RINGS_ID

  • GPU_USE_SINGLE_SCRATCH

  • GPU_ENABLE_LARGE_ALLOCATION

  • HSA_LOCAL_MEMORY_ENABLE

  • HSA_ENABLE_COARSE_GRAIN_SVM

  • GPU_IFH_MODE

  • OCL_SYSMEM_REQUIREMENT

  • OCL_CODE_CACHE_ENABLE

  • OCL_CODE_CACHE_RESET

Known issues in this release#

The following are the known issues in this release.

DISTRIBUTED/TEST_DISTRIBUTED_SPAWN fails#

When user applications call ncclCommAbort to destruct communicators and then create new communicators repeatedly, subsequent communicators may fail to initialize.

This issue is under investigation and will be resolved in a future release.

Library changes in ROCm 5.5.0#

Library

Version

AMDMIGraphX

2.5

hipBLAS

0.53.0 ⇒ 0.54.0

hipBLASLt

0.1.0

hipCUB

2.13.0 ⇒ 2.13.1

hipFFT

1.0.10 ⇒ 1.0.11

hipRAND

2.10.16

hipSOLVER

1.6.0 ⇒ 1.7.0

hipSPARSE

2.3.3 ⇒ 2.3.5

MIOpen

2.19.0

MIVisionX

2.3.0

rccl

2.13.4 ⇒ 2.15.5

rocALUTION

2.1.3 ⇒ 2.1.8

rocBLAS

2.46.0 ⇒ 2.47.0

rocFFT

1.0.21 ⇒ 1.0.22

rocm-cmake

0.8.0 ⇒ 0.8.1

rocPRIM

2.12.0 ⇒ 2.13.0

rocRAND

2.10.16 ⇒ 2.10.17

rocSOLVER

3.20.0 ⇒ 3.21.0

rocSPARSE

2.4.0 ⇒ 2.5.1

rocThrust

2.17.0

rocWMMA

0.9 ⇒ 1.0

Tensile

4.35.0 ⇒ 4.36.0

AMDMIGraphX 2.5#

MIGraphX 2.5 for ROCm 5.5.0

Added#
  • Y-Model feature to store tuning information with the optimized model

  • Added Python 3.10 bindings

  • Accuracy checker tool based on ONNX Runtime

  • ONNX Operators parse_split, and Trilu

  • Build support for ROCm MLIR

  • Added migraphx-driver flag to print optimizations in python (–python)

  • Added JIT implementation of the Gather and Pad operator which results in better handling of larger tensor sizes.

Optimizations#
  • Improved performance of Transformer based models

  • Improved performance of the Pad, Concat, Gather, and Pointwise operators

  • Improved onnx/pb file loading speed

  • Added general optimize pass which runs several passes such as simplify_reshapes/algebra and DCE in loop.

Fixed#
  • Improved parsing Tensorflow Protobuf files

  • Resolved various accuracy issues with some onnx models

  • Resolved a gcc-12 issue with mivisionx

  • Improved support for larger sized models and batches

  • Use –offload-arch instead of –cuda-gpu-arch for the HIP compiler

  • Changes inside JIT to use float accumulator for large reduce ops of half type to avoid overflow.

  • Changes inside JIT to temporarily use cosine to compute sine function.

Changed#
  • Changed version/location of 3rd party build dependencies to pick up fixes

hipBLAS 0.54.0#

hipBLAS 0.54.0 for ROCm 5.5.0

Added#
  • added option to opt-in to use __half for hipblasHalf type in the API for c++ users who define HIPBLAS_USE_HIP_HALF

  • added scripts to plot performance for multiple functions

  • data driven hipblas-bench and hipblas-test execution via external yaml format data files

  • client smoke test added for quick validation using command hipblas-test –yaml hipblas_smoke.yaml

Fixed#
  • fixed datatype conversion functions to support more rocBLAS/cuBLAS datatypes

  • fixed geqrf to return successfully when nullptrs are passed in with n == 0 || m == 0

  • fixed getrs to return successfully when given nullptrs with corresponding size = 0

  • fixed getrs to give info = -1 when transpose is not an expected type

  • fixed gels to return successfully when given nullptrs with corresponding size = 0

  • fixed gels to give info = -1 when transpose is not in (‘N’, ‘T’) for real cases or not in (‘N’, ‘C’) for complex cases

Changed#
  • changed reference code for Windows to OpenBLAS

  • hipblas client executables all now begin with hipblas- prefix

hipBLASLt 0.1.0#

hipBLASLt 0.1.0 for ROCm 5.5.0

Added#
  • Enable hipBLASLt APIs

  • Support gfx90a

  • Support problem type: fp32, fp16, bf16

  • Support activation: relu, gelu

  • Support bias vector

  • Support Scale D vector

  • Integreate with tensilelite kernel generator

  • Add Gtest: hipblaslt-test

  • Add full function tool: hipblaslt-bench

  • Add sample app: example_hipblaslt_preference

Optimizations#
  • Gridbase solution search algorithm for untuned size

  • Tune 10k sizes for each problem type

hipCUB 2.13.1#

hipCUB 2.13.1 for ROCm 5.5.0

Added#
  • Benchmarks for BlockShuffle, BlockLoad, and BlockStore.

Changed#
  • CUB backend references CUB and Thrust version 1.17.2.

  • Improved benchmark coverage of BlockScan by adding ExclusiveScan, benchmark coverage of BlockRadixSort by adding SortBlockedToStriped, and benchmark coverage of WarpScan by adding Broadcast.

Fixed#
  • Windows HIP SDK support

Known Issues#
  • BlockRadixRankMatch is currently broken under the rocPRIM backend.

  • BlockRadixRankMatch with a warp size that does not exactly divide the block size is broken under the CUB backend.

hipFFT 1.0.11#

hipFFT 1.0.11 for ROCm 5.5.0

Fixed#
  • Fixed old version rocm include/lib folders not removed on upgrade.

hipRAND 2.10.16#

hipRAND 2.10.16 for ROCm 5.5.0

Added#
  • rocRAND backend support for Sobol 64, Scrambled Sobol 32 and 64, and MT19937.

  • hiprandGenerateLongLong for generating 64-bits uniformly distributed integers with Sobol 64 and Scrambled Sobol 64.

Changed#
  • Python 2.7 is no longer officially supported.

hipSOLVER 1.7.0#

hipSOLVER 1.7.0 for ROCm 5.5.0

Added#
  • Added functions

    • gesvdj

      • hipsolverSgesvdj_bufferSize, hipsolverDgesvdj_bufferSize, hipsolverCgesvdj_bufferSize, hipsolverZgesvdj_bufferSize

      • hipsolverSgesvdj, hipsolverDgesvdj, hipsolverCgesvdj, hipsolverZgesvdj

    • gesvdjBatched

      • hipsolverSgesvdjBatched_bufferSize, hipsolverDgesvdjBatched_bufferSize, hipsolverCgesvdjBatched_bufferSize, hipsolverZgesvdjBatched_bufferSize

      • hipsolverSgesvdjBatched, hipsolverDgesvdjBatched, hipsolverCgesvdjBatched, hipsolverZgesvdjBatched

hipSPARSE 2.3.5#

hipSPARSE 2.3.5 for ROCm 5.5.0

Improved#
  • Fixed an issue, where the rocm folder was not removed on upgrade of meta packages

  • Fixed a compilation issue with cusparse backend

  • Added more detailed messages on unit test failures due to missing input data

  • Improved documentation

  • Fixed a bug with deprecation messages when using gcc9 (Thanks @Maetveis)

MIOpen 2.19.0#

MIOpen 2.19.0 for ROCm 5.5.0

Added#
  • ROCm 5.5 support for gfx1101 (Navi32)

Changed#
  • Tuning results for MLIR on ROCm 5.5

  • Bumping MLIR commit to 5.5.0 release tag

Fixed#
  • Fix 3d convolution Host API bug

  • [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required.

rccl 2.15.5#

RCCL 2.15.5 for ROCm 5.5.0

Changed#
  • Compatibility with NCCL 2.15.5

  • Unit test executable renamed to rccl-UnitTests

Added#
  • HW-topology aware binary tree implementation

  • Experimental support for MSCCL

  • New unit tests for hipGraph support

  • NPKit integration

Fixed#
  • rocm-smi ID conversion

  • Support for HIP_VISIBLE_DEVICES for unit tests

  • Support for p2p transfers to non (HIP) visible devices

Removed#
  • Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench

rocALUTION 2.1.8#

rocALUTION 2.1.8 for ROCm 5.5.0

Added#
  • Added build support for Navi32

Improved#
  • Fixed a typo in MPI backend

  • Fixed a bug with the backend when HIP support is disabled

  • Fixed a bug in SAAMG hierarchy building on HIP backend

  • Improved SAAMG hierarchy build performance on HIP backend

Changed#
  • LocalVector::GetIndexValues(ValueType*) is deprecated, use LocalVector::GetIndexValues(const LocalVector&, LocalVector*) instead

  • LocalVector::SetIndexValues(const ValueType*) is deprecated, use LocalVector::SetIndexValues(const LocalVector&, const LocalVector&) instead

  • LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*) is deprecated, use LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix*) instead

  • LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, float, LocalMatrix*, LocalMatrix*) is deprecated, use LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, LocalMatrix*) instead

  • LocalMatrix::RugeStueben() is deprecated

  • LocalMatrix::AMGSmoothedAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*, int) is deprecated, use LocalMatrix::AMGAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix*, int) instead

  • LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix*, LocalMatrix*) is deprecated, use LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix*) instead

rocBLAS 2.47.0#

rocBLAS 2.47.0 for ROCm 5.5.0

Added#
  • added functionality rocblas_geam_ex for matrix-matrix minimum operations

  • added HIP Graph support as beta feature for rocBLAS Level 1, Level 2, and Level 3(pointer mode host) functions

  • added beta features API. Exposed using compiler define ROCBLAS_BETA_FEATURES_API

  • added support for vector initialization in the rocBLAS test framework with negative increments

  • added windows build documentation for forthcoming support using ROCm HIP SDK

  • added scripts to plot performance for multiple functions

Optimizations#
  • improved performance of Level 2 rocBLAS GEMV for float and double precision. Performance enhanced by 150-200% for certain problem sizes when (m==n) measured on a gfx90a GPU.

  • improved performance of Level 2 rocBLAS GER for float, double and complex float precisions. Performance enhanced by 5-7% for certain problem sizes measured on a gfx90a GPU.

  • improved performance of Level 2 rocBLAS SYMV for float and double precisions. Performance enhanced by 120-150% for certain problem sizes measured on both gfx908 and gfx90a GPUs.

Fixed#
  • fixed setting of executable mode on client script rocblas_gentest.py to avoid potential permission errors with clients rocblas-test and rocblas-bench

  • fixed deprecated API compatibility with Visual Studio compiler

  • fixed test framework memory exception handling for Level 2 functions when the host memory allocation exceeds the available memory

Changed#
  • install.sh internally runs rmake.py (also used on windows) and rmake.py may be used directly by developers on linux (use –help)

  • rocblas client executables all now begin with rocblas- prefix

Removed#
  • install.sh removed options -o –cov as now Tensile will use the default COV format, set by cmake define Tensile_CODE_OBJECT_VERSION=default

rocFFT 1.0.22#

rocFFT 1.0.22 for ROCm 5.5.0

Optimizations#
  • Improved performance of 1D lengths < 2048 that use Bluestein’s algorithm.

  • Reduced time for generating code during plan creation.

  • Optimized 3D R2C/C2R lengths 32, 84, 128.

  • Optimized batched small 1D R2C/C2R cases.

Added#
  • Added gfx1101 to default AMDGPU_TARGETS.

Changed#
  • Moved client programs to C++17.

  • Moved planar kernels and infrequently used Stockham kernels to be runtime-compiled.

  • Moved transpose, real-complex, Bluestein, and Stockham kernels to library kernel cache.

Fixed#
  • Removed zero-length twiddle table allocations, which fixes errors from hipMallocManaged.

  • Fixed incorrect freeing of HIP stream handles during twiddle computation when multiple devices are present.

rocm-cmake 0.8.1#

rocm-cmake 0.8.1 for ROCm 5.5.0

Fixed#
  • ROCMInstallTargets: Added compatibility symlinks for included cmake files in &lt;ROCM&gt;/lib/cmake/&lt;PACKAGE&gt;.

Changed#
  • ROCMHeaderWrapper: The wrapper header deprecation message is now a deprecation warning.

rocPRIM 2.13.0#

rocPRIM 2.13.0 for ROCm 5.5.0

Added#
  • New block level radix_rank primitive.

  • New block level radix_rank_match primitive.

Changed#
  • Improved the performance of block_radix_sort and device_radix_sort.

Known Issues#
  • Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows.

Fixed#
  • Fixed benchmark build on Windows

rocRAND 2.10.17#

rocRAND 2.10.17 for ROCm 5.5.0

Added#
  • MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.

  • New benchmark for the device API using Google Benchmark, benchmark_rocrand_device_api, replacing benchmark_rocrand_kernel. benchmark_rocrand_kernel is deprecated and will be removed in a future version. Likewise, benchmark_curand_host_api is added to replace benchmark_curand_generate and benchmark_curand_device_api is added to replace benchmark_curand_kernel.

  • experimental HIP-CPU feature

  • ThreeFry pseudorandom number generator based on Salmon et al., 2011, “Parallel random numbers: as easy as 1, 2, 3”.

Changed#
  • Python 2.7 is no longer officially supported.

Fixed#
  • Windows HIP SDK support

rocSOLVER 3.21.0#

rocSOLVER 3.21.0 for ROCm 5.5.0

Added#
  • SVD for general matrices using Jacobi algorithm:

    • GESVDJ (with batched and strided_batched versions)

  • LU factorization without pivoting for block tridiagonal matrices:

    • GEBLTTRF_NPVT (with batched and strided_batched versions)

  • Linear system solver without pivoting for block tridiagonal matrices:

    • GEBLTTRS_NPVT (with batched and strided_batched, versions)

  • Product of triangular matrices

    • LAUUM

  • Added experimental hipGraph support for rocSOLVER functions

Optimized#
  • Improved the performance of SYEVJ/HEEVJ.

Changed#
  • STEDC, SYEVD/HEEVD and SYGVD/HEGVD now use fully implemented Divide and Conquer approach.

Fixed#
  • SYEVJ/HEEVJ should now be invariant under matrix scaling.

  • SYEVJ/HEEVJ should now properly output the eigenvalues when no sweeps are executed.

  • Fixed GETF2_NPVT and GETRF_NPVT input data initialization in tests and benchmarks.

  • Fixed rocblas missing from the dependency list of the rocsolver deb and rpm packages.

rocSPARSE 2.5.1#

rocSPARSE 2.5.1 for ROCm 5.5.0

Added#
  • Added bsrgemm and spgemm for BSR format

  • Added bsrgeam

  • Added build support for Navi32

  • Added experimental hipGraph support for some rocSPARSE routines

  • Added csritsv, spitsv csr iterative triangular solve

  • Added mixed precisions for SpMV

  • Added batched SpMM for transpose A in COO format with atomic atomic algorithm

Improved#
  • Optimization to csr2bsr

  • Optimization to csr2csr_compress

  • Optimization to csr2coo

  • Optimization to gebsr2csr

  • Optimization to csr2gebsr

  • Fixes to documentation

  • Fixes a bug in COO SpMV gridsize

  • Fixes a bug in SpMM gridsize when using very large matrices

Known Issues#
  • In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split.

rocWMMA 1.0#

rocWMMA 1.0 for ROCm 5.5.0

Added#
  • Added support for wave32 on gfx11+

  • Added infrastructure changes to support hipRTC

  • Added performance tracking system

Changed#
  • Modified the assignment of hardware information

  • Modified the data access for unsigned datatypes

  • Added library config to support multiple architectures

Tensile 4.36.0#

Tensile 4.36.0 for ROCm 5.5.0

Added#
  • Add functions for user-driven tuning

  • Add GFX11 support: HostLibraryTests yamls, rearragne FP32©/FP64© instruction order, archCaps for instruction renaming condition, adjust vgpr bank for A/B/C for optimize, separate vscnt and vmcnt, dual mac

  • Add binary search for Grid-Based algorithm

  • Add reject condition for (StoreCInUnroll + BufferStore=0) and (DirectToVgpr + ScheduleIterAlg<3 + PrefetchGlobalRead==2)

  • Add support for (DirectToLds + hgemm + NN/NT/TT) and (DirectToLds + hgemm + GlobalLoadVectorWidth < 4)

  • Add support for (DirectToLds + hgemm(TLU=True only) or sgemm + NumLoadsCoalesced > 1)

  • Add GSU SingleBuffer algorithm for HSS/BSS

  • Add gfx900:xnack-, gfx1032, gfx1034, gfx1035

  • Enable gfx1031 support

Optimizations#
  • Use AssertSizeLessThan for BufferStoreOffsetLimitCheck if it is smaller than MT1

  • Improve InitAccVgprOpt

Changed#
  • Use global_atomic for GSU instead of flat and global_store for debug code

  • Replace flat_load/store with global_load/store

  • Use global_load/store for BufferLoad/Store=0 and enable scheduling

  • LocalSplitU support for HGEMM+HPA when MFMA disabled

  • Update Code Object Version

  • Type cast local memory to COMPUTE_DATA_TYPE in LDS to avoid precision loss

  • Update asm cap cache arguments

  • Unify SplitGlobalRead into ThreadSeparateGlobalRead and remove SplitGlobalRead

  • Change checks, error messages, assembly syntax, and coverage for DirectToLds

  • Remove unused cmake file

  • Clean up the LLVM dependency code

  • Update ThreadSeparateGlobalRead test cases for PrefetchGlobalRead=2

  • Update sgemm/hgemm test cases for DirectToLds and ThreadSepareteGlobalRead

Fixed#
  • Add build-id to header of compiled source kernels

  • Fix solution index collisions

  • Fix h beta vectorwidth4 correctness issue for WMMA

  • Fix an error with BufferStore=0

  • Fix mismatch issue with (StoreCInUnroll + PrefetchGlobalRead=2)

  • Fix MoveMIoutToArch bug

  • Fix flat load correctness issue on I8 and flat store correctness issue

  • Fix mismatch issue with BufferLoad=0 + TailLoop for large array sizes

  • Fix code generation error with BufferStore=0 and StoreCInUnrollPostLoop

  • Fix issues with DirectToVgpr + ScheduleIterAlg<3

  • Fix mismatch issue with DGEMM TT + LocalReadVectorWidth=2

  • Fix mismatch issue with PrefetchGlobalRead=2

  • Fix mismatch issue with DirectToVgpr + PrefetchGlobalRead=2 + small tile size

  • Fix an error with PersistentKernel=0 + PrefetchAcrossPersistent=1 + PrefetchAcrossPersistentMode=1

  • Fix mismatch issue with DirectToVgpr + DirectToLds + only 1 iteration in unroll loop case

  • Remove duplicate GSU kernels: for GSU = 1, GSUAlgorithm SingleBuffer and MultipleBuffer kernels are identical

  • Fix for failing CI tests due to CpuThreads=0

  • Fix mismatch issue with DirectToLds + PrefetchGlobalRead=2

  • Remove the reject condition for ThreadSeparateGlobalRead and DirectToLds (HGEMM, SGEMM only)

  • Modify reject condition for minimum lanes of ThreadSeparateGlobalRead (SGEMM or larger data type only)


ROCm 5.4.3#

Deprecations and warnings#
HIP Perl scripts deprecation#

The hipcc and hipconfig Perl scripts are deprecated. In a future release, compiled binaries will be available as hipcc.bin and hipconfig.bin as replacements for the Perl scripts.

Note

There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to hipcc.bin and hipconfig.bin. The hipcc/hipconfig soft link will be assimilated to point from hipcc/hipconfig to the respective compiled binaries as the default option.

Linux file system hierarchy standard for ROCm#

ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.

New file system hierarchy#

The following is the new file system hierarchy:4

/opt/rocm-<ver>
    | --bin
      | --All externally exposed Binaries
    | --libexec
        | --<component>
            | -- Component specific private non-ISA executables (architecture independent)
    | --include
        | -- <component>
            | --<header files>
    | --lib
        | --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
            (public libraries linked with application)
        | --<component> (component specific private library, executable data)
        | --<cmake>
            | --components
                | --<component>.config.cmake
    | --share
        | --html/<component>/*.html
        | --info/<component>/*.[pdf, md, txt]
        | --man
        | --doc
            | --<component>
                | --<licenses>
        | --<component>
            | --<misc files> (arch independent non-executable)
            | --samples

Note

ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.

For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.

Backward compatibility with older file systems#

ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.

Note

ROCm will continue supporting backward compatibility until the next major release.

Wrapper header files#

Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include) with a warning message to include files from the new location (/opt/rocm-xxx/include) as shown in the example below:

// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"

The wrapper header files’ backward compatibility deprecation is as follows:

  • #pragma message announcing deprecation – ROCm v5.2 release

  • #pragma message changed to #warning – Future release

  • #warning changed to #error – Future release

  • Backward compatibility wrappers removed – Future release

Library files#

Library files are available in the /opt/rocm-xxx/lib folder. For backward compatibility, the old library location (/opt/rocm-xxx/<component>/lib) has a soft link to the library at the new location.

Example:

$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root   24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
CMake config files#

All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component> folder. For backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake) consist of a soft link to the new CMake config.

Example:

$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
Defect fixes#
Compiler improvements#

In ROCm v5.4.3, improvements to the compiler address errors with the following signatures:

  • “error: unhandled SGPR spill to memory”

  • “cannot scavenge register without an emergency spill slot!”

  • “error: ran out of registers during register allocation”

Known issues#
Compiler option error at runtime#

Some users may encounter a “Cannot find Symbol” error at runtime when using -save-temps. While most -save-temps use cases work correctly, this error may appear occasionally.

This issue is under investigation, and the known workaround is not to use -save-temps when the error appears.

Library changes in ROCm 5.4.3#

Library

Version

hipBLAS

0.53.0

hipCUB

2.13.0

hipFFT

1.0.10

hipSOLVER

1.6.0

hipSPARSE

2.3.3

MIVisionX

2.3.0

rccl

2.13.4

rocALUTION

2.1.3

rocBLAS

2.46.0

rocFFT

1.0.20 ⇒ 1.0.21

rocm-cmake

0.8.0

rocPRIM

2.12.0

rocRAND

2.10.16

rocSOLVER

3.20.0

rocSPARSE

2.4.0

rocThrust

2.17.0

rocWMMA

0.9

Tensile

4.35.0

rocFFT 1.0.21#

rocFFT 1.0.21 for ROCm 5.4.3

Fixed#
  • Removed source directory from rocm_install_targets call to prevent installation of rocfft.h in an unintended location.


ROCm 5.4.2#

Deprecations and warnings#
HIP Perl scripts deprecation#

The hipcc and hipconfig Perl scripts are deprecated. In a future release, compiled binaries will be available as hipcc.bin and hipconfig.bin as replacements for the Perl scripts.

Note

There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to hipcc.bin and hipconfig.bin. The hipcc/hipconfig soft link will be assimilated to point from hipcc/hipconfig to the respective compiled binaries as the default option.

hipcc options deprecation#

The following hipcc options are being deprecated and will be removed in a future release:

  • The --amdgpu-target option is being deprecated, and user must use the –offload-arch option to specify the GPU architecture.

  • The --amdhsa-code-object-version option is being deprecated. Users can use the Clang/LLVM option -mllvm -mcode-object-version to debug issues related to code object versions.

  • The --hipcc-func-supp/--hipcc-no-func-supp options are being deprecated, as the function calls are already supported in production on AMD GPUs.

Known issues#

Under certain circumstances typified by high register pressure, users may encounter a compiler abort with one of the following error messages:

  • error: unhandled SGPR spill to memory

  • cannot scavenge register without an emergency spill slot!

  • error: ran out of registers during register allocation

This is a known issue and will be fixed in a future release.

Library changes in ROCm 5.4.2#

Library

Version

hipBLAS

0.53.0

hipCUB

2.13.0

hipFFT

1.0.10

hipSOLVER

1.6.0

hipSPARSE

2.3.3

MIVisionX

2.3.0

rccl

2.13.4

rocALUTION

2.1.3

rocBLAS

2.46.0

rocFFT

1.0.20

rocm-cmake

0.8.0

rocPRIM

2.12.0

rocRAND

2.10.16

rocSOLVER

3.20.0

rocSPARSE

2.4.0

rocThrust

2.17.0

rocWMMA

0.9

Tensile

4.35.0


ROCm 5.4.1#

What’s new in this release#
HIP enhancements#

The ROCm v5.4.1 release consists of the following new HIP API:

New HIP API - hipLaunchHostFunc#

The following new HIP API is introduced in the ROCm v5.4.1 release.

Note

This is a pre-official version (beta) release of the new APIs.

hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData);

This swaps the stream capture mode of a thread.

@param [in] mode - Pointer to mode value to swap with the current mode

This parameter returns #hipSuccess, #hipErrorInvalidValue.

For more information, refer to the HIP API documentation at /bundle/HIP_API_Guide/page/modules.html.

Deprecations and warnings#
HIP Perl scripts deprecation#

The hipcc and hipconfig Perl scripts are deprecated. In a future release, compiled binaries will be available as hipcc.bin and hipconfig.bin as replacements for the Perl scripts.

Note

There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to hipcc.bin and hipconfig.bin. The hipcc/hipconfig soft link will be assimilated to point from hipcc/hipconfig to the respective compiled binaries as the default option.

IFWI fixes#

These defects were identified and documented as known issues in previous ROCm releases and are fixed in this release.

AMD Instinct™ MI200 firmware IFWI maintenance update #3#

This IFWI release fixes the following issue in AMD Instinct™ MI210/MI250 Accelerators.

After prolonged periods of operation, certain MI200 Instinct™ Accelerators may perform in a degraded way resulting in application failures.

In this package, AMD delivers a new firmware version for MI200 GPU accelerators and a firmware installation tool – AMD FW FLASH 1.2.

GPU

Productionp part number

SKU

IFWI name

MI210

113-D673XX

D67302

D6730200V.110

MI210

113-D673XX

D67301

D6730100V.073

MI250

113-D652XX

D65209

D6520900.073

MI250

113-D652XX

D65210

D6521000.073

Instructions on how to download and apply MI200 maintenance updates are available at:

https://www.amd.com/en/support/server-accelerators/amd-instinct/amd-instinct-mi-series/amd-instinct-mi210

AMD Instinct™ MI200 SRIOV virtualization support#

Maintenance update #3, combined with ROCm 5.4.1, now provides SRIOV virtualization support for all AMD Instinct™ MI200 devices.

Library changes in ROCm 5.4.1#

Library

Version

hipBLAS

0.53.0

hipCUB

2.13.0

hipFFT

1.0.10

hipSOLVER

1.6.0

hipSPARSE

2.3.3

MIVisionX

2.3.0

rccl

2.13.4

rocALUTION

2.1.3

rocBLAS

2.46.0

rocFFT

1.0.19 ⇒ 1.0.20

rocm-cmake

0.8.0

rocPRIM

2.12.0

rocRAND

2.10.16

rocSOLVER

3.20.0

rocSPARSE

2.4.0

rocThrust

2.17.0

rocWMMA

0.9

Tensile

4.35.0

rocFFT 1.0.20#

rocFFT 1.0.20 for ROCm 5.4.1

Fixed#
  • Fixed incorrect results on strided large 1D FFTs where batch size does not equal the stride.


ROCm 5.4.0#

What’s new in this release#
HIP enhancements#

The ROCm v5.4 release consists of the following HIP enhancements:

Support for wall_clock64#

A new timer function wall_clock64() is supported, which returns wall clock count at a constant frequency on the device.

long long int wall_clock64();

It returns wall clock count at a constant frequency on the device, which can be queried via HIP API with the hipDeviceAttributeWallClockRate attribute of the device in the HIP application code.

Example:

int wallClkRate = 0; //in kilohertz
+HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId));

Where hipDeviceAttributeWallClockRate is a device attribute.

Note

The wall clock frequency is a per-device attribute.

New registry added for GPU_MAX_HW_QUEUES#

The GPU_MAX_HW_QUEUES registry defines the maximum number of independent hardware queues allocated per process per device.

The environment variable controls how many independent hardware queues HIP runtime can create per process, per device. If the application allocates more HIP streams than this number, then the HIP runtime reuses the same hardware queues for the new streams in a round-robin manner.

Note

This maximum number does not apply to hardware queues created for CU-masked HIP streams or cooperative queues for HIP Cooperative Groups (there is only one queue per device).

For more details, refer to the HIP Programming Guide.

New HIP APIs in this release#

The following new HIP APIs are available in the ROCm v5.4 release.

Note

This is a pre-official version (beta) release of the new APIs.

Error handling#
hipError_t hipDrvGetErrorName(hipError_t hipError, const char** errorString);

This returns HIP errors in the text string format.

hipError_t hipDrvGetErrorString(hipError_t hipError, const char** errorString);

This returns text string messages with more details about the error.

For more information, refer to the HIP API Guide.

HIP tests source separation#

With ROCm v5.4, a separate GitHub project is created at

ROCm/hip-tests

This contains HIP catch2 tests and samples, and new tests will continue to develop.

In future ROCm releases, catch2 tests and samples will be removed from the HIP project.

OpenMP enhancements#

This release consists of the following OpenMP enhancements:

  • Enable new device RTL in libomptarget as default.

  • New flag -fopenmp-target-fast to imply -fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism.

  • Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is generated.

  • Initial implementation of optimized cross-team sum reduction for float and double type scalars.

  • Pool-based optimization in the OpenMP runtime to reduce locking during data transfer.

Deprecations and warnings#
HIP Perl scripts deprecation#

The hipcc and hipconfig Perl scripts are deprecated. In a future release, compiled binaries will be available as hipcc.bin and hipconfig.bin as replacements for the Perl scripts.

Note

There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to hipcc.bin and hipconfig.bin. The hipcc/hipconfig soft link will be assimilated to point from hipcc/hipconfig to the respective compiled binaries as the default option.

Linux file system hierarchy standard for ROCm#

ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.

New file system hierarchy#

The following is the new file system hierarchy:

/opt/rocm-<ver>
    | --bin
      | --All externally exposed Binaries
    | --libexec
        | --<component>
            | -- Component specific private non-ISA executables (architecture independent)
    | --include
        | -- <component>
            | --<header files>
    | --lib
        | --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
            (public libraries linked with application)
        | --<component> (component specific private library, executable data)
        | --<cmake>
            | --components
                | --<component>.config.cmake
    | --share
        | --html/<component>/*.html
        | --info/<component>/*.[pdf, md, txt]
        | --man
        | --doc
            | --<component>
                | --<licenses>
        | --<component>
            | --<misc files> (arch independent non-executable)
            | --samples

Note

ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.

For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.

Backward compatibility with older file systems#

ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.

Note

ROCm will continue supporting backward compatibility until the next major release.

Wrapper header files#

Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include) with a warning message to include files from the new location (/opt/rocm-xxx/include) as shown in the example below:

// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"

The wrapper header files’ backward compatibility deprecation is as follows:

  • #pragma message announcing deprecation – ROCm v5.2 release

  • #pragma message changed to #warning – Future release

  • #warning changed to #error – Future release

  • Backward compatibility wrappers removed – Future release

Library files#

Library files are available in the /opt/rocm-xxx/lib folder. For backward compatibility, the old library location (/opt/rocm-xxx/<component>/lib) has a soft link to the library at the new location.

Example:

$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root   24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
CMake config files#

All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component> folder. For backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake) consist of a soft link to the new CMake config.

Example:

$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
Defect fixes#

The following defects are fixed in this release.

These defects were identified and documented as known issues in previous ROCm releases and are fixed in this release.

Memory allocated using hipHostMalloc() with flags didn’t exhibit fine-grain behavior#
Issue#

The test was incorrectly using the hipDeviceAttributePageableMemoryAccess device attribute to determine coherent support.

Fix#

hipHostMalloc() allocates memory with fine-grained access by default when the environment variable HIP_HOST_COHERENT=1 is used.

For more information, refer to HIP Runtime API Reference.

SoftHang with hipStreamWithCUMask test on AMD Instinct™#
Issue#

On GFX10 GPUs, kernel execution hangs when it is launched on streams created using hipStreamWithCUMask.

Fix#

On GFX10 GPUs, each workgroup processor encompasses two compute units, and the compute units must be enabled as a pair. The hipStreamWithCUMask API unit test cases are updated to set compute unit mask (cuMask) in pairs for GFX10 GPUs.

ROCm tools GPU IDs#

The HIP language device IDs are not the same as the GPU IDs reported by the tools. GPU IDs are globally unique and guaranteed to be consistent across APIs and processes.

GPU IDs reported by ROCTracer and ROCProfiler or ROCm Tools are HSA Driver Node ID of that GPU, as it is a unique ID for that device in that particular node.

Library changes in ROCm 5.4.0#

Library

Version

hipBLAS

0.52.0 ⇒ 0.53.0

hipCUB

2.12.0 ⇒ 2.13.0

hipFFT

1.0.9 ⇒ 1.0.10

hipSOLVER

1.5.0 ⇒ 1.6.0

hipSPARSE

2.3.1 ⇒ 2.3.3

MIVisionX

2.3.0

rccl

2.12.10 ⇒ 2.13.4

rocALUTION

2.1.0 ⇒ 2.1.3

rocBLAS

2.45.0 ⇒ 2.46.0

rocFFT

1.0.18 ⇒ 1.0.19

rocm-cmake

0.8.0

rocPRIM

2.11.0 ⇒ 2.12.0

rocRAND

2.10.15 ⇒ 2.10.16

rocSOLVER

3.19.0 ⇒ 3.20.0

rocSPARSE

2.2.0 ⇒ 2.4.0

rocThrust

2.16.0 ⇒ 2.17.0

rocWMMA

0.8 ⇒ 0.9

Tensile

4.34.0 ⇒ 4.35.0

hipBLAS 0.53.0#

hipBLAS 0.53.0 for ROCm 5.4.0

Added#
  • Allow for selection of int8 datatype

  • Added support for hipblasXgels and hipblasXgelsStridedBatched operations (with s,d,c,z precisions), only supported with rocBLAS backend

  • Added support for hipblasXgelsBatched operations (with s,d,c,z precisions)

hipCUB 2.13.0#

hipCUB 2.13.0 for ROCm 5.4.0

Added#
  • CMake functionality to improve build parallelism of the test suite that splits compilation units by function or by parameters.

  • New overload for BlockAdjacentDifference::SubtractLeftPartialTile that takes a predecessor item.

Changed#
  • Improved build parallelism of the test suite by splitting up large compilation units for DeviceRadixSort, DeviceSegmentedRadixSort and DeviceSegmentedSort.

  • CUB backend references CUB and thrust version 1.17.1.

hipFFT 1.0.10#

hipFFT 1.0.10 for ROCm 5.4.0

Added#
  • Added hipfftExtPlanScaleFactor API to efficiently multiply each output element of a FFT by a given scaling factor. Result scaling must be supported in the backend FFT library.

Changed#
  • When hipFFT is built against the rocFFT backend, rocFFT 1.0.19 or higher is now required.

hipSOLVER 1.6.0#

hipSOLVER 1.6.0 for ROCm 5.4.0

Added#
  • Added compatibility-only functions

    • gesvdaStridedBatched

      • hipsolverDnSgesvdaStridedBatched_bufferSize, hipsolverDnDgesvdaStridedBatched_bufferSize, hipsolverDnCgesvdaStridedBatched_bufferSize, hipsolverDnZgesvdaStridedBatched_bufferSize

      • hipsolverDnSgesvdaStridedBatched, hipsolverDnDgesvdaStridedBatched, hipsolverDnCgesvdaStridedBatched, hipsolverDnZgesvdaStridedBatched

hipSPARSE 2.3.3#

hipSPARSE 2.3.3 for ROCm 5.4.0

Added#
  • Added hipsparseCsr2cscEx2_bufferSize and hipsparseCsr2cscEx2 routines

Changed#
  • HIPSPARSE_ORDER_COLUMN has been renamed to HIPSPARSE_ORDER_COL to match cusparse

rccl 2.13.4#

RCCL 2.13.4 for ROCm 5.4.0

Changed#
  • Compatibility with NCCL 2.13.4

  • Improvements to RCCL when running with hipGraphs

  • RCCL_ENABLE_HIPGRAPH environment variable is no longer necessary to enable hipGraph support

  • Minor latency improvements

Fixed#
  • Resolved potential memory access error due to asynchronous memset

rocALUTION 2.1.3#

rocALUTION 2.1.3 for ROCm 5.4.0

Added#
  • Added build support for Navi31 and Navi33

  • Added support for non-squared global matrices

Improved#
  • Fixed a memory leak in MatrixMult on HIP backend

  • Global structures can now be used with a single process

Changed#
  • Switched GTest death test style to ‘threadsafe’

  • GlobalVector::GetGhostSize() is deprecated and will be removed

  • ParallelManager::GetGlobalSize(), ParallelManager::GetLocalSize(), ParallelManager::SetGlobalSize() and ParallelManager::SetLocalSize() are deprecated and will be removed

  • Vector::GetGhostSize() is deprecated and will be removed

  • Multigrid::SetOperatorFormat(unsigned int) is deprecated and will be removed, use Multigrid::SetOperatorFormat(unsigned int, int) instead

  • RugeStuebenAMG::SetCouplingStrength(ValueType) is deprecated and will be removed, use SetStrengthThreshold(float) instead

rocBLAS 2.46.0#

rocBLAS 2.46.0 for ROCm 5.4.0

Added#
  • client smoke test dataset added for quick validation using command rocblas-test –yaml rocblas_smoke.yaml

  • Added stream order device memory allocation as a non-default beta option.

Optimized#
  • Improved trsm performance for small sizes by using a substitution method technique

  • Improved syr2k and her2k performance significantly by using a block-recursive algorithm

Changed#
  • Level 2, Level 1, and Extension functions: argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values. With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer. This improves consistency with legacy BLAS behaviour.

  • Add variable to turn on/off ieee16/ieee32 tests for mixed precision gemm

  • Allow hipBLAS to select int8 datatype

  • Disallow B == C && ldb != ldc in rocblas_xtrmm_outofplace

Fixed#
  • FORTRAN interfaces generalized for FORTRAN compilers other than gfortran

  • fix for trsm_strided_batched rocblas-bench performance gathering

  • Fix for rocm-smi path in commandrunner.py script to match ROCm 5.2 and above

rocFFT 1.0.19#

rocFFT 1.0.19 for ROCm 5.4.0

Optimizations#
  • Optimized some strided large 1D plans.

Added#
  • Added rocfft_plan_description_set_scale_factor API to efficiently multiply each output element of a FFT by a given scaling factor.

  • Created a rocfft_kernel_cache.db file next to the installed library. SBCC kernels are moved to this file when built with the library, and are runtime-compiled for new GPU architectures.

  • Added gfx1100 and gfx1102 to default AMDGPU_TARGETS.

Changed#
  • Moved runtime compilation cache to in-memory by default. A default on-disk cache can encounter contention problems on multi-node clusters with a shared filesystem. rocFFT can still be told to use an on-disk cache by setting the ROCFFT_RTC_CACHE_PATH environment variable.

rocPRIM 2.12.0#

rocPRIM 2.12.0 for ROCm 5.4.0

Changed#
  • device_partition, device_unique, and device_reduce_by_key now support problem sizes larger than 2^32 items.

Removed#
  • block_sort::sort() overload for keys and values with a dynamic size. This overload was documented but the implementation is missing. To avoid further confusion the documentation is removed until a decision is made on implementing the function.

Fixed#
  • Fixed the compilation failure in device_merge if the two key iterators don’t match.

rocRAND 2.10.16#

rocRAND 2.10.16 for ROCm 5.4.0

Added#
  • MRG31K3P pseudorandom number generator based on L’Ecuyer and Touzin, 2000, “Fast combined multiple recursive generators with multipliers of the form a = ±2q ±2r”.

  • LFSR113 pseudorandom number generator based on L’Ecuyer, 1999, “Tables of maximally equidistributed combined LFSR generators”.

  • SCRAMBLED_SOBOL32 and SCRAMBLED_SOBOL64 quasirandom number generators. The Scrambled Sobol sequences are generated by scrambling the output of a Sobol sequence.

Changed#
  • The mrg_&lt;distribution&gt;_distribution structures, which provided numbers based on MRG32K3A, are now replaced by mrg_engine_&lt;distribution&gt;_distribution, where &lt;distribution&gt; is log_normal, normal, poisson, or uniform. These structures provide numbers for MRG31K3P (with template type rocrand_state_mrg31k3p) and MRG32K3A (with template type rocrand_state_mrg32k3a).

Fixed#
  • Sobol64 now returns 64 bits random numbers, instead of 32 bits random numbers. As a result, the performance of this generator has regressed.

  • Fixed a bug that prevented compiling code in C++ mode (with a host compiler) when it included the rocRAND headers on Windows.

rocSOLVER 3.20.0#

rocSOLVER 3.20.0 for ROCm 5.4.0

Added#
  • Partial SVD for bidiagonal matrices:

    • BDSVDX

  • Partial SVD for general matrices:

    • GESVDX (with batched and strided_batched versions)

Changed#
  • Changed ROCSOLVER_EMBED_FMT default to ON for users building directly with CMake. This matches the existing default when building with install.sh or rmake.py.

rocSPARSE 2.4.0#

rocSPARSE 2.4.0 for ROCm 5.4.0

Added#
  • Added rocsparse_spmv_ex routine

  • Added rocsparse_bsrmv_ex_analysis and rocsparse_bsrmv_ex routines

  • Added csritilu0 routine

  • Added build support for Navi31 and Navi 33

Improved#
  • Optimization to segmented algorithm for COO SpMV by performing analysis

  • Improve performance when generating random matrices.

  • Fixed bug in ellmv

  • Optimized bsr2csr routine

  • Fixed integer overflow bugs

rocThrust 2.17.0#

rocThrust 2.17.0 for ROCm 5.4.0

Added#
  • Updated to match upstream Thrust 1.17.0

rocWMMA 0.9#

rocWMMA 0.9 for ROCm 5.4.0

Added#
  • Added gemm driver APIs for flow control builtins

  • Added benchmark logging systems

  • Restructured tests to follow naming convention. Added macros for test generation

Changed#
  • Changed CMake to accomodate the modified test infrastructure

  • Fine tuned the multi-block kernels with and without lds

  • Adjusted Maximum Vector Width to dWordx4 Width

  • Updated Efficiencies to display as whole number percentages

  • Updated throughput from GFlops/s to TFlops/s

  • Reset the ad-hoc tests to use smaller sizes

  • Modified the output validation to use CPU-based implementation against rocWMMA

  • Modified the extended vector test to return error codes for memory allocation failures

Tensile 4.35.0#

Tensile 4.35.0 for ROCm 5.4.0

Added#
  • Async DMA support for Transpose Data Layout (ThreadSeparateGlobalReadA/B)

  • Option to output library logic in dictionary format

  • No solution found error message for benchmarking client

  • Exact K check for StoreCInUnrollExact

  • Support for CGEMM + MIArchVgpr

  • client-path parameter for using prebuilt client

  • CleanUpBuildFiles global parameter

  • Debug flag for printing library logic index of winning solution

  • NumWarmups global parameter for benchmarking

  • Windows support for benchmarking client

  • DirectToVgpr support for CGEMM

  • TensileLibLogicToYaml for creating tuning configs from library logic solutions

Optimizations#
  • Put beta code and store separately if StoreCInUnroll = x4 store

  • Improved performance for StoreCInUnroll + b128 store

Changed#
  • Re-enable HardwareMonitor for gfx90a

  • Decision trees use MLFeatures instead of Properties

Fixed#
  • Reject DirectToVgpr + MatrixInstBM/BN > 1

  • Fix benchmark timings when using warmups and/or validation

  • Fix mismatch issue with DirectToVgprB + VectorWidth > 1

  • Fix mismatch issue with DirectToLds + NumLoadsCoalesced > 1 + TailLoop

  • Fix incorrect reject condition for DirectToVgpr

  • Fix reject condition for DirectToVgpr + MIWaveTile < VectorWidth

  • Fix incorrect instruction generation with StoreCInUnroll


ROCm 5.3.3#

Defect fixes#
Issue with rocTHRUST and rocPRIM libraries#

There was a known issue with rocTHRUST and rocPRIM libraries supporting iterator and types in ROCm v5.3.x releases.

  • thrust::merge no longer correctly supports different iterator types for keys_input1 and keys_input2.

  • rocprim::device_merge no longer correctly supports using different types for keys_input1 and keys_input2.

This issue is resolved with the following fixes to compilation failures:

  • rocPRIM: in device_merge if the two key iterators do not match.

  • rocTHRUST: in thrust::merge if the two key iterators do not match.

Library changes in ROCm 5.3.3#

Library

Version

hipBLAS

0.52.0

hipCUB

2.12.0

hipFFT

1.0.9

hipSOLVER

1.5.0

hipSPARSE

2.3.1

MIVisionX

2.3.0

rccl

2.12.10

rocALUTION

2.1.0

rocBLAS

2.45.0

rocFFT

1.0.18

rocm-cmake

0.8.0

rocPRIM

2.11.0

rocRAND

2.10.15

rocSOLVER

3.19.0

rocSPARSE

2.2.0

rocThrust

2.16.0

rocWMMA

0.8

Tensile

4.34.0


ROCm 5.3.2#

Defect fixes#

The following known issues in ROCm v5.3.2 are fixed in this release.

Peer-to-peer DMA mapping errors with SLES and RHEL#

Peer-to-Peer Direct Memory Access (DMA) mapping errors on Dell systems (R7525 and R750XA) with SLES 15 SP3/SP4 and RHEL 9.0 are fixed in this release.

Previously, running rocminfo resulted in Peer-to-Peer DMA mapping errors.

RCCL tuning table#

The RCCL tuning table is updated for supported platforms.

SGEMM (F32 GEMM) routines in rocBLAS#

Functional correctness failures in SGEMM (F32 GEMM) routines in rocBLAS for certain problem sizes and ranges are fixed in this release.

Known issues#

This section consists of known issues in this release.

AMD Instinct™ MI200 SRIOV virtualization issue#

There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of SRIOV-based workloads but does not impact Discrete Device Assignment (DDA) or bare metal.

Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV workloads.

AMD Instinct™ MI200 firmware updates#

Customers cannot update the Integrated Firmware Image (IFWI) for AMD Instinct™ MI200 accelerators.

An updated firmware maintenance bundle consisting of an installation tool and images specific to AMD Instinct™ MI200 accelerators is under planning and will be available soon.

Known issue with rocThrust and rocPRIM libraries#

There is a known known issue with rocThrust and rocPRIM libraries supporting iterator and types in ROCm v5.3.x releases.

  • thrust::merge no longer correctly supports different iterator types for keys_input1 and keys_input2.

  • rocprim::device_merge no longer correctly supports using different types for keys_input1 and keys_input2.

This issue is currently under investigation and will be resolved in a future release.

Library changes in ROCm 5.3.2#

Library

Version

hipBLAS

0.52.0

hipCUB

2.12.0

hipFFT

1.0.9

hipSOLVER

1.5.0

hipSPARSE

2.3.1

MIVisionX

2.3.0

rccl

2.12.10

rocALUTION

2.1.0

rocBLAS

2.45.0

rocFFT

1.0.18

rocm-cmake

0.8.0

rocPRIM

2.11.0

rocRAND

2.10.15

rocSOLVER

3.19.0

rocSPARSE

2.2.0

rocThrust

2.16.0

rocWMMA

0.8

Tensile

4.34.0


ROCm 5.3.0#

Deprecations and warnings#
HIP Perl scripts deprecation#

The hipcc and hipconfig Perl scripts are deprecated. In a future release, compiled binaries will be available as hipcc.bin and hipconfig.bin as replacements for the Perl scripts.

Note

There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to hipcc.bin and hipconfig.bin. The hipcc/hipconfig soft link will be assimilated to point from hipcc/hipconfig to the respective compiled binaries as the default option.

Linux file system hierarchy standard for ROCm#

ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.

New file system hierarchy#

The following is the new file system hierarchy:

/opt/rocm-<ver>
    | --bin
      | --All externally exposed Binaries
    | --libexec
        | --<component>
            | -- Component specific private non-ISA executables (architecture independent)
    | --include
        | -- <component>
            | --<header files>
    | --lib
        | --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
            (public libraries linked with application)
        | --<component> (component specific private library, executable data)
        | --<cmake>
            | --components
                | --<component>.config.cmake
    | --share
        | --html/<component>/*.html
        | --info/<component>/*.[pdf, md, txt]
        | --man
        | --doc
            | --<component>
                | --<licenses>
        | --<component>
            | --<misc files> (arch independent non-executable)
            | --samples

Note

ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.

For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.

Backward compatibility with older file systems#

ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.

Note

ROCm will continue supporting backward compatibility until the next major release.

Wrapper header files#

Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include) with a warning message to include files from the new location (/opt/rocm-xxx/include) as shown in the example below:

// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"

The wrapper header files’ backward compatibility deprecation is as follows:

  • #pragma message announcing deprecation – ROCm v5.2 release

  • #pragma message changed to #warning – Future release

  • #warning changed to #error – Future release

  • Backward compatibility wrappers removed – Future release

Library files#

Library files are available in the /opt/rocm-xxx/lib folder. For backward compatibility, the old library location (/opt/rocm-xxx/<component>/lib) has a soft link to the library at the new location.

Example:

$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root   24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
CMake config files#

All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component> folder. For backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake) consist of a soft link to the new CMake config.

Example:

$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
Defect fixes#

The following defects are fixed in this release.

These defects were identified and documented as known issues in previous ROCm releases and are fixed in the ROCm v5.3 release.

Kernel produces incorrect results with ROCm 5.2#

User code did not initialize certain data constructs, leading to a correctness issue. A strict reading of the C++ standard suggests that failing to initialize these data constructs is undefined behavior. However, a special case was added for a specific compiler builtin to handle the uninitialized data in a defined manner.

The compiler fix consists of the following patches:

  • A new noundef attribute is added. This attribute denotes when a function call argument or return value may never contain uninitialized bits. For more information, see https://reviews.llvm.org/D81678

  • The application of this attribute was refined such that it was not added to a specific compiler built-in where the compiler knows that inactive lanes do not impact program execution. For more information, see ROCm/llvm-project.

Known issues#

This section consists of known issues in this release.

Issue with OpenMP-extras package upgrade#

The openmp-extras package has been split into runtime (openmp-extras-runtime) and dev (openmp-extras-devel) packages. This change has broken the upgrade support for the openmp-extras package in RHEL/SLES.

An available workaround in RHEL is to use the following command for upgrades:

sudo yum upgrade rocm-language-runtime --allowerasing

An available workaround in SLES is to use the following command for upgrades:

zypper update --force-resolution <meta-package>
AMD Instinct™ MI200 SRIOV virtualization issue#

There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of SRIOV-based workloads, but does not impact Discrete Device Assignment (DDA) or Bare Metal.

Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV workloads.

System crash when IMMOU is enabled#

If input-output memory management unit (IOMMU) is enabled in SBIOS and ROCm is installed, the system may report the following failure or errors when running workloads such as bandwidth test, clinfo, and HelloWord.cl and cause a system crash.

  • IO PAGE FAULT

  • IRQ remapping does not support X2APIC mode

  • NMI error

Workaround: To avoid the system crash, add amd_iommu=on iommu=pt as the kernel bootparam, as indicated in the warning message.

Library changes in ROCm 5.3.0#

Library

Version

hipBLAS

0.51.0 ⇒ 0.52.0

hipCUB

2.11.1 ⇒ 2.12.0

hipFFT

1.0.8 ⇒ 1.0.9

hipSOLVER

1.4.0 ⇒ 1.5.0

hipSPARSE

2.2.0 ⇒ 2.3.1

MIVisionX

2.3.0

rccl

2.12.10

rocALUTION

2.0.3 ⇒ 2.1.0

rocBLAS

2.44.0 ⇒ 2.45.0

rocFFT

1.0.17 ⇒ 1.0.18

rocm-cmake

0.8.0

rocPRIM

2.10.14 ⇒ 2.11.0

rocRAND

2.10.14 ⇒ 2.10.15

rocSOLVER

3.18.0 ⇒ 3.19.0

rocSPARSE

2.2.0

rocThrust

2.15.0 ⇒ 2.16.0

rocWMMA

0.7 ⇒ 0.8

Tensile

4.33.0 ⇒ 4.34.0

hipBLAS 0.52.0#

hipBLAS 0.52.0 for ROCm 5.3.0

Added#
  • Added –cudapath option to install.sh to allow user to specify which cuda build they would like to use.

  • Added –installcuda option to install.sh to install cuda via a package manager. Can be used with new –installcudaversion option to specify which version of cuda to install.

Fixed#
  • Fixed #includes to support a compiler version.

  • Fixed client dependency support in install.sh

hipCUB 2.12.0#

hipCUB 2.12.0 for ROCm 5.3.0

Added#
  • UniqueByKey device algorithm

  • SubtractLeft, SubtractLeftPartialTile, SubtractRight, SubtractRightPartialTile overloads in BlockAdjacentDifference.

    • The old overloads (FlagHeads, FlagTails, FlagHeadsAndTails) are deprecated.

  • DeviceAdjacentDifference algorithm.

  • Extended benchmark suite of DeviceHistogram, DeviceScan, DevicePartition, DeviceReduce, DeviceSegmentedReduce, DeviceSegmentedRadixSort, DeviceRadixSort, DeviceSpmv, DeviceMergeSort, DeviceSegmentedSort

Changed#
  • Obsolated type traits defined in util_type.hpp. Use the standard library equivalents instead.

  • CUB backend references CUB and thrust version 1.16.0.

  • DeviceRadixSort’s num_items parameter’s type is now templated instead of being an int.

    • If an integral type with a size at most 4 bytes is passed (i.e. an int), the former logic applies.

    • Otherwise the algorithm uses a larger indexing type that makes it possible to sort input data over 2**32 elements.

  • Improved build parallelism of the test suite by splitting up large compilation units

hipFFT 1.0.9#

hipFFT 1.0.9 for ROCm 5.3.0

Changed#
  • Clean up build warnings.

  • GNUInstall Dir enhancements.

  • Requires gtest 1.11.

hipSOLVER 1.5.0#

hipSOLVER 1.5.0 for ROCm 5.3.0

Added#
  • Added functions

    • syevj

      • hipsolverSsyevj_bufferSize, hipsolverDsyevj_bufferSize, hipsolverCheevj_bufferSize, hipsolverZheevj_bufferSize

      • hipsolverSsyevj, hipsolverDsyevj, hipsolverCheevj, hipsolverZheevj

    • syevjBatched

      • hipsolverSsyevjBatched_bufferSize, hipsolverDsyevjBatched_bufferSize, hipsolverCheevjBatched_bufferSize, hipsolverZheevjBatched_bufferSize

      • hipsolverSsyevjBatched, hipsolverDsyevjBatched, hipsolverCheevjBatched, hipsolverZheevjBatched

    • sygvj

      • hipsolverSsygvj_bufferSize, hipsolverDsygvj_bufferSize, hipsolverChegvj_bufferSize, hipsolverZhegvj_bufferSize

      • hipsolverSsygvj, hipsolverDsygvj, hipsolverChegvj, hipsolverZhegvj

  • Added compatibility-only functions

    • syevdx/heevdx

      • hipsolverDnSsyevdx_bufferSize, hipsolverDnDsyevdx_bufferSize, hipsolverDnCheevdx_bufferSize, hipsolverDnZheevdx_bufferSize

      • hipsolverDnSsyevdx, hipsolverDnDsyevdx, hipsolverDnCheevdx, hipsolverDnZheevdx

    • sygvdx/hegvdx

      • hipsolverDnSsygvdx_bufferSize, hipsolverDnDsygvdx_bufferSize, hipsolverDnChegvdx_bufferSize, hipsolverDnZhegvdx_bufferSize

      • hipsolverDnSsygvdx, hipsolverDnDsygvdx, hipsolverDnChegvdx, hipsolverDnZhegvdx

  • Added –mem_query option to hipsolver-bench, which will print the amount of device memory workspace required by the function.

Changed#
  • The rocSOLVER backend will now set info to zero if rocSOLVER does not reference info. (Applies to orgbr/ungbr, orgqr/ungqr, orgtr/ungtr, ormqr/unmqr, ormtr/unmtr, gebrd, geqrf, getrs, potrs, and sytrd/hetrd).

  • gesvdj will no longer require extra workspace to transpose V when jobz is HIPSOLVER_EIG_MODE_VECTOR and econ is 1.

Fixed#
  • Fixed Fortran return value declarations within hipsolver_module.f90

  • Fixed gesvdj_bufferSize returning HIPSOLVER_STATUS_INVALID_VALUE when jobz is HIPSOLVER_EIG_MODE_NOVECTOR and 1 <= ldv < n

  • Fixed gesvdj returning HIPSOLVER_STATUS_INVALID_VALUE when jobz is HIPSOLVER_EIG_MODE_VECTOR, econ is 1, and m < n

hipSPARSE 2.3.1#

hipSPARSE 2.3.1 for ROCm 5.3.0

Added#
  • Add SpMM and SpMM batched for CSC format

rocALUTION 2.1.0#

rocALUTION 2.1.0 for ROCm 5.3.0

Added#
  • Benchmarking tool

  • Ext+I Interpolation with sparsify strategies added for RS-AMG

Improved#
  • ParallelManager

rocBLAS 2.45.0#

rocBLAS 2.45.0 for ROCm 5.3.0

Added#
  • install.sh option –upgrade_tensile_venv_pip to upgrade Pip in Tensile Virtual Environment. The corresponding CMake option is TENSILE_VENV_UPGRADE_PIP.

  • install.sh option –relocatable or -r adds rpath and removes ldconf entry on rocBLAS build.

  • install.sh option –lazy-library-loading to enable on-demand loading of tensile library files at runtime to speedup rocBLAS initialization.

  • Support for RHEL9 and CS9.

  • Added Numerical checking routine for symmetric, Hermitian, and triangular matrices, so that they could be checked for any numerical abnormalities such as NaN, Zero, infinity and denormal value.

Optimizations#
  • trmm_outofplace performance improvements for all sizes and data types using block-recursive algorithm.

  • herkx performance improvements for all sizes and data types using block-recursive algorithm.

  • syrk/herk performance improvements by utilising optimised syrkx/herkx code.

  • symm/hemm performance improvements for all sizes and datatypes using block-recursive algorithm.

Changed#
  • Unifying library logic file names: affects HBH (->HHS_BH), BBH (->BBS_BH), 4xi8BH (->4xi8II_BH). All HPA types are using the new naming convention now.

  • Level 3 function argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values. With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer. This improves consistency with legacy BLAS behaviour.

  • Level 1, 2, and 3 function argument checking for enums is now more rigorously matching legacy BLAS so returns rocblas_status_invalid_value if arguments do not match the accepted subset.

  • Add quick-return for internal trmm and gemm template functions.

  • Moved function block sizes to a shared header file.

  • Level 1, 2, and 3 functions use rocblas_stride datatype for offset.

  • Modified the matrix and vector memory allocation in our test infrastructure for all Level 1, 2, 3 and BLAS_EX functions.

  • Added specific initialization for symmetric, Hermitian, and triangular matrix types in our test infrastructure.

  • Added NaN tests to the test infrastructure for the rest of Level 3, BLAS_EX functions.

Fixed#
  • Improved logic to #include <filesystem> vs <experimental/filesystem>.

  • install.sh -s option to build rocblas as a static library.

  • dot function now sets the device results asynchronously for N <= 0

Deprecated#
  • is_complex helper is now deprecated. Use rocblas_is_complex instead.

  • The enum truncate_t and the value truncate is now deprecated and will removed from the ROCm release 6.0. It is replaced by rocblas_truncate_t and rocblas_truncate, respectively. The new enum rocblas_truncate_t and the value rocblas_truncate could be used from this ROCm release for an easy transition.

Removed#
  • install.sh options –hip-clang , –no-hip-clang, –merge-files, –no-merge-files are removed.

rocFFT 1.0.18#

rocFFT 1.0.18 for ROCm 5.3.0

Changed#
  • Runtime compilation cache now looks for environment variables XDG_CACHE_HOME (on Linux) and LOCALAPPDATA (on Windows) before falling back to HOME.

Optimizations#
  • Optimized 2D R2C/C2R to use 2-kernel plans where possible.

  • Improved performance of the Bluestein algorithm.

  • Optimized sbcc-168 and 100 by using half-lds.

Fixed#
  • Fixed occasional failures to parallelize runtime compilation of kernels. Failures would be retried serially and ultimately succeed, but this would take extra time.

  • Fixed failures of some R2C 3D transforms that use the unsupported TILE_UNALGNED SBRC kernels. An example is 98^3 R2C out-of-place.

  • Fixed bugs in SBRC_ERC type.

rocm-cmake 0.8.0#

rocm-cmake 0.8.0 for ROCm 5.3.0

Fixed#
  • Fixed error in prerm scripts created by rocm_create_package that could break uninstall for packages using the PTH option.

Changed#
  • ROCM_USE_DEV_COMPONENT set to on by default for all platforms. This means that Windows will now generate runtime and devel packages by default

  • ROCMInstallTargets now defaults CMAKE_INSTALL_LIBDIR to lib if not otherwise specified.

  • Changed default Debian compression type to xz and enabled multi-threaded package compression.

  • rocm_create_package will no longer warn upon failure to determine version of program rpmbuild.

rocPRIM 2.11.0#

rocPRIM 2.11.0 for ROCm 5.3.0

Added#
  • New functions subtract_left and subtract_right in block_adjacent_difference to apply functions on pairs of adjacent items distributed between threads in a block.

  • New device level adjacent_difference primitives.

  • Added experimental tooling for automatic kernel configuration tuning for various architectures

  • Benchmarks collect and output more detailed system information

  • CMake functionality to improve build parallelism of the test suite that splits compilation units by function or by parameters.

  • Reverse iterator.

rocRAND 2.10.15#

rocRAND 2.10.15 for ROCm 5.3.0

Changed#
  • Increased number of warmup iterations for rocrand_benchmark_generate from 5 to 15 to eliminate corner cases that would generate artificially high benchmark scores.

rocSOLVER 3.19.0#

rocSOLVER 3.19.0 for ROCm 5.3.0

Added#
  • Partial eigensolver routines for symmetric/hermitian matrices:

    • SYEVX (with batched and strided_batched versions)

    • HEEVX (with batched and strided_batched versions)

  • Generalized symmetric- and hermitian-definite partial eigensolvers:

    • SYGVX (with batched and strided_batched versions)

    • HEGVX (with batched and strided_batched versions)

  • Eigensolver routines for symmetric/hermitian matrices using Jacobi algorithm:

    • SYEVJ (with batched and strided_batched versions)

    • HEEVJ (with batched and strided_batched versions)

  • Generalized symmetric- and hermitian-definite eigensolvers using Jacobi algorithm:

    • SYGVJ (with batched and strided_batched versions)

    • HEGVJ (with batched and strided_batched versions)

  • Added –profile_kernels option to rocsolver-bench, which will include kernel calls in the profile log (if profile logging is enabled with –profile).

Changed#
  • Changed rocsolver-bench result labels cpu_time and gpu_time to cpu_time_us and gpu_time_us, respectively.

Removed#
  • Removed dependency on cblas from the rocsolver test and benchmark clients.

Fixed#
  • Fixed incorrect SYGS2/HEGS2, SYGST/HEGST, SYGV/HEGV, and SYGVD/HEGVD results for batch counts larger than 32.

  • Fixed STEIN memory access fault when nev is 0.

  • Fixed incorrect STEBZ results for close eigenvalues when range = index.

  • Fixed git unsafe repository error when building with ./install.sh -cd as a non-root user.

rocThrust 2.16.0#

rocThrust 2.16.0 for ROCm 5.3.0

Changed#
  • rocThrust functionality dependent on device malloc works is functional as ROCm 5.2 reneabled device malloc. Device launched thrust::sort and thrust::sort_by_key are available for use.

rocWMMA 0.8#

rocWMMA 0.8 for ROCm 5.3.0

Tensile 4.34.0#

Tensile 4.34.0 for ROCm 5.3.0

Added#
  • Lazy loading of solution libraries and code object files

  • Support for dictionary style logic files

  • Support for decision tree based logic files using dictionary format

  • DecisionTreeLibrary for solution selection

  • DirectToLDS support for HGEMM

  • DirectToVgpr support for SGEMM

  • Grid based distance metric for solution selection

  • Support for gfx11xx

  • Support for DirectToVgprA/B + TLU=False

  • ForkParameters Groups as a way of specifying solution parameters

  • Support for a new Tensile yaml config format

  • TensileClientConfig for generating Tensile client config files

  • Options for TensileCreateLibrary to build client and create client config file

Optimizations#
  • Solution generation is now cached and is not repeated if solution parameters are unchanged

Changed#
  • Default MACInstruction to FMA

Fixed#
  • Accept StaggerUStride=0 as valid

  • Reject invalid data types for UnrollLoopEfficiencyEnable

  • Fix invalid code generation issues related to DirectToVgpr

  • Return hipErrorNotFound if no modules are loaded

  • Fix performance drop for NN ZGEMM with 96x64 macro tile

  • Fix memory violation for general batched kernels when alpha/beta/K = 0


ROCm 5.2.3#

Changes in this release#
Ubuntu 18.04 end-of-life announcement#

Support for Ubuntu 18.04 ends in this release. Future releases of ROCm will not provide prebuilt packages for Ubuntu 18.04.

HIP runtime#
Fixes#
  • A bug was discovered in the HIP graph capture implementation in the ROCm v5.2.0 release. If the same kernel is called twice (with different argument values) in a graph capture, the implementation only kept the argument values for the second kernel call.

  • A bug was introduced in the hiprtc implementation in the ROCm v5.2.0 release. This bug caused the hiprtcGetLoweredName call to fail for named expressions with whitespace in it.

Example:

The named expression my_sqrt<complex<double>> passed but my_sqrt<complex<double >> failed.

RCCL#
Additions#

Compatibility with NCCL 2.12.10

  • Packages for test and benchmark executables on all supported OSes using CPack

  • Added custom signal handler - opt-in with RCCL_ENABLE_SIGNALHANDLER=1

    • Additional details provided if Binary File Descriptor library (BFD) is pre-installed.

  • Added experimental support for using multiple ranks per device

    • Requires using a new interface to create communicator (ncclCommInitRankMulti), refer to the interface documentation for details.

    • To avoid potential deadlocks, user might have to set an environment variables increasing the number of hardware queues. For example,

export GPU_MAX_HW_QUEUES=16
  • Added support for reusing ports in NET/IB channels

    • Opt-in with NCCL_IB_SOCK_CLIENT_PORT_REUSE=1 and NCCL_IB_SOCK_SERVER_PORT_REUSE=1

    • When “Call to bind failed: Address already in use” error happens in large-scale AlltoAll (for example, >=64 MI200 nodes), users are suggested to opt-in either one or both of the options to resolve the massive port usage issue

    • Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned >1

Removals#
  • Removed experimental clique-based kernels

Development tools#

No notable changes in this release for development tools, including the compiler, profiler, and debugger deployment and management tools

No notable changes in this release for deployment and management tools.

For release information for older ROCm releases, refer to ROCm/ROCm

Library changes in ROCm 5.2.3#

Library

Version

hipBLAS

0.51.0

hipCUB

2.11.1

hipFFT

1.0.8

hipSOLVER

1.4.0

hipSPARSE

2.2.0

MIVisionX

2.3.0

rccl

2.11.4 ⇒ 2.12.10

rocALUTION

2.0.3

rocBLAS

2.44.0

rocFFT

1.0.17

rocPRIM

2.10.14

rocRAND

2.10.14

rocSOLVER

3.18.0

rocSPARSE

2.2.0

rocThrust

2.15.0

rocWMMA

0.7

Tensile

4.33.0

rccl 2.12.10#

RCCL 2.12.10 for ROCm 5.2.3

Added#
  • Compatibility with NCCL 2.12.10

  • Packages for test and benchmark executables on all supported OSes using CPack.

  • Adding custom signal handler - opt-in with RCCL_ENABLE_SIGNALHANDLER=1

    • Additional details provided if Binary File Descriptor library (BFD) is pre-installed

  • Adding support for reusing ports in NET/IB channels

    • Opt-in with NCCL_IB_SOCK_CLIENT_PORT_REUSE=1 and NCCL_IB_SOCK_SERVER_PORT_REUSE=1

    • When “Call to bind failed : Address already in use” error happens in large-scale AlltoAll (e.g., >=64 MI200 nodes), users are suggested to opt-in either one or both of the options to resolve the massive port usage issue

    • Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned >1

Removed#
  • Removed experimental clique-based kernels


ROCm 5.2.1#

Library changes in ROCm 5.2.1#

Library

Version

hipBLAS

0.51.0

hipCUB

2.11.1

hipFFT

1.0.8

hipSOLVER

1.4.0

hipSPARSE

2.2.0

MIVisionX

2.2.0 ⇒ 2.3.0

rccl

2.11.4

rocALUTION

2.0.3

rocBLAS

2.44.0

rocFFT

1.0.17

rocPRIM

2.10.14

rocRAND

2.10.14

rocSOLVER

3.18.0

rocSPARSE

2.2.0

rocThrust

2.15.0

rocWMMA

0.7

Tensile

4.33.0

MIVisionX 2.3.0#

MIVisionX for ROCm 5.2.1

Added#
  • Docker Support for ROCm 5.2.X

Optimizations#
Changed#
Fixed#
Tested Configurations#
  • Windows 10 / 11

  • Linux distribution

    • Ubuntu - 18.04 / 20.04

    • CentOS - 7 / 8

    • SLES - 15-SP2

  • ROCm: rocm-core - 5.2.0.50200-65

  • miopen-hip - 2.16.0.50101-48

  • miopen-opencl - 2.16.0.50101-48

  • migraphx - 2.1.0.50101-48

  • Protobuf - V3.12.4

  • OpenCV - 4.5.5

  • RPP - 0.93

  • FFMPEG - n4.4.2

  • Dependencies for all the above packages

  • MIVisionX Setup Script - V2.3.4

Known Issues#
  • OpenCV 4.X support for some apps missing

Mivisionx Dependency Map#

Docker Image: sudo docker build -f docker/ubuntu20/{DOCKER_LEVEL_FILE_NAME}.dockerfile -t {mivisionx-level-NUMBER} .

  • #c5f015 new component added to the level

  • #1589F0 existing component from the previous level

Build Level

MIVisionX Dependencies

Modules

Libraries and Executables

Docker Tag

Level_1

cmake <br> gcc <br> g++

amd_openvx <br> utilities

#c5f015 libopenvx.so - OpenVX&trade; Lib - CPU <br> #c5f015 libvxu.so - OpenVX&trade; immediate node Lib - CPU <br> #c5f015 runvx - OpenVX&trade; Graph Executor - CPU with Display OFF

Docker Image Version (tag latest semver)

Level_2

ROCm OpenCL <br> +Level 1

amd_openvx <br> amd_openvx_extensions <br> utilities

#c5f015 libopenvx.so - OpenVX&trade; Lib - CPU/GPU <br> #c5f015 libvxu.so - OpenVX&trade; immediate node Lib - CPU/GPU <br> #c5f015 libvx_loomsl.so - Loom 360 Stitch Lib <br> #c5f015 loom_shell - 360 Stitch App <br> #c5f015 runcl - OpenCL&trade; program debug App <br> #c5f015 runvx - OpenVX&trade; Graph Executor - Display OFF

Docker Image Version (tag latest semver)

Level_3

OpenCV <br> FFMPEG <br> +Level 2

amd_openvx <br> amd_openvx_extensions <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #c5f015 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #c5f015 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #c5f015 mv_compile - Neural Net Model Compile <br> #c5f015 runvx - OpenVX&trade; Graph Executor - Display ON

Docker Image Version (tag latest semver)

Level_4

MIOpenGEMM <br> MIOpen <br> ProtoBuf <br> +Level 3

amd_openvx <br> amd_openvx_extensions <br> apps <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #1589F0 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #1589F0 mv_compile - Neural Net Model Compile <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #1589F0 runvx - OpenVX&trade; Graph Executor - Display ON <br> #c5f015 libvx_nn.so - OpenVX&trade; Neural Net Extension <br> #c5f015 inference_server_app - Cloud Inference App

Docker Image Version (tag latest semver)

Level_5

AMD_RPP <br> rocAL deps <br> +Level 4

amd_openvx <br> amd_openvx_extensions <br> apps <br> rocAL <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #1589F0 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #1589F0 mv_compile - Neural Net Model Compile <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #1589F0 runvx - OpenVX&trade; Graph Executor - Display ON <br> #1589F0 libvx_nn.so - OpenVX&trade; Neural Net Extension <br> #1589F0 inference_server_app - Cloud Inference App <br> #c5f015 libvx_rpp.so - OpenVX&trade; RPP Extension <br> #c5f015 librali.so - Radeon Augmentation Library <br> #c5f015 rali_pybind.so - rocAL Pybind Lib

Docker Image Version (tag latest semver)


ROCm 5.2.0#

What’s new in this release#
HIP enhancements#

The ROCm v5.2 release consists of the following HIP enhancements:

HIP installation guide updates#

The HIP Installation Guide is updated to include building HIP tests from source on the AMD and NVIDIA platforms.

For more details, refer to the HIP Installation Guide v5.2.

Support for device-side malloc on HIP-Clang#

HIP-Clang now supports device-side malloc. This implementation does not require the use of hipDeviceSetLimit(hipLimitMallocHeapSize,value) nor respect any setting. The heap is fully dynamic and can grow until the available free memory on the device is consumed.

The test codes at the following link show how to implement applications using malloc and free functions in device kernels:

ROCm/HIP

New HIP APIs in this release#

The following new HIP APIs are available in the ROCm v5.2 release. Note that this is a pre-official version (beta) release of the new APIs:

Device management HIP APIs#

The new device management HIP APIs are as follows:

  • Gets a UUID for the device. This API returns a UUID for the device.

    hipError_t hipDeviceGetUuid(hipUUID* uuid, hipDevice_t device);
    

    Note that this new API corresponds to the following CUDA API:

      CUresult cuDeviceGetUuid(CUuuid* uuid, CUdevice dev);
    
  • Gets default memory pool of the specified device

    hipError_t hipDeviceGetDefaultMemPool(hipMemPool_t* mem_pool, int device);
    
  • Sets the current memory pool of a device

    hipError_t hipDeviceSetMemPool(int device, hipMemPool_t mem_pool);
    
  • Gets the current memory pool for the specified device

    hipError_t hipDeviceGetMemPool(hipMemPool_t* mem_pool, int device);
    
New HIP runtime APIs in memory management#

The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory management are:

  • Allocates memory with stream ordered semantics

    hipError_t hipMallocAsync(void** dev_ptr, size_t size, hipStream_t stream);
    
  • Frees memory with stream ordered semantics

    hipError_t hipFreeAsync(void* dev_ptr, hipStream_t stream);
    
  • Releases freed memory back to the OS

    hipError_t hipMemPoolTrimTo(hipMemPool_t mem_pool, size_t min_bytes_to_hold);
    
  • Sets attributes of a memory pool

    hipError_t hipMemPoolSetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void* value);
    
  • Gets attributes of a memory pool

    hipError_t hipMemPoolGetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void* value);
    
  • Controls visibility of the specified pool between devices

    hipError_t hipMemPoolSetAccess(hipMemPool_t mem_pool, const hipMemAccessDesc* desc_list, size_t count);
    
  • Returns the accessibility of a pool from a device

    hipError_t hipMemPoolGetAccess(hipMemAccessFlags* flags, hipMemPool_t mem_pool, hipMemLocation* location);
    
  • Creates a memory pool

    hipError_t hipMemPoolCreate(hipMemPool_t* mem_pool, const hipMemPoolProps* pool_props);
    
  • Destroys the specified memory pool

    hipError_t hipMemPoolDestroy(hipMemPool_t mem_pool);
    
  • Allocates memory from a specified pool with stream ordered semantics

    hipError_t hipMallocFromPoolAsync(void** dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream);
    
  • Exports a memory pool to the requested handle type

    hipError_t hipMemPoolExportToShareableHandle(
        void*                      shared_handle,
        hipMemPool_t               mem_pool,
        hipMemAllocationHandleType handle_type,
        unsigned int               flags);
    
  • Imports a memory pool from a shared handle

    hipError_t hipMemPoolImportFromShareableHandle(
        hipMemPool_t*              mem_pool,
        void*                      shared_handle,
        hipMemAllocationHandleType handle_type,
        unsigned int               flags);
    
  • Exports data to share a memory pool allocation between processes

    hipError_t hipMemPoolExportPointer(hipMemPoolPtrExportData* export_data, void* dev_ptr);
    Import a memory pool allocation from another process.t
    hipError_t hipMemPoolImportPointer(
        void**                   dev_ptr,
        hipMemPool_t             mem_pool,
        hipMemPoolPtrExportData* export_data);
    
HIP graph management APIs#

The new HIP Graph Management APIs are as follows:

  • Enqueues a host function call in a stream

    hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData);
    
  • Swaps the stream capture mode of a thread

    hipError_t hipThreadExchangeStreamCaptureMode(hipStreamCaptureMode* mode);
    
  • Sets a node attribute

    hipError_t hipGraphKernelNodeSetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, const hipKernelNodeAttrValue* value);
    
  • Gets a node attribute

    hipError_t hipGraphKernelNodeGetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, hipKernelNodeAttrValue* value);
    
Support for virtual memory management APIs#

The new APIs for virtual memory management are as follows:

  • Frees an address range reservation made via hipMemAddressReserve

    hipError_t hipMemAddressFree(void* devPtr, size_t size);
    
  • Reserves an address range

    hipError_t hipMemAddressReserve(void** ptr, size_t size, size_t alignment, void* addr, unsigned long long flags);
    
  • Creates a memory allocation described by the properties and size

    hipError_t hipMemCreate(hipMemGenericAllocationHandle_t* handle, size_t size, const hipMemAllocationProp* prop, unsigned long long flags);
    
  • Exports an allocation to a requested shareable handle type

    hipError_t hipMemExportToShareableHandle(void* shareableHandle, hipMemGenericAllocationHandle_t handle, hipMemAllocationHandleType handleType, unsigned long long flags);
    
  • Gets the access flags set for the given location and ptr

    hipError_t hipMemGetAccess(unsigned long long* flags, const hipMemLocation* location, void* ptr);
    
  • Calculates either the minimal or recommended granularity

    hipError_t hipMemGetAllocationGranularity(size_t* granularity, const hipMemAllocationProp* prop, hipMemAllocationGranularity_flags option);
    
  • Retrieves the property structure of the given handle

    hipError_t hipMemGetAllocationPropertiesFromHandle(hipMemAllocationProp* prop, hipMemGenericAllocationHandle_t handle);
    
  • Imports an allocation from a requested shareable handle type

    hipError_t hipMemImportFromShareableHandle(hipMemGenericAllocationHandle_t* handle, void* osHandle, hipMemAllocationHandleType shHandleType);
    
  • Maps an allocation handle to a reserved virtual address range

    hipError_t hipMemMap(void* ptr, size_t size, size_t offset, hipMemGenericAllocationHandle_t handle, unsigned long long flags);
    
  • Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays

    hipError_t hipMemMapArrayAsync(hipArrayMapInfo* mapInfoList, unsigned int count, hipStream_t stream);
    
  • Release a memory handle representing a memory allocation, that was previously allocated through hipMemCreate

    hipError_t hipMemRelease(hipMemGenericAllocationHandle_t handle);
    
  • Returns the allocation handle of the backing memory allocation given the address

    hipError_t hipMemRetainAllocationHandle(hipMemGenericAllocationHandle_t* handle, void* addr);
    
  • Sets the access flags for each location specified in desc for the given virtual address range

    hipError_t hipMemSetAccess(void* ptr, size_t size, const hipMemAccessDesc* desc, size_t count);
    
  • Unmaps memory allocation of a given address range

    hipError_t hipMemUnmap(void* ptr, size_t size);
    

For more information, refer to the HIP API documentation at Modules.

Planned HIP changes in future releases#

Changes to hipDeviceProp_t, HIPMEMCPY_3D, and hipArray structures (and related HIP APIs) are planned in the next major release. These changes may impact backward compatibility.

Refer to the release notes in subsequent releases for more information.

ROCm math and communication libraries#

In this release, ROCm math and communication libraries consist of the following enhancements and fixes:

  • New rocWMMA for matrix multiplication and accumulation operations acceleration

This release introduces a new ROCm C++ library for accelerating mixed-precision matrix multiplication and accumulation (MFMA) operations leveraging specialized GPU matrix cores. rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a header library of GPU device code, meaning matrix core acceleration may be compiled directly into your kernel device code. This can benefit from compiler optimization in the generation of kernel assembly and does not incur additional overhead costs of linking to external runtime libraries or having to launch separate kernels.

rocWMMA is released as a header library and includes test and sample projects to validate and illustrate example usages of the C++ API. GEMM matrix multiplication is used as primary validation given the heavy precedent for the library. However, the usage portfolio is growing significantly and demonstrates different ways rocWMMA may be consumed.

For more information, refer to Communication libraries.

OpenMP enhancements in this release#
OMPT target support#

The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the OpenMP specification document. These are APIs that allow first-party tools to examine the profile and traces for kernels that execute on a device. A tool may register callbacks for data transfer and kernel dispatch entry points. A tool may use APIs to start and stop tracing for device-related activities, such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled, trace records for device activities are collected during program execution and returned to the tool using the APIs described in the specification.

Following is an example demonstrating how a tool would use the OMPT target APIs supported. The README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to follow, and you can run the provided example as indicated below:

cd /opt/rocm/llvm/examples/tools/ompt/veccopy-ompt-target-tracing
make run

The file veccopy-ompt-target-tracing.c simulates how a tool would initiate device activity tracing. The file callbacks.h shows the callbacks that may be registered and implemented by the tool.

Deprecations and warnings#
Linux file system hierarchy standard for ROCm#

ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.

New file system hierarchy#

The following is the new file system hierarchy:

/opt/rocm-<ver>
    | --bin
      | --All externally exposed Binaries
    | --libexec
        | --<component>
            | -- Component specific private non-ISA executables (architecture independent)
    | --include
        | -- <component>
            | --<header files>
    | --lib
        | --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
            (public libraries linked with application)
        | --<component> (component specific private library, executable data)
        | --<cmake>
            | --components
                | --<component>.config.cmake
    | --share
        | --html/<component>/*.html
        | --info/<component>/*.[pdf, md, txt]
        | --man
        | --doc
            | --<component>
                | --<licenses>
        | --<component>
            | --<misc files> (arch independent non-executable)
            | --samples

Note

ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.

For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.

Backward compatibility with older file systems#

ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.

Note

ROCm will continue supporting backward compatibility until the next major release.

Wrapper header files#

Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include) with a warning message to include files from the new location (/opt/rocm-xxx/include) as shown in the example below:

// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"

The wrapper header files’ backward compatibility deprecation is as follows:

  • #pragma message announcing deprecation – ROCm v5.2 release

  • #pragma message changed to #warning – Future release

  • #warning changed to #error – Future release

  • Backward compatibility wrappers removed – Future release

Library files#

Library files are available in the /opt/rocm-xxx/lib folder. For backward compatibility, the old library location (/opt/rocm-xxx/<component>/lib) has a soft link to the library at the new location.

Example:

$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root   24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
CMake config files#

All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component> folder. For backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake) consist of a soft link to the new CMake config.

Example:

$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
Planned deprecation of hip-rocclr and hip-base packages#

In the ROCm v5.2 release, hip-rocclr and hip-base packages (Debian and RPM) are planned for deprecation and will be removed in a future release. hip-runtime-amd and hip-dev(el) will replace these packages respectively. Users of hip-rocclr must install two packages, hip-runtime-amd and hip-dev, to get the same set of packages installed by hip-rocclr previously.

Currently, both package names hip-rocclr (or) hip-runtime-amd and hip-base (or) hip-dev(el) are supported.

Deprecation of integrated HIP directed tests#

The integrated HIP directed tests, which are currently built by default, are deprecated in this release. The default building and execution support through CMake will be removed in future release.

Defect fixes#

Defect

Fix

ROCmInfo does not list gpus

code fix

Hang observed while restoring cooperative group samples

code fix

ROCM-SMI over SRIOV: Unsupported commands do not return proper error message

code fix

Known issues#

This section consists of known issues in this release.

Compiler error on gfx1030 when compiling at -O0#
Issue#

A compiler error occurs when using -O0 flag to compile code for gfx1030 that calls atomicAddNoRet, which is defined in amd_hip_atomic.h. The compiler generates an illegal instruction for gfx1030.

Workaround#

The workaround is not to use the -O0 flag for this case. For higher optimization levels, the compiler does not generate an invalid instruction.

System freeze observed during CUDA memtest checkpoint#
Issue#

Checkpoint/Restore in Userspace (CRIU) requires 20 MB of VRAM approximately to checkpoint and restore. The CRIU process may freeze if the maximum amount of available VRAM is allocated to checkpoint applications.

Workaround#

To use CRIU to checkpoint and restore your application, limit the amount of VRAM the application uses to ensure at least 20 MB is available.

HPC test fails with the “HSA_STATUS_ERROR_MEMORY_FAULT” error#
Issue#

The compiler may incorrectly compile a program that uses the __shfl_sync(mask, value, srcLane) function when the “value” parameter to the function is undefined along some path to the function. For most functions, uninitialized inputs cause undefined behavior, but the definition for __shfl_sync should allow for undefined values.

Workaround#

The workaround is to initialize the parameters to __shfl_sync.

Note

When the -Wall compilation flag is used, the compiler generates a warning indicating the variable is initialized along some path.

Example:

double res = 0.0; // Initialize the input to __shfl_sync.
if (lane == 0) {
  res = <some expression>
}
res = __shfl_sync(mask, res, 0);
Kernel produces incorrect result#
Issue#

In recent changes to Clang, insertion of the noundef attribute to all the function arguments has been enabled by default.

In the HIP kernel, variable var in shfl_sync may not be initialized, so LLVM IR treats it as undef.

So, the function argument that is potentially undef (because it is not initialized) has always been assumed to be noundef by LLVM IR (since Clang has inserted the noundef attribute). This leads to ambiguous kernel execution.

Workaround#
  • Skip adding noundef attribute to functions tagged with convergent attribute. Refer to https://reviews.llvm.org/D124158 for more information.

  • Introduce shuffle attribute and add it to __shfl like APIs at hip headers. Clang can skip adding the noundef attribute, if it finds that argument is tagged with shuffle attribute. Refer to https://reviews.llvm.org/D125378 for more information.

  • Introduce clang builtin for __shfl to identify it and skip adding noundef attribute.

  • Introduce __builtin_freeze to use on the relevant arguments in library wrappers. The library/header need to insert freezes on the relevant inputs.

Issue with applications triggering oversubscription#

There is a known issue with applications that trigger oversubscription. A hardware hang occurs when ROCgdb is used on AMD Instinct™ MI50 and MI100 systems.

This issue is under investigation and will be fixed in a future release.

Library changes in ROCm 5.2.0#

Library

Version

hipBLAS

0.50.0 ⇒ 0.51.0

hipCUB

2.11.0 ⇒ 2.11.1

hipFFT

1.0.7 ⇒ 1.0.8

hipSOLVER

1.3.0 ⇒ 1.4.0

hipSPARSE

2.1.0 ⇒ 2.2.0

MIVisionX

2.2.0

rccl

2.11.4

rocALUTION

2.0.2 ⇒ 2.0.3

rocBLAS

2.43.0 ⇒ 2.44.0

rocFFT

1.0.16 ⇒ 1.0.17

rocPRIM

2.10.13 ⇒ 2.10.14

rocRAND

2.10.13 ⇒ 2.10.14

rocSOLVER

3.17.0 ⇒ 3.18.0

rocSPARSE

2.1.0 ⇒ 2.2.0

rocThrust

2.14.0 ⇒ 2.15.0

rocWMMA

0.7

Tensile

4.32.0 ⇒ 4.33.0

hipBLAS 0.51.0#

hipBLAS 0.51.0 for ROCm 5.2.0

Added#
  • Packages for test and benchmark executables on all supported OSes using CPack.

  • Added File/Folder Reorg Changes with backward compatibility support enabled using ROCM-CMAKE wrapper functions

  • Added user-specified initialization option to hipblas-bench

Fixed#
  • Fixed version gathering in performance measuring script

hipCUB 2.11.1#

hipCUB 2.11.1 for ROCm 5.2.0

Added#
  • Packages for tests and benchmark executable on all supported OSes using CPack.

hipFFT 1.0.8#

hipFFT 1.0.8 for ROCm 5.2.0

Added#
  • Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions.

  • Packages for test and benchmark executables on all supported OSes using CPack.

hipSOLVER 1.4.0#

hipSOLVER 1.4.0 for ROCm 5.2.0

Added#
  • Package generation for test and benchmark executables on all supported OSes using CPack.

  • File/Folder Reorg

    • Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions.

Fixed#
  • Fixed the ReadTheDocs documentation generation.

hipSPARSE 2.2.0#

hipSPARSE 2.2.0 for ROCm 5.2.0

Added#
  • Packages for test and benchmark executables on all supported OSes using CPack.

rocALUTION 2.0.3#

rocALUTION 2.0.3 for ROCm 5.2.0

Added#
  • Packages for test and benchmark executables on all supported OSes using CPack.

rocBLAS 2.44.0#

rocBLAS 2.44.0 for ROCm 5.2.0

Added#
  • Packages for test and benchmark executables on all supported OSes using CPack.

  • Added Denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output vectors of rocBLAS level 1 and 2 functions.

  • Added Denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output general matrices of rocBLAS level 2 and 3 functions.

  • Added NaN initialization tests to the yaml files of Level 2 rocBLAS batched and strided-batched functions for testing purposes.

  • Added memory allocation check to avoid disk swapping during rocblas-test runs by skipping tests.

Optimizations#
  • Improved performance of non-batched and batched her2 for all sizes and data types.

  • Improved performance of non-batched and batched amin for all data types using shuffle reductions.

  • Improved performance of non-batched and batched amax for all data types using shuffle reductions.

  • Improved performance of trsv for all sizes and data types.

Changed#
  • Modifying gemm_ex for HBH (High-precision F16). The alpha/beta data type remains as F32 without narrowing to F16 and expanding back to F32 in the kernel. This change prevents rounding errors due to alpha/beta conversion in situations where alpha/beta are not exactly represented as an F16.

  • Modified non-batched and batched asum, nrm2 functions to use shuffle instruction based reductions.

  • For gemm, gemm_ex, gemm_ex2 internal API use rocblas_stride datatype for offset.

  • For symm, hemm, syrk, herk, dgmm, geam internal API use rocblas_stride datatype for offset.

  • AMD copyright year for all rocBLAS files.

  • For gemv (transpose-case), typecasted the ‘lda’(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions.

Fixed#
  • For function her2 avoid overflow in offset calculation.

  • For trsm when alpha == 0 and on host, allow A to be nullptr.

  • Fixed memory access issue in trsv.

  • Fixed git pre-commit script to update only AMD copyright year.

  • Fixed dgmm, geam test functions to set correct stride values.

  • For functions ssyr2k and dsyr2k allow trans == rocblas_operation_conjugate_transpose.

  • Fixed compilation error for clients-only build.

Removed#
  • Remove Navi12 (gfx1011) from fat binary.

rocFFT 1.0.17#

rocFFT 1.0.17 for ROCm 5.2.0

Added#
  • Packages for test and benchmark executables on all supported OSes using CPack.

  • Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions.

Changed#
  • Improved reuse of twiddle memory between plans.

  • Set a default load/store callback when only one callback type is set via the API for improved performance.

Optimizations#
  • Introduced a new access pattern of lds (non-linear) and applied it on sbcc kernels len 64 to get performance improvement.

Fixed#
  • Fixed plan creation failure in cases where SBCC kernels would need to write to non-unit-stride buffers.

rocPRIM 2.10.14#

rocPRIM 2.10.14 for ROCm 5.2.0

Added#
  • Packages for tests and benchmark executable on all supported OSes using CPack.

  • Added File/Folder Reorg Changes and Enabled Backward compatibility support using wrapper headers.

rocRAND 2.10.14#

rocRAND 2.10.14 for ROCm 5.2.0

Added#
  • Backward compatibility for deprecated #include &lt;rocrand.h&gt; using wrapper header files.

  • Packages for test and benchmark executables on all supported OSes using CPack.

rocSOLVER 3.18.0#

rocSOLVER 3.18.0 for ROCm 5.2.0

Added#
  • Partial eigenvalue decomposition routines:

    • STEBZ

    • STEIN

  • Package generation for test and benchmark executables on all supported OSes using CPack.

  • Added tests for multi-level logging

  • Added tests for rocsolver-bench client

  • File/Folder Reorg

    • Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions.

Fixed#
  • Fixed compatibility with libfmt 8.1

rocSPARSE 2.2.0#

rocSPARSE 2.2.0 for ROCm 5.2.0

Added#
  • batched SpMM for CSR, COO and Blocked ELL formats.

  • Packages for test and benchmark executables on all supported OSes using CPack.

  • Clients file importers and exporters.

Improved#
  • Clients code size reduction.

  • Clients error handling.

  • Clients benchmarking for performance tracking.

Changed#
  • Test adjustments due to roundoff errors.

  • Fixing API calls compatiblity with rocPRIM.

Known Issues#
  • none

rocThrust 2.15.0#

rocThrust 2.15.0 for ROCm 5.2.0

Added#
  • Packages for tests and benchmark executable on all supported OSes using CPack.

rocWMMA 0.7#

rocWMMA 0.7 for ROCm 5.2.0

Added#
  • Added unit tests for DLRM kernels

  • Added GEMM sample

  • Added DLRM sample

  • Added SGEMV sample

  • Added unit tests for cooperative wmma load and stores

  • Added unit tests for IOBarrier.h

  • Added wmma load/ store tests for different matrix types (A, B and Accumulator)

  • Added more block sizes 1, 2, 4, 8 to test MmaSyncMultiTest

  • Added block sizes 4, 8 to test MmaSynMultiLdsTest

  • Added support for wmma load / store layouts with block dimension greater than 64

  • Added IOShape structure to define the attributes of mapping and layouts for all wmma matrix types

  • Added CI testing for rocWMMA

Changed#
  • Renamed wmma to rocwmma in cmake, header files and documentation

  • Renamed library files

  • Modified Layout.h to use different matrix offset calculations (base offset, incremental offset and cumulative offset)

  • Opaque load/store continue to use incrementatl offsets as they fill the entire block

  • Cooperative load/store use cumulative offsets as they fill only small portions for the entire block

  • Increased Max split counts to 64 for cooperative load/store

  • Moved all the wmma definitions, API headers to rocwmma namespace

  • Modified wmma fill unit tests to validate all matrix types (A, B, Accumulator)

Tensile 4.33.0#

Tensile 4.33.0 for ROCm 5.2.0

Added#
  • TensileUpdateLibrary for updating old library logic files

  • Support for TensileRetuneLibrary to use sizes from separate file

  • ZGEMM DirectToVgpr/DirectToLds/StoreCInUnroll/MIArchVgpr support

  • Tests for denorm correctness

  • Option to write different architectures to different TensileLibrary files

Optimizations#
  • Optimize MessagePackLoadLibraryFile by switching to fread

  • DGEMM tail loop optimization for PrefetchAcrossPersistentMode=1/DirectToVgpr

Changed#
  • Alpha/beta datatype remains as F32 for HPA HGEMM

  • Force assembly kernels to not flush denorms

  • Use hipDeviceAttributePhysicalMultiProcessorCount as multiProcessorCount

Fixed#
  • Fix segmentation fault when run i8 datatype with TENSILE_DB=0x80


ROCm 5.1.3#

Library changes in ROCm 5.1.3#

Library

Version

hipBLAS

0.50.0

hipCUB

2.11.0

hipFFT

1.0.7

hipRAND

2.10.13

hipSOLVER

1.3.0

hipSPARSE

2.1.0

MIVisionX

2.1.0 ⇒ 2.2.0

rccl

2.11.4

rocALUTION

2.0.2

rocBLAS

2.43.0

rocFFT

1.0.16

rocPRIM

2.10.13

rocRAND

2.10.13

rocSOLVER

3.17.0

rocSPARSE

2.1.0

rocThrust

2.14.0

Tensile

4.32.0

MIVisionX 2.2.0#

MIVisionX for ROCm 5.1.3

Added#
Optimizations#
Changed#
  • DockerFiles - Updates to install ROCm 5.1.1 Plus

Fixed#
Tested Configurations#
  • Windows 10 / 11

  • Linux distribution

    • Ubuntu - 18.04 / 20.04

    • CentOS - 7 / 8

    • SLES - 15-SP2

  • ROCm: rocm-core - 5.1.1.50101-48

  • miopen-hip - 2.16.0.50101-48

  • miopen-opencl - 2.16.0.50101-48

  • migraphx - 2.1.0.50101-48

  • Protobuf - V3.12.0

  • OpenCV - 4.5.5

  • RPP - 0.93

  • FFMPEG - n4.0.4

  • Dependencies for all the above packages

  • MIVisionX Setup Script - V2.3.0

Known Issues#
Mivisionx Dependency Map#

Docker Image: sudo docker build -f docker/ubuntu20/{DOCKER_LEVEL_FILE_NAME}.dockerfile -t {mivisionx-level-NUMBER} .

  • #c5f015 new component added to the level

  • #1589F0 existing component from the previous level

Build Level

MIVisionX Dependencies

Modules

Libraries and Executables

Docker Tag

Level_1

cmake <br> gcc <br> g++

amd_openvx <br> utilities

#c5f015 libopenvx.so - OpenVX&trade; Lib - CPU <br> #c5f015 libvxu.so - OpenVX&trade; immediate node Lib - CPU <br> #c5f015 runvx - OpenVX&trade; Graph Executor - CPU with Display OFF

Docker Image Version (tag latest semver)

Level_2

ROCm OpenCL <br> +Level 1

amd_openvx <br> amd_openvx_extensions <br> utilities

#c5f015 libopenvx.so - OpenVX&trade; Lib - CPU/GPU <br> #c5f015 libvxu.so - OpenVX&trade; immediate node Lib - CPU/GPU <br> #c5f015 libvx_loomsl.so - Loom 360 Stitch Lib <br> #c5f015 loom_shell - 360 Stitch App <br> #c5f015 runcl - OpenCL&trade; program debug App <br> #c5f015 runvx - OpenVX&trade; Graph Executor - Display OFF

Docker Image Version (tag latest semver)

Level_3

OpenCV <br> FFMPEG <br> +Level 2

amd_openvx <br> amd_openvx_extensions <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #c5f015 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #c5f015 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #c5f015 mv_compile - Neural Net Model Compile <br> #c5f015 runvx - OpenVX&trade; Graph Executor - Display ON

Docker Image Version (tag latest semver)

Level_4

MIOpenGEMM <br> MIOpen <br> ProtoBuf <br> +Level 3

amd_openvx <br> amd_openvx_extensions <br> apps <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #1589F0 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #1589F0 mv_compile - Neural Net Model Compile <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #1589F0 runvx - OpenVX&trade; Graph Executor - Display ON <br> #c5f015 libvx_nn.so - OpenVX&trade; Neural Net Extension <br> #c5f015 inference_server_app - Cloud Inference App

Docker Image Version (tag latest semver)

Level_5

AMD_RPP <br> rocAL deps <br> +Level 4

amd_openvx <br> amd_openvx_extensions <br> apps <br> rocAL <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #1589F0 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #1589F0 mv_compile - Neural Net Model Compile <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #1589F0 runvx - OpenVX&trade; Graph Executor - Display ON <br> #1589F0 libvx_nn.so - OpenVX&trade; Neural Net Extension <br> #1589F0 inference_server_app - Cloud Inference App <br> #c5f015 libvx_rpp.so - OpenVX&trade; RPP Extension <br> #c5f015 librali.so - Radeon Augmentation Library <br> #c5f015 rali_pybind.so - rocAL Pybind Lib

Docker Image Version (tag latest semver)


ROCm 5.1.1#

Library changes in ROCm 5.1.1#

Library

Version

hipBLAS

0.50.0

hipCUB

2.11.0

hipFFT

1.0.7

hipRAND

2.10.13

hipSOLVER

1.3.0

hipSPARSE

2.1.0

MIVisionX

2.1.0

rccl

2.11.4

rocALUTION

2.0.2

rocBLAS

2.43.0

rocFFT

1.0.16

rocPRIM

2.10.13

rocRAND

2.10.13

rocSOLVER

3.17.0

rocSPARSE

2.1.0

rocThrust

2.14.0

Tensile

4.32.0


ROCm 5.1.0#

What’s new in this release#
HIP enhancements#

The ROCm v5.1 release consists of the following HIP enhancements.

HIP installation guide updates#

The HIP installation guide now includes information on installing and building HIP from source on AMD and NVIDIA platforms.

Refer to the HIP Installation Guide v5.1 for more details.

Support for HIP graph#

ROCm v5.1 extends support for HIP Graph.

Planned changes for HIP in future releases#
Separation of hiprtc (libhiprtc) library from hip runtime (amdhip64)#

On ROCm/Linux, to maintain backward compatibility, the hipruntime library (amdhip64) will continue to include hiprtc symbols in future releases. The backward compatible support may be discontinued by removing hiprtc symbols from the hipruntime library (amdhip64) in the next major release.

hipDeviceProp_t structure enhancements#

Changes to the hipDeviceProp_t structure in the next major release may result in backward incompatibility. More details on these changes will be provided in subsequent releases.

ROCDebugger enhancements#
Multi-language source-level debugger#

The compiler now generates a source-level variable and function argument debug information.

The accuracy is guaranteed if the compiler options -g -O0 are used and apply only to HIP.

This enhancement enables ROCDebugger users to interact with the HIP source-level variables and function arguments.

Note

The newly-suggested compiler -g option must be used instead of the previously-suggested -ggdb option. Although the effect of these two options is currently equivalent, this is not guaranteed for the future, as changes might be made by the upstream LLVM community.

Machine interface lanes support#

ROCDebugger Machine Interface (MI) extends support to lanes, which includes the following enhancements:

  • Added a new -lane-info command, listing the current thread’s lanes.

  • The -thread-select command now supports a lane switch to switch to a specific lane of a thread:

    -thread-select -l LANE THREAD
    
  • The =thread-selected notification gained a lane-id attribute. This enables the frontend to know which lane of the thread was selected.

  • The *stopped asynchronous record gained lane-id and hit-lanes attributes. The former indicates which lane is selected, and the latter indicates which lanes explain the stop.

  • MI commands now accept a global –lane option, similar to the global –thread and –frame options.

  • MI varobjs are now lane-aware.

For more information, refer to the ROC Debugger User Guide at ROCgdb.

Enhanced - clone-inferior command#

The clone-inferior command now ensures that the TTY, CMD, ARGS, and AMDGPU PRECISE-MEMORY settings are copied from the original inferior to the new one. All modifications to the environment variables done using the ‘set environment’ or ‘unset environment’ commands are also copied to the new inferior.

MIOpen support for RDNA GPUs#

This release includes support for AMD Radeon™ Pro W6800, in addition to other bug fixes and performance improvements as listed below:

  • MIOpen now supports RDNA GPUs!! (via MIOpen PRs 973, 780, 764, 740, 739, 677, 660, 653, 493, 498)

  • Fixed a correctness issue with ImplicitGemm algorithm

  • Updated the performance data for new kernel versions

  • Improved MIOpen build time by splitting large kernel header files

  • Fixed an issue in reduction kernels for padded tensors

  • Various other bug fixes and performance improvements

For more information, see Documentation.

Checkpoint restore support with CRIU#

The new Checkpoint Restore in Userspace (CRIU) functionality is implemented to support AMD GPU and ROCm applications.

CRIU is a userspace tool to Checkpoint and Restore an application.

CRIU lacked the support for checkpoint restore applications that used device files such as a GPU. With this ROCm release, CRIU is enhanced with a new plugin to support AMD GPUs, which includes:

  • Single and Multi GPU systems (Gfx9)

  • Checkpoint / Restore on a different system

  • Checkpoint / Restore inside a docker container

  • PyTorch

  • TensorFlow

  • Using CRIU Image Streamer

For more information, refer to checkpoint-restore/criu

Note

The CRIU plugin (amdgpu_plugin) is merged upstream with the CRIU repository. The KFD kernel patches are also available upstream with the amd-staging-drm-next branch (public) and the ROCm 5.1 release branch.

Note

This is a Beta release of the Checkpoint and Restore functionality, and some features are not available in this release.

For more information, refer to the following websites:

Defect fixes#

The following defects are fixed in this release.

Driver fails to load after installation#

The issue with the driver failing to load after ROCm installation is now fixed.

The driver installs successfully, and the server reboots with working rocminfo and clinfo.

ROCDebugger defect fixes#
Breakpoints in GPU kernel code before kernel is loaded#

Previously, setting a breakpoint in device code by line number before the device code was loaded into the program resulted in ROCgdb incorrectly moving the breakpoint to the first following line that contains host code.

Now, the breakpoint is left pending. When the GPU kernel gets loaded, the breakpoint resolves to a location in the kernel.

Registers invalidated after write#

Previously, the stale just-written value was presented as a current value.

ROCgdb now invalidates the cached values of registers whose content might differ after being written. For example, registers with read-only bits.

ROCgdb also invalidates all volatile registers when a volatile register is written. For example, writing VCC invalidates the content of STATUS as STATUS.VCCZ may change.

Scheduler-locking and GPU wavefronts#

When scheduler-locking is in effect, new wavefronts created by a resumed thread, CPU, or GPU wavefront, are held in the halt state. For example, the “set scheduler-locking” command.

ROCDebugger fails before completion of kernel execution#

It was possible (although erroneous) for a debugger to load GPU code in memory, send it to the device, start executing a kernel on the device, and dispose of the original code before the kernel had finished execution. If a breakpoint was hit after this point, the debugger failed with an internal error while trying to access the debug information.

This issue is now fixed by ensuring that the debugger keeps a local copy of the original code and debug information.

Known issues#
Random memory access fault errors observed while running math libraries unit tests#

Issue: Random memory access fault issues are observed while running Math libraries unit tests. This issue is encountered in ROCm v5.0, ROCm v5.0.1, and ROCm v5.0.2.

Note, the faults only occur in the SRIOV environment.

Workaround: Use SDMA to update the page table. The Guest set up steps are as follows:

sudo modprobe amdgpu vm_update_mode=0

To verify, use

Guest:

cat /sys/module/amdgpu/parameters/vm_update_mode 0

Where expectation is 0.

CU masking causes application to freeze#

Using CU Masking results in an application freeze or runs exceptionally slowly. This issue is noticed only in the GFX10 suite of products. Note, this issue is observed only in GFX10 suite of products.

This issue is under active investigation at this time.

Failed checkpoint in Docker containers#

A defect with Ubuntu images kernel-5.13-30-generic and kernel-5.13-35-generic with Overlay FS results in incorrect reporting of the mount ID.

This issue with Ubuntu causes CRIU checkpointing to fail in Docker containers.

As a workaround, use an older version of the kernel. For example, Ubuntu 5.11.0-46-generic.

Issue with restoring workloads using cooperative groups feature#

Workloads that use the cooperative groups function to ensure all waves can be resident at the same time may fail to restore correctly. This issue is under investigation and will be fixed in a future release.

Radeon Pro V620 and W6800 workstation GPUs#
No support for ROCDebugger on SRIOV#

ROCDebugger is not supported in the SRIOV environment on any GPU.

This is a known issue and will be fixed in a future release.

Random error messages in ROCm SMI for SR-IOV#

Random error messages are generated by unsupported functions or commands.

This is a known issue and will be fixed in a future release.

Library changes in ROCm 5.1.0#

Library

Version

hipBLAS

0.49.0 ⇒ 0.50.0

hipCUB

2.10.13 ⇒ 2.11.0

hipFFT

1.0.4 ⇒ 1.0.7

hipRAND

2.10.13

hipSOLVER

1.2.0 ⇒ 1.3.0

hipSPARSE

2.0.0 ⇒ 2.1.0

MIVisionX

2.1.0

rccl

2.10.3 ⇒ 2.11.4

rocALUTION

2.0.1 ⇒ 2.0.2

rocBLAS

2.42.0 ⇒ 2.43.0

rocFFT

1.0.13 ⇒ 1.0.16

rocPRIM

2.10.12 ⇒ 2.10.13

rocRAND

2.10.12 ⇒ 2.10.13

rocSOLVER

3.16.0 ⇒ 3.17.0

rocSPARSE

2.0.0 ⇒ 2.1.0

rocThrust

2.13.0 ⇒ 2.14.0

Tensile

4.31.0 ⇒ 4.32.0

hipBLAS 0.50.0#

hipBLAS 0.50.0 for ROCm 5.1.0

Added#
  • Added library version and device information to hipblas-test output

  • Added –rocsolver-path command line option to choose path to pre-built rocSOLVER, as absolute or relative path

  • Added –cmake_install command line option to update cmake to minimum version if required

  • Added cmake-arg parameter to pass in cmake arguments while building

  • Added infrastructure to support readthedocs hipBLAS documentation.

Fixed#
  • Added hipblasVersionMinor define. hipblaseVersionMinor remains defined for backwards compatibility.

  • Doxygen warnings in hipblas.h header file.

Changed#
  • rocblas-path command line option can be specified as either absolute or relative path

  • Help message improvements in install.sh and rmake.py

  • Updated googletest dependency from 1.10.0 to 1.11.0

hipCUB 2.11.0#

hipCUB 2.11.0 for ROCm 5.1.0

Added#
  • Device segmented sort

  • Warp merge sort, WarpMask and thread sort from cub 1.15.0 supported in hipCUB

  • Device three way partition

Changed#
  • Device_scan and device_segmented_scan: inclusive_scan now uses the input-type as accumulator-type, exclusive_scan uses initial-value-type.

    • This particularly changes behaviour of small-size input types with large-size output types (e.g. short input, int output).

    • And low-res input with high-res output (e.g. float input, double output)

    • Block merge sort no longer supports non power of two blocksizes

hipFFT 1.0.7#

hipFFT 1.0.7 for ROCm 5.1.0

Changed#
  • Use fft_params struct for accuracy and benchmark clients.

hipRAND 2.10.13#

hipRAND 2.10.13 for ROCm 5.1.0

Changed#
  • Header file installation location changed to match other libraries.

    • Using the hiprand.h header file should now use #include &lt;hiprand/hiprand.h&gt;, rather than #include &lt;hiprand.h&gt;

    • Symlinks are included for backwards compatibility

hipSOLVER 1.3.0#

hipSOLVER 1.3.0 for ROCm 5.1.0

Added#
  • Added functions

    • gels

      • hipsolverSSgels_bufferSize, hipsolverDDgels_bufferSize, hipsolverCCgels_bufferSize, hipsolverZZgels_bufferSize

      • hipsolverSSgels, hipsolverDDgels, hipsolverCCgels, hipsolverZZgels

  • Added library version and device information to hipsolver-test output.

  • Added compatibility API with hipsolverDn prefix.

  • Added compatibility-only functions

    • gesvdj

      • hipsolverDnSgesvdj_bufferSize, hipsolverDnDgesvdj_bufferSize, hipsolverDnCgesvdj_bufferSize, hipsolverDnZgesvdj_bufferSize

      • hipsolverDnSgesvdj, hipsolverDnDgesvdj, hipsolverDnCgesvdj, hipsolverDnZgesvdj

    • gesvdjBatched

      • hipsolverDnSgesvdjBatched_bufferSize, hipsolverDnDgesvdjBatched_bufferSize, hipsolverDnCgesvdjBatched_bufferSize, hipsolverDnZgesvdjBatched_bufferSize

      • hipsolverDnSgesvdjBatched, hipsolverDnDgesvdjBatched, hipsolverDnCgesvdjBatched, hipsolverDnZgesvdjBatched

    • syevj

      • hipsolverDnSsyevj_bufferSize, hipsolverDnDsyevj_bufferSize, hipsolverDnCheevj_bufferSize, hipsolverDnZheevj_bufferSize

      • hipsolverDnSsyevj, hipsolverDnDsyevj, hipsolverDnCheevj, hipsolverDnZheevj

    • syevjBatched

      • hipsolverDnSsyevjBatched_bufferSize, hipsolverDnDsyevjBatched_bufferSize, hipsolverDnCheevjBatched_bufferSize, hipsolverDnZheevjBatched_bufferSize

      • hipsolverDnSsyevjBatched, hipsolverDnDsyevjBatched, hipsolverDnCheevjBatched, hipsolverDnZheevjBatched

    • sygvj

      • hipsolverDnSsygvj_bufferSize, hipsolverDnDsygvj_bufferSize, hipsolverDnChegvj_bufferSize, hipsolverDnZhegvj_bufferSize

      • hipsolverDnSsygvj, hipsolverDnDsygvj, hipsolverDnChegvj, hipsolverDnZhegvj

Changed#
  • The rocSOLVER backend now allows hipsolverXXgels and hipsolverXXgesv to be called in-place when B == X.

  • The rocSOLVER backend now allows rwork to be passed as a null pointer to hipsolverXgesvd.

Fixed#
  • bufferSize functions will now return HIPSOLVER_STATUS_NOT_INITIALIZED instead of HIPSOLVER_STATUS_INVALID_VALUE when both handle and lwork are null.

  • Fixed rare memory allocation failure in syevd/heevd and sygvd/hegvd caused by improper workspace array allocation outside of rocSOLVER.

hipSPARSE 2.1.0#

hipSPARSE 2.1.0 for ROCm 5.1.0

Added#
  • Added gtsv_interleaved_batch and gpsv_interleaved_batch routines

  • Add SpGEMM_reuse

Changed#
  • Changed BUILD_CUDA with USE_CUDA in install script and cmake files

  • Update googletest to 11.1

Improved#
  • Fixed a bug in SpMM Alg versioning

Known Issues#
  • none

rccl 2.11.4#

RCCL 2.11.4 for ROCm 5.1.0

Added#
  • Compatibility with NCCL 2.11.4

Known Issues#
  • Managed memory is not currently supported for clique-based kernels

rocALUTION 2.0.2#

rocALUTION 2.0.2 for ROCm 5.1.0

Added#
  • Added out-of-place matrix transpose functionality

  • Added LocalVector<bool>

rocBLAS 2.43.0#

rocBLAS 2.43.0 for ROCm 5.1.0

Added#
  • Option to install script for number of jobs to use for rocBLAS and Tensile compilation (-j, –jobs)

  • Option to install script to build clients without using any Fortran (–clients_no_fortran)

  • rocblas_client_initialize function, to perform rocBLAS initialize for clients(benchmark/test) and report the execution time.

  • Added tests for output of reduction functions when given bad input

  • Added user specified initialization (rand_int/trig_float/hpl) for initializing matrices and vectors in rocblas-bench

Optimizations#
  • Improved performance of trsm with side == left and n == 1

  • Improved perforamnce of trsm with side == left and m <= 32 along with side == right and n <= 32

Changed#
  • For syrkx and trmm internal API use rocblas_stride datatype for offset

  • For non-batched and batched gemm_ex functions if the C matrix pointer equals the D matrix pointer (aliased) their respective type and leading dimension arguments must now match

  • Test client dependencies updated to GTest 1.11

  • non-global false positives reported by cppcheck from file based suppression to inline suppression. File based suppression will only be used for global false positives.

  • Help menu messages in install.sh

  • For ger function, typecast the ‘lda’(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions.

  • Modified default initialization from rand_int to hpl for initializing matrices and vectors in rocblas-bench

Fixed#
  • For function trmv (non-transposed cases) avoid overflow in offset calculation

  • Fixed cppcheck errors/warnings

  • Fixed doxygen warnings

rocFFT 1.0.16#

rocFFT 1.0.16 for ROCm 5.1.0

Changed#
  • Supported unaligned tile dimension for SBRC_2D kernels.

  • Improved (more RAII) test and benchmark infrastructure.

  • Enabled runtime compilation of length-2304 FFT kernel during plan creation.

Optimizations#
  • Optimized more large 1D cases by using L1D_CC plan.

  • Optimized 3D 200^3 C2R case.

  • Optimized 1D 2^30 double precision on MI200.

Fixed#
  • Fixed correctness of some R2C transforms with unusual strides.

Removed#
  • The hipFFT API (header) has been removed from after a long deprecation period. Please use the hipFFT package/repository to obtain the hipFFT API.

rocPRIM 2.10.13#

rocPRIM 2.10.13 for ROCm 5.1.0

Fixed#
  • Fixed radix sort int64_t bug introduced in [2.10.11]

Added#
  • Future value

  • Added device partition_three_way to partition input to three output iterators based on two predicates

Changed#
  • The reduce/scan algorithm precision issues in the tests has been resolved for half types.

Known Issues#
  • device_segmented_radix_sort unit test failing for HIP on Windows

rocRAND 2.10.13#

rocRAND 2.10.13 for ROCm 5.1.0

Added#
  • Generating a random sequence different sizes now produces the same sequence without gaps indepent of how many values are generated per call.

    • Only in the case of XORWOW, MRG32K3A, PHILOX4X32_10, SOBOL32 and SOBOL64

    • This only holds true if the size in each call is a divisor of the distributions output_width due to performance

    • Similarly the output pointer has to be aligned to output_width * sizeof(output_type)

Changed#
  • hipRAND split into a separate package

  • Header file installation location changed to match other libraries.

    • Using the rocrand.h header file should now use #include &lt;rocrand/rocrand.h&gt;, rather than #include &lt;rocrand/rocrand.h&gt;

  • rocRAND still includes hipRAND using a submodule

    • The rocRAND package also sets the provides field with hipRAND, so projects which require hipRAND can begin to specify it.

Fixed#
  • Fix offset behaviour for XORWOW, MRG32K3A and PHILOX4X32_10 generator, setting offset now correctly generates the same sequence starting from the offset.

    • Only uniform int and float will work as these can be generated with a single call to the generator

Known Issues#
  • kernel_xorwow unit test is failing for certain GPU architectures.

rocSOLVER 3.17.0#

rocSOLVER 3.17.0 for ROCm 5.1.0

Optimized#
  • Optimized non-pivoting and batch cases of the LU factorization

Fixed#
  • Fixed missing synchronization in SYTRF with rocblas_fill_lower that could potentially result in incorrect pivot values.

  • Fixed multi-level logging output to file with the ROCSOLVER_LOG_PATH, ROCSOLVER_LOG_TRACE_PATH, ROCSOLVER_LOG_BENCH_PATH and ROCSOLVER_LOG_PROFILE_PATH environment variables.

  • Fixed performance regression in the batched LU factorization of tiny matrices

rocSPARSE 2.1.0#

rocSPARSE 2.1.0 for ROCm 5.1.0

Added#
  • gtsv_interleaved_batch

  • gpsv_interleaved_batch

  • SpGEMM_reuse

  • Allow copying of mat info struct

Improved#
  • Optimization for SDDMM

  • Allow unsorted matrices in csrgemm multipass algorithm

Known Issues#
  • none

rocThrust 2.14.0#

rocThrust 2.14.0 for ROCm 5.1.0

Added#
  • Updated to match upstream Thrust 1.15.0

Known Issues#
  • async_copy, partition, and stable_sort_by_key unit tests are failing on HIP on Windows.

Tensile 4.32.0#

Tensile 4.32.0 for ROCm 5.1.0

Added#
  • Better control of parallelism to control memory usage

  • Support for multiprocessing on Windows for TensileCreateLibrary

  • New JSD metric and metric selection functionality

  • Initial changes to support two-tier solution selection

Optimized#
  • Optimized runtime of TensileCreateLibraries by reducing max RAM usage

  • StoreCInUnroll additional optimizations plus adaptive K support

  • DGEMM NN optimizations with PrefetchGlobalRead(PGR)=2 support

Changed#
  • Update Googletest to 1.11.0

Removed#
  • Remove no longer supported benchmarking steps


ROCm 5.0.2#

Defect fixes#

The following defects are fixed in the ROCm v5.0.2 release.

Issue with hostcall facility in HIP runtime#

In ROCm v5.0, when using the “assert()” call in a HIP kernel, the compiler may sometimes fail to emit kernel metadata related to the hostcall facility, which results in incomplete initialization of the hostcall facility in the HIP runtime. This can cause the HIP kernel to crash when it attempts to execute the “assert()” call.

The root cause was an incorrect check in the compiler to determine whether the hostcall facility is required by the kernel. This is fixed in the ROCm v5.0.2 release.

The resolution includes a compiler change, which emits the required metadata by default, unless the compiler can prove that the hostcall facility is not required by the kernel. This ensures that the “assert()” call never fails.

Note

This fix may lead to breakage in some OpenMP offload use cases, which use print inside a target region and result in an abort in device code. The issue will be fixed in a future release.

The compatibility matrix in the Deep-learning guide is updated for ROCm v5.0.2.

Library changes in ROCm 5.0.2#

Library

Version

hipBLAS

0.49.0

hipCUB

2.10.13

hipFFT

1.0.4

hipSOLVER

1.2.0

hipSPARSE

2.0.0

MIVisionX

2.0.1 ⇒ 2.1.0

rccl

2.10.3

rocALUTION

2.0.1

rocBLAS

2.42.0

rocFFT

1.0.13

rocPRIM

2.10.12

rocRAND

2.10.12

rocSOLVER

3.16.0

rocSPARSE

2.0.0

rocThrust

2.13.0

Tensile

4.31.0

MIVisionX 2.1.0#

MIVisionX for ROCm 5.0.2

Added#
  • New Tests - AMD_MEDIA

Optimizations#
  • Readme Updates

  • HIP Buffer Transfer - Eliminate cupy usage

Changed#
  • Backend - Default Backend set to HIP

Fixed#
  • Minor bugs and warnings

  • AMD_MEDIA - Bug Fixes

Tested Configurations#
  • Windows 10

  • Linux distribution

    • Ubuntu - 18.04 / 20.04

    • CentOS - 7 / 8

    • SLES - 15-SP2

  • ROCm: rocm-dev - 4.5.2.40502-164

  • rocm-cmake - rocm-4.2.0

  • MIOpenGEMM - 1.1.5

  • MIOpen - 2.14.0

  • Protobuf - V3.12.0

  • OpenCV - 4.5.5

  • RPP - 0.92

  • FFMPEG - n4.0.4

  • Dependencies for all the above packages

  • MIVisionX Setup Script - V2.0.0

Known Issues#
  • TBD

Mivisionx Dependency Map#

Docker Image: sudo docker build -f docker/ubuntu20/{DOCKER_LEVEL_FILE_NAME}.dockerfile -t {mivisionx-level-NUMBER} .

  • #c5f015 new component added to the level

  • #1589F0 existing component from the previous level

Build Level

MIVisionX Dependencies

Modules

Libraries and Executables

Docker Tag

Level_1

cmake <br> gcc <br> g++

amd_openvx <br> utilities

#c5f015 libopenvx.so - OpenVX&trade; Lib - CPU <br> #c5f015 libvxu.so - OpenVX&trade; immediate node Lib - CPU <br> #c5f015 runvx - OpenVX&trade; Graph Executor - CPU with Display OFF

Docker Image Version (tag latest semver)

Level_2

ROCm OpenCL <br> +Level 1

amd_openvx <br> amd_openvx_extensions <br> utilities

#c5f015 libopenvx.so - OpenVX&trade; Lib - CPU/GPU <br> #c5f015 libvxu.so - OpenVX&trade; immediate node Lib - CPU/GPU <br> #c5f015 libvx_loomsl.so - Loom 360 Stitch Lib <br> #c5f015 loom_shell - 360 Stitch App <br> #c5f015 runcl - OpenCL&trade; program debug App <br> #c5f015 runvx - OpenVX&trade; Graph Executor - Display OFF

Docker Image Version (tag latest semver)

Level_3

OpenCV <br> FFMPEG <br> +Level 2

amd_openvx <br> amd_openvx_extensions <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #c5f015 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #c5f015 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #c5f015 mv_compile - Neural Net Model Compile <br> #c5f015 runvx - OpenVX&trade; Graph Executor - Display ON

Docker Image Version (tag latest semver)

Level_4

MIOpenGEMM <br> MIOpen <br> ProtoBuf <br> +Level 3

amd_openvx <br> amd_openvx_extensions <br> apps <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #1589F0 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #1589F0 mv_compile - Neural Net Model Compile <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #1589F0 runvx - OpenVX&trade; Graph Executor - Display ON <br> #c5f015 libvx_nn.so - OpenVX&trade; Neural Net Extension <br> #c5f015 inference_server_app - Cloud Inference App

Docker Image Version (tag latest semver)

Level_5

AMD_RPP <br> rocAL deps <br> +Level 4

amd_openvx <br> amd_openvx_extensions <br> apps <br> rocAL <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #1589F0 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #1589F0 mv_compile - Neural Net Model Compile <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #1589F0 runvx - OpenVX&trade; Graph Executor - Display ON <br> #1589F0 libvx_nn.so - OpenVX&trade; Neural Net Extension <br> #1589F0 inference_server_app - Cloud Inference App <br> #c5f015 libvx_rpp.so - OpenVX&trade; RPP Extension <br> #c5f015 librali.so - Radeon Augmentation Library <br> #c5f015 rali_pybind.so - rocAL Pybind Lib

Docker Image Version (tag latest semver)


ROCm 5.0.1#

Deprecations and warnings#
Refactor of HIPCC/HIPCONFIG#

In prior ROCm releases, by default, the hipcc/hipconfig Perl scripts were used to identify and set target compiler options, target platform, compiler, and runtime appropriately.

In ROCm v5.0.1, hipcc.bin and hipconfig.bin have been added as the compiled binary implementations of the hipcc and hipconfig. These new binaries are currently a work-in-progress, considered, and marked as experimental. ROCm plans to fully transition to hipcc.bin and hipconfig.bin in the a future ROCm release. The existing hipcc and hipconfig Perl scripts are renamed to hipcc.pl and hipconfig.pl respectively. New top-level hipcc and hipconfig Perl scripts are created, which can switch between the Perl script or the compiled binary based on the environment variable HIPCC_USE_PERL_SCRIPT.

In ROCm 5.0.1, by default, this environment variable is set to use hipcc and hipconfig through the Perl scripts.

Subsequent Perl scripts will no longer be available in ROCm in a future release.

Library changes in ROCm 5.0.1#

Library

Version

hipBLAS

0.49.0

hipCUB

2.10.13

hipFFT

1.0.4

hipSOLVER

1.2.0

hipSPARSE

2.0.0

MIVisionX

2.0.1

rccl

2.10.3

rocALUTION

2.0.1

rocBLAS

2.42.0

rocFFT

1.0.13

rocPRIM

2.10.12

rocRAND

2.10.12

rocSOLVER

3.16.0

rocSPARSE

2.0.0

rocThrust

2.13.0

Tensile

4.31.0


ROCm 5.0.0#

What’s new in this release#
HIP enhancements#

The ROCm v5.0 release consists of the following HIP enhancements.

HIP installation guide updates#

The HIP Installation Guide is updated to include building HIP from source on the NVIDIA platform.

Refer to the HIP Installation Guide v5.0 for more details.

Managed memory allocation#

Managed memory, including the __managed__ keyword, is now supported in the HIP combined host/device compilation. Through unified memory allocation, managed memory allows data to be shared and accessible to both the CPU and GPU using a single pointer. The allocation is managed by the AMD GPU driver using the Linux Heterogeneous Memory Management (HMM) mechanism. The user can call managed memory API hipMallocManaged to allocate a large chunk of HMM memory, execute kernels on a device, and fetch data between the host and device as needed.

Note

In a HIP application, it is recommended to do a capability check before calling the managed memory APIs. For example,

int managed_memory = 0;
HIPCHECK(hipDeviceGetAttribute(&managed_memory,
  hipDeviceAttributeManagedMemory,p_gpuDevice));
if (!managed_memory ) {
  printf ("info: managed memory access not supported on the device %d\n Skipped\n", p_gpuDevice);
}
else {
  HIPCHECK(hipSetDevice(p_gpuDevice));
  HIPCHECK(hipMallocManaged(&Hmm, N * sizeof(T)));
. . .
}

Note

The managed memory capability check may not be necessary; however, if HMM is not supported, managed malloc will fall back to using system memory. Other managed memory API calls will, then, have

Refer to the HIP API documentation for more details on managed memory APIs.

For the application, see

ROCm/HIP

New environment variable#

The following new environment variable is added in this release:

Environment Variable

Value

Description

HSA_COOP_CU_COUNT

0 or 1 (default is 0)

Some processors support more CUs than can reliably be used in a cooperative dispatch. Setting the environment variable HSA_COOP_CU_COUNT to 1 will cause ROCr to return the correct CU count for cooperative groups through the HSA_AMD_AGENT_INFO_COOPERATIVE_COMPUTE_UNIT_COUNT attribute of hsa_agent_get_info(). Setting HSA_COOP_CU_COUNT to other values, or leaving it unset, will cause ROCr to return the same CU count for the attributes HSA_AMD_AGENT_INFO_COOPERATIVE_COMPUTE_UNIT_COUNT and HSA_AMD_AGENT_INFO_COMPUTE_UNIT_COUNT. Future ROCm releases will make HSA_COOP_CU_COUNT=1 the default.

Breaking changes#
Runtime breaking change#

Re-ordering of the enumerated type in hip_runtime_api.h to better match NV. See below for the difference in enumerated types.

ROCm software will be affected if any of the defined enums listed below are used in the code. Applications built with ROCm v5.0 enumerated types will work with a ROCm 4.5.2 driver. However, an undefined behavior error will occur with a ROCm v4.5.2 application that uses these enumerated types with a ROCm 5.0 runtime.

typedef enum hipDeviceAttribute_t {
-    hipDeviceAttributeMaxThreadsPerBlock,       ///< Maximum number of threads per block.
-    hipDeviceAttributeMaxBlockDimX,             ///< Maximum x-dimension of a block.
-    hipDeviceAttributeMaxBlockDimY,             ///< Maximum y-dimension of a block.
-    hipDeviceAttributeMaxBlockDimZ,             ///< Maximum z-dimension of a block.
-    hipDeviceAttributeMaxGridDimX,              ///< Maximum x-dimension of a grid.
-    hipDeviceAttributeMaxGridDimY,              ///< Maximum y-dimension of a grid.
-    hipDeviceAttributeMaxGridDimZ,              ///< Maximum z-dimension of a grid.
-    hipDeviceAttributeMaxSharedMemoryPerBlock,  ///< Maximum shared memory available per block in
-                                                ///< bytes.
-    hipDeviceAttributeTotalConstantMemory,      ///< Constant memory size in bytes.
-    hipDeviceAttributeWarpSize,                 ///< Warp size in threads.
-    hipDeviceAttributeMaxRegistersPerBlock,  ///< Maximum number of 32-bit registers available to a
-                                             ///< thread block. This number is shared by all thread
-                                             ///< blocks simultaneously resident on a
-                                             ///< multiprocessor.
-    hipDeviceAttributeClockRate,             ///< Peak clock frequency in kilohertz.
-    hipDeviceAttributeMemoryClockRate,       ///< Peak memory clock frequency in kilohertz.
-    hipDeviceAttributeMemoryBusWidth,        ///< Global memory bus width in bits.
-    hipDeviceAttributeMultiprocessorCount,   ///< Number of multiprocessors on the device.
-    hipDeviceAttributeComputeMode,           ///< Compute mode that device is currently in.
-    hipDeviceAttributeL2CacheSize,  ///< Size of L2 cache in bytes. 0 if the device doesn't have L2
-                                    ///< cache.
-    hipDeviceAttributeMaxThreadsPerMultiProcessor,  ///< Maximum resident threads per
-                                                    ///< multiprocessor.
-    hipDeviceAttributeComputeCapabilityMajor,       ///< Major compute capability version number.
-    hipDeviceAttributeComputeCapabilityMinor,       ///< Minor compute capability version number.
-    hipDeviceAttributeConcurrentKernels,  ///< Device can possibly execute multiple kernels
-                                          ///< concurrently.
-    hipDeviceAttributePciBusId,           ///< PCI Bus ID.
-    hipDeviceAttributePciDeviceId,        ///< PCI Device ID.
-    hipDeviceAttributeMaxSharedMemoryPerMultiprocessor,  ///< Maximum Shared Memory Per
-                                                         ///< Multiprocessor.
-    hipDeviceAttributeIsMultiGpuBoard,                   ///< Multiple GPU devices.
-    hipDeviceAttributeIntegrated,                        ///< iGPU
-    hipDeviceAttributeCooperativeLaunch,                 ///< Support cooperative launch
-    hipDeviceAttributeCooperativeMultiDeviceLaunch,      ///< Support cooperative launch on multiple devices
-    hipDeviceAttributeMaxTexture1DWidth,    ///< Maximum number of elements in 1D images
-    hipDeviceAttributeMaxTexture2DWidth,    ///< Maximum dimension width of 2D images in image elements
-    hipDeviceAttributeMaxTexture2DHeight,   ///< Maximum dimension height of 2D images in image elements
-    hipDeviceAttributeMaxTexture3DWidth,    ///< Maximum dimension width of 3D images in image elements
-    hipDeviceAttributeMaxTexture3DHeight,   ///< Maximum dimensions height of 3D images in image elements
-    hipDeviceAttributeMaxTexture3DDepth,    ///< Maximum dimensions depth of 3D images in image elements
+    hipDeviceAttributeCudaCompatibleBegin = 0,

-    hipDeviceAttributeHdpMemFlushCntl,      ///< Address of the HDP_MEM_COHERENCY_FLUSH_CNTL register
-    hipDeviceAttributeHdpRegFlushCntl,      ///< Address of the HDP_REG_COHERENCY_FLUSH_CNTL register
+    hipDeviceAttributeEccEnabled = hipDeviceAttributeCudaCompatibleBegin, ///< Whether ECC support is enabled.
+    hipDeviceAttributeAccessPolicyMaxWindowSize,        ///< Cuda only. The maximum size of the window policy in bytes.
+    hipDeviceAttributeAsyncEngineCount,                 ///< Cuda only. Asynchronous engines number.
+    hipDeviceAttributeCanMapHostMemory,                 ///< Whether host memory can be mapped into device address space
+    hipDeviceAttributeCanUseHostPointerForRegisteredMem,///< Cuda only. Device can access host registered memory
+                                                        ///< at the same virtual address as the CPU
+    hipDeviceAttributeClockRate,                        ///< Peak clock frequency in kilohertz.
+    hipDeviceAttributeComputeMode,                      ///< Compute mode that device is currently in.
+    hipDeviceAttributeComputePreemptionSupported,       ///< Cuda only. Device supports Compute Preemption.
+    hipDeviceAttributeConcurrentKernels,                ///< Device can possibly execute multiple kernels concurrently.
+    hipDeviceAttributeConcurrentManagedAccess,          ///< Device can coherently access managed memory concurrently with the CPU
+    hipDeviceAttributeCooperativeLaunch,                ///< Support cooperative launch
+    hipDeviceAttributeCooperativeMultiDeviceLaunch,     ///< Support cooperative launch on multiple devices
+    hipDeviceAttributeDeviceOverlap,                    ///< Cuda only. Device can concurrently copy memory and execute a kernel.
+                                                        ///< Deprecated. Use instead asyncEngineCount.
+    hipDeviceAttributeDirectManagedMemAccessFromHost,   ///< Host can directly access managed memory on
+                                                        ///< the device without migration
+    hipDeviceAttributeGlobalL1CacheSupported,           ///< Cuda only. Device supports caching globals in L1
+    hipDeviceAttributeHostNativeAtomicSupported,        ///< Cuda only. Link between the device and the host supports native atomic operations
+    hipDeviceAttributeIntegrated,                       ///< Device is integrated GPU
+    hipDeviceAttributeIsMultiGpuBoard,                  ///< Multiple GPU devices.
+    hipDeviceAttributeKernelExecTimeout,                ///< Run time limit for kernels executed on the device
+    hipDeviceAttributeL2CacheSize,                      ///< Size of L2 cache in bytes. 0 if the device doesn't have L2 cache.
+    hipDeviceAttributeLocalL1CacheSupported,            ///< caching locals in L1 is supported
+    hipDeviceAttributeLuid,                             ///< Cuda only. 8-byte locally unique identifier in 8 bytes. Undefined on TCC and non-Windows platforms
+    hipDeviceAttributeLuidDeviceNodeMask,               ///< Cuda only. Luid device node mask. Undefined on TCC and non-Windows platforms
+    hipDeviceAttributeComputeCapabilityMajor,           ///< Major compute capability version number.
+    hipDeviceAttributeManagedMemory,                    ///< Device supports allocating managed memory on this system
+    hipDeviceAttributeMaxBlocksPerMultiProcessor,       ///< Cuda only. Max block size per multiprocessor
+    hipDeviceAttributeMaxBlockDimX,                     ///< Max block size in width.
+    hipDeviceAttributeMaxBlockDimY,                     ///< Max block size in height.
+    hipDeviceAttributeMaxBlockDimZ,                     ///< Max block size in depth.
+    hipDeviceAttributeMaxGridDimX,                      ///< Max grid size  in width.
+    hipDeviceAttributeMaxGridDimY,                      ///< Max grid size  in height.
+    hipDeviceAttributeMaxGridDimZ,                      ///< Max grid size  in depth.
+    hipDeviceAttributeMaxSurface1D,                     ///< Maximum size of 1D surface.
+    hipDeviceAttributeMaxSurface1DLayered,              ///< Cuda only. Maximum dimensions of 1D layered surface.
+    hipDeviceAttributeMaxSurface2D,                     ///< Maximum dimension (width, height) of 2D surface.
+    hipDeviceAttributeMaxSurface2DLayered,              ///< Cuda only. Maximum dimensions of 2D layered surface.
+    hipDeviceAttributeMaxSurface3D,                     ///< Maximum dimension (width, height, depth) of 3D surface.
+    hipDeviceAttributeMaxSurfaceCubemap,                ///< Cuda only. Maximum dimensions of Cubemap surface.
+    hipDeviceAttributeMaxSurfaceCubemapLayered,         ///< Cuda only. Maximum dimension of Cubemap layered surface.
+    hipDeviceAttributeMaxTexture1DWidth,                ///< Maximum size of 1D texture.
+    hipDeviceAttributeMaxTexture1DLayered,              ///< Cuda only. Maximum dimensions of 1D layered texture.
+    hipDeviceAttributeMaxTexture1DLinear,               ///< Maximum number of elements allocatable in a 1D linear texture.
+                                                        ///< Use cudaDeviceGetTexture1DLinearMaxWidth() instead on Cuda.
+    hipDeviceAttributeMaxTexture1DMipmap,               ///< Cuda only. Maximum size of 1D mipmapped texture.
+    hipDeviceAttributeMaxTexture2DWidth,                ///< Maximum dimension width of 2D texture.
+    hipDeviceAttributeMaxTexture2DHeight,               ///< Maximum dimension hight of 2D texture.
+    hipDeviceAttributeMaxTexture2DGather,               ///< Cuda only. Maximum dimensions of 2D texture if gather operations  performed.
+    hipDeviceAttributeMaxTexture2DLayered,              ///< Cuda only. Maximum dimensions of 2D layered texture.
+    hipDeviceAttributeMaxTexture2DLinear,               ///< Cuda only. Maximum dimensions (width, height, pitch) of 2D textures bound to pitched memory.
+    hipDeviceAttributeMaxTexture2DMipmap,               ///< Cuda only. Maximum dimensions of 2D mipmapped texture.
+    hipDeviceAttributeMaxTexture3DWidth,                ///< Maximum dimension width of 3D texture.
+    hipDeviceAttributeMaxTexture3DHeight,               ///< Maximum dimension height of 3D texture.
+    hipDeviceAttributeMaxTexture3DDepth,                ///< Maximum dimension depth of 3D texture.
+    hipDeviceAttributeMaxTexture3DAlt,                  ///< Cuda only. Maximum dimensions of alternate 3D texture.
+    hipDeviceAttributeMaxTextureCubemap,                ///< Cuda only. Maximum dimensions of Cubemap texture
+    hipDeviceAttributeMaxTextureCubemapLayered,         ///< Cuda only. Maximum dimensions of Cubemap layered texture.
+    hipDeviceAttributeMaxThreadsDim,                    ///< Maximum dimension of a block
+    hipDeviceAttributeMaxThreadsPerBlock,               ///< Maximum number of threads per block.
+    hipDeviceAttributeMaxThreadsPerMultiProcessor,      ///< Maximum resident threads per multiprocessor.
+    hipDeviceAttributeMaxPitch,                         ///< Maximum pitch in bytes allowed by memory copies
+    hipDeviceAttributeMemoryBusWidth,                   ///< Global memory bus width in bits.
+    hipDeviceAttributeMemoryClockRate,                  ///< Peak memory clock frequency in kilohertz.
+    hipDeviceAttributeComputeCapabilityMinor,           ///< Minor compute capability version number.
+    hipDeviceAttributeMultiGpuBoardGroupID,             ///< Cuda only. Unique ID of device group on the same multi-GPU board
+    hipDeviceAttributeMultiprocessorCount,              ///< Number of multiprocessors on the device.
+    hipDeviceAttributeName,                             ///< Device name.
+    hipDeviceAttributePageableMemoryAccess,             ///< Device supports coherently accessing pageable memory
+                                                        ///< without calling hipHostRegister on it
+    hipDeviceAttributePageableMemoryAccessUsesHostPageTables, ///< Device accesses pageable memory via the host's page tables
+    hipDeviceAttributePciBusId,                         ///< PCI Bus ID.
+    hipDeviceAttributePciDeviceId,                      ///< PCI Device ID.
+    hipDeviceAttributePciDomainID,                      ///< PCI Domain ID.
+    hipDeviceAttributePersistingL2CacheMaxSize,         ///< Cuda11 only. Maximum l2 persisting lines capacity in bytes
+    hipDeviceAttributeMaxRegistersPerBlock,             ///< 32-bit registers available to a thread block. This number is shared
+                                                        ///< by all thread blocks simultaneously resident on a multiprocessor.
+    hipDeviceAttributeMaxRegistersPerMultiprocessor,    ///< 32-bit registers available per block.
+    hipDeviceAttributeReservedSharedMemPerBlock,        ///< Cuda11 only. Shared memory reserved by CUDA driver per block.
+    hipDeviceAttributeMaxSharedMemoryPerBlock,          ///< Maximum shared memory available per block in bytes.
+    hipDeviceAttributeSharedMemPerBlockOptin,           ///< Cuda only. Maximum shared memory per block usable by special opt in.
+    hipDeviceAttributeSharedMemPerMultiprocessor,       ///< Cuda only. Shared memory available per multiprocessor.
+    hipDeviceAttributeSingleToDoublePrecisionPerfRatio, ///< Cuda only. Performance ratio of single precision to double precision.
+    hipDeviceAttributeStreamPrioritiesSupported,        ///< Cuda only. Whether to support stream priorities.
+    hipDeviceAttributeSurfaceAlignment,                 ///< Cuda only. Alignment requirement for surfaces
+    hipDeviceAttributeTccDriver,                        ///< Cuda only. Whether device is a Tesla device using TCC driver
+    hipDeviceAttributeTextureAlignment,                 ///< Alignment requirement for textures
+    hipDeviceAttributeTexturePitchAlignment,            ///< Pitch alignment requirement for 2D texture references bound to pitched memory;
+    hipDeviceAttributeTotalConstantMemory,              ///< Constant memory size in bytes.
+    hipDeviceAttributeTotalGlobalMem,                   ///< Global memory available on devicice.
+    hipDeviceAttributeUnifiedAddressing,                ///< Cuda only. An unified address space shared with the host.
+    hipDeviceAttributeUuid,                             ///< Cuda only. Unique ID in 16 byte.
+    hipDeviceAttributeWarpSize,                         ///< Warp size in threads.

-    hipDeviceAttributeMaxPitch,             ///< Maximum pitch in bytes allowed by memory copies
-    hipDeviceAttributeTextureAlignment,     ///<Alignment requirement for textures
-    hipDeviceAttributeTexturePitchAlignment, ///<Pitch alignment requirement for 2D texture references bound to pitched memory;
-    hipDeviceAttributeKernelExecTimeout,    ///<Run time limit for kernels executed on the device
-    hipDeviceAttributeCanMapHostMemory,     ///<Device can map host memory into device address space
-    hipDeviceAttributeEccEnabled,           ///<Device has ECC support enabled
+    hipDeviceAttributeCudaCompatibleEnd = 9999,
+    hipDeviceAttributeAmdSpecificBegin = 10000,

-    hipDeviceAttributeCooperativeMultiDeviceUnmatchedFunc,        ///< Supports cooperative launch on multiple
-                                                                  ///devices with unmatched functions
-    hipDeviceAttributeCooperativeMultiDeviceUnmatchedGridDim,     ///< Supports cooperative launch on multiple
-                                                                  ///devices with unmatched grid dimensions
-    hipDeviceAttributeCooperativeMultiDeviceUnmatchedBlockDim,    ///< Supports cooperative launch on multiple
-                                                                  ///devices with unmatched block dimensions
-    hipDeviceAttributeCooperativeMultiDeviceUnmatchedSharedMem,   ///< Supports cooperative launch on multiple
-                                                                  ///devices with unmatched shared memories
-    hipDeviceAttributeAsicRevision,         ///< Revision of the GPU in this device
-    hipDeviceAttributeManagedMemory,        ///< Device supports allocating managed memory on this system
-    hipDeviceAttributeDirectManagedMemAccessFromHost, ///< Host can directly access managed memory on
-                                                      /// the device without migration
-    hipDeviceAttributeConcurrentManagedAccess,  ///< Device can coherently access managed memory
-                                                /// concurrently with the CPU
-    hipDeviceAttributePageableMemoryAccess,     ///< Device supports coherently accessing pageable memory
-                                                /// without calling hipHostRegister on it
-    hipDeviceAttributePageableMemoryAccessUsesHostPageTables, ///< Device accesses pageable memory via
-                                                              /// the host's page tables
-    hipDeviceAttributeCanUseStreamWaitValue ///< '1' if Device supports hipStreamWaitValue32() and
-                                            ///< hipStreamWaitValue64() , '0' otherwise.
+    hipDeviceAttributeClockInstructionRate = hipDeviceAttributeAmdSpecificBegin,  ///< Frequency in khz of the timer used by the device-side "clock*"
+    hipDeviceAttributeArch,                                     ///< Device architecture
+    hipDeviceAttributeMaxSharedMemoryPerMultiprocessor,         ///< Maximum Shared Memory PerMultiprocessor.
+    hipDeviceAttributeGcnArch,                                  ///< Device gcn architecture
+    hipDeviceAttributeGcnArchName,                              ///< Device gcnArch name in 256 bytes
+    hipDeviceAttributeHdpMemFlushCntl,                          ///< Address of the HDP_MEM_COHERENCY_FLUSH_CNTL register
+    hipDeviceAttributeHdpRegFlushCntl,                          ///< Address of the HDP_REG_COHERENCY_FLUSH_CNTL register
+    hipDeviceAttributeCooperativeMultiDeviceUnmatchedFunc,      ///< Supports cooperative launch on multiple
+                                                                ///< devices with unmatched functions
+    hipDeviceAttributeCooperativeMultiDeviceUnmatchedGridDim,   ///< Supports cooperative launch on multiple
+                                                                ///< devices with unmatched grid dimensions
+    hipDeviceAttributeCooperativeMultiDeviceUnmatchedBlockDim,  ///< Supports cooperative launch on multiple
+                                                                ///< devices with unmatched block dimensions
+    hipDeviceAttributeCooperativeMultiDeviceUnmatchedSharedMem, ///< Supports cooperative launch on multiple
+                                                                ///< devices with unmatched shared memories
+    hipDeviceAttributeIsLargeBar,                               ///< Whether it is LargeBar
+    hipDeviceAttributeAsicRevision,                             ///< Revision of the GPU in this device
+    hipDeviceAttributeCanUseStreamWaitValue,                    ///< '1' if Device supports hipStreamWaitValue32() and
+                                                                ///< hipStreamWaitValue64() , '0' otherwise.

+    hipDeviceAttributeAmdSpecificEnd = 19999,
+    hipDeviceAttributeVendorSpecificBegin = 20000,
+    // Extended attributes for vendors
 } hipDeviceAttribute_t;

 enum hipComputeMode {
Known issues#
Incorrect dGPU behavior when using AMDVBFlash tool#

The AMDVBFlash tool, used for flashing the VBIOS image to dGPU, does not communicate with the ROM Controller specifically when the driver is present. This is because the driver, as part of its runtime power management feature, puts the dGPU to a sleep state.

As a workaround, users can run amdgpu.runpm=0, which temporarily disables the runtime power management feature from the driver and dynamically changes some power control-related sysfs files.

Issue with START timestamp in ROCProfiler#

Users may encounter an issue with the enabled timestamp functionality for monitoring one or multiple counters. ROCProfiler outputs the following four timestamps for each kernel:

  • Dispatch

  • Start

  • End

  • Complete

Issue#

This defect is related to the Start timestamp functionality, which incorrectly shows an earlier time than the Dispatch timestamp.

To reproduce the issue,

  1. Enable timing using the –timestamp on flag.

  2. Use the -i option with the input filename that contains the name of the counter(s) to monitor.

  3. Run the program.

  4. Check the output result file.

Current behavior#

BeginNS is lower than DispatchNS, which is incorrect.

Expected behavior#

The correct order is:

Dispatch < Start < End < Complete

Users cannot use ROCProfiler to measure the time spent on each kernel because of the incorrect timestamp with counter collection enabled.

Radeon Pro V620 and W6800 workstation GPUs#
No support for SMI and ROCDebugger on SRIOV#

System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV environment on any GPU. For more information, refer to the Systems Management Interface documentation.

Deprecations and warnings#
ROCm libraries changes – deprecations and deprecation removal#
  • The hipFFT.h header is now provided only by the hipFFT package. Up to ROCm 5.0, users would get hipFFT.h in the rocFFT package too.

  • The GlobalPairwiseAMG class is now entirely removed, users should use the PairwiseAMG class instead.

  • The rocsparse_spmm signature in 5.0 was changed to match that of rocsparse_spmm_ex. In 5.0, rocsparse_spmm_ex is still present, but deprecated. Signature diff for rocsparse_spmm rocsparse_spmm in 5.0

    rocsparse_status rocsparse_spmm(rocsparse_handle            handle,
                                    rocsparse_operation         trans_A,
                                    rocsparse_operation         trans_B,
                                    const void*                 alpha,
                                    const rocsparse_spmat_descr mat_A,
                                    const rocsparse_dnmat_descr mat_B,
                                    const void*                 beta,
                                    const rocsparse_dnmat_descr mat_C,
                                    rocsparse_datatype          compute_type,
                                    rocsparse_spmm_alg          alg,
                                    rocsparse_spmm_stage        stage,
                                    size_t*                     buffer_size,
                                    void*                       temp_buffer);
    

    rocSPARSE_spmm in 4.0

    rocsparse_status rocsparse_spmm(rocsparse_handle            handle,
                                    rocsparse_operation         trans_A,
                                    rocsparse_operation         trans_B,
                                    const void*                 alpha,
                                    const rocsparse_spmat_descr mat_A,
                                    const rocsparse_dnmat_descr mat_B,
                                    const void*                 beta,
                                    const rocsparse_dnmat_descr mat_C,
                                    rocsparse_datatype          compute_type,
                                    rocsparse_spmm_alg          alg,
                                    size_t*                     buffer_size,
                                    void*                       temp_buffer);
    
HIP API deprecations and warnings#
Warning - arithmetic operators of HIP complex and vector types#

In this release, arithmetic operators of HIP complex and vector types are deprecated.

  • As alternatives to arithmetic operators of HIP complex types, users can use arithmetic operators of std::complex types.

  • As alternatives to arithmetic operators of HIP vector types, users can use the operators of the native clang vector type associated with the data member of HIP vector types.

During the deprecation, two macros _HIP_ENABLE_COMPLEX_OPERATORS and _HIP_ENABLE_VECTOR_OPERATORS are provided to allow users to conditionally enable arithmetic operators of HIP complex or vector types.

Note, the two macros are mutually exclusive and, by default, set to Off.

The arithmetic operators of HIP complex and vector types will be removed in a future release.

Refer to the HIP API Guide for more information.

Warning - compiler-generated code object version 4 deprecation#

Support for loading compiler-generated code object version 4 will be deprecated in a future release with no release announcement and replaced with code object 5 as the default version.

The current default is code object version 4.

Warning - MIOpenTensile deprecation#

MIOpenTensile will be deprecated in a future release.

Library changes in ROCm 5.0.0#

Library

Version

hipBLAS

0.49.0

hipCUB

2.10.13

hipFFT

1.0.4

hipSOLVER

1.2.0

hipSPARSE

2.0.0

MIVisionX

2.0.1

rccl

2.10.3

rocALUTION

2.0.1

rocBLAS

2.42.0

rocFFT

1.0.13

rocPRIM

2.10.12

rocRAND

2.10.12

rocSOLVER

3.16.0

rocSPARSE

2.0.0

rocThrust

2.13.0

Tensile

4.31.0

hipBLAS 0.49.0#

hipBLAS 0.49.0 for ROCm 5.0.0

Added#
  • Added rocSOLVER functions to hipblas-bench

  • Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex

  • Added compilation warning for future trmm changes

  • Added documentation to hipblas.h

  • Added option to forgo pivoting for getrf and getri when ipiv is nullptr

  • Added code coverage option

Fixed#
  • Fixed use of incorrect ‘HIP_PATH’ when building from source.

  • Fixed windows packaging

  • Allowing negative increments in hipblas-bench

  • Removed boost dependency

hipCUB 2.10.13#

hipCUB 2.10.13 for ROCm 5.0.0

Fixed#
  • Added missing includes to hipcub.hpp

Added#
  • Bfloat16 support to test cases (device_reduce & device_radix_sort)

  • Device merge sort

  • Block merge sort

  • API update to CUB 1.14.0

Changed#
  • The SetupNVCC.cmake automatic target selector select all of the capabalities of all available card for NVIDIA backend.

hipFFT 1.0.4#

hipFFT 1.0.4 for ROCm 5.0.0

Fixed#
  • Add calls to rocFFT setup/cleanup.

  • Cmake fixes for clients and backend support.

Added#
  • Added support for Windows 10 as a build target.

hipSOLVER 1.2.0#

hipSOLVER 1.2.0 for ROCm 5.0.0

Added#
  • Added functions

    • sytrf

      • hipsolverSsytrf_bufferSize, hipsolverDsytrf_bufferSize, hipsolverCsytrf_bufferSize, hipsolverZsytrf_bufferSize

      • hipsolverSsytrf, hipsolverDsytrf, hipsolverCsytrf, hipsolverZsytrf

Fixed#
  • Fixed use of incorrect HIP_PATH when building from source (#40). Thanks @jakub329homola!

hipSPARSE 2.0.0#

hipSPARSE 2.0.0 for ROCm 5.0.0

Added#
  • Added (conjugate) transpose support for csrmv, hybmv and spmv routines

MIVisionX 2.0.1#

MIVisionX for ROCm 5.0.0

Added#
  • Support for cmake 3.22.X

  • Support for OpenCV 4.X.X

  • Support for mv_compile with the HIP GPU backend

  • Support for tensor_compare node (less/greater/less_than/greater_than/equal onnx operators)

Optimizations#
  • Code Cleanup

  • Readme Updates

Changed#
  • License Updates

Fixed#
  • Minor bugs and warnings

  • Inference server application - OpenCL Backend

  • vxCreateThreshold Fix - Apps & Sample

Tested Configurations#
  • Windows 10

  • Linux distribution

    • Ubuntu - 18.04 / 20.04

    • CentOS - 7 / 8

    • SLES - 15-SP2

  • ROCm: rocm-dev - 4.5.2.40502-164

  • rocm-cmake - rocm-4.2.0

  • MIOpenGEMM - 1.1.5

  • MIOpen - 2.14.0

  • Protobuf - V3.12.0

  • OpenCV - 3.4.0

  • RPP - 0.92

  • FFMPEG - n4.0.4

  • Dependencies for all the above packages

  • MIVisionX Setup Script - V2.0.0

Known Issues#
  • Package install requires OpenCV v3.4.X to execute AMD OpenCV extensions

Mivisionx Dependency Map#

Docker Image: docker pull kiritigowda/ubuntu-18.04:{TAGNAME}

  • #c5f015 new component added to the level

  • #1589F0 existing component from the previous level

Build Level

MIVisionX Dependencies

Modules

Libraries and Executables

Docker Tag

Level_1

cmake <br> gcc <br> g++

amd_openvx <br> utilities

#c5f015 libopenvx.so - OpenVX&trade; Lib - CPU <br> #c5f015 libvxu.so - OpenVX&trade; immediate node Lib - CPU <br> #c5f015 runvx - OpenVX&trade; Graph Executor - CPU with Display OFF

Docker Image Version (tag latest semver)

Level_2

ROCm OpenCL <br> +Level 1

amd_openvx <br> amd_openvx_extensions <br> utilities

#c5f015 libopenvx.so - OpenVX&trade; Lib - CPU/GPU <br> #c5f015 libvxu.so - OpenVX&trade; immediate node Lib - CPU/GPU <br> #c5f015 libvx_loomsl.so - Loom 360 Stitch Lib <br> #c5f015 loom_shell - 360 Stitch App <br> #c5f015 runcl - OpenCL&trade; program debug App <br> #c5f015 runvx - OpenVX&trade; Graph Executor - Display OFF

Docker Image Version (tag latest semver)

Level_3

OpenCV <br> FFMPEG <br> +Level 2

amd_openvx <br> amd_openvx_extensions <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #c5f015 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #c5f015 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #c5f015 mv_compile - Neural Net Model Compile <br> #c5f015 runvx - OpenVX&trade; Graph Executor - Display ON

Docker Image Version (tag latest semver)

Level_4

MIOpenGEMM <br> MIOpen <br> ProtoBuf <br> +Level 3

amd_openvx <br> amd_openvx_extensions <br> apps <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #1589F0 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #1589F0 mv_compile - Neural Net Model Compile <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #1589F0 runvx - OpenVX&trade; Graph Executor - Display ON <br> #c5f015 libvx_nn.so - OpenVX&trade; Neural Net Extension <br> #c5f015 inference_server_app - Cloud Inference App

Docker Image Version (tag latest semver)

Level_5

AMD_RPP <br> rocAL deps <br> +Level 4

amd_openvx <br> amd_openvx_extensions <br> apps <br> rocAL <br> utilities

#1589F0 libopenvx.so - OpenVX&trade; Lib <br> #1589F0 libvxu.so - OpenVX&trade; immediate node Lib <br> #1589F0 libvx_loomsl.so - Loom 360 Stitch Lib <br> #1589F0 loom_shell - 360 Stitch App <br> #1589F0 libvx_amd_media.so - OpenVX&trade; Media Extension <br> #1589F0 libvx_opencv.so - OpenVX&trade; OpenCV InterOp Extension <br> #1589F0 mv_compile - Neural Net Model Compile <br> #1589F0 runcl - OpenCL&trade; program debug App <br> #1589F0 runvx - OpenVX&trade; Graph Executor - Display ON <br> #1589F0 libvx_nn.so - OpenVX&trade; Neural Net Extension <br> #1589F0 inference_server_app - Cloud Inference App <br> #c5f015 libvx_rpp.so - OpenVX&trade; RPP Extension <br> #c5f015 librali.so - Radeon Augmentation Library <br> #c5f015 rali_pybind.so - rocAL Pybind Lib

Docker Image Version (tag latest semver)

rccl 2.10.3#

RCCL 2.10.3 for ROCm 5.0.0

Added#
  • Compatibility with NCCL 2.10.3

Known Issues#
  • Managed memory is not currently supported for clique-based kernels

rocALUTION 2.0.1#

rocALUTION 2.0.1 for ROCm 5.0.0

Changed#
  • Removed deprecated GlobalPairwiseAMG class, please use PairwiseAMG instead.

  • Changed to C++ 14 Standard

Improved#
  • Added sanitizer option

  • Improved documentation

rocBLAS 2.42.0#

rocBLAS 2.42.0 for ROCm 5.0.0

Added#
  • Added rocblas_get_version_string_size convenience function

  • Added rocblas_xtrmm_outofplace, an out-of-place version of rocblas_xtrmm

  • Added hpl and trig initialization for gemm_ex to rocblas-bench

  • Added source code gemm. It can be used as an alternative to Tensile for debugging and development

  • Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex

Optimizations#
  • Improved performance of non-batched and batched single-precision GER for size m > 1024. Performance enhanced by 5-10% measured on a MI100 (gfx908) GPU.

  • Improved performance of non-batched and batched HER for all sizes and data types. Performance enhanced by 2-17% measured on a MI100 (gfx908) GPU.

Changed#
  • Instantiate templated rocBLAS functions to reduce size of librocblas.so

  • Removed static library dependency on msgpack

  • Removed boost dependencies for clients

Fixed#
  • Option to install script to build only rocBLAS clients with a pre-built rocBLAS library

  • Correctly set output of nrm2_batched_ex and nrm2_strided_batched_ex when given bad input

  • Fix for dgmm with side == rocblas_side_left and a negative incx

  • Fixed out-of-bounds read for small trsm

  • Fixed numerical checking for tbmv_strided_batched

rocFFT 1.0.13#

rocFFT 1.0.13 for ROCm 5.0.0

Optimizations#
  • Improved many plans by removing unnecessary transpose steps.

  • Optimized scheme selection for 3D problems.

    • Imposed less restrictions on 3D_BLOCK_RC selection. More problems can use 3D_BLOCK_RC and have some performance gain.

    • Enabled 3D_RC. Some 3D problems with SBCC-supported z-dim can use less kernels and get benefit.

    • Force –length 336 336 56 (dp) use faster 3D_RC to avoid it from being skipped by conservative threshold test.

  • Optimized some even-length R2C/C2R cases by doing more operations in-place and combining pre/post processing into Stockham kernels.

  • Added radix-17.

Added#
  • Added new kernel generator for select fused-2D transforms.

Fixed#
  • Improved large 1D transform decompositions.

rocPRIM 2.10.12#

rocPRIM 2.10.12 for ROCm 5.0.0

Fixed#
  • Enable bfloat16 tests and reduce threshold for bfloat16

  • Fix device scan limit_size feature

  • Non-optimized builds no longer trigger local memory limit errors

Added#
  • Added scan size limit feature

  • Added reduce size limit feature

  • Added transform size limit feature

  • Add block_load_striped and block_store_striped

  • Add gather_to_blocked to gather values from other threads into a blocked arrangement

  • The block sizes for device merge sorts initial block sort and its merge steps are now separate in its kernel config

    • the block sort step supports multiple items per thread

Changed#
  • size_limit for scan, reduce and transform can now be set in the config struct instead of a parameter

  • Device_scan and device_segmented_scan: inclusive_scan now uses the input-type as accumulator-type, exclusive_scan uses initial-value-type.

    • This particularly changes behaviour of small-size input types with large-size output types (e.g. short input, int output).

    • And low-res input with high-res output (e.g. float input, double output)

  • Revert old Fiji workaround, because they solved the issue at compiler side

  • Update README cmake minimum version number

  • Block sort support multiple items per thread

    • currently only powers of two block sizes, and items per threads are supported and only for full blocks

  • Bumped the minimum required version of CMake to 3.16

Known Issues#
  • Unit tests may soft hang on MI200 when running in hipMallocManaged mode.

  • device_segmented_radix_sort, device_scan unit tests failing for HIP on Windows

  • ReduceEmptyInput cause random faulire with bfloat16

rocRAND 2.10.12#

rocRAND 2.10.12 for ROCm 5.0.0

Changed#
  • No updates or changes for ROCm 5.0.0.

rocSOLVER 3.16.0#

rocSOLVER 3.16.0 for ROCm 5.0.0

Added#
  • Symmetric matrix factorizations:

    • LASYF

    • SYTF2, SYTRF (with batched and strided_batched versions)

  • Added rocsolver_get_version_string_size to help with version string queries

  • Added rocblas_layer_mode_ex and the ability to print kernel calls in the trace and profile logs

  • Expanded batched and strided_batched sample programs.

Optimized#
  • Improved general performance of LU factorization

  • Increased parallelism of specialized kernels when compiling from source, reducing build times on multi-core systems.

Changed#
  • The rocsolver-test client now prints the rocSOLVER version used to run the tests, rather than the version used to build them

  • The rocsolver-bench client now prints the rocSOLVER version used in the benchmark

Fixed#
  • Added missing stdint.h include to rocsolver.h

rocSPARSE 2.0.0#

rocSPARSE 2.0.0 for ROCm 5.0.0

Added#
  • csrmv, coomv, ellmv, hybmv for (conjugate) transposed matrices

  • csrmv for symmetric matrices

Changed#
  • spmm_ex is now deprecated and will be removed in the next major release

Improved#
  • Optimization for gtsv

rocThrust 2.13.0#

rocThrust 2.13.0 for ROCm 5.0.0

Added#
  • Updated to match upstream Thrust 1.13.0

  • Updated to match upstream Thrust 1.14.0

  • Added async scan

Changed#
  • Scan algorithms: inclusive_scan now uses the input-type as accumulator-type, exclusive_scan uses initial-value-type.

    • This particularly changes behaviour of small-size input types with large-size output types (e.g. short input, int output).

    • And low-res input with high-res output (e.g. float input, double output)

Tensile 4.31.0#

Tensile 4.31.0 for ROCm 5.0.0

Added#
  • DirectToLds support (x2/x4)

  • DirectToVgpr support for DGEMM

  • Parameter to control number of files kernels are merged into to better parallelize kernel compilation

  • FP16 alternate implementation for HPA HGEMM on aldebaran

Optimized#
  • Add DGEMM NN custom kernel for HPL on aldebaran

Changed#
  • Update tensile_client executable to std=c++14

Removed#
  • Remove unused old Tensile client code

Fixed#
  • Fix hipErrorInvalidHandle during benchmarks

  • Fix addrVgpr for atomic GSU

  • Fix for Python 3.8: add case for Constant nodeType

  • Fix architecture mapping for gfx1011 and gfx1012

  • Fix PrintSolutionRejectionReason verbiage in KernelWriter.py

  • Fix vgpr alignment problem when enabling flat buffer load

Precision support#

Use the following sections to identify data types and HIP types ROCm™ supports.

Integral types#

The signed and unsigned integral types that are supported by ROCm are listed in the following table, together with their corresponding HIP type and a short description.

Type name

HIP type

Description

int8

int8_t, uint8_t

A signed or unsigned 8-bit integer

int16

int16_t, uint16_t

A signed or unsigned 16-bit integer

int32

int32_t, uint32_t

A signed or unsigned 32-bit integer

int64

int64_t, uint64_t

A signed or unsigned 64-bit integer

Floating-point types#

The floating-point types that are supported by ROCm are listed in the following table, together with their corresponding HIP type and a short description.

Supported floating-point types

Type name

HIP type

Description

float8 (E4M3)

-

An 8-bit floating-point number that mostly follows IEEE-754 conventions and S1E4M3 bit layout, as described in 8-bit Numerical Formats for Deep Neural Networks , with expanded range and with no infinity or signed zero. NaN is represented as negative zero.

float8 (E5M2)

-

An 8-bit floating-point number mostly following IEEE-754 conventions and S1E5M2 bit layout, as described in 8-bit Numerical Formats for Deep Neural Networks , with expanded range and with no infinity or signed zero. NaN is represented as negative zero.

float16

half

A 16-bit floating-point number that conforms to the IEEE 754-2008 half-precision storage format.

bfloat16

bfloat16

A shortened 16-bit version of the IEEE 754 single-precision storage format.

tensorfloat32

-

A floating-point number that occupies 32 bits or less of storage, providing improved range compared to half (16-bit) format, at (potentially) greater throughput than single-precision (32-bit) formats.

float32

float

A 32-bit floating-point number that conforms to the IEEE 754 single-precision storage format.

float64

double

A 64-bit floating-point number that conforms to the IEEE 754 double-precision storage format.

Note

  • The float8 and tensorfloat32 types are internal types used in calculations in Matrix Cores and can be stored in any type of the same size.

  • The encodings for FP8 (E5M2) and FP8 (E4M3) that are natively supported by MI300 differ from the FP8 (E5M2) and FP8 (E4M3) encodings used in H100 (FP8 Formats for Deep Learning).

  • In some AMD documents and articles, float8 (E5M2) is referred to as bfloat8.

ROCm support icons#

In the following sections, we use icons to represent the level of support. These icons, described in the following table, are also used on the library data type support pages.

Icon

Definition

Not supported

⚠️

Partial support

Full support

Note

  • Full support means that the type is supported natively or with hardware emulation.

  • Native support means that the operations for that type are implemented in hardware. Types that are not natively supported are emulated with the available hardware. The performance of non-natively supported types can differ from the full instruction throughput rate. For example, 16-bit integer operations can be performed on the 32-bit integer ALUs at full rate; however, 64-bit integer operations might need several instructions on the 32-bit integer ALUs.

  • Any type can be emulated by software, but this page does not cover such cases.

Hardware type support#

AMD GPU hardware support for data types is listed in the following tables.

Compute units support#

The following table lists data type support for compute units.

Type name

int8

int16

int32

int64

MI100

MI200 series

MI300 series

Type name

float8 (E4M3)

float8 (E5M2)

float16

bfloat16

tensorfloat32

float32

float64

MI100

MI200 series

MI300 series

Matrix core support#

The following table lists data type support for AMD GPU matrix cores.

Type name

int8

int16

int32

int64

MI100

MI200 series

MI300 series

Type name

float8 (E4M3)

float8 (E5M2)

float16

bfloat16

tensorfloat32

float32

float64

MI100

MI200 series

MI300 series

Atomic operations support#

The following table lists data type support for atomic operations.

Type name

int8

int16

int32

int64

MI100

MI200 series

MI300 series

Type name

float8 (E4M3)

float8 (E5M2)

float16

bfloat16

tensorfloat32

float32

float64

MI100

MI200 series

MI300 series

Note

For cases that are not natively supported, you can emulate atomic operations using software. Software-emulated atomic operations have high negative performance impact when they frequently access the same memory address.

Data Type support in ROCm Libraries#

ROCm library support for int8, float8 (E4M3), float8 (E5M2), int16, float16, bfloat16, int32, tensorfloat32, float32, int64, and float64 is listed in the following tables.

Libraries input/output type support#

The following tables list ROCm library support for specific input and output data types. For a detailed description, refer to the corresponding library data type support page.

Library input/output data type name

int8

int16

int32

int64

hipSPARSELt (details)

✅/✅

❌/❌

❌/❌

❌/❌

rocRAND (details)

-/✅

-/✅

-/✅

-/✅

hipRAND (details)

-/✅

-/✅

-/✅

-/✅

rocPRIM (details)

✅/✅

✅/✅

✅/✅

✅/✅

hipCUB (details)

✅/✅

✅/✅

✅/✅

✅/✅

rocThrust (details)

✅/✅

✅/✅

✅/✅

✅/✅

Library input/output data type name

float8 (E4M3)

float8 (E5M2)

float16

bfloat16

tensorfloat32

float32

float64

hipSPARSELt (details)

❌/❌

❌/❌

✅/✅

✅/✅

❌/❌

❌/❌

❌/❌

rocRAND (details)

-/❌

-/❌

-/✅

-/❌

-/❌

-/✅

-/✅

hipRAND (details)

-/❌

-/❌

-/✅

-/❌

-/❌

-/✅

-/✅

rocPRIM (details)

❌/❌

❌/❌

✅/✅

✅/✅

❌/❌

✅/✅

✅/✅

hipCUB (details)

❌/❌

❌/❌

✅/✅

✅/✅

❌/❌

✅/✅

✅/✅

rocThrust (details)

❌/❌

❌/❌

⚠️/⚠️

⚠️/⚠️

❌/❌

✅/✅

✅/✅

Libraries internal calculations type support#

The following tables list ROCm library support for specific internal data types. For a detailed description, refer to the corresponding library data type support page.

Library internal data type name

int8

int16

int32

int64

hipSPARSELt (details)

Library internal data type name

float8 (E4M3)

float8 (E5M2)

float16

bfloat16

tensorfloat32

float32

float64

hipSPARSELt (details)

ROCm API libraries#

ROCm tools#

Accelerator and GPU hardware specifications#

The following tables provide an overview of the hardware specifications for AMD Instinct™ accelerators, and AMD Radeon™ PRO and Radeon™ GPUs.

Model

Architecture

LLVM target name

VRAM (GiB)

Compute Units

Wavefront Size

LDS (KiB)

L3 Cache (MiB)

L2 Cache (MiB)

L1 Vector Cache (KiB)

L1 Scalar Cache (KiB)

L1 Instruction Cache (KiB)

VGPR File (KiB)

SGPR File (KiB)

MI300X

CDNA3

gfx941 or gfx942

192

304

64

64

256

32

32

16 per 2 CUs

64 per 2 CUs

512

12.5

MI300A

CDNA3

gfx940 or gfx942

128

228

64

64

256

24

32

16 per 2 CUs

64 per 2 CUs

512

12.5

MI250X

CDNA2

gfx90a

128

220 (110 per GCD)

64

64

16 (8 per GCD)

16

16 per 2 CUs

32 per 2 CUs

512

12.5

MI250

CDNA2

gfx90a

128

208

64

64

16 (8 per GCD)

16

16 per 2 CUs

32 per 2 CUs

512

12.5

MI210

CDNA2

gfx90a

64

104

64

64

8

16

16 per 2 CUs

32 per 2 CUs

512

12.5

MI100

CDNA

gfx908

32

120

64

64

8

16

16 per 3 CUs

32 per 3 CUs

256 VGPR and 256 AccVGPR

12.5

MI60

GCN5.1

gfx906

32

64

64

64

4

16

16 per 3 CUs

32 per 3 CUs

256

12.5

MI50 (32GB)

GCN5.1

gfx906

32

60

64

64

4

16

16 per 3 CUs

32 per 3 CUs

256

12.5

MI50 (16GB)

GCN5.1

gfx906

16

60

64

64

4

16

16 per 3 CUs

32 per 3 CUs

256

12.5

MI25

GCN5.0

gfx900

16

64

64

64

4

16

16 per 3 CUs

32 per 3 CUs

256

12.5

MI8

GCN3.0

gfx803

4

64

64

64

2

16

16 per 4 CUs

32 per 4 CUs

256

12.5

MI6

GCN4.0

gfx803

16

36

64

64

2

16

16 per 4 CUs

32 per 4 CUs

256

12.5

Model

Architecture

LLVM target name

VRAM (GiB)

Compute Units

Wavefront Size

LDS (KiB)

Infinity Cache (MiB)

L2 Cache (MiB)

Graphics L1 Cache (KiB)

L0 Vector Cache (KiB)

L0 Scalar Cache (KiB)

L0 Instruction Cache (KiB)

VGPR File (KiB)

SGPR File (KiB)

Radeon PRO W7900

RDNA3

gfx1100

48

96

32

128

96

6

256

32

16

32

384

20

Radeon PRO W7800

RDNA3

gfx1100

32

70

32

128

64

6

256

32

16

32

384

20

Radeon PRO W7700

RDNA3

gfx1101

16

48

32

128

64

4

256

32

16

32

384

20

Radeon PRO W6800

RDNA2

gfx1030

32

60

32

128

128

4

128

16

16

32

256

20

Radeon PRO W6600

RDNA2

gfx1032

8

28

32

128

32

2

128

16

16

32

256

20

Radeon PRO V620

RDNA2

gfx1030

32

72

32

128

128

4

128

16

16

32

256

20

Radeon Pro W5500

RDNA

gfx1012

8

22

32

128

4

128

16

16

32

256

20

Radeon Pro VII

GCN5.1

gfx906

16

60

64

64

4

16

16 per 3 CUs

32 per 3 CUs

256

12.5

Model

Architecture

LLVM target name

VRAM (GiB)

Compute Units

Wavefront Size

LDS (KiB)

Infinity Cache (MiB)

L2 Cache (MiB)

Graphics L1 Cache (KiB)

L0 Vector Cache (KiB)

L0 Scalar Cache (KiB)

L0 Instruction Cache (KiB)

VGPR File (KiB)

SGPR File (KiB)

Radeon RX 7900 XTX

RDNA3

gfx1100

24

96

32

128

96

6

256

32

16

32

384

20

Radeon RX 7900 XT

RDNA3

gfx1100

20

84

32

128

80

6

256

32

16

32

384

20

Radeon RX 7900 GRE

RDNA3

gfx1100

16

80

32

128

64

6

256

32

16

32

384

20

Radeon RX 7800 XT

RDNA3

gfx1101

16

60

32

128

64

4

256

32

16

32

384

20

Radeon RX 7700 XT

RDNA3

gfx1101

12

54

32

128

48

4

256

32

16

32

384

20

Radeon RX 7600

RDNA3

gfx1102

8

32

32

128

32

2

256

32

16

32

256

20

Radeon RX 6950 XT

RDNA2

gfx1030

16

80

32

128

128

4

128

16

16

32

256

20

Radeon RX 6900 XT

RDNA2

gfx1030

16

80

32

128

128

4

128

16

16

32

256

20

Radeon RX 6800 XT

RDNA2

gfx1030

16

72

32

128

128

4

128

16

16

32

256

20

Radeon RX 6800

RDNA2

gfx1030

16

60

32

128

128

4

128

16

16

32

256

20

Radeon RX 6750 XT

RDNA2

gfx1031

12

40

32

128

96

3

128

16

16

32

256

20

Radeon RX 6700 XT

RDNA2

gfx1031

12

40

32

128

96

3

128

16

16

32

256

20

Radeon RX 6700

RDNA2

gfx1031

10

36

32

128

80

3

128

16

16

32

256

20

Radeon RX 6650 XT

RDNA2

gfx1032

8

32

32

128

32

2

128

16

16

32

256

20

Radeon RX 6600 XT

RDNA2

gfx1032

8

32

32

128

32

2

128

16

16

32

256

20

Radeon RX 6600

RDNA2

gfx1032

8

28

32

128

32

2

128

16

16

32

256

20

Radeon VII

GCN5.1

gfx906

16

60

64

64 per CU

4

16

16 per 3 CUs

32 per 3 CUs

256

12.5

For more information on the terms used here, see the specific documents and guides or Understanding the HIP programming model.

Deep learning guide#

The following sections cover the different framework installations for ROCm and deep-learning applications. The following image provides the sequential flow for the use of each framework. Refer to the ROCm Compatible Frameworks Release Notes for each framework’s most current release notes at Third-party support.

ROCm Compatible Frameworks Flowchart

Frameworks installation#

GPU-enabled Message Passing Interface#

The Message Passing Interface (MPI) is a standard API for distributed and parallel application development that can scale to multi-node clusters. To facilitate the porting of applications to clusters with GPUs, ROCm enables various technologies. You can use these technologies add GPU pointers to MPI calls and enable ROCm-aware MPI libraries to deliver optimal performance for both intra-node and inter-node GPU-to-GPU communication.

The AMD kernel driver exposes remote direct memory access (RDMA) through PeerDirect interfaces. This allows network interface cards (NICs) to directly read and write to RDMA-capable GPU device memory, resulting in high-speed direct memory access (DMA) transfers between GPU and NIC. These interfaces are used to optimize inter-node MPI message communication.

The Open MPI project is an open source implementation of the MPI. It’s developed and maintained by a consortium of academic, research, and industry partners. To compile Open MPI with ROCm support, refer to the following sections:

ROCm-aware Open MPI on InfiniBand and RoCE networks using UCX#

The Unified Communication Framework (UCX), is an open source, cross-platform framework designed to provide a common set of communication interfaces for various network programming models and interfaces. UCX uses ROCm technologies to implement various network operation primitives. UCX is the standard communication library for InfiniBand and RDMA over Converged Ethernet (RoCE) network interconnect. To optimize data transfer operations, many MPI libraries, including Open MPI, can leverage UCX internally.

UCX and Open MPI have a compile option to enable ROCm support. To install and configure UCX to compile Open MPI for ROCm, use the following instructions.

1. Set environment variables to install all software components in the same base directory. We use the home directory in our example, but you can specify a different location if you want.

export INSTALL_DIR=$HOME/ompi_for_gpu
export BUILD_DIR=/tmp/ompi_for_gpu_build
mkdir -p $BUILD_DIR

2. Install UCX. To view UCX and ROCm version compatibility, refer to the communication libraries tables

export UCX_DIR=$INSTALL_DIR/ucx
cd $BUILD_DIR
git clone https://github.com/openucx/ucx.git -b v1.15.x
cd ucx
./autogen.sh
mkdir build
cd build
../configure -prefix=$UCX_DIR \
    --with-rocm=/opt/rocm
make -j $(nproc)
make -j $(nproc) install
  1. Install Open MPI.

    export OMPI_DIR=$INSTALL_DIR/ompi
    cd $BUILD_DIR
    git clone --recursive https://github.com/open-mpi/ompi.git \
        -b v5.0.x
    cd ompi
    ./autogen.pl
    mkdir build
    cd build
    ../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \
        --with-rocm=/opt/rocm
    make -j $(nproc)
    make install
    

ROCm-enabled OSU benchmarks#

You can use OSU Micro Benchmarks (OMB) to evaluate the performance of various primitives on ROCm-supported AMD GPUs. The --enable-rocm option exposes this functionality.

export OSU_DIR=$INSTALL_DIR/osu
cd $BUILD_DIR
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.2.tar.gz
tar xfz osu-micro-benchmarks-7.2.tar.gz
cd osu-micro-benchmarks-7.2
./configure --enable-rocm \
    --with-rocm=/opt/rocm \
    CC=$OMPI_DIR/bin/mpicc CXX=$OMPI_DIR/bin/mpicxx \
    LDFLAGS="-L$OMPI_DIR/lib/ -lmpi -L/opt/rocm/lib/ \
    $(hipconfig -C) -lamdhip64" CXXFLAGS="-std=c++11"
make -j $(nproc)

Intra-node run#

Before running an Open MPI job, you must set the following environment variables to ensure that you’re using the correct versions of Open MPI and UCX.

export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
export PATH=$OMPI_DIR/bin:$PATH

To run the OSU bandwidth benchmark between the first two GPU devices (GPU 0 and GPU 1) inside the same node, use the following code.

$OMPI_DIR/bin/mpirun -np 2 \
-x UCX_TLS=sm,self,rocm \
--mca pml ucx \
./c/mpi/pt2pt/standard/osu_bw D D

This measures the unidirectional bandwidth from the first device (GPU 0) to the second device (GPU 1). To select specific devices, for example GPU 2 and GPU 3, include the following command:

export HIP_VISIBLE_DEVICES=2,3

To force using a copy kernel instead of a DMA engine for the data transfer, use the following command:

export HSA_ENABLE_SDMA=0

The following output shows the effective transfer bandwidth measured for inter-die data transfer between GPU 2 and GPU 3 on a system with MI250 GPUs. For messages larger than 67 MB, an effective utilization of about 150 GB/sec is achieved:

Inter-GPU bandwidth for various payload sizes

Collective operations#

Collective operations on GPU buffers are best handled through the Unified Collective Communication (UCC) library component in Open MPI. To accomplish this, you must configure and compile the UCC library with ROCm support.

Note

You can verify UCC and ROCm version compatibility using the communication libraries tables

export UCC_DIR=$INSTALL_DIR/ucc
git clone https://github.com/openucx/ucc.git -b v1.2.x
cd ucc
./autogen.sh
./configure --with-rocm=/opt/rocm \
            --with-ucx=$UCX_DIR   \
            --prefix=$UCC_DIR
make -j && make install

# Configure and compile Open MPI with UCX, UCC, and ROCm support
cd ompi
./configure --with-rocm=/opt/rocm  \
            --with-ucx=$UCX_DIR    \
            --with-ucc=$UCC_DIR
            --prefix=$OMPI_DIR

To use the UCC component with an MPI application, you must set additional parameters:

mpirun --mca pml ucx --mca osc ucx \
   --mca coll_ucc_enable 1     \
   --mca coll_ucc_priority 100 -np 64 ./my_mpi_app

ROCm-aware Open MPI using libfabric#

For network interconnects that are not covered in the previous category, such as HPE Slingshot, ROCm-aware communication can often be achieved through the libfabric library. For more information, refer to the libfabric documentation.

Note

When using Open MPI v5.0.x with libfabric support, shared memory communication between processes on the same node goes through the ob1/sm component. This component has fundamental support for GPU memory that is, accomplished by using a staging host buffer Consequently, the performance of device-to-device shared memory communication is lower than the theoretical peak performance allowed by the GPU-to-GPU interconnect.

  1. Install libfabric. Note that libfabric is often pre-installed. To determine if it’s already installed, run:

module avail libfabric

Alternatively, you can download and compile libfabric with ROCm support. Note that not all components required to support some networks (e.g., HPE Slingshot) are available in the open source repository. Therefore, using a pre-installed libfabric library is strongly recommended over compiling libfabric manually.

If a pre-compiled libfabric library is available on your system, you can skip the following step.

  1. Compile libfabric with ROCm support.

export OFI_DIR=$INSTALL_DIR/ofi
cd $BUILD_DIR
git clone https://github.com/ofiwg/libfabric.git -b v1.19.x
cd libfabric
./autogen.sh
./configure --prefix=$OFI_DIR   \
            --with-rocr=/opt/rocm
make -j $(nproc)
make install

Installing Open MPI with libfabric support#

To build Open MPI with libfabric, use the following code:

export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git \
    -b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ofi=$OFI_DIR \
                --with-rocm=/opt/rocm
make -j $(nproc)
make install

ROCm-aware OSU with Open MPI and libfabric#

Compiling a ROCm-aware version of OSU benchmarks with Open MPI and libfabric uses the same process described in ROCm-enabled OSU benchmarks.

To run an OSU benchmark using multiple nodes, use the following code:

export LD_LIBRARY_PATH=$OMPI_DIR/lib:$OFI_DIR/lib64:/opt/rocm/lib
$OMPI_DIR/bin/mpirun --mca pml ob1 --mca btl_ofi_mode 2 -np 2 \
./c/mpi/pt2pt/standard/osu_bw D D

System debugging guide#

ROCm language and system-level debug, flags, and environment variables#

Kernel options to avoid: the Ethernet port getting renamed every time you change graphics cards, net.ifnames=0 biosdevname=0

ROCr error code#

  • 2 Invalid Dimension

  • 4 Invalid Group Memory

  • 8 Invalid (or Null) Code

  • 32 Invalid Format

  • 64 Group is too large

  • 128 Out of VGPRs

  • 0x80000000 Debug Options

Command to dump firmware version and get Linux kernel version#

sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info

uname -a

Debug flags#

Debug messages when developing/debugging base ROCm driver. You could enable the printing from libhsakmt.so by setting an environment variable, HSAKMT_DEBUG_LEVEL. Available debug levels are 3-7. The higher level you set, the more messages will print.

  • export HSAKMT_DEBUG_LEVEL=3 : Only pr_err() prints.

  • export HSAKMT_DEBUG_LEVEL=4 : pr_err() and pr_warn() print.

  • export HSAKMT_DEBUG_LEVEL=5 : We currently do not implement “notice”. Setting to 5 is same as setting to 4.

  • export HSAKMT_DEBUG_LEVEL=6 : pr_err(), pr_warn(), and pr_info print.

  • export HSAKMT_DEBUG_LEVEL=7 : Everything including pr_debug prints.

ROCr level environment variables for debug#

HSA_ENABLE_SDMA=0

HSA_ENABLE_INTERRUPT=0

HSA_SVM_GUARD_PAGES=0

HSA_DISABLE_CACHE=1

Turn off page retry on GFX9/Vega devices#

sudo -s

echo 1 > /sys/module/amdkfd/parameters/noretry

HIP environment variables 3.x#

OpenCL debug flags#

AMD_OCL_WAIT_COMMAND=1 (0 = OFF, 1 = On)

PCIe-debug#

For information on how to debug and profile HIP applications, see hip:how_to_guides/debugging

Tuning guides#

Use case-specific system setup and tuning guides.

High-performance computing#

High-performance computing (HPC) workloads have unique requirements. The default hardware and BIOS configurations for OEM platforms may not provide optimal performance for HPC workloads. To enable optimal HPC settings on a per-platform and per-workload level, this guide calls out:

  • BIOS settings that can impact performance

  • Hardware configuration best practices

  • Supported versions of operating systems

  • Workload-specific recommendations for optimal BIOS and operating system settings

There is also a discussion on the AMD Instinct™ software development environment, including information on how to install and run the DGEMM, STREAM, HPCG, and HPL benchmarks. This guidance provides a good starting point but is not exhaustively tested across all compilers.

Prerequisites to understanding this document and to performing tuning of HPC applications include:

  • Experience in configuring servers

  • Administrative access to the server’s Management Interface (BMC)

  • Administrative access to the operating system

  • Familiarity with the OEM server’s BMC (strongly recommended)

  • Familiarity with the OS specific tools for configuration, monitoring, and troubleshooting (strongly recommended)

This document provides guidance on tuning systems with various AMD Instinct™ accelerators for HPC workloads. This document is not an all-inclusive guide, and some items referred to may have similar, but different, names in various OEM systems (for example, OEM-specific BIOS settings). This document also provides suggestions on items that should be the initial focus of additional, application-specific tuning.

This document is based on the AMD EPYC™ 7003-series processor family (former codename “Milan”).

While this guide is a good starting point, developers are encouraged to perform their own performance testing for additional tuning.

AMD Instinct™ MI200

This chapter goes through how to configure your AMD Instinct™ MI200 accelerated compute nodes to get the best performance out of them.

AMD Instinct™ MI100

This chapter briefly reviews hardware aspects of the AMD Instinct™ MI100 accelerators and the CDNA™ 1 architecture that is the foundation of these GPUs.

Workstation#

Workstation workloads, much like high-performance computing, have a unique set of requirements, a blend of both graphics and compute, certification, stability and the list continues.

The document covers specific software requirements and processes needed to use these GPUs for Single Root I/O Virtualization (SR-IOV) and machine learning (ML).

The main purpose of this document is to help users utilize the RDNA 2 GPUs to their full potential.

AMD Radeon™ PRO W6000 and V620

This chapter describes the AMD GPUs with RDNA™ 2 architecture, namely AMD Radeon PRO W6800 and AMD Radeon PRO V620

MI100 high-performance computing and tuning guide#

System settings#

This chapter reviews system settings that are required to configure the system for AMD Instinct™ MI100 accelerators and that can improve performance of the GPUs. It is advised to configure the system for best possible host configuration according to the high-performance computing tuning guides for AMD EPYC™ 7002 Series and EPYC™ 7003 Series processors, depending on the processor generation of the system.

In addition to the BIOS settings listed below the following settings (System BIOS settings) will also have to be enacted via the command line (see Operating system settings):

  • Core C states

  • AMD-PCI-UTIL (on AMD EPYC™ 7002 series processors)

  • IOMMU (if needed)

System BIOS settings#

For maximum MI100 GPU performance on systems with AMD EPYC™ 7002 series processors (codename “Rome”) and AMI System BIOS, the following configuration of System BIOS settings has been validated. These settings must be used for the qualification process and should be set as default values for the system BIOS. Analogous settings for other non-AMI System BIOS providers could be set similarly. For systems with Intel processors, some settings may not apply or be available as listed in the following table.

Recommended settings for the system BIOS in a GIGABYTE platform.#

BIOS Setting Location

Parameter

Value

Comments

Advanced / PCI Subsystem Settings

Above 4G Decoding

Enabled

GPU Large BAR Support

AMD CBS / CPU Common Options

Global C-state Control

Auto

Global C-States

AMD CBS / CPU Common Options

CCD/Core/Thread Enablement

Accept

Global C-States

AMD CBS / CPU Common Options / Performance

SMT Control

Disable

Global C-States

AMD CBS / DF Common Options / Memory Addressing

NUMA nodes per socket

NPS 1,2,4

NUMA Nodes (NPS)

AMD CBS / DF Common Options / Memory Addressing

Memory interleaving

Auto

Numa Nodes (NPS)

AMD CBS / DF Common Options / Link

4-link xGMI max speed

18 Gbps

Set AMD CPU xGMI speed to highest rate supported

AMD CBS / DF Common Options / Link

3-link xGMI max speed

18 Gbps

Set AMD CPU xGMI speed to highest rate supported

AMD CBS / NBIO Common Options

IOMMU

Disable

AMD CBS / NBIO Common Options

PCIe Ten Bit Tag Support

Enable

AMD CBS / NBIO Common Options

Preferred IO

Manual

AMD CBS / NBIO Common Options

Preferred IO Bus

“Use lspci to find pci device id”

AMD CBS / NBIO Common Options

Enhanced Preferred IO Mode

Enable

AMD CBS / NBIO Common Options / SMU Common Options

Determinism Control

Manual

AMD CBS / NBIO Common Options / SMU Common Options

Determinism Slider

Power

AMD CBS / NBIO Common Options / SMU Common Options

cTDP Control

Manual

AMD CBS / NBIO Common Options / SMU Common Options

cTDP

240

AMD CBS / NBIO Common Options / SMU Common Options

Package Power Limit Control

Manual

AMD CBS / NBIO Common Options / SMU Common Options

Package Power Limit

240

AMD CBS / NBIO Common Options / SMU Common Options

xGMI Link Width Control

Manual

AMD CBS / NBIO Common Options / SMU Common Options

xGMI Force Link Width

2

AMD CBS / NBIO Common Options / SMU Common Options

xGMI Force Link Width Control

Force

AMD CBS / NBIO Common Options / SMU Common Options

APBDIS

1

AMD CBS / NBIO Common Options / SMU Common Options

DF C-states

Auto

AMD CBS / NBIO Common Options / SMU Common Options

Fixed SOC P-state

P0

AMD CBS / UMC Common Options / DDR4 Common Options

Enforce POR

Accept

AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR

Overclock

Enabled

AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR

Memory Clock Speed

1600 MHz

Set to max Memory Speed, if using 3200 MHz DIMMs

AMD CBS / UMC Common Options / DDR4 Common Options / DRAM Controller Configuration / DRAM Power Options

Power Down Enable

Disabled

RAM Power Down

AMD CBS / Security

TSME

Disabled

Memory Encryption

Memory configuration#

For the memory addressing modes, especially the number of NUMA nodes per socket/processor (NPS), the recommended setting is to follow the guidance of the high-performance computing tuning guides for AMD EPYC™ 7002 Series and AMD EPYC™ 7003 Series processors to provide the optimal configuration for host side computation.

If the system is set to one NUMA domain per socket/processor (NPS1), bidirectional copy bandwidth between host memory and GPU memory may be slightly higher (up to about 16% more) than with four NUMA domains per socket processor (NPS4). For memory bandwidth sensitive applications using MPI, NPS4 is recommended. For applications that are not optimized for NUMA locality, NPS1 is the recommended setting.

Operating system settings#
CPU core states - C-states#

There are several core states (C-states) that an AMD EPYC CPU can idle within:

  • C0: active. This is the active state while running an application.

  • C1: idle

  • C2: idle and power gated. This is a deeper sleep state and will have a greater latency when moving back to the C0 state, compared to when the CPU is coming out of C1.

Disabling C2 is important for running with a high performance, low-latency network. To disable power-gating on all cores run the following on Linux systems:

cpupower idle-set -d 2

Note that the cpupower tool must be installed, as it is not part of the base packages of most Linux® distributions. The package needed varies with the respective Linux distribution.

sudo apt install linux-tools-common
sudo yum install cpupowerutils
sudo zypper install cpupower
AMD-IOPM-UTIL#

This section applies to AMD EPYC™ 7002 processors to optimize advanced Dynamic Power Management (DPM) in the I/O logic (see NBIO description above) for performance. Certain I/O workloads may benefit from disabling this power management. This utility disables DPM for all PCI-e root complexes in the system and locks the logic into the highest performance operational mode.

Disabling I/O DPM will reduce the latency and/or improve the throughput of low-bandwidth messages for PCI-e InfiniBand NICs and GPUs. Other workloads with low-bandwidth bursty PCI-e I/O characteristics may benefit as well if multiple such PCI-e devices are installed in the system.

The actions of the utility do not persist across reboots. There is no need to change any existing firmware settings when using this utility. The “Preferred I/O” and “Enhanced Preferred I/O” settings should remain unchanged at enabled.

Tip

The recommended method to use the utility is either to create a system start-up script, for example, a one-shot systemd service unit, or run the utility when starting up a job scheduler on the system. The installer packages (see Power Management Utility) will create and enable a systemd service unit for you. This service unit is configured to run in one-shot mode. This means that even when the service unit runs as expected, the status of the service unit will show inactive. This is the expected behavior when the utility runs normally. If the service unit shows failed, the utility did not run as expected. The output in either case can be shown with the systemctl status command.

Stopping the service unit has no effect since the utility does not leave anything running. To undo the effects of the utility, disable the service unit with the systemctl disable command and reboot the system.

The utility does not have any command-line options, and it must be run with super-user permissions.

Systems with 256 CPU threads - IOMMU configuration#

For systems that have 256 logical CPU cores or more (e.g., 64-core AMD EPYC™ 7763 in a dual-socket configuration and SMT enabled), setting the input-output memory management unit (IOMMU) configuration to “disabled” can limit the number of available logical cores to 255. The reason is that the Linux® kernel disables X2APIC in this case and falls back to Advanced Programmable Interrupt Controller (APIC), which can only enumerate a maximum of 255 (logical) cores.

If SMT is enabled by setting “CCD/Core/Thread Enablement > SMT Control” to “enable”, the following steps can be applied to the system to enable all (logical) cores of the system:

  • In the server BIOS, set IOMMU to “Enabled”.

  • When configuring the Grub boot loader, add the following arguments for the Linux kernel: amd_iommu=on iommu=pt

  • Update Grub to use the modified configuration:

    sudo grub2-mkconfig -o /boot/grub2/grub.cfg
    
  • Reboot the system.

  • Verify IOMMU passthrough mode by inspecting the kernel log via dmesg:

    [...]
    [   0.000000] Kernel command line: [...] amd_iommu=on iommu=pt
       [...]
    

Once the system is properly configured, ROCm software can be installed.

System management#

For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to Quick-start (Linux). To verify that the installation was successful, refer to the post-install instructions and system tools. Should verification fail, consult the System Debugging Guide.

Hardware verification with ROCm#

The AMD ROCm™ platform ships with tools to query the system structure. To query the GPU hardware, the rocm-smi command is available. It can show available GPUs in the system with their device ID and their respective firmware (or VBIOS) versions:

rocm-smi --showhw output on an 8*MI100 system

Another important query is to show the system structure, the localization of the GPUs in the system, and the fabric connections between the system components:

mi100-smi-showtopo output on an 8*MI100 system

The previous command shows the system structure in four blocks:

  • The first block of the output shows the distance between the GPUs similar to what the numactl command outputs for the NUMA domains of a system. The weight is a qualitative measure for the “distance” data must travel to reach one GPU from another one. While the values do not carry a special (physical) meaning, the higher the value the more hops are needed to reach the destination from the source GPU.

  • The second block has a matrix for the number of hops required to send data from one GPU to another. For the GPUs in the local hive, this number is one, while for the others it is three (one hop to leave the hive, one hop across the processors, and one hop within the destination hive).

  • The third block outputs the link types between the GPUs. This can either be “XGMI” for AMD Infinity Fabric™ links or “PCIE” for PCIe Gen4 links.

  • The fourth block reveals the localization of a GPU with respect to the NUMA organization of the shared memory of the AMD EPYC™ processors.

To query the compute capabilities of the GPU devices, the rocminfo command is available with the AMD ROCm™ platform. It lists specific details about the GPU devices, including but not limited to the number of compute units, width of the SIMD pipelines, memory information, and Instruction Set Architecture:

rocminfo output fragment on an 8*MI100 system

For a complete list of architecture (LLVM target) names, refer to Linux and Windows support.

Testing inter-device bandwidth#

Hardware verification with ROCm showed the rocm-smi --showtopo command to show how the system structure and how the GPUs are located and connected in this structure. For more details, the rocm-bandwidth-test can run benchmarks to show the effective link bandwidth between the components of the system.

The ROCm Bandwidth Test program can be installed with the following package-manager commands:

sudo apt install rocm-bandwidth-test
sudo yum install rocm-bandwidth-test
sudo zypper install rocm-bandwidth-test

Alternatively, the source code can be downloaded and built from source.

The output will list the available compute devices (CPUs and GPUs):

rocm-bandwidth-test output fragment on an 8*MI100 system listing devices

The output will also show a matrix that contains a “1” if a device can communicate to another device (CPU and GPU) of the system and it will show the NUMA distance (similar to rocm-smi):

rocm-bandwidth-test output fragment on an 8*MI100 system showing inter-device access matrix

rocm-bandwidth-test output fragment on an 8*MI100 system showing inter-device NUMA distance

The output also contains the measured bandwidth for unidirectional and bidirectional transfers between the devices (CPU and GPU):

rocm-bandwidth-test output fragment on an 8*MI100 system showing uni- and bidirectional bandwidths

MI200 high-performance computing and tuning guide#

System settings#

This chapter reviews system settings that are required to configure the system for AMD Instinct MI250 accelerators and improve the performance of the GPUs. It is advised to configure the system for the best possible host configuration according to the High Performance Computing (HPC) Tuning Guide for AMD EPYC 7003 Series Processors.

Configure the system BIOS settings as explained in System BIOS settings and enact the below given settings via the command line as explained in Operating system settings:

  • Core C states

  • input-output memory management unit (IOMMU), if needed

System BIOS settings#

For maximum MI250 GPU performance on systems with AMD EPYC™ 7003-series processors (codename “Milan”) and AMI System BIOS, the following configuration of system BIOS settings has been validated. These settings must be used for the qualification process and should be set as default values for the system BIOS. Analogous settings for other non-AMI System BIOS providers could be set similarly. For systems with Intel processors, some settings may not apply or be available as listed in the following table.

BIOS Setting Location

Parameter

Value

Comments

Advanced / PCI Subsystem Settings

Above 4G Decoding

Enabled

GPU Large BAR Support

Advanced / PCI Subsystem Settings

SR-IOV Support

Disabled

Disable Single Root IO Virtualization

AMD CBS / CPU Common Options

Global C-state Control

Auto

Global C-States

AMD CBS / CPU Common Options

CCD/Core/Thread Enablement

Accept

Global C-States

AMD CBS / CPU Common Options / Performance

SMT Control

Disable

Global C-States

AMD CBS / DF Common Options / Memory Addressing

NUMA nodes per socket

NPS 1,2,4

NUMA Nodes (NPS)

AMD CBS / DF Common Options / Memory Addressing

Memory interleaving

Auto

Numa Nodes (NPS)

AMD CBS / DF Common Options / Link

4-link xGMI max speed

18 Gbps

Set AMD CPU xGMI speed to highest rate supported

AMD CBS / NBIO Common Options

IOMMU

Disable

AMD CBS / NBIO Common Options

PCIe Ten Bit Tag Support

Auto

AMD CBS / NBIO Common Options

Preferred IO

Bus

AMD CBS / NBIO Common Options

Preferred IO Bus

“Use lspci to find pci device id”

AMD CBS / NBIO Common Options

Enhanced Preferred IO Mode

Enable

AMD CBS / NBIO Common Options / SMU Common Options

Determinism Control

Manual

AMD CBS / NBIO Common Options / SMU Common Options

Determinism Slider

Power

AMD CBS / NBIO Common Options / SMU Common Options

cTDP Control

Manual

Set cTDP to the maximum supported by the installed CPU

AMD CBS / NBIO Common Options / SMU Common Options

cTDP

280

AMD CBS / NBIO Common Options / SMU Common Options

Package Power Limit Control

Manual

Set Package Power Limit to the maximum supported by the installed CPU

AMD CBS / NBIO Common Options / SMU Common Options

Package Power Limit

280

AMD CBS / NBIO Common Options / SMU Common Options

xGMI Link Width Control

Manual

Set AMD CPU xGMI width to 16 bits

AMD CBS / NBIO Common Options / SMU Common Options

xGMI Force Link Width

2

AMD CBS / NBIO Common Options / SMU Common Options

xGMI Force Link Width Control

Force

AMD CBS / NBIO Common Options / SMU Common Options

APBDIS

1

AMD CBS / NBIO Common Options / SMU Common Options

DF C-states

Enabled

AMD CBS / NBIO Common Options / SMU Common Options

Fixed SOC P-state

P0

AMD CBS / UMC Common Options / DDR4 Common Options

Enforce POR

Accept

AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR

Overclock

Enabled

AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR

Memory Clock Speed

1600 MHz

Set to max Memory Speed, if using 3200 MHz DIMMs

AMD CBS / UMC Common Options / DDR4 Common Options / DRAM Controller Configuration / DRAM Power Options

Power Down Enable

Disabled

RAM Power Down

AMD CBS / Security

TSME

Disabled

Memory Encryption

Memory configuration#

For setting the memory addressing modes, especially the number of NUMA nodes per socket/processor (NPS), follow the guidance of the “High Performance Computing (HPC) Tuning Guide for AMD EPYC 7003 Series Processors” to provide the optimal configuration for host side computation. For most HPC workloads, NPS=4 is the recommended value.

Operating system settings#
CPU core states - C-states#

There are several core states (C-states) that an AMD EPYC CPU can idle within:

  • C0: active. This is the active state while running an application.

  • C1: idle

  • C2: idle and power gated. This is a deeper sleep state and will have a greater latency when moving back to the C0 state, compared to when the CPU is coming out of C1.

Disabling C2 is important for running with a high performance, low-latency network. To disable power-gating on all cores run the following on Linux systems:

cpupower idle-set -d 2

Note that the cpupower tool must be installed, as it is not part of the base packages of most Linux® distributions. The package needed varies with the respective Linux distribution.

sudo apt install linux-tools-common
sudo yum install cpupowerutils
sudo zypper install cpupower
AMD-IOPM-UTIL#

This section applies to AMD EPYC™ 7002 processors to optimize advanced Dynamic Power Management (DPM) in the I/O logic (see NBIO description above) for performance. Certain I/O workloads may benefit from disabling this power management. This utility disables DPM for all PCI-e root complexes in the system and locks the logic into the highest performance operational mode.

Disabling I/O DPM will reduce the latency and/or improve the throughput of low-bandwidth messages for PCI-e InfiniBand NICs and GPUs. Other workloads with low-bandwidth bursty PCI-e I/O characteristics may benefit as well if multiple such PCI-e devices are installed in the system.

The actions of the utility do not persist across reboots. There is no need to change any existing firmware settings when using this utility. The “Preferred I/O” and “Enhanced Preferred I/O” settings should remain unchanged at enabled.

Tip

The recommended method to use the utility is either to create a system start-up script, for example, a one-shot systemd service unit, or run the utility when starting up a job scheduler on the system. The installer packages (see Power Management Utility) will create and enable a systemd service unit for you. This service unit is configured to run in one-shot mode. This means that even when the service unit runs as expected, the status of the service unit will show inactive. This is the expected behavior when the utility runs normally. If the service unit shows failed, the utility did not run as expected. The output in either case can be shown with the systemctl status command.

Stopping the service unit has no effect since the utility does not leave anything running. To undo the effects of the utility, disable the service unit with the systemctl disable command and reboot the system.

The utility does not have any command-line options, and it must be run with super-user permissions.

Systems with 256 CPU threads - IOMMU configuration#

For systems that have 256 logical CPU cores or more (e.g., 64-core AMD EPYC™ 7763 in a dual-socket configuration and SMT enabled), setting the input-output memory management unit (IOMMU) configuration to “disabled” can limit the number of available logical cores to 255. The reason is that the Linux® kernel disables X2APIC in this case and falls back to Advanced Programmable Interrupt Controller (APIC), which can only enumerate a maximum of 255 (logical) cores.

If SMT is enabled by setting “CCD/Core/Thread Enablement > SMT Control” to “enable”, the following steps can be applied to the system to enable all (logical) cores of the system:

  • In the server BIOS, set IOMMU to “Enabled”.

  • When configuring the Grub boot loader, add the following arguments for the Linux kernel: amd_iommu=on iommu=pt

  • Update Grub to use the modified configuration:

    sudo grub2-mkconfig -o /boot/grub2/grub.cfg
    
  • Reboot the system.

  • Verify IOMMU passthrough mode by inspecting the kernel log via dmesg:

    [...]
    [   0.000000] Kernel command line: [...] amd_iommu=on iommu=pt
       [...]
    

Once the system is properly configured, ROCm software can be installed.

System management#

For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to Quick-start (Linux). For verifying that the installation was successful, refer to the post-install instructions and system tools. Should verification fail, consult the System Debugging Guide.

Hardware verification with ROCm#

The AMD ROCm™ platform ships with tools to query the system structure. To query the GPU hardware, the rocm-smi command is available. It can show available GPUs in the system with their device ID and their respective firmware (or VBIOS) versions:

rocm-smi --showhw output on an 8*MI200 system

To see the system structure, the localization of the GPUs in the system, and the fabric connections between the system components, use:

rocm-smi --showtopo output on an 8*MI200 system

  • The first block of the output shows the distance between the GPUs similar to what the numactl command outputs for the NUMA domains of a system. The weight is a qualitative measure for the “distance” data must travel to reach one GPU from another one. While the values do not carry a special (physical) meaning, the higher the value the more hops are needed to reach the destination from the source GPU.

  • The second block has a matrix named “Hops between two GPUs”, where 1 means the two GPUs are directly connected with XGMI, 2 means both GPUs are linked to the same CPU socket and GPU communications will go through the CPU, and 3 means both GPUs are linked to different CPU sockets so communications will go through both CPU sockets. This number is one for all GPUs in this case since they are all connected to each other through the Infinity Fabric links.

  • The third block outputs the link types between the GPUs. This can either be “XGMI” for AMD Infinity Fabric links or “PCIE” for PCIe Gen4 links.

  • The fourth block reveals the localization of a GPU with respect to the NUMA organization of the shared memory of the AMD EPYC processors.

To query the compute capabilities of the GPU devices, use rocminfo command. It lists specific details about the GPU devices, including but not limited to the number of compute units, width of the SIMD pipelines, memory information, and Instruction Set Architecture (ISA):

rocminfo output fragment on an 8*MI200 system

For a complete list of architecture (LLVM target) names, refer to GPU OS Support for Linux and Windows.

Testing inter-device bandwidth#

Hardware verification with ROCm showed the rocm-smi --showtopo command to show how the system structure and how the GPUs are located and connected in this structure. For more details, the rocm-bandwidth-test can run benchmarks to show the effective link bandwidth between the components of the system.

The ROCm Bandwidth Test program can be installed with the following package-manager commands:

sudo apt install rocm-bandwidth-test
sudo yum install rocm-bandwidth-test
sudo zypper install rocm-bandwidth-test

Alternatively, the source code can be downloaded and built from source.

The output will list the available compute devices (CPUs and GPUs), including their device ID and PCIe ID:

rocm-bandwidth-test output fragment on an 8*MI200 system listing devices

The output will also show a matrix that contains a “1” if a device can communicate to another device (CPU and GPU) of the system and it will show the NUMA distance (similar to rocm-smi):

'rocm-bandwidth-test' output fragment on an 8*MI200 system showing inter-device access matrix and NUMA distances

The output also contains the measured bandwidth for unidirectional and bidirectional transfers between the devices (CPU and GPU):

'rocm-bandwidth-test' output fragment on an 8*MI200 system showing uni- and bidirectional bandwidths

RDNA2 workstation tuning guide#

System settings#

This chapter reviews system settings that are required to configure the system for ROCm virtualization on RDNA2-based AMD Radeon™ PRO GPUs. Installing ROCm on Bare Metal follows the routine ROCm installation procedure.

To enable ROCm virtualization on V620, one has to setup Single Root I/O Virtualization (SR-IOV) in the BIOS via setting found in the following (System BIOS settings). A tested configuration can be followed in (Operating system settings).

Attention

SR-IOV is supported on V620 and unsupported on W6800.

System BIOS settings#
Settings for the system BIOS in an ASrock platform.#

Advanced / North Bridge Configuration

IOMMU

Enabled

Input-output Memory Management Unit

Advanced / North Bridge Configuration

ACS Enable

Enabled

Access Control Service

Advanced / PCIe/PCI/PnP Configuration

SR-IOV Support

Enabled

Single Root I/O Virtualization

Advanced / ACPI settings

PCI AER Support

Enabled

Advanced Error Reporting

To set up the host, update SBIOS to version 1.2a.

Operating system settings#
System Configuration Prerequisites#

Server

SMC 4124 [AS -4124GS-TNR]

Host OS

Ubuntu 20.04.3 LTS

Host Kernel

5.4.0-97-generic

CPU

AMD EPYC 7552 48-Core Processor

GPU

RDNA2 V620 (D603GLXE)

SBIOS

Version SMC_r_1.2a

VBIOS

113-D603GLXE-077

Guest OS 1

Ubuntu 20.04.5 LTS

Guest OS 2

RHEL 9.0

GIM Driver

gim-dkms_1.0.0.1234577_all

VM CPU Cores

32

VM RAM

64 GB

Install the following Kernel-based Virtual Machine (KVM) Hypervisor packages:

sudo apt-get -y install qemu-kvm qemu-utils  bridge-utils virt-manager  gir1.2-spiceclientgtk*  gir1.2-spice-client-gtk* libvirt-daemon-system dnsmasq-base
sudo virsh net-start default /*to enable Virtual network by default

Enable input-output memory management unit (IOMMU) in GRUB settings by adding the following line to /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=on" for AMD CPU

Update grub and reboot

sudo update=grub
sudo reboot

Install the GPU-IOV Module (GIM, where IOV is I/O Virtualization) driver and follow the steps below.z

sudo dpkg -i <gim_driver>
sudo reboot
# Load Host Driver to Create 1VF
sudo modprobe gim vf_num=1
# Note: If GIM driver loaded successfully, we could see "gim info:(gim_init:213) *****Running GIM*****" in dmesg
lspci -d 1002:

Which should output something like:

01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1478
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1479
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73a1
03:02.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73ae → VF
Guest OS installation#

First, assign GPU virtual function (VF) to VM using the following steps.

  1. Shut down the VM.

  2. Run virt-manager

  3. In the Virtual Machine Manager GUI, select the VM and click Open.

    Virtual Machine Manager

  4. In the VM GUI, go to Show Virtual Hardware Details > Add Hardware to configure hardware.

    Show virtual hardware details

  5. Go to Add Hardware > PCI Host Device > VF and click Finish.

    VF Selection

Then start the VM.

Finally install ROCm on the virtual machine (VM). For detailed instructions, refer to the Linux install guide.

GPU architecture documentation#

AMD Instinct MI300 series

Review hardware aspects of the AMD Instinct™ MI300 series of GPU accelerators and the CDNA™ 3 architecture.

AMD Instinct MI200 series

Review hardware aspects of the AMD Instinct™ MI200 series of GPU accelerators and the CDNA™ 2 architecture.

AMD Instinct MI100

Review hardware aspects of the AMD Instinct™ MI100 series of GPU accelerators and the CDNA™ 1 architecture.

AMD Instinct™ MI300 series microarchitecture#

The AMD Instinct MI300 series accelerators are based on the AMD CDNA 3 architecture which was designed to deliver leadership performance for HPC, artificial intelligence (AI), and machine learning (ML) workloads. The AMD Instinct MI300 series accelerators are well-suited for extreme scalability and compute performance, running on everything from individual servers to the world’s largest exascale supercomputers.

With the MI300 series, AMD is introducing the Accelerator Complex Die (XCD), which contains the GPU computational elements of the processor along with the lower levels of the cache hierarchy.

The following image depicts the structure of a single XCD in the AMD Instinct MI300 accelerator series.

_images/image007.png

XCD-level system architecture showing 40 Compute Units, each with 32 KB L1 cache, a Unified Compute System with 4 ACE Compute Accelerators, shared 4MB of L2 cache and an HWS Hardware Scheduler.#

On the XCD, four Asynchronous Compute Engines (ACEs) send compute shader workgroups to the Compute Units (CUs). The XCD has 40 CUs: 38 active CUs at the aggregate level and 2 disabled CUs for yield management. The CUs all share a 4 MB L2 cache that serves to coalesce all memory traffic for the die. With less than half of the CUs of the AMD Instinct MI200 Series compute die, the AMD CDNA™ 3 XCD die is a smaller building block. However, it uses more advanced packaging and the processor can include 6 or 8 XCDs for up to 304 CUs, roughly 40% more than MI250X.

The MI300 Series integrate up to 8 vertically stacked XCDs, 8 stacks of High-Bandwidth Memory 3 (HBM3) and 4 I/O dies (containing system infrastructure) using the AMD Infinity Fabric™ technology as interconnect.

The Matrix Cores inside the CDNA 3 CUs have significant improvements, emphasizing AI and machine learning, enhancing throughput of existing data types while adding support for new data types. CDNA 2 Matrix Cores support FP16 and BF16, while offering INT8 for inference. Compared to MI250X accelerators, CDNA 3 Matrix Cores triple the performance for FP16 and BF16, while providing a performance gain of 6.8 times for INT8. FP8 has a performance gain of 16 times compared to FP32, while TF32 has a gain of 4 times compared to FP32.

Peak-performance capabilities of the MI300X for different data types.#

Computation and Data Type

FLOPS/CLOCK/CU

Peak TFLOPS

Matrix FP64

256

163.4

Vector FP64

128

81.7

Matrix FP32

256

163.4

Vector FP32

256

163.4

Vector TF32

1024

653.7

Matrix FP16

2048

1307.4

Matrix BF16

2048

1307.4

Matrix FP8

4096

2614.9

Matrix INT8

4096

2614.9

The above table summarizes the aggregated peak performance of the AMD Instinct MI300X Open Compute Platform (OCP) Open Accelerator Modules (OAMs) for different data types and command processors. The middle column lists the peak performance (number of data elements processed in a single instruction) of a single compute unit if a SIMD (or matrix) instruction is submitted in each clock cycle. The third column lists the theoretical peak performance of the OAM. The theoretical aggregated peak memory bandwidth of the GPU is 5.3 TB per second.

The following image shows the block diagram of the APU (left) and the OAM package (right) both connected via AMD Infinity Fabric™ network on-chip.

MI300 series system architecture showing MI300A (left) with 6 XCDs and 3 CCDs, while the MI300X (right) has 8 XCDs.#

Node-level architecture#

_images/image009.png

MI300 series node-level architecture showing 8 fully interconnected MI300X OAM modules connected to (optional) PCIEe switches via retimers and HGX connectors.#

The image above shows the node-level architecture of a system with AMD EPYC processors in a dual-socket configuration and eight AMD Instinct MI300X accelerators. The MI300X OAMs attach to the host system via PCIe Gen 5 x16 links (yellow lines). The GPUs are using seven high-bandwidth, low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected 8-GPU system.

MI300 and MI200 series performance counters and metrics#

This document lists and describes the hardware performance counters and derived metrics available for the AMD Instinct™ MI300 and MI200 GPU. You can also access this information using the ROCProfiler tool.

MI300 and MI200 series performance counters#

Series performance counters include the following categories:

The following sections provide additional details for each category.

Note

Preliminary validation of all MI300 and MI200 series performance counters is in progress. Those with an asterisk (*) require further evaluation.

Command processor counters#

Command processor counters are further classified into command processor-fetcher and command processor-compute.

Command processor-fetcher counters#

Hardware counter

Unit

Definition

CPF_CMP_UTCL1_STALL_ON_TRANSLATION

Cycles

Number of cycles one of the compute unified translation caches (L1) is stalled waiting on translation

CPF_CPF_STAT_BUSY

Cycles

Number of cycles command processor-fetcher is busy

CPF_CPF_STAT_IDLE

Cycles

Number of cycles command processor-fetcher is idle

CPF_CPF_STAT_STALL

Cycles

Number of cycles command processor-fetcher is stalled

CPF_CPF_TCIU_BUSY

Cycles

Number of cycles command processor-fetcher texture cache interface unit interface is busy

CPF_CPF_TCIU_IDLE

Cycles

Number of cycles command processor-fetcher texture cache interface unit interface is idle

CPF_CPF_TCIU_STALL

Cycles

Number of cycles command processor-fetcher texture cache interface unit interface is stalled waiting on free tags

The texture cache interface unit is the interface between the command processor and the memory system.

Command processor-compute counters#

Hardware counter

Unit

Definition

CPC_ME1_BUSY_FOR_PACKET_DECODE

Cycles

Number of cycles command processor-compute micro engine is busy decoding packets

CPC_UTCL1_STALL_ON_TRANSLATION

Cycles

Number of cycles one of the unified translation caches (L1) is stalled waiting on translation

CPC_CPC_STAT_BUSY

Cycles

Number of cycles command processor-compute is busy

CPC_CPC_STAT_IDLE

Cycles

Number of cycles command processor-compute is idle

CPC_CPC_STAT_STALL

Cycles

Number of cycles command processor-compute is stalled

CPC_CPC_TCIU_BUSY

Cycles

Number of cycles command processor-compute texture cache interface unit interface is busy

CPC_CPC_TCIU_IDLE

Cycles

Number of cycles command processor-compute texture cache interface unit interface is idle

CPC_CPC_UTCL2IU_BUSY

Cycles

Number of cycles command processor-compute unified translation cache (L2) interface is busy

CPC_CPC_UTCL2IU_IDLE

Cycles

Number of cycles command processor-compute unified translation cache (L2) interface is idle

CPC_CPC_UTCL2IU_STALL

Cycles

Number of cycles command processor-compute unified translation cache (L2) interface is stalled

CPC_ME1_DC0_SPI_BUSY

Cycles

Number of cycles command processor-compute micro engine processor is busy

The micro engine runs packet-processing firmware on the command processor-compute counter.

Graphics register bus manager counters#

Hardware counter

Unit

Definition

GRBM_COUNT

Cycles

Number of free-running GPU cycles

GRBM_GUI_ACTIVE

Cycles

Number of GPU active cycles

GRBM_CP_BUSY

Cycles

Number of cycles any of the command processor blocks are busy

GRBM_SPI_BUSY

Cycles

Number of cycles any of the shader processor input is busy in the shader engines

GRBM_TA_BUSY

Cycles

Number of cycles any of the texture addressing unit is busy in the shader engines

GRBM_TC_BUSY

Cycles

Number of cycles any of the texture cache blocks are busy

GRBM_CPC_BUSY

Cycles

Number of cycles the command processor-compute is busy

GRBM_CPF_BUSY

Cycles

Number of cycles the command processor-fetcher is busy

GRBM_UTCL2_BUSY

Cycles

Number of cycles the unified translation cache (Level 2 [L2]) block is busy

GRBM_EA_BUSY

Cycles

Number of cycles the efficiency arbiter block is busy

Texture cache blocks include:

  • Texture cache arbiter

  • Texture cache per pipe, also known as vector Level 1 (L1) cache

  • Texture cache per channel, also known as known as L2 cache

  • Texture cache interface

Shader processor input counters#

Hardware counter

Unit

Definition

SPI_CSN_BUSY

Cycles

Number of cycles with outstanding waves

SPI_CSN_WINDOW_VALID

Cycles

Number of cycles enabled by perfcounter_start event

SPI_CSN_NUM_THREADGROUPS

Workgroups

Number of dispatched workgroups

SPI_CSN_WAVE

Wavefronts

Number of dispatched wavefronts

SPI_RA_REQ_NO_ALLOC

Cycles

Number of arbiter cycles with requests but no allocation

SPI_RA_REQ_NO_ALLOC_CSN

Cycles

Number of arbiter cycles with compute shader (nth pipe) requests but no compute shader (nth pipe) allocation

SPI_RA_RES_STALL_CSN

Cycles

Number of arbiter stall cycles due to shortage of compute shader (nth pipe) pipeline slots

SPI_RA_TMP_STALL_CSN

Cycles

Number of stall cycles due to shortage of temp space

SPI_RA_WAVE_SIMD_FULL_CSN

SIMD-cycles

Accumulated number of single instruction, multiple data (SIMD) per cycle affected by shortage of wave slots for compute shader (nth pipe) wave dispatch

SPI_RA_VGPR_SIMD_FULL_CSN

SIMD-cycles

Accumulated number of SIMDs per cycle affected by shortage of vector general-purpose register (VGPR) slots for compute shader (nth pipe) wave dispatch

SPI_RA_SGPR_SIMD_FULL_CSN

SIMD-cycles

Accumulated number of SIMDs per cycle affected by shortage of scalar general-purpose register (SGPR) slots for compute shader (nth pipe) wave dispatch

SPI_RA_LDS_CU_FULL_CSN

CU

Number of compute units affected by shortage of local data share (LDS) space for compute shader (nth pipe) wave dispatch

SPI_RA_BAR_CU_FULL_CSN

CU

Number of compute units with compute shader (nth pipe) waves waiting at a BARRIER

SPI_RA_BULKY_CU_FULL_CSN

CU

Number of compute units with compute shader (nth pipe) waves waiting for BULKY resource

SPI_RA_TGLIM_CU_FULL_CSN

Cycles

Number of compute shader (nth pipe) wave stall cycles due to restriction of tg_limit for thread group size

SPI_RA_WVLIM_STALL_CSN

Cycles

Number of cycles compute shader (nth pipe) is stalled due to WAVE_LIMIT

SPI_VWC_CSC_WR

Qcycles

Number of quad-cycles taken to initialize VGPRs when launching waves

SPI_SWC_CSC_WR

Qcycles

Number of quad-cycles taken to initialize SGPRs when launching waves

Compute unit counters#

The compute unit counters are further classified into instruction mix, matrix fused multiply-add (FMA) operation counters, level counters, wavefront counters, wavefront cycle counters, and LDS counters.

Instruction mix#

Hardware counter

Unit

Definition

SQ_INSTS

Instr

Number of instructions issued

SQ_INSTS_VALU

Instr

Number of vector arithmetic logic unit (VALU) instructions including matrix FMA issued

SQ_INSTS_VALU_ADD_F16

Instr

Number of VALU half-precision floating-point (F16) ADD or SUB instructions issued

SQ_INSTS_VALU_MUL_F16

Instr

Number of VALU F16 Multiply instructions issued

SQ_INSTS_VALU_FMA_F16

Instr

Number of VALU F16 FMA or multiply-add instructions issued

SQ_INSTS_VALU_TRANS_F16

Instr

Number of VALU F16 Transcendental instructions issued

SQ_INSTS_VALU_ADD_F32

Instr

Number of VALU full-precision floating-point (F32) ADD or SUB instructions issued

SQ_INSTS_VALU_MUL_F32

Instr

Number of VALU F32 Multiply instructions issued

SQ_INSTS_VALU_FMA_F32

Instr

Number of VALU F32 FMAor multiply-add instructions issued

SQ_INSTS_VALU_TRANS_F32

Instr

Number of VALU F32 Transcendental instructions issued

SQ_INSTS_VALU_ADD_F64

Instr

Number of VALU F64 ADD or SUB instructions issued

SQ_INSTS_VALU_MUL_F64

Instr

Number of VALU F64 Multiply instructions issued

SQ_INSTS_VALU_FMA_F64

Instr

Number of VALU F64 FMA or multiply-add instructions issued

SQ_INSTS_VALU_TRANS_F64

Instr

Number of VALU F64 Transcendental instructions issued

SQ_INSTS_VALU_INT32

Instr

Number of VALU 32-bit integer instructions (signed or unsigned) issued

SQ_INSTS_VALU_INT64

Instr

Number of VALU 64-bit integer instructions (signed or unsigned) issued

SQ_INSTS_VALU_CVT

Instr

Number of VALU Conversion instructions issued

SQ_INSTS_VALU_MFMA_I8

Instr

Number of 8-bit Integer matrix FMA instructions issued

SQ_INSTS_VALU_MFMA_F16

Instr

Number of F16 matrix FMA instructions issued

SQ_INSTS_VALU_MFMA_F32

Instr

Number of F32 matrix FMA instructions issued

SQ_INSTS_VALU_MFMA_F64

Instr

Number of F64 matrix FMA instructions issued

SQ_INSTS_MFMA

Instr

Number of matrix FMA instructions issued

SQ_INSTS_VMEM_WR

Instr

Number of vector memory write instructions (including flat) issued

SQ_INSTS_VMEM_RD

Instr

Number of vector memory read instructions (including flat) issued

SQ_INSTS_VMEM

Instr

Number of vector memory instructions issued, including both flat and buffer instructions

SQ_INSTS_SALU

Instr

Number of scalar arithmetic logic unit (SALU) instructions issued

SQ_INSTS_SMEM

Instr

Number of scalar memory instructions issued

SQ_INSTS_SMEM_NORM

Instr

Number of scalar memory instructions normalized to match smem_level issued

SQ_INSTS_FLAT

Instr

Number of flat instructions issued

SQ_INSTS_FLAT_LDS_ONLY

Instr

MI200 series only Number of FLAT instructions that read/write only from/to LDS issued. Works only if EARLY_TA_DONE is enabled.

SQ_INSTS_LDS

Instr

Number of LDS instructions issued (MI200: includes flat; MI300: does not include flat)

SQ_INSTS_GDS

Instr

Number of global data share instructions issued

SQ_INSTS_EXP_GDS

Instr

Number of EXP and global data share instructions excluding skipped export instructions issued

SQ_INSTS_BRANCH

Instr

Number of branch instructions issued

SQ_INSTS_SENDMSG

Instr

Number of SENDMSG instructions including s_endpgm issued

SQ_INSTS_VSKIPPED

Instr

Number of vector instructions skipped

Flat instructions allow read, write, and atomic access to a generic memory address pointer that can resolve to any of the following physical memories:

  • Global Memory

  • Scratch (“private”)

  • LDS (“shared”)

  • Invalid - MEM_VIOL TrapStatus

Matrix fused multiply-add operation counters#

Hardware counter

Unit

Definition

SQ_INSTS_VALU_MFMA_MOPS_I8

IOP

Number of 8-bit integer matrix FMA ops in the unit of 512

SQ_INSTS_VALU_MFMA_MOPS_F16

FLOP

Number of F16 floating matrix FMA ops in the unit of 512

SQ_INSTS_VALU_MFMA_MOPS_BF16

FLOP

Number of BF16 floating matrix FMA ops in the unit of 512

SQ_INSTS_VALU_MFMA_MOPS_F32

FLOP

Number of F32 floating matrix FMA ops in the unit of 512

SQ_INSTS_VALU_MFMA_MOPS_F64

FLOP

Number of F64 floating matrix FMA ops in the unit of 512

Level counters#

Note

All level counters must be followed by SQ_ACCUM_PREV_HIRES counter to measure average latency.

Hardware counter

Unit

Definition

SQ_ACCUM_PREV

Count

Accumulated counter sample value where accumulation takes place once every four cycles

SQ_ACCUM_PREV_HIRES

Count

Accumulated counter sample value where accumulation takes place once every cycle

SQ_LEVEL_WAVES

Waves

Number of inflight waves

SQ_INST_LEVEL_VMEM

Instr

Number of inflight vector memory (including flat) instructions

SQ_INST_LEVEL_SMEM

Instr

Number of inflight scalar memory instructions

SQ_INST_LEVEL_LDS

Instr

Number of inflight LDS (including flat) instructions

SQ_IFETCH_LEVEL

Instr

Number of inflight instruction fetch requests from the cache

Use the following formulae to calculate latencies:

  • Vector memory latency = SQ_ACCUM_PREV_HIRES divided by SQ_INSTS_VMEM

  • Wave latency = SQ_ACCUM_PREV_HIRES divided by SQ_WAVE

  • LDS latency = SQ_ACCUM_PREV_HIRES divided by SQ_INSTS_LDS

  • Scalar memory latency = SQ_ACCUM_PREV_HIRES divided by SQ_INSTS_SMEM_NORM

  • Instruction fetch latency = SQ_ACCUM_PREV_HIRES divided by SQ_IFETCH

Wavefront counters#

Hardware counter

Unit

Definition

SQ_WAVES

Waves

Number of wavefronts dispatched to sequencers, including both new and restored wavefronts

SQ_WAVES_SAVED

Waves

Number of context-saved waves

SQ_WAVES_RESTORED

Waves

Number of context-restored waves sent to sequencers

SQ_WAVES_EQ_64

Waves

Number of wavefronts with exactly 64 active threads sent to sequencers

SQ_WAVES_LT_64

Waves

Number of wavefronts with less than 64 active threads sent to sequencers

SQ_WAVES_LT_48

Waves

Number of wavefronts with less than 48 active threads sent to sequencers

SQ_WAVES_LT_32

Waves

Number of wavefronts with less than 32 active threads sent to sequencers

SQ_WAVES_LT_16

Waves

Number of wavefronts with less than 16 active threads sent to sequencers

Wavefront cycle counters#

Hardware counter

Unit

Definition

SQ_CYCLES

Cycles

Clock cycles

SQ_BUSY_CYCLES

Cycles

Number of cycles while sequencers reports it to be busy

SQ_BUSY_CU_CYCLES

Qcycles

Number of quad-cycles each compute unit is busy

SQ_VALU_MFMA_BUSY_CYCLES

Cycles

Number of cycles the matrix FMA arithmetic logic unit (ALU) is busy

SQ_WAVE_CYCLES

Qcycles

Number of quad-cycles spent by waves in the compute units

SQ_WAIT_ANY

Qcycles

Number of quad-cycles spent waiting for anything

SQ_WAIT_INST_ANY

Qcycles

Number of quad-cycles spent waiting for any instruction to be issued

SQ_ACTIVE_INST_ANY

Qcycles

Number of quad-cycles spent by each wave to work on an instruction

SQ_ACTIVE_INST_VMEM

Qcycles

Number of quad-cycles spent by the sequencer instruction arbiter to work on a vector memory instruction

SQ_ACTIVE_INST_LDS

Qcycles

Number of quad-cycles spent by the sequencer instruction arbiter to work on an LDS instruction

SQ_ACTIVE_INST_VALU

Qcycles

Number of quad-cycles spent by the sequencer instruction arbiter to work on a VALU instruction

SQ_ACTIVE_INST_SCA

Qcycles

Number of quad-cycles spent by the sequencer instruction arbiter to work on a SALU or scalar memory instruction

SQ_ACTIVE_INST_EXP_GDS

Qcycles

Number of quad-cycles spent by the sequencer instruction arbiter to work on an EXPORT or GDS instruction

SQ_ACTIVE_INST_MISC

Qcycles

Number of quad-cycles spent by the sequencer instruction arbiter to work on a BRANCH or SENDMSG instruction

SQ_ACTIVE_INST_FLAT

Qcycles

Number of quad-cycles spent by the sequencer instruction arbiter to work on a flat instruction

SQ_INST_CYCLES_VMEM_WR

Qcycles

Number of quad-cycles spent to send addr and cmd data for vector memory write instructions

SQ_INST_CYCLES_VMEM_RD

Qcycles

Number of quad-cycles spent to send addr and cmd data for vector memory read instructions

SQ_INST_CYCLES_SMEM

Qcycles

Number of quad-cycles spent to execute scalar memory reads

SQ_INST_CYCLES_SALU

Qcycles

Number of quad-cycles spent to execute non-memory read scalar operations

SQ_THREAD_CYCLES_VALU

Qcycles

Number of quad-cycles spent to execute VALU operations on active threads

SQ_WAIT_INST_LDS

Qcycles

Number of quad-cycles spent waiting for LDS instruction to be issued

SQ_THREAD_CYCLES_VALU is similar to INST_CYCLES_VALU, but it’s multiplied by the number of active threads.

LDS counters#

Hardware counter

Unit

Definition

SQ_LDS_ATOMIC_RETURN

Cycles

Number of atomic return cycles in LDS

SQ_LDS_BANK_CONFLICT

Cycles

Number of cycles LDS is stalled by bank conflicts

SQ_LDS_ADDR_CONFLICT

Cycles

Number of cycles LDS is stalled by address conflicts

SQ_LDS_UNALIGNED_STALL

Cycles

Number of cycles LDS is stalled processing flat unaligned load or store operations

SQ_LDS_MEM_VIOLATIONS

Count

Number of threads that have a memory violation in the LDS

SQ_LDS_IDX_ACTIVE

Cycles

Number of cycles LDS is used for indexed operations

Miscellaneous counters#

Hardware counter

Unit

Definition

SQ_IFETCH

Count

Number of instruction fetch requests from L1i, in 32-byte width

SQ_ITEMS

Threads

Number of valid items per wave

L1 instruction cache (L1i) and scalar L1 data cache (L1d) counters#

Hardware counter

Unit

Definition

SQC_ICACHE_REQ

Req

Number of L1 instruction (L1i) cache requests

SQC_ICACHE_HITS

Count

Number of L1i cache hits

SQC_ICACHE_MISSES

Count

Number of non-duplicate L1i cache misses including uncached requests

SQC_ICACHE_MISSES_DUPLICATE

Count

Number of duplicate L1i cache misses whose previous lookup miss on the same cache line is not fulfilled yet

SQC_DCACHE_REQ

Req

Number of scalar L1d requests

SQC_DCACHE_INPUT_VALID_READYB

Cycles

Number of cycles while sequencer input is valid but scalar L1d is not ready

SQC_DCACHE_HITS

Count

Number of scalar L1d hits

SQC_DCACHE_MISSES

Count

Number of non-duplicate scalar L1d misses including uncached requests

SQC_DCACHE_MISSES_DUPLICATE

Count

Number of duplicate scalar L1d misses

SQC_DCACHE_REQ_READ_1

Req

Number of constant cache read requests in a single 32-bit data word

SQC_DCACHE_REQ_READ_2

Req

Number of constant cache read requests in two 32-bit data words

SQC_DCACHE_REQ_READ_4

Req

Number of constant cache read requests in four 32-bit data words

SQC_DCACHE_REQ_READ_8

Req

Number of constant cache read requests in eight 32-bit data words

SQC_DCACHE_REQ_READ_16

Req

Number of constant cache read requests in 16 32-bit data words

SQC_DCACHE_ATOMIC

Req

Number of atomic requests

SQC_TC_REQ

Req

Number of texture cache requests that were issued by instruction and constant caches

SQC_TC_INST_REQ

Req

Number of instruction requests to the L2 cache

SQC_TC_DATA_READ_REQ

Req

Number of data Read requests to the L2 cache

SQC_TC_DATA_WRITE_REQ

Req

Number of data write requests to the L2 cache

SQC_TC_DATA_ATOMIC_REQ

Req

Number of data atomic requests to the L2 cache

SQC_TC_STALL

Cycles

Number of cycles while the valid requests to the L2 cache are stalled

Vector L1 cache subsystem counters#

The vector L1 cache subsystem counters are further classified into texture addressing unit, texture data unit, vector L1d or texture cache per pipe, and texture cache arbiter counters.

Texture addressing unit counters#

Hardware counter

Unit

Definition

Value range for n

TA_TA_BUSY[n]

Cycles

Texture addressing unit busy cycles

0-15

TA_TOTAL_WAVEFRONTS[n]

Instr

Number of wavefronts processed by texture addressing unit

0-15

TA_BUFFER_WAVEFRONTS[n]

Instr

Number of buffer wavefronts processed by texture addressing unit

0-15

TA_BUFFER_READ_WAVEFRONTS[n]

Instr

Number of buffer read wavefronts processed by texture addressing unit

0-15

TA_BUFFER_WRITE_WAVEFRONTS[n]

Instr

Number of buffer write wavefronts processed by texture addressing unit

0-15

TA_BUFFER_ATOMIC_WAVEFRONTS[n]

Instr

Number of buffer atomic wavefronts processed by texture addressing unit

0-15

TA_BUFFER_TOTAL_CYCLES[n]

Cycles

Number of buffer cycles (including read and write) issued to texture cache

0-15

TA_BUFFER_COALESCED_READ_CYCLES[n]

Cycles

Number of coalesced buffer read cycles issued to texture cache

0-15

TA_BUFFER_COALESCED_WRITE_CYCLES[n]

Cycles

Number of coalesced buffer write cycles issued to texture cache

0-15

TA_ADDR_STALLED_BY_TC_CYCLES[n]

Cycles

Number of cycles texture addressing unit address path is stalled by texture cache

0-15

TA_DATA_STALLED_BY_TC_CYCLES[n]

Cycles

Number of cycles texture addressing unit data path is stalled by texture cache

0-15

TA_ADDR_STALLED_BY_TD_CYCLES[n]

Cycles

Number of cycles texture addressing unit address path is stalled by texture data unit

0-15

TA_FLAT_WAVEFRONTS[n]

Instr

Number of flat opcode wavefronts processed by texture addressing unit

0-15

TA_FLAT_READ_WAVEFRONTS[n]

Instr

Number of flat opcode read wavefronts processed by texture addressing unit

0-15

TA_FLAT_WRITE_WAVEFRONTS[n]

Instr

Number of flat opcode write wavefronts processed by texture addressing unit

0-15

TA_FLAT_ATOMIC_WAVEFRONTS[n]

Instr

Number of flat opcode atomic wavefronts processed by texture addressing unit

0-15

Texture data unit counters#

Hardware counter

Unit

Definition

Value range for n

TD_TD_BUSY[n]

Cycle

Texture data unit busy cycles while it is processing or waiting for data

0-15

TD_TC_STALL[n]

Cycle

Number of cycles texture data unit is stalled waiting for texture cache data

0-15

TD_SPI_STALL[n]

Cycle

Number of cycles texture data unit is stalled by shader processor input

0-15

TD_LOAD_WAVEFRONT[n]

Instr

Number of wavefront instructions (read, write, atomic)

0-15

TD_STORE_WAVEFRONT[n]

Instr

Number of write wavefront instructions

0-15

TD_ATOMIC_WAVEFRONT[n]

Instr

Number of atomic wavefront instructions

0-15

TD_COALESCABLE_WAVEFRONT[n]

Instr

Number of coalescable wavefronts according to texture addressing unit

0-15

Texture cache per pipe counters#

Hardware counter

Unit

Definition

Value range for n

TCP_GATE_EN1[n]

Cycles

Number of cycles vector L1d interface clocks are turned on

0-15

TCP_GATE_EN2[n]

Cycles

Number of cycles vector L1d core clocks are turned on

0-15

TCP_TD_TCP_STALL_CYCLES[n]

Cycles

Number of cycles texture data unit stalls vector L1d

0-15

TCP_TCR_TCP_STALL_CYCLES[n]

Cycles

Number of cycles texture cache router stalls vector L1d

0-15

TCP_READ_TAGCONFLICT_STALL_CYCLES[n]

Cycles

Number of cycles tag RAM conflict stalls on a read

0-15

TCP_WRITE_TAGCONFLICT_STALL_CYCLES[n]

Cycles

Number of cycles tag RAM conflict stalls on a write

0-15

TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES[n]

Cycles

Number of cycles tag RAM conflict stalls on an atomic

0-15

TCP_PENDING_STALL_CYCLES[n]

Cycles

Number of cycles vector L1d is stalled due to data pending from L2 Cache

0-15

TCP_TCP_TA_DATA_STALL_CYCLES

Cycles

Number of cycles texture cache per pipe stalls texture addressing unit data interface

NA

TCP_TA_TCP_STATE_READ[n]

Req

Number of state reads

0-15

TCP_VOLATILE[n]

Req

Number of L1 volatile pixels or buffers from texture addressing unit

0-15

TCP_TOTAL_ACCESSES[n]

Req

Number of vector L1d accesses. Equals TCP_PERF_SEL_TOTAL_READ`+`TCP_PERF_SEL_TOTAL_NONREAD

0-15

TCP_TOTAL_READ[n]

Req

Number of vector L1d read accesses

0-15

TCP_TOTAL_WRITE[n]

Req

Number of vector L1d write accesses

0-15

TCP_TOTAL_ATOMIC_WITH_RET[n]

Req

Number of vector L1d atomic requests with return

0-15

TCP_TOTAL_ATOMIC_WITHOUT_RET[n]

Req

Number of vector L1d atomic without return

0-15

TCP_TOTAL_WRITEBACK_INVALIDATES[n]

Count

Total number of vector L1d writebacks and invalidates

0-15

TCP_UTCL1_REQUEST[n]

Req

Number of address translation requests to unified translation cache (L1)

0-15

TCP_UTCL1_TRANSLATION_HIT[n]

Req

Number of unified translation cache (L1) translation hits

0-15

TCP_UTCL1_TRANSLATION_MISS[n]

Req

Number of unified translation cache (L1) translation misses

0-15

TCP_UTCL1_PERMISSION_MISS[n]

Req

Number of unified translation cache (L1) permission misses

0-15

TCP_TOTAL_CACHE_ACCESSES[n]

Req

Number of vector L1d cache accesses including hits and misses

0-15

TCP_TCP_LATENCY[n]

Cycles

MI200 series only Accumulated wave access latency to vL1D over all wavefronts

0-15

TCP_TCC_READ_REQ_LATENCY[n]

Cycles

MI200 series only Total vL1D to L2 request latency over all wavefronts for reads and atomics with return

0-15

TCP_TCC_WRITE_REQ_LATENCY[n]

Cycles

MI200 series only Total vL1D to L2 request latency over all wavefronts for writes and atomics without return

0-15

TCP_TCC_READ_REQ[n]

Req

Number of read requests to L2 cache

0-15

TCP_TCC_WRITE_REQ[n]

Req

Number of write requests to L2 cache

0-15

TCP_TCC_ATOMIC_WITH_RET_REQ[n]

Req

Number of atomic requests to L2 cache with return

0-15

TCP_TCC_ATOMIC_WITHOUT_RET_REQ[n]

Req

Number of atomic requests to L2 cache without return

0-15

TCP_TCC_NC_READ_REQ[n]

Req

Number of non-coherently cached read requests to L2 cache

0-15

TCP_TCC_UC_READ_REQ[n]

Req

Number of uncached read requests to L2 cache

0-15

TCP_TCC_CC_READ_REQ[n]

Req

Number of coherently cached read requests to L2 cache

0-15

TCP_TCC_RW_READ_REQ[n]

Req

Number of coherently cached with write read requests to L2 cache

0-15

TCP_TCC_NC_WRITE_REQ[n]

Req

Number of non-coherently cached write requests to L2 cache

0-15

TCP_TCC_UC_WRITE_REQ[n]

Req

Number of uncached write requests to L2 cache

0-15

TCP_TCC_CC_WRITE_REQ[n]

Req

Number of coherently cached write requests to L2 cache

0-15

TCP_TCC_RW_WRITE_REQ[n]

Req

Number of coherently cached with write write requests to L2 cache

0-15

TCP_TCC_NC_ATOMIC_REQ[n]

Req

Number of non-coherently cached atomic requests to L2 cache

0-15

TCP_TCC_UC_ATOMIC_REQ[n]

Req

Number of uncached atomic requests to L2 cache

0-15

TCP_TCC_CC_ATOMIC_REQ[n]

Req

Number of coherently cached atomic requests to L2 cache

0-15

TCP_TCC_RW_ATOMIC_REQ[n]

Req

Number of coherently cached with write atomic requests to L2 cache

0-15

Note that:

  • TCP_TOTAL_READ[n] = TCP_PERF_SEL_TOTAL_HIT_LRU_READ + TCP_PERF_SEL_TOTAL_MISS_LRU_READ + TCP_PERF_SEL_TOTAL_MISS_EVICT_READ

  • TCP_TOTAL_WRITE[n] = TCP_PERF_SEL_TOTAL_MISS_LRU_WRITE``+ ``TCP_PERF_SEL_TOTAL_MISS_EVICT_WRITE

  • TCP_TOTAL_WRITEBACK_INVALIDATES[n] = TCP_PERF_SEL_TOTAL_WBINVL1``+ ``TCP_PERF_SEL_TOTAL_WBINVL1_VOL``+ ``TCP_PERF_SEL_CP_TCP_INVALIDATE``+ ``TCP_PERF_SEL_SQ_TCP_INVALIDATE_VOL

Texture cache arbiter counters#

Hardware counter

Unit

Definition

Value range for n

TCA_CYCLE[n]

Cycles

Number of texture cache arbiter cycles

0-31

TCA_BUSY[n]

Cycles

Number of cycles texture cache arbiter has a pending request

0-31

L2 cache access counters#

L2 cache is also known as texture cache per channel.

Hardware counter

Unit

Definition

Value range for n

TCC_CYCLE[n]

Cycles

Number of L2 cache free-running clocks

0-31

TCC_BUSY[n]

Cycles

Number of L2 cache busy cycles

0-31

TCC_REQ[n]

Req

Number of L2 cache requests of all types (measured at the tag block)

0-31

TCC_STREAMING_REQ[n]

Req

Number of L2 cache streaming requests (measured at the tag block)

0-31

TCC_NC_REQ[n]

Req

Number of non-coherently cached requests (measured at the tag block)

0-31

TCC_UC_REQ[n]

Req

Number of uncached requests. This is measured at the tag block

0-31

TCC_CC_REQ[n]

Req

Number of coherently cached requests. This is measured at the tag block

0-31

TCC_RW_REQ[n]

Req

Number of coherently cached with write requests. This is measured at the tag block

0-31

TCC_PROBE[n]

Req

Number of probe requests

0-31

TCC_PROBE_ALL[n]

Req

Number of external probe requests with EA_TCC_preq_all == 1

0-31

TCC_READ[n]

Req

Number of L2 cache read requests (includes compressed reads but not metadata reads)

0-31

TCC_WRITE[n]

Req

Number of L2 cache write requests

0-31

TCC_ATOMIC[n]

Req

Number of L2 cache atomic requests of all types

0-31

TCC_HIT[n]

Req

Number of L2 cache hits

0-31

TCC_MISS[n]

Req

Number of L2 cache misses

0-31

TCC_WRITEBACK[n]

Req

Number of lines written back to the main memory, including writebacks of dirty lines and uncached write or atomic requests

0-31

TCC_EA0_WRREQ[n]

Req

Number of 32-byte and 64-byte transactions going over the TC_EA_wrreq interface (doesn’t include probe commands)

0-31

TCC_EA0_WRREQ_64B[n]

Req

Total number of 64-byte transactions (write or CMPSWAP) going over the TC_EA_wrreq interface

0-31

TCC_EA0_WR_UNCACHED_32B[n]

Req

Number of 32 or 64-byte write or atomic going over the TC_EA_wrreq interface due to uncached traffic

0-31

TCC_EA0_WRREQ_STALL[n]

Cycles

Number of cycles a write request is stalled

0-31

TCC_EA0_WRREQ_IO_CREDIT_STALL[n]

Cycles

Number of cycles an efficiency arbiter write request is stalled due to the interface running out of input-output (IO) credits

0-31

TCC_EA0_WRREQ_GMI_CREDIT_STALL[n]

Cycles

Number of cycles an efficiency arbiter write request is stalled due to the interface running out of GMI credits

0-31

TCC_EA0_WRREQ_DRAM_CREDIT_STALL[n]

Cycles

Number of cycles an efficiency arbiter write request is stalled due to the interface running out of DRAM credits

0-31

TCC_TOO_MANY_EA_WRREQS_STALL[n]

Cycles

Number of cycles the L2 cache is unable to send an efficiency arbiter write request due to it reaching its maximum capacity of pending efficiency arbiter write requests

0-31

TCC_EA0_WRREQ_LEVEL[n]

Req

The accumulated number of efficiency arbiter write requests in flight

0-31

TCC_EA0_ATOMIC[n]

Req

Number of 32-byte or 64-byte atomic requests going over the TC_EA_wrreq interface

0-31

TCC_EA0_ATOMIC_LEVEL[n]

Req

The accumulated number of efficiency arbiter atomic requests in flight

0-31

TCC_EA0_RDREQ[n]

Req

Number of 32-byte or 64-byte read requests to efficiency arbiter

0-31

TCC_EA0_RDREQ_32B[n]

Req

Number of 32-byte read requests to efficiency arbiter

0-31

TCC_EA0_RD_UNCACHED_32B[n]

Req

Number of 32-byte efficiency arbiter reads due to uncached traffic. A 64-byte request is counted as 2

0-31

TCC_EA0_RDREQ_IO_CREDIT_STALL[n]

Cycles

Number of cycles there is a stall due to the read request interface running out of IO credits

0-31

TCC_EA0_RDREQ_GMI_CREDIT_STALL[n]

Cycles

Number of cycles there is a stall due to the read request interface running out of GMI credits

0-31

TCC_EA0_RDREQ_DRAM_CREDIT_STALL[n]

Cycles

Number of cycles there is a stall due to the read request interface running out of DRAM credits

0-31

TCC_EA0_RDREQ_LEVEL[n]

Req

The accumulated number of efficiency arbiter read requests in flight

0-31

TCC_EA0_RDREQ_DRAM[n]

Req

Number of 32-byte or 64-byte efficiency arbiter read requests to High Bandwidth Memory (HBM)

0-31

TCC_EA0_WRREQ_DRAM[n]

Req

Number of 32-byte or 64-byte efficiency arbiter write requests to HBM

0-31

TCC_TAG_STALL[n]

Cycles

Number of cycles the normal request pipeline in the tag is stalled for any reason

0-31

TCC_NORMAL_WRITEBACK[n]

Req

Number of writebacks due to requests that are not writeback requests

0-31

TCC_ALL_TC_OP_WB_WRITEBACK[n]

Req

Number of writebacks due to all TC_OP writeback requests

0-31

TCC_NORMAL_EVICT[n]

Req

Number of evictions due to requests that are not invalidate or probe requests

0-31

TCC_ALL_TC_OP_INV_EVICT[n]

Req

Number of evictions due to all TC_OP invalidate requests

0-31

Hardware counter

Unit

Definition

Value range for n

TCC_CYCLE[n]

Cycles

Number of L2 cache free-running clocks

0-31

TCC_BUSY[n]

Cycles

Number of L2 cache busy cycles

0-31

TCC_REQ[n]

Req

Number of L2 cache requests of all types (measured at the tag block)

0-31

TCC_STREAMING_REQ[n]

Req

Number of L2 cache streaming requests (measured at the tag block)

0-31

TCC_NC_REQ[n]

Req

Number of non-coherently cached requests (measured at the tag block)

0-31

TCC_UC_REQ[n]

Req

Number of uncached requests. This is measured at the tag block

0-31

TCC_CC_REQ[n]

Req

Number of coherently cached requests. This is measured at the tag block

0-31

TCC_RW_REQ[n]

Req

Number of coherently cached with write requests. This is measured at the tag block

0-31

TCC_PROBE[n]

Req

Number of probe requests

0-31

TCC_PROBE_ALL[n]

Req

Number of external probe requests with EA_TCC_preq_all == 1

0-31

TCC_READ[n]

Req

Number of L2 cache read requests (includes compressed reads but not metadata reads)

0-31

TCC_WRITE[n]

Req

Number of L2 cache write requests

0-31

TCC_ATOMIC[n]

Req

Number of L2 cache atomic requests of all types

0-31

TCC_HIT[n]

Req

Number of L2 cache hits

0-31

TCC_MISS[n]

Req

Number of L2 cache misses

0-31

TCC_WRITEBACK[n]

Req

Number of lines written back to the main memory, including writebacks of dirty lines and uncached write or atomic requests

0-31

TCC_EA_WRREQ[n]

Req

Number of 32-byte and 64-byte transactions going over the TC_EA_wrreq interface (doesn’t include probe commands)

0-31

TCC_EA_WRREQ_64B[n]

Req

Total number of 64-byte transactions (write or CMPSWAP) going over the TC_EA_wrreq interface

0-31

TCC_EA_WR_UNCACHED_32B[n]

Req

Number of 32 write or atomic going over the TC_EA_wrreq interface due to uncached traffic. A 64-byte request will be counted as 2

0-31

TCC_EA_WRREQ_STALL[n]

Cycles

Number of cycles a write request is stalled

0-31

TCC_EA_WRREQ_IO_CREDIT_STALL[n]

Cycles

Number of cycles an efficiency arbiter write request is stalled due to the interface running out of input-output (IO) credits

0-31

TCC_EA_WRREQ_GMI_CREDIT_STALL[n]

Cycles

Number of cycles an efficiency arbiter write request is stalled due to the interface running out of GMI credits

0-31

TCC_EA_WRREQ_DRAM_CREDIT_STALL[n]

Cycles

Number of cycles an efficiency arbiter write request is stalled due to the interface running out of DRAM credits

0-31

TCC_TOO_MANY_EA_WRREQS_STALL[n]

Cycles

Number of cycles the L2 cache is unable to send an efficiency arbiter write request due to it reaching its maximum capacity of pending efficiency arbiter write requests

0-31

TCC_EA_WRREQ_LEVEL[n]

Req

The accumulated number of efficiency arbiter write requests in flight

0-31

TCC_EA_ATOMIC[n]

Req

Number of 32-byte or 64-byte atomic requests going over the TC_EA_wrreq interface

0-31

TCC_EA_ATOMIC_LEVEL[n]

Req

The accumulated number of efficiency arbiter atomic requests in flight

0-31

TCC_EA_RDREQ[n]

Req

Number of 32-byte or 64-byte read requests to efficiency arbiter

0-31

TCC_EA_RDREQ_32B[n]

Req

Number of 32-byte read requests to efficiency arbiter

0-31

TCC_EA_RD_UNCACHED_32B[n]

Req

Number of 32-byte efficiency arbiter reads due to uncached traffic. A 64-byte request is counted as 2

0-31

TCC_EA_RDREQ_IO_CREDIT_STALL[n]

Cycles

Number of cycles there is a stall due to the read request interface running out of IO credits

0-31

TCC_EA_RDREQ_GMI_CREDIT_STALL[n]

Cycles

Number of cycles there is a stall due to the read request interface running out of GMI credits

0-31

TCC_EA_RDREQ_DRAM_CREDIT_STALL[n]

Cycles

Number of cycles there is a stall due to the read request interface running out of DRAM credits

0-31

TCC_EA_RDREQ_LEVEL[n]

Req

The accumulated number of efficiency arbiter read requests in flight

0-31

TCC_EA_RDREQ_DRAM[n]

Req

Number of 32-byte or 64-byte efficiency arbiter read requests to High Bandwidth Memory (HBM)

0-31

TCC_EA_WRREQ_DRAM[n]

Req

Number of 32-byte or 64-byte efficiency arbiter write requests to HBM

0-31

TCC_TAG_STALL[n]

Cycles

Number of cycles the normal request pipeline in the tag is stalled for any reason

0-31

TCC_NORMAL_WRITEBACK[n]

Req

Number of writebacks due to requests that are not writeback requests

0-31

TCC_ALL_TC_OP_WB_WRITEBACK[n]

Req

Number of writebacks due to all TC_OP writeback requests

0-31

TCC_NORMAL_EVICT[n]

Req

Number of evictions due to requests that are not invalidate or probe requests

0-31

TCC_ALL_TC_OP_INV_EVICT[n]

Req

Number of evictions due to all TC_OP invalidate requests

0-31

Note the following:

  • TCC_REQ[n] may be more than the number of requests arriving at the texture cache per channel, but it’s a good indication of the total amount of work that needs to be performed.

  • For TCC_EA0_WRREQ[n], atomics may travel over the same interface and are generally classified as write requests.

  • CC mtypes can produce uncached requests, and those are included in TCC_EA0_WR_UNCACHED_32B[n]

  • TCC_EA0_WRREQ_LEVEL[n] is primarily intended to measure average efficiency arbiter write latency.

    • Average write latency = TCC_PERF_SEL_EA0_WRREQ_LEVEL divided by TCC_PERF_SEL_EA0_WRREQ

  • TCC_EA0_ATOMIC_LEVEL[n] is primarily intended to measure average efficiency arbiter atomic latency

    • Average atomic latency = TCC_PERF_SEL_EA0_WRREQ_ATOMIC_LEVEL divided by TCC_PERF_SEL_EA0_WRREQ_ATOMIC

  • TCC_EA0_RDREQ_LEVEL[n] is primarily intended to measure average efficiency arbiter read latency.

    • Average read latency = TCC_PERF_SEL_EA0_RDREQ_LEVEL divided by TCC_PERF_SEL_EA0_RDREQ

  • Stalls can occur regardless of the need for a read to be performed

  • Normally, stalls are measured exactly at one point in the pipeline however in the case of TCC_TAG_STALL[n], probes can stall the pipeline at a variety of places. There is no single point that can accurately measure the total stalls

MI300 and MI200 series derived metrics list#

Hardware counter

Definition

ALUStalledByLDS

Percentage of GPU time ALU units are stalled due to the LDS input queue being full or the output queue not being ready (value range: 0% (optimal) to 100%)

FetchSize

Total kilobytes fetched from the video memory; measured with all extra fetches and any cache or memory effects taken into account

FlatLDSInsts

Average number of flat instructions that read from or write to LDS, run per work item (affected by flow control)

FlatVMemInsts

Average number of flat instructions that read from or write to the video memory, run per work item (affected by flow control). Includes flat instructions that read from or write to scratch

GDSInsts

Average number of global data share read or write instructions run per work item (affected by flow control)

GPUBusy

Percentage of time GPU is busy

L2CacheHit

Percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache (value range: 0% (no hit) to 100% (optimal))

LDSBankConflict

Percentage of GPU time LDS is stalled by bank conflicts (value range: 0% (optimal) to 100%)

LDSInsts

Average number of LDS read or write instructions run per work item (affected by flow control). Excludes flat instructions that read from or write to LDS.

MemUnitBusy

Percentage of GPU time the memory unit is active, which is measured with all extra fetches and writes and any cache or memory effects taken into account (value range: 0% to 100% (fetch-bound))

MemUnitStalled

Percentage of GPU time the memory unit is stalled (value range: 0% (optimal) to 100%)

MemWrites32B

Total number of effective 32B write transactions to the memory

TCA_BUSY_sum

Total number of cycles texture cache arbiter has a pending request, over all texture cache arbiter instances

TCA_CYCLE_sum

Total number of cycles over all texture cache arbiter instances

SALUBusy

Percentage of GPU time scalar ALU instructions are processed (value range: 0% to 100% (optimal))

SALUInsts

Average number of scalar ALU instructions run per work item (affected by flow control)

SFetchInsts

Average number of scalar fetch instructions from the video memory run per work item (affected by flow control)

VALUBusy

Percentage of GPU time vector ALU instructions are processed (value range: 0% to 100% (optimal))

VALUInsts

Average number of vector ALU instructions run per work item (affected by flow control)

VALUUtilization

Percentage of active vector ALU threads in a wave, where a lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64 (value range: 0%, 100% (optimal - no thread divergence))

VFetchInsts

Average number of vector fetch instructions from the video memory run per work-item (affected by flow control); excludes flat instructions that fetch from video memory

VWriteInsts

Average number of vector write instructions to the video memory run per work-item (affected by flow control); excludes flat instructions that write to video memory

Wavefronts

Total wavefronts

WRITE_REQ_32B

Total number of 32-byte effective memory writes

WriteSize

Total kilobytes written to the video memory; measured with all extra fetches and any cache or memory effects taken into account

WriteUnitStalled

Percentage of GPU time the write unit is stalled (value range: 0% (optimal) to 100%)

You can lower ALUStalledByLDS by reducing LDS bank conflicts or number of LDS accesses. You can lower MemUnitStalled by reducing the number or size of fetches and writes. MemUnitBusy includes the stall time (MemUnitStalled).

Hardware counters by and over all texture addressing unit instances#

The following table shows the hardware counters by all texture addressing unit instances.

Hardware counter

Definition

TA_BUFFER_WAVEFRONTS_sum

Total number of buffer wavefronts processed

TA_BUFFER_READ_WAVEFRONTS_sum

Total number of buffer read wavefronts processed

TA_BUFFER_WRITE_WAVEFRONTS_sum

Total number of buffer write wavefronts processed

TA_BUFFER_ATOMIC_WAVEFRONTS_sum

Total number of buffer atomic wavefronts processed

TA_BUFFER_TOTAL_CYCLES_sum

Total number of buffer cycles (including read and write) issued to texture cache

TA_BUFFER_COALESCED_READ_CYCLES_sum

Total number of coalesced buffer read cycles issued to texture cache

TA_BUFFER_COALESCED_WRITE_CYCLES_sum

Total number of coalesced buffer write cycles issued to texture cache

TA_FLAT_READ_WAVEFRONTS_sum

Sum of flat opcode reads processed

TA_FLAT_WRITE_WAVEFRONTS_sum

Sum of flat opcode writes processed

TA_FLAT_WAVEFRONTS_sum

Total number of flat opcode wavefronts processed

TA_FLAT_READ_WAVEFRONTS_sum

Total number of flat opcode read wavefronts processed

TA_FLAT_ATOMIC_WAVEFRONTS_sum

Total number of flat opcode atomic wavefronts processed

TA_TOTAL_WAVEFRONTS_sum

Total number of wavefronts processed

The following table shows the hardware counters over all texture addressing unit instances.

Hardware counter

Definition

TA_ADDR_STALLED_BY_TC_CYCLES_sum

Total number of cycles texture addressing unit address path is stalled by texture cache

TA_ADDR_STALLED_BY_TD_CYCLES_sum

Total number of cycles texture addressing unit address path is stalled by texture data unit

TA_BUSY_avr

Average number of busy cycles

TA_BUSY_max

Maximum number of texture addressing unit busy cycles

TA_BUSY_min

Minimum number of texture addressing unit busy cycles

TA_DATA_STALLED_BY_TC_CYCLES_sum

Total number of cycles texture addressing unit data path is stalled by texture cache

TA_TA_BUSY_sum

Total number of texture addressing unit busy cycles

Hardware counters over all texture cache per channel instances#

Hardware counter

Definition

TCC_ALL_TC_OP_WB_WRITEBACK_sum

Total number of writebacks due to all TC_OP writeback requests.

TCC_ALL_TC_OP_INV_EVICT_sum

Total number of evictions due to all TC_OP invalidate requests.

TCC_ATOMIC_sum

Total number of L2 cache atomic requests of all types.

TCC_BUSY_avr

Average number of L2 cache busy cycles.

TCC_BUSY_sum

Total number of L2 cache busy cycles.

TCC_CC_REQ_sum

Total number of coherently cached requests.

TCC_CYCLE_sum

Total number of L2 cache free running clocks.

TCC_EA0_WRREQ_sum

Total number of 32-byte and 64-byte transactions going over the TC_EA0_wrreq interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands.

TCC_EA0_WRREQ_64B_sum

Total number of 64-byte transactions (write or CMPSWAP) going over the TC_EA0_wrreq interface.

TCC_EA0_WR_UNCACHED_32B_sum

Total Number of 32-byte write or atomic going over the TC_EA0_wrreq interface due to uncached traffic. Note that coherently cached mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2.

TCC_EA0_WRREQ_STALL_sum

Total Number of cycles a write request is stalled, over all instances.

TCC_EA0_WRREQ_IO_CREDIT_STALL_sum

Total number of cycles an efficiency arbiter write request is stalled due to the interface running out of IO credits, over all instances.

TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum

Total number of cycles an efficiency arbiter write request is stalled due to the interface running out of GMI credits, over all instances.

TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum

Total number of cycles an efficiency arbiter write request is stalled due to the interface running out of DRAM credits, over all instances.

TCC_EA0_WRREQ_LEVEL_sum

Total number of efficiency arbiter write requests in flight.

TCC_EA0_RDREQ_LEVEL_sum

Total number of efficiency arbiter read requests in flight.

TCC_EA0_ATOMIC_sum

Total Number of 32-byte or 64-byte atomic requests going over the TC_EA0_wrreq interface.

TCC_EA0_ATOMIC_LEVEL_sum

Total number of efficiency arbiter atomic requests in flight.

TCC_EA0_RDREQ_sum

Total number of 32-byte or 64-byte read requests to efficiency arbiter.

TCC_EA0_RDREQ_32B_sum

Total number of 32-byte read requests to efficiency arbiter.

TCC_EA0_RD_UNCACHED_32B_sum

Total number of 32-byte efficiency arbiter reads due to uncached traffic.

TCC_EA0_RDREQ_IO_CREDIT_STALL_sum

Total number of cycles there is a stall due to the read request interface running out of IO credits.

TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum

Total number of cycles there is a stall due to the read request interface running out of GMI credits.

TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum

Total number of cycles there is a stall due to the read request interface running out of DRAM credits.

TCC_EA0_RDREQ_DRAM_sum

Total number of 32-byte or 64-byte efficiency arbiter read requests to HBM.

TCC_EA0_WRREQ_DRAM_sum

Total number of 32-byte or 64-byte efficiency arbiter write requests to HBM.

TCC_HIT_sum

Total number of L2 cache hits.

TCC_MISS_sum

Total number of L2 cache misses.

TCC_NC_REQ_sum

Total number of non-coherently cached requests.

TCC_NORMAL_WRITEBACK_sum

Total number of writebacks due to requests that are not writeback requests.

TCC_NORMAL_EVICT_sum

Total number of evictions due to requests that are not invalidate or probe requests.

TCC_PROBE_sum

Total number of probe requests.

TCC_PROBE_ALL_sum

Total number of external probe requests with EA0_TCC_preq_all == 1.

TCC_READ_sum

Total number of L2 cache read requests (including compressed reads but not metadata reads).

TCC_REQ_sum

Total number of all types of L2 cache requests.

TCC_RW_REQ_sum

Total number of coherently cached with write requests.

TCC_STREAMING_REQ_sum

Total number of L2 cache streaming requests.

TCC_TAG_STALL_sum

Total number of cycles the normal request pipeline in the tag is stalled for any reason.

TCC_TOO_MANY_EA0_WRREQS_STALL_sum

Total number of cycles L2 cache is unable to send an efficiency arbiter write request due to it reaching its maximum capacity of pending efficiency arbiter write requests.

TCC_UC_REQ_sum

Total number of uncached requests.

TCC_WRITE_sum

Total number of L2 cache write requests.

TCC_WRITEBACK_sum

Total number of lines written back to the main memory including writebacks of dirty lines and uncached write or atomic requests.

TCC_WRREQ_STALL_max

Maximum number of cycles a write request is stalled.

Hardware counters by, for, or over all texture cache per pipe instances#

The following table shows the hardware counters by all texture cache per pipe instances.

Hardware counter

Definition

TCP_TA_TCP_STATE_READ_sum

Total number of state reads by ATCPPI

TCP_TOTAL_CACHE_ACCESSES_sum

Total number of vector L1d accesses (including hits and misses)

TCP_UTCL1_PERMISSION_MISS_sum

Total number of unified translation cache (L1) permission misses

TCP_UTCL1_REQUEST_sum

Total number of address translation requests to unified translation cache (L1)

TCP_UTCL1_TRANSLATION_MISS_sum

Total number of unified translation cache (L1) translation misses

TCP_UTCL1_TRANSLATION_HIT_sum

Total number of unified translation cache (L1) translation hits

The following table shows the hardware counters for all texture cache per pipe instances.

Hardware counter

Definition

TCP_TCC_READ_REQ_LATENCY_sum

Total vector L1d to L2 request latency over all wavefronts for reads and atomics with return

TCP_TCC_WRITE_REQ_LATENCY_sum

Total vector L1d to L2 request latency over all wavefronts for writes and atomics without return

TCP_TCP_LATENCY_sum

Total wave access latency to vector L1d over all wavefronts

The following table shows the hardware counters over all texture cache per pipe instances.

Hardware counter

Definition

TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum

Total number of cycles tag RAM conflict stalls on an atomic

TCP_GATE_EN1_sum

Total number of cycles vector L1d interface clocks are turned on

TCP_GATE_EN2_sum

Total number of cycles vector L1d core clocks are turned on

TCP_PENDING_STALL_CYCLES_sum

Total number of cycles vector L1d cache is stalled due to data pending from L2 Cache

TCP_READ_TAGCONFLICT_STALL_CYCLES_sum

Total number of cycles tag RAM conflict stalls on a read

TCP_TCC_ATOMIC_WITH_RET_REQ_sum

Total number of atomic requests to L2 cache with return

TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum

Total number of atomic requests to L2 cache without return

TCP_TCC_CC_READ_REQ_sum

Total number of coherently cached read requests to L2 cache

TCP_TCC_CC_WRITE_REQ_sum

Total number of coherently cached write requests to L2 cache

TCP_TCC_CC_ATOMIC_REQ_sum

Total number of coherently cached atomic requests to L2 cache

TCP_TCC_NC_READ_REQ_sum

Total number of non-coherently cached read requests to L2 cache

TCP_TCC_NC_WRITE_REQ_sum

Total number of non-coherently cached write requests to L2 cache

TCP_TCC_NC_ATOMIC_REQ_sum

Total number of non-coherently cached atomic requests to L2 cache

TCP_TCC_READ_REQ_sum

Total number of read requests to L2 cache

TCP_TCC_RW_READ_REQ_sum

Total number of coherently cached with write read requests to L2 cache

TCP_TCC_RW_WRITE_REQ_sum

Total number of coherently cached with write write requests to L2 cache

TCP_TCC_RW_ATOMIC_REQ_sum

Total number of coherently cached with write atomic requests to L2 cache

TCP_TCC_UC_READ_REQ_sum

Total number of uncached read requests to L2 cache

TCP_TCC_UC_WRITE_REQ_sum

Total number of uncached write requests to L2 cache

TCP_TCC_UC_ATOMIC_REQ_sum

Total number of uncached atomic requests to L2 cache

TCP_TCC_WRITE_REQ_sum

Total number of write requests to L2 cache

TCP_TCR_TCP_STALL_CYCLES_sum

Total number of cycles texture cache router stalls vector L1d

TCP_TD_TCP_STALL_CYCLES_sum

Total number of cycles texture data unit stalls vector L1d

TCP_TOTAL_ACCESSES_sum

Total number of vector L1d accesses

TCP_TOTAL_READ_sum

Total number of vector L1d read accesses

TCP_TOTAL_WRITE_sum

Total number of vector L1d write accesses

TCP_TOTAL_ATOMIC_WITH_RET_sum

Total number of vector L1d atomic requests with return

TCP_TOTAL_ATOMIC_WITHOUT_RET_sum

Total number of vector L1d atomic requests without return

TCP_TOTAL_WRITEBACK_INVALIDATES_sum

Total number of vector L1d writebacks and invalidates

TCP_VOLATILE_sum

Total number of L1 volatile pixels or buffers from texture addressing unit

TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum

Total number of cycles tag RAM conflict stalls on a write

Hardware counter over all texture data unit instances#

Hardware counter

Definition

TD_ATOMIC_WAVEFRONT_sum

Total number of atomic wavefront instructions

TD_COALESCABLE_WAVEFRONT_sum

Total number of coalescable wavefronts according to texture addressing unit

TD_LOAD_WAVEFRONT_sum

Total number of wavefront instructions (read, write, atomic)

TD_SPI_STALL_sum

Total number of cycles texture data unit is stalled by shader processor input

TD_STORE_WAVEFRONT_sum

Total number of write wavefront instructions

TD_TC_STALL_sum

Total number of cycles texture data unit is stalled waiting for texture cache data

TD_TD_BUSY_sum

Total number of texture data unit busy cycles while it is processing or waiting for data

AMD Instinct™ MI250 microarchitecture#

The microarchitecture of the AMD Instinct MI250 accelerators is based on the AMD CDNA 2 architecture that targets compute applications such as HPC, artificial intelligence (AI), and machine learning (ML) and that run on everything from individual servers to the world’s largest exascale supercomputers. The overall system architecture is designed for extreme scalability and compute performance.

The following image shows the components of a single Graphics Compute Die (GCD) of the CDNA 2 architecture. On the top and the bottom are AMD Infinity Fabric™ interfaces and their physical links that are used to connect the GPU die to the other system-level components of the node (see also Section 2.2). Both interfaces can drive four AMD Infinity Fabric links. One of the AMD Infinity Fabric links of the controller at the bottom can be configured as a PCIe link. Each of the AMD Infinity Fabric links between GPUs can run at up to 25 GT/sec, which correlates to a peak transfer bandwidth of 50 GB/sec for a 16-wide link ( two bytes per transaction). Section 2.2 has more details on the number of AMD Infinity Fabric links and the resulting transfer rates between the system-level components.

To the left and the right are memory controllers that attach the High Bandwidth Memory (HBM) modules to the GCD. AMD Instinct MI250 GPUs use HBM2e, which offers a peak memory bandwidth of 1.6 TB/sec per GCD.

The execution units of the GPU are depicted in the following image as Compute Units (CU). The MI250 GCD has 104 active CUs. Each compute unit is further subdivided into four SIMD units that process SIMD instructions of 16 data elements per instruction (for the FP64 data type). This enables the CU to process 64 work items (a so-called “wavefront”) at a peak clock frequency of 1.7 GHz. Therefore, the theoretical maximum FP64 peak performance per GCD is 45.3 TFLOPS for vector instructions. The MI250 compute units also provide specialized execution units (also called matrix cores), which are geared toward executing matrix operations like matrix-matrix multiplications. For FP64, the peak performance of these units amounts to 90.5 TFLOPS.

Structure of a single GCD in the AMD Instinct MI250 accelerator.

Peak-performance capabilities of the MI250 OAM for different data types.#

Computation and Data Type

FLOPS/CLOCK/CU

Peak TFLOPS

Matrix FP64

256

90.5

Vector FP64

128

45.3

Matrix FP32

256

90.5

Packed FP32

256

90.5

Vector FP32

128

45.3

Matrix FP16

1024

362.1

Matrix BF16

1024

362.1

Matrix INT8

1024

362.1

The above table summarizes the aggregated peak performance of the AMD Instinct MI250 OCP Open Accelerator Modules (OAM, OCP is short for Open Compute Platform) and its two GCDs for different data types and execution units. The middle column lists the peak performance (number of data elements processed in a single instruction) of a single compute unit if a SIMD (or matrix) instruction is being retired in each clock cycle. The third column lists the theoretical peak performance of the OAM module. The theoretical aggregated peak memory bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD).

Dual-GCD architecture of the AMD Instinct MI250 accelerators

The following image shows the block diagram of an OAM package that consists of two GCDs, each of which constitutes one GPU device in the system. The two GCDs in the package are connected via four AMD Infinity Fabric links running at a theoretical peak rate of 25 GT/sec, giving 200 GB/sec peak transfer bandwidth between the two GCDs of an OAM, or a bidirectional peak transfer bandwidth of 400 GB/sec for the same.

Node-level architecture#

The following image shows the node-level architecture of a system that is based on the AMD Instinct MI250 accelerator. The MI250 OAMs attach to the host system via PCIe Gen 4 x16 links (yellow lines). Each GCD maintains its own PCIe x16 link to the host part of the system. Depending on the server platform, the GCD can attach to the AMD EPYC processor directly or via an optional PCIe switch . Note that some platforms may offer an x8 interface to the GCDs, which reduces the available host-to-GPU bandwidth.

Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor

The preceding image shows the node-level architecture of a system with AMD EPYC processors in a dual-socket configuration and four AMD Instinct MI250 accelerators. The MI250 OAMs attach to the host processors system via PCIe Gen 4 x16 links (yellow lines). Depending on the system design, a PCIe switch may exist to make more PCIe lanes available for additional components like network interfaces and/or storage devices. Each GCD maintains its own PCIe x16 link to the host part of the system or to the PCIe switch. Please note, some platforms may offer an x8 interface to the GCDs, which will reduce the available host-to-GPU bandwidth.

Between the OAMs and their respective GCDs, a peer-to-peer (P2P) network allows for direct data exchange between the GPU dies via AMD Infinity Fabric links ( black, green, and red lines). Each of these 16-wide links connects to one of the two GPU dies in the MI250 OAM and operates at 25 GT/sec, which corresponds to a theoretical peak transfer rate of 50 GB/sec per link (or 100 GB/sec bidirectional peak transfer bandwidth). The GCD pairs 2 and 6 as well as GCDs 0 and 4 connect via two XGMI links, which is indicated by the thicker red line in the preceding image.

AMD Instinct™ MI100 microarchitecture#

The following image shows the node-level architecture of a system that comprises two AMD EPYC™ processors and (up to) eight AMD Instinct™ accelerators. The two EPYC processors are connected to each other with the AMD Infinity™ fabric which provides a high-bandwidth (up to 18 GT/sec) and coherent links such that each processor can access the available node memory as a single shared-memory domain in a non-uniform memory architecture (NUMA) fashion. In a 2P, or dual-socket, configuration, three AMD Infinity™ fabric links are available to connect the processors plus one PCIe Gen 4 x16 link per processor can attach additional I/O devices such as the host adapters for the network fabric.

Structure of a single GCD in the AMD Instinct MI100 accelerator

In a typical node configuration, each processor can host up to four AMD Instinct™ accelerators that are attached using PCIe Gen 4 links at 16 GT/sec, which corresponds to a peak bidirectional link bandwidth of 32 GB/sec. Each hive of four accelerators can participate in a fully connected, coherent AMD Instinct™ fabric that connects the four accelerators using 23 GT/sec AMD Infinity fabric links that run at a higher frequency than the inter-processor links. This inter-GPU link can be established in certified server systems if the GPUs are mounted in neighboring PCIe slots by installing the AMD Infinity Fabric™ bridge for the AMD Instinct™ accelerators.

Microarchitecture#

The microarchitecture of the AMD Instinct accelerators is based on the AMD CDNA architecture, which targets compute applications such as high-performance computing (HPC) and AI & machine learning (ML) that run on everything from individual servers to the world’s largest exascale supercomputers. The overall system architecture is designed for extreme scalability and compute performance.

Structure of the AMD Instinct accelerator (MI100 generation)

The above image shows the AMD Instinct accelerator with its PCIe Gen 4 x16 link (16 GT/sec, at the bottom) that connects the GPU to (one of) the host processor(s). It also shows the three AMD Infinity Fabric ports that provide high-speed links (23 GT/sec, also at the bottom) to the other GPUs of the local hive.

On the left and right of the floor plan, the High Bandwidth Memory (HBM) attaches via the GPU memory controller. The MI100 generation of the AMD Instinct accelerator offers four stacks of HBM generation 2 (HBM2) for a total of 32GB with a 4,096bit-wide memory interface. The peak memory bandwidth of the attached HBM2 is 1.228 TB/sec at a memory clock frequency of 1.2 GHz.

The execution units of the GPU are depicted in the above image as Compute Units (CU). There are a total 120 compute units that are physically organized into eight Shader Engines (SE) with fifteen compute units per shader engine. Each compute unit is further sub-divided into four SIMD units that process SIMD instructions of 16 data elements per instruction. This enables the CU to process 64 data elements (a so-called ‘wavefront’) at a peak clock frequency of 1.5 GHz. Therefore, the theoretical maximum FP64 peak performance is 11.5 TFLOPS (4 [SIMD units] x 16 [elements per instruction] x 120 [CU] x 1.5 [GHz]).

Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA architecture

The preceding image shows the block diagram of a single CU of an AMD Instinct™ MI100 accelerator and summarizes how instructions flow through the execution engines. The CU fetches the instructions via a 32KB instruction cache and moves them forward to execution via a dispatcher. The CU can handle up to ten wavefronts at a time and feed their instructions into the execution unit. The execution unit contains 256 vector general-purpose registers (VGPR) and 800 scalar general-purpose registers (SGPR). The VGPR and SGPR are dynamically allocated to the executing wavefronts. A wavefront can access a maximum of 102 scalar registers. Excess scalar-register usage will cause register spilling and thus may affect execution performance.

A wavefront can occupy any number of VGPRs from 0 to 256, directly affecting occupancy; that is, the number of concurrently active wavefronts in the CU. For instance, with 119 VGPRs used, only two wavefronts can be active in the CU at the same time. With the instruction latency of four cycles per SIMD instruction, the occupancy should be as high as possible such that the compute unit can improve execution efficiency by scheduling instructions from multiple wavefronts.

Peak-performance capabilities of MI100 for different data types.#

Computation and Data Type

FLOPS/CLOCK/CU

Peak TFLOPS

Vector FP64

64

11.5

Matrix FP32

256

46.1

Vector FP32

128

23.1

Matrix FP16

1024

184.6

Matrix BF16

512

92.3

GPU memory#

For the HIP reference documentation, see:

Host memory exists on the host (e.g. CPU) of the machine in random access memory (RAM).

Device memory exists on the device (e.g. GPU) of the machine in video random access memory (VRAM). Recent architectures use graphics double data rate (GDDR) synchronous dynamic random-access memory (SDRAM)such as GDDR6, or high-bandwidth memory (HBM) such as HBM2e.

Memory allocation#

Memory can be allocated in two ways: pageable memory, and pinned memory. The following API calls with result in these allocations:

API

Data location

Allocation

System allocated

Host

Pageable

hipMallocManaged

Host

Managed

hipHostMalloc

Host

Pinned

hipMalloc

Device

Pinned

Tip

hipMalloc and hipFree are blocking calls, however, HIP recently added non-blocking versions hipMallocAsync and hipFreeAsync which take in a stream as an additional argument.

Pageable memory#

Pageable memory is usually gotten when calling malloc or new in a C++ application. It is unique in that it exists on “pages” (blocks of memory), which can be migrated to other memory storage. For example, migrating memory between CPU sockets on a motherboard, or a system that runs out of space in RAM and starts dumping pages of RAM into the swap partition of your hard drive.

Pinned memory#

Pinned memory (or page-locked memory, or non-pageable memory) is host memory that is mapped into the address space of all GPUs, meaning that the pointer can be used on both host and device. Accessing host-resident pinned memory in device kernels is generally not recommended for performance, as it can force the data to traverse the host-device interconnect (e.g. PCIe), which is much slower than the on-device bandwidth (>40x on MI200).

Pinned host memory can be allocated with one of two types of coherence support:

Note

In HIP, pinned memory allocations are coherent by default (hipHostMallocDefault). There are additional pinned memory flags (e.g. hipHostMallocMapped and hipHostMallocPortable). On MI200 these options do not impact performance.

For more information, see the section memory allocation flags in the HIP Programming Guide: hip:user_guide/programming_manual.

Much like how a process can be locked to a CPU core by setting affinity, a pinned memory allocator does this with the memory storage system. On multi-socket systems it is important to ensure that pinned memory is located on the same socket as the owning process, or else each cache line will be moved through the CPU-CPU interconnect, thereby increasing latency and potentially decreasing bandwidth.

In practice, pinned memory is used to improve transfer times between host and device. For transfer operations, such as hipMemcpy or hipMemcpyAsync, using pinned memory instead of pageable memory on host can lead to a ~3x improvement in bandwidth.

Tip

If the application needs to move data back and forth between device and host (separate allocations), use pinned memory on the host side.

Managed memory#

Managed memory refers to universally addressable, or unified memory available on the MI200 series of GPUs. Much like pinned memory, managed memory shares a pointer between host and device and (by default) supports fine-grained coherence, however, managed memory can also automatically migrate pages between host and device. The allocation will be managed by AMD GPU driver using the Linux HMM (Heterogeneous Memory Management) mechanism.

If heterogenous memory management (HMM) is not available, then hipMallocManaged will default back to using system memory and will act like pinned host memory. Other managed memory API calls will have undefined behavior. It is therefore recommended to check for managed memory capability with: hipDeviceGetAttribute and hipDeviceAttributeManagedMemory.

HIP supports additional calls that work with page migration:

  • hipMemAdvise

  • hipMemPrefetchAsync

Tip

If the application needs to use data on both host and device regularly, does not want to deal with separate allocations, and is not worried about maxing out the VRAM on MI200 GPUs (64 GB per GCD), use managed memory.

Tip

If managed memory performance is poor, check to see if managed memory is supported on your system and if page migration (XNACK) is enabled.

Access behavior#

Memory allocations for GPUs behave as follow:

API

Data location

Host access

Device access

System allocated

Host

Local access

Unhandled page fault

hipMallocManaged

Host

Local access

Zero-copy

hipHostMalloc

Host

Local access

Zero-copy*

hipMalloc

Device

Zero-copy

Local access

Zero-copy accesses happen over the Infinity Fabric interconnect or PCI-E lanes on discrete GPUs.

Note

While hipHostMalloc allocated memory is accessible by a device, the host pointer must be converted to a device pointer with hipHostGetDevicePointer.

Memory allocated through standard system allocators such as malloc, can be accessed a device by registering the memory via hipHostRegister. The device pointer to be used in kernels can be retrieved with hipHostGetDevicePointer. Registered memory is treated like hipHostMalloc and will have similar performance.

On devices that support and have XNACK enabled, such as the MI250X, hipHostRegister is not required as memory accesses are handled via automatic page migration.

XNACK#

Normally, host and device memory are separate and data has to be transferred manually via hipMemcpy.

On a subset of GPUs, such as the MI200, there is an option to automatically migrate pages of memory between host and device. This is important for managed memory, where the locality of the data is important for performance. Depending on the system, page migration may be disabled by default in which case managed memory will act like pinned host memory and suffer degraded performance.

XNACK describes the GPUs ability to retry memory accesses that failed due a page fault (which normally would lead to a memory access error), and instead retrieve the missing page.

This also affects memory allocated by the system as indicated by the following table:

API

Data location

Host after device access

Device after host access

System allocated

Host

Migrate page to host

Migrate page to device

hipMallocManaged

Host

Migrate page to host

Migrate page to device

hipHostMalloc

Host

Local access

Zero-copy

hipMalloc

Device

Zero-copy

Local access

To check if page migration is available on a platform, use rocminfo:

$ rocminfo | grep xnack
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-

Here, xnack- means that XNACK is available but is disabled by default. Turning on XNACK by setting the environment variable HSA_XNACK=1 and gives the expected result, xnack+:

$ HSA_XNACK=1 rocminfo | grep xnack
Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack+

hipccby default will generate code that runs correctly with both XNACK enabled or disabled. Setting the --offload-arch=-option with xnack+ or xnack- forces code to be only run with XNACK enabled or disabled respectively.

# Compiled kernels will run regardless if XNACK is enabled or is disabled. 
hipcc --offload-arch=gfx90a

# Compiled kernels will only be run if XNACK is enabled with XNACK=1.
hipcc --offload-arch=gfx90a:xnack+

# Compiled kernels will only be run if XNACK is disabled with XNACK=0.
hipcc --offload-arch=gfx90a:xnack-

Tip

If you want to make use of page migration, use managed memory. While pageable memory will migrate correctly, it is not a portable solution and can have performance issues if the accessed data isn’t page aligned.

Coherence#

  • Coarse-grained coherence means that memory is only considered up to date at kernel boundaries, which can be enforced through hipDeviceSynchronize, hipStreamSynchronize, or any blocking operation that acts on the null stream (e.g. hipMemcpy). For example, cacheable memory is a type of coarse-grained memory where an up-to-date copy of the data can be stored elsewhere (e.g. in an L2 cache).

  • Fine-grained coherence means the coherence is supported while a CPU/GPU kernel is running. This can be useful if both host and device are operating on the same dataspace using system-scope atomic operations (e.g. updating an error code or flag to a buffer). Fine-grained memory implies that up-to-date data may be made visible to others regardless of kernel boundaries as discussed above.

API

Flag

Coherence

hipHostMalloc

hipHostMallocDefault

Fine-grained

hipHostMalloc

hipHostMallocNonCoherent

Coarse-grained

API

Flag

Coherence

hipExtMallocWithFlags

hipDeviceMallocDefault

Coarse-grained

hipExtMallocWithFlags

hipDeviceMallocFinegrained

Fine-grained

API

hipMemAdvise argument

Coherence

hipMallocManaged

Fine-grained

hipMallocManaged

hipMemAdviseSetCoarseGrain

Coarse-grained

malloc

Fine-grained

malloc

hipMemAdviseSetCoarseGrain

Coarse-grained

Tip

Try to design your algorithms to avoid host-device memory coherence (e.g. system scope atomics). While it can be a useful feature in very specific cases, it is not supported on all systems, and can negatively impact performance by introducing the host-device interconnect bottleneck.

The availability of fine- and coarse-grained memory pools can be checked with rocminfo:

$ rocminfo
...
*******
Agent 1
*******
Name:                    AMD EPYC 7742 64-Core Processor
...
Pool Info:
Pool 1
Segment:                 GLOBAL; FLAGS: FINE GRAINED
...
Pool 3
Segment:                 GLOBAL; FLAGS: COARSE GRAINED
...
*******
Agent 9
*******
Name:                    gfx90a
...
Pool Info:
Pool 1
Segment:                 GLOBAL; FLAGS: COARSE GRAINED
...

System direct memory access#

In most cases, the default behavior for HIP in transferring data from a pinned host allocation to device will run at the limit of the interconnect. However, there are certain cases where the interconnect is not the bottleneck.

The primary way to transfer data onto and off of a GPU, such as the MI200, is to use the onboard System Direct Memory Access engine, which is used to feed blocks of memory to the off-device interconnect (either GPU-CPU or GPU-GPU). Each GCD has a separate SDMA engine for host-to-device and device-to-host memory transfers. Importantly, SDMA engines are separate from the computing infrastructure, meaning that memory transfers to and from a device will not impact kernel compute performance, though they do impact memory bandwidth to a limited extent. The SDMA engines are mainly tuned for PCIe-4.0 x16, which means they are designed to operate at bandwidths up to 32 GB/s.

Note

An important feature of the MI250X platform is the Infinity Fabric™ interconnect between host and device. The Infinity Fabric interconnect supports improved performance over standard PCIe-4.0 (usually ~50% more bandwidth); however, since the SDMA engine does not run at this speed, it will not max out the bandwidth of the faster interconnect.

The bandwidth limitation can be countered by bypassing the SDMA engine and replacing it with a type of copy kernel known as a “blit” kernel. Blit kernels will use the compute units on the GPU, thereby consuming compute resources, which may not always be beneficial. The easiest way to enable blit kernels is to set an environment variable HSA_ENABLE_SDMA=0, which will disable the SDMA engine. On systems where the GPU uses a PCIe interconnect instead of an Infinity Fabric interconnect, blit kernels will not impact bandwidth, but will still consume compute resources. The use of SDMA vs blit kernels also applies to MPI data transfers and GPU-GPU transfers.

ROCm compilers disambiguation#

ROCm ships multiple compilers of varying origins and purposes. This article disambiguates compiler naming used throughout the documentation.

Compiler terms#

Term

Description

amdclang++

Clang/LLVM-based compiler that is part of rocm-llvm package. The source code is available at https://github.com/ROCm/llvm-project.

AOCC

Closed-source clang-based compiler that includes additional CPU optimizations. Offered as part of ROCm via the rocm-llvm-alt package. See for details, https://developer.amd.com/amd-aocc/.

HIP-Clang

Informal term for the amdclang++ compiler

HIPIFY

Tools including hipify-clang and hipify-perl, used to automatically translate CUDA source code into portable HIP C++. The source code is available at https://github.com/ROCm/HIPIFY

hipcc

HIP compiler driver. A utility that invokes clang or nvcc depending on the target and passes the appropriate include and library options for the target compiler and HIP infrastructure. The source code is available at https://github.com/ROCm/HIPCC.

ROCmCC

Clang/LLVM-based compiler. ROCmCC in itself is not a binary but refers to the overall compiler.

OpenMP support in ROCm#

Introduction#

The ROCm™ installation includes an LLVM-based implementation that fully supports the OpenMP 4.5 standard and a subset of OpenMP 5.0, 5.1, and 5.2 standards. Fortran, C/C++ compilers, and corresponding runtime libraries are included. Along with host APIs, the OpenMP compilers support offloading code and data onto GPU devices. This document briefly describes the installation location of the OpenMP toolchain, example usage of device offloading, and usage of rocprof with OpenMP applications. The GPUs supported are the same as those supported by this ROCm release. See the list of supported GPUs for Linux and Windows.

The ROCm OpenMP compiler is implemented using LLVM compiler technology. The following image illustrates the internal steps taken to translate a user’s application into an executable that can offload computation to the AMDGPU. The compilation is a two-pass process. Pass 1 compiles the application to generate the CPU code and Pass 2 links the CPU code to the AMDGPU device code.

OpenMP toolchain

Installation#

The OpenMP toolchain is automatically installed as part of the standard ROCm installation and is available under /opt/rocm-{version}/llvm. The sub-directories are:

  • bin: Compilers (flang and clang) and other binaries.

  • examples: The usage section below shows how to compile and run these programs.

  • include: Header files.

  • lib: Libraries including those required for target offload.

  • lib-debug: Debug versions of the above libraries.

OpenMP: usage#

The example programs can be compiled and run by pointing the environment variable ROCM_PATH to the ROCm install directory.

Example:

export ROCM_PATH=/opt/rocm-{version}
cd $ROCM_PATH/share/openmp-extras/examples/openmp/veccopy
sudo make run

Note

sudo is required since we are building inside the /opt directory. Alternatively, copy the files to your home directory first.

The above invocation of Make compiles and runs the program. Note the options that are required for target offload from an OpenMP program:

-fopenmp --offload-arch=<gpu-arch>

Note

The compiler also accepts the alternative offloading notation:

-fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<gpu-arch>

Obtain the value of gpu-arch by running the following command:

% /opt/rocm-{version}/bin/rocminfo | grep gfx

See the complete list of compiler command-line references here.

Using rocprof with OpenMP#

The following steps describe a typical workflow for using rocprof with OpenMP code compiled with AOMP:

  1. Run rocprof with the program command line:

    % rocprof <application> <args>
    

    This produces a results.csv file in the user’s current directory that shows basic stats such as kernel names, grid size, number of registers used, etc. The user can choose to specify the preferred output file name using the o option.

  2. Add options for a detailed result:

    --stats: % rocprof --stats <application> <args>
    

    The stats option produces timestamps for the kernels. Look into the output CSV file for the field, DurationNs, which is useful in getting an understanding of the critical kernels in the code.

    Apart from --stats, the option --timestamp on produces a timestamp for the kernels.

  3. After learning about the required kernels, the user can take a detailed look at each one of them. rocprof has support for hardware counters: a set of basic and a set of derived ones. See the complete list of counters using options –list-basic and –list-derived. rocprof accepts either a text or an XML file as an input.

For more details on rocprof, refer to the ROCProfilerV1 User Manual.

Using tracing options#

Prerequisite: When using the --sys-trace option, compile the OpenMP program with:

    -Wl,-rpath,/opt/rocm-{version}/lib -lamdhip64

The following tracing options are widely used to generate useful information:

  • --hsa-trace: This option is used to get a JSON output file with the HSA API execution traces and a flat profile in a CSV file.

  • --sys-trace: This allows programmers to trace both HIP and HSA calls. Since this option results in loading libamdhip64.so, follow the prerequisite as mentioned above.

A CSV and a JSON file are produced by the above trace options. The CSV file presents the data in a tabular format, and the JSON file can be visualized using Google Chrome at chrome://tracing/ or Perfetto. Navigate to Chrome or Perfetto and load the JSON file to see the timeline of the HSA calls.

For more details on tracing, refer to the ROCProfilerV1 User Manual.

Environment variables#

Environment Variable

Purpose

OMP_NUM_TEAMS

To set the number of teams for kernel launch, which is otherwise chosen by the implementation by default. You can set this number (subject to implementation limits) for performance tuning.

LIBOMPTARGET_KERNEL_TRACE

To print useful statistics for device operations. Setting it to 1 and running the program emits the name of every kernel launched, the number of teams and threads used, and the corresponding register usage. Setting it to 2 additionally emits timing information for kernel launches and data transfer operations between the host and the device.

LIBOMPTARGET_INFO

To print informational messages from the device runtime as the program executes. Setting it to a value of 1 or higher, prints fine-grain information and setting it to -1 prints complete information.

LIBOMPTARGET_DEBUG

To get detailed debugging information about data transfer operations and kernel launch when using a debug version of the device library. Set this environment variable to 1 to get the detailed information from the library.

GPU_MAX_HW_QUEUES

To set the number of HSA queues in the OpenMP runtime. The HSA queues are created on demand up to the maximum value as supplied here. The queue creation starts with a single initialized queue to avoid unnecessary allocation of resources. The provided value is capped if it exceeds the recommended, device-specific value.

LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES

To set the threshold size up to which data transfers are initiated asynchronously. The default threshold size is 110241024 bytes (1MB).

OMPX_FORCE_SYNC_REGIONS

To force the runtime to execute all operations synchronously, i.e., wait for an operation to complete immediately. This affects data transfers and kernel execution. While it is mainly designed for debugging, it may have a minor positive effect on performance in certain situations.

OpenMP: features#

The OpenMP programming model is greatly enhanced with the following new features implemented in the past releases.

Asynchronous behavior in OpenMP target regions#

  • Controlling Asynchronous Behavior

The OpenMP offloading runtime executes in an asynchronous fashion by default, allowing multiple data transfers to start concurrently. However, if the data to be transferred becomes larger than the default threshold of 1MB, the runtime falls back to a synchronous data transfer. The buffers that have been locked already are always executed asynchronously. You can overrule this default behavior by setting LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES and OMPX_FORCE_SYNC_REGIONS. See the Environment Variables table for details.

  • Multithreaded Offloading on the Same Device

The libomptarget plugin for GPU offloading allows creation of separate configurable HSA queues per chiplet, which enables two or more threads to concurrently offload to the same device.

  • Parallel Memory Copy Invocations

Implicit asynchronous execution of single target region enables parallel memory copy invocations.

Unified shared memory#

Unified Shared Memory (USM) provides a pointer-based approach to memory management. To implement USM, fulfill the following system requirements along with Xnack capability.

Prerequisites#
  • Linux Kernel versions above 5.14

  • Latest KFD driver packaged in ROCm stack

  • Xnack, as USM support can only be tested with applications compiled with Xnack capability

Xnack capability#

When enabled, Xnack capability allows GPU threads to access CPU (system) memory, allocated with OS-allocators, such as malloc, new, and mmap. Xnack must be enabled both at compile- and run-time. To enable Xnack support at compile-time, use:

--offload-arch=gfx908:xnack+

Or use another functionally equivalent option Xnack-any:

--offload-arch=gfx908

To enable Xnack functionality at runtime on a per-application basis, use environment variable:

HSA_XNACK=1

When Xnack support is not needed:

  • Build the applications to maximize resource utilization using:

--offload-arch=gfx908:xnack-
  • At runtime, set the HSA_XNACK environment variable to 0.

Unified shared memory pragma#

This OpenMP pragma is available on MI200 through xnack+ support.

omp requires unified_shared_memory

As stated in the OpenMP specifications, this pragma makes the map clause on target constructs optional. By default, on MI200, all memory allocated on the host is fine grain. Using the map clause on a target clause is allowed, which transforms the access semantics of the associated memory to coarse grain.

A simple program demonstrating the use of this feature is:
$ cat parallel_for.cpp
#include <stdlib.h>
#include <stdio.h>

#define N 64
#pragma omp requires unified_shared_memory
int main() {
  int n = N;
  int *a = new int[n];
  int *b = new int[n];

  for(int i = 0; i < n; i++)
    b[i] = i;

  #pragma omp target parallel for map(to:b[:n])
  for(int i = 0; i < n; i++)
    a[i] = b[i];

  for(int i = 0; i < n; i++)
    if(a[i] != i)
      printf("error at %d: expected %d, got %d\n", i, i+1, a[i]);

  return 0;
}
$ clang++ -O2 -target x86_64-pc-linux-gnu -fopenmp --offload-arch=gfx90a:xnack+ parallel_for.cpp
$ HSA_XNACK=1 ./a.out

In the above code example, pointer “a” is not mapped in the target region, while pointer “b” is. Both are valid pointers on the GPU device and passed by-value to the kernel implementing the target region. This means the pointer values on the host and the device are the same.

The difference between the memory pages pointed to by these two variables is that the pages pointed by “a” are in fine-grain memory, while the pages pointed to by “b” are in coarse-grain memory during and after the execution of the target region. This is accomplished in the OpenMP runtime library with calls to the ROCr runtime to set the pages pointed by “b” as coarse grain.

OMPT target support#

The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the OpenMP specification document. These APIs allow first-party tools to examine the profile and kernel traces that execute on a device. A tool can register callbacks for data transfer and kernel dispatch entry points or use APIs to start and stop tracing for device-related activities such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled, trace records for device activities are collected during program execution and returned to the tool using the APIs described in the specification.

The following example demonstrates how a tool uses the supported OMPT target APIs. The README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to be followed, and the provided example can be run as shown below:

cd $ROCM_PATH/share/openmp-extras/examples/tools/ompt/veccopy-ompt-target-tracing
sudo make run

The file veccopy-ompt-target-tracing.c simulates how a tool initiates device activity tracing. The file callbacks.h shows the callbacks registered and implemented by the tool.

Floating point atomic operations#

The MI200-series GPUs support the generation of hardware floating-point atomics using the OpenMP atomic pragma. The support includes single- and double-precision floating-point atomic operations. The programmer must ensure that the memory subjected to the atomic operation is in coarse-grain memory by mapping it explicitly with the help of map clauses when not implicitly mapped by the compiler as per the OpenMP specifications. This makes these hardware floating-point atomic instructions “fast,” as they are faster than using a default compare-and-swap loop scheme, but at the same time “unsafe,” as they are not supported on fine-grain memory. The operation in unified_shared_memory mode also requires programmers to map the memory explicitly when not implicitly mapped by the compiler.

To request fast floating-point atomic instructions at the file level, use compiler flag -munsafe-fp-atomics or a hint clause on a specific pragma:

double a = 0.0;
#pragma omp atomic hint(AMD_fast_fp_atomics)
a = a + 1.0;

Note

AMD_unsafe_fp_atomics is an alias for AMD_fast_fp_atomics, and AMD_safe_fp_atomics is implemented with a compare-and-swap loop.

To disable the generation of fast floating-point atomic instructions at the file level, build using the option -msafe-fp-atomics or use a hint clause on a specific pragma:

double a = 0.0;
#pragma omp atomic hint(AMD_safe_fp_atomics)
a = a + 1.0;

The hint clause value always has a precedence over the compiler flag, which allows programmers to create atomic constructs with a different behavior than the rest of the file.

See the example below, where the user builds the program using -msafe-fp-atomics to select a file-wide “safe atomic” compilation. However, the fast atomics hint clause over variable “a” takes precedence and operates on “a” using a fast/unsafe floating-point atomic, while the variable “b” in the absence of a hint clause is operated upon using safe floating-point atomics as per the compiler flag.

double a = 0.0;.
#pragma omp atomic hint(AMD_fast_fp_atomics)
a = a + 1.0;

double b = 0.0;
#pragma omp atomic
b = b + 1.0;

AddressSanitizer tool#

AddressSanitizer (ASan) is a memory error detector tool utilized by applications to detect various errors ranging from spatial issues such as out-of-bound access to temporal issues such as use-after-free. The AOMP compiler supports ASan for AMD GPUs with applications written in both HIP and OpenMP.

Features supported on host platform (Target x86_64):

  • Use-after-free

  • Buffer overflows

  • Heap buffer overflow

  • Stack buffer overflow

  • Global buffer overflow

  • Use-after-return

  • Use-after-scope

  • Initialization order bugs

Features supported on AMDGPU platform (amdgcn-amd-amdhsa):

  • Heap buffer overflow

  • Global buffer overflow

Software (kernel/OS) requirements: Unified Shared Memory support with Xnack capability. See the section on Unified Shared Memory for prerequisites and details on Xnack.

Example:

  • Heap buffer overflow

void  main() {
.......  // Some program statements
.......  // Some program statements
#pragma omp target map(to : A[0:N], B[0:N]) map(from: C[0:N])
{
#pragma omp parallel for
    for(int i =0 ; i < N; i++){
    C[i+10] = A[i] + B[i];
  }   // end of for loop
}
.......   // Some program statements
}// end of main

See the complete sample code for heap buffer overflow here.

  • Global buffer overflow

#pragma omp declare target
   int A[N],B[N],C[N];
#pragma omp end declare target
void main(){
......  // some program statements
......  // some program statements
#pragma omp target data map(to:A[0:N],B[0:N]) map(from: C[0:N])
{
#pragma omp target update to(A,B)
#pragma omp target parallel for
for(int i=0; i<N; i++){
    C[i]=A[i*100]+B[i+22];
} // end of for loop
#pragma omp target update from(C)
}
........  // some program statements
} // end of main

See the complete sample code for global buffer overflow here.

Clang compiler option for kernel optimization#

You can use the clang compiler option -fopenmp-target-fast for kernel optimization if certain constraints implied by its component options are satisfied. -fopenmp-target-fast enables the following options:

  • -fopenmp-target-ignore-env-vars: It enables code generation of specialized kernels including no-loop and Cross-team reductions.

  • -fopenmp-assume-no-thread-state: It enables the compiler to assume that no thread in a parallel region modifies an Internal Control Variable (ICV), thus potentially reducing the device runtime code execution.

  • -fopenmp-assume-no-nested-parallelism: It enables the compiler to assume that no thread in a parallel region encounters a parallel region, thus potentially reducing the device runtime code execution.

  • -O3 if no -O* is specified by the user.

Specialized kernels#

Clang will attempt to generate specialized kernels based on compiler options and OpenMP constructs. The following specialized kernels are supported:

  • No-loop

  • Big-jump-loop

  • Cross-team reductions

To enable the generation of specialized kernels, follow these guidelines:

  • Do not specify teams, threads, and schedule-related environment variables. The num_teams clause in an OpenMP target construct acts as an override and prevents the generation of the no-loop kernel. If the specification of num_teams clause is a user requirement then clang tries to generate the big-jump-loop kernel instead of the no-loop kernel.

  • Assert the absence of the teams, threads, and schedule-related environment variables by adding the command-line option -fopenmp-target-ignore-env-vars.

  • To automatically enable the specialized kernel generation, use -Ofast or -fopenmp-target-fast for compilation.

  • To disable specialized kernel generation, use -fno-openmp-target-ignore-env-vars.

No-loop kernel generation#

The no-loop kernel generation feature optimizes the compiler performance by generating a specialized kernel for certain OpenMP target constructs such as target teams distribute parallel for. The specialized kernel generation feature assumes every thread executes a single iteration of the user loop, which leads the runtime to launch a total number of GPU threads equal to or greater than the iteration space size of the target region loop. This allows the compiler to generate code for the loop body without an enclosing loop, resulting in reduced control-flow complexity and potentially better performance.

Big-jump-loop kernel generation#

A no-loop kernel is not generated if the OpenMP teams construct uses a num_teams clause. Instead, the compiler attempts to generate a different specialized kernel called the big-jump-loop kernel. The compiler launches the kernel with a grid size determined by the number of teams specified by the OpenMP num_teams clause and the blocksize chosen either by the compiler or specified by the corresponding OpenMP clause.

Cross-team optimized reduction kernel generation#

If the OpenMP construct has a reduction clause, the compiler attempts to generate optimized code by utilizing efficient cross-team communication. New APIs for cross-team reduction are implemented in the device runtime and are automatically generated by clang.

ROCm Linux Filesystem Hierarchy Standard reorganization#

Introduction#

The ROCm Software has adopted the Linux Filesystem Hierarchy Standard (FHS) https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html in order to to ensure ROCm is consistent with standard open source conventions. The following sections specify how current and future releases of ROCm adhere to FHS, how the previous ROCm file system is supported, and how improved versioning specifications are applied to ROCm.

Adopting the FHS#

In order to standardize ROCm directory structure and directory content layout ROCm has adopted the FHS, adhering to open source conventions for Linux-based distribution. FHS ensures internal consistency within the ROCm stack, as well as external consistency with other systems and distributions. The ROCm proposed file structure is outlined below:

/opt/rocm-<ver>
    | -- bin
         | -- all public binaries
    | -- lib
         | -- lib<soname>.so->lib<soname>.so.major->lib<soname>.so.major.minor.patch
              (public libaries to link with applications)
         | -- <component>
              | -- architecture dependent libraries and binaries used internally by components
         | -- cmake
              | -- <component>
                   | --<component>-config.cmake
    | -- libexec
         | -- <component>
              | -- non ISA/architecture independent executables used internally by components
    | -- include
         | -- <component>
              | -- public header files
    | -- share
         | -- html
              | -- <component>
                   | -- html documentation
         | -- info
              | -- <component>
                   | -- info files
         | -- man
              | -- <component>
                   | -- man pages
         | -- doc
              | -- <component>
                   | -- license files
         | -- <component>
              | -- samples
              | -- architecture independent misc files

Changes from earlier ROCm versions#

The following table provides a brief overview of the new ROCm FHS layout, compared to the layout of earlier ROCm versions. Note that /opt/ is used to denote the default rocm-installation-path and should be replaced in case of a non-standard installation location of the ROCm distribution.

 ______________________________________________________
|  New ROCm Layout            |  Previous ROCm Layout  |
|_____________________________|________________________|
| /opt/rocm-<ver>             | /opt/rocm-<ver>        |
|     | -- bin                |     | -- bin           |
|     | -- lib                |     | -- lib           |
|          | -- cmake         |     | -- include       |
|     | -- libexec            |     | -- <component_1> |
|     | -- include            |          | -- bin      |
|          | -- <component_1> |          | -- cmake    |
|     | -- share              |          | -- doc      |
|          | -- html          |          | -- lib      |
|          | -- info          |          | -- include  |
|          | -- man           |          | -- samples  |
|          | -- doc           |     | -- <component_n> |
|          | -- <component_1> |          | -- bin      |
|               | -- samples  |          | -- cmake    |
|               | -- ..       |          | -- doc      |
|          | -- <component_n> |          | -- lib      |
|               | -- samples  |          | -- include  |
|               | -- ..       |          | -- samples  |
|______________________________________________________|

ROCm FHS reorganization: backward compatibility#

The FHS file organization for ROCm was first introduced in the release of ROCm 5.2 . Backward compatibility was implemented to make sure users could still run their ROCm applications while transitioning to the new FHS. ROCm has moved header files and libraries to their new locations as indicated in the above structure, and included symbolic-links and wrapper header files in their old location for backward compatibility. The following sections detail ROCm backward compatibility implementation for wrapper header files, executable files, library files and CMake config files.

Wrapper header files#

Wrapper header files are placed in the old location ( /opt/rocm-<ver>/<component>/include) with a warning message to include files from the new location (/opt/rocm-<ver>/include) as shown in the example below.

#pragma message "This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip."
#include <hip/hip_runtime.h>
  • Starting at ROCm 5.2 release, the deprecation for backward compatibility wrapper header files is: #pragma message announcing #warning.

  • Starting from ROCm 6.0 (tentatively) backward compatibility for wrapper header files will be removed, and the #pragma message will be announcing #error.

Executable files#

Executable files are available in the /opt/rocm-<ver>/bin folder. For backward compatibility, the old library location (/opt/rocm-<ver>/<component>/bin) has a soft link to the library at the new location. Soft links will be removed in a future release, tentatively ROCm v6.0.

$ ls -l /opt/rocm/hip/bin/
lrwxrwxrwx 1 root root   24 Jan 1 23:32 hipcc -> ../../bin/hipcc

Library files#

Library files are available in the /opt/rocm-<ver>/lib folder. For backward compatibility, the old library location (/opt/rocm-<ver>/<component>/lib) has a soft link to the library at the new location. Soft links will be removed in a future release, tentatively ROCm v6.0.

$ ls -l /opt/rocm/hip/lib/
drwxr-xr-x 4 root root 4096 Jan 1 10:45 cmake
lrwxrwxrwx 1 root root   24 Jan 1 23:32 libamdhip64.so -> ../../lib/libamdhip64.so

CMake config files#

All CMake configuration files are available in the /opt/rocm-<ver>/lib/cmake/<component> folder. For backward compatibility, the old CMake locations (/opt/rocm-<ver>/<component>/lib/cmake) consist of a soft link to the new CMake config. Soft links will be removed in a future release, tentatively ROCm v6.0.

$ ls -l /opt/rocm/hip/lib/cmake/hip/
lrwxrwxrwx 1 root root 42 Jan 1 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake

Changes required in applications using ROCm#

Applications using ROCm are advised to use the new file paths. As the old files will be deprecated in a future release. Applications have to make sure to include correct header file and use correct search paths.

  1. #include<header_file.h> needs to be changed to #include <component/header_file.h>

    For example: #include <hip.h> needs to change to #include <hip/hip.h>

  2. Any variable in CMake or Makefiles pointing to component folder needs to changed.

    For example: VAR1=/opt/rocm/hip needs to be changed to VAR1=/opt/rocm VAR2=/opt/rocm/hsa needs to be changed to VAR2=/opt/rocm

  3. Any reference to /opt/rocm/<component>/bin or /opt/rocm/<component>/lib needs to be changed to /opt/rocm/bin and /opt/rocm/lib/, respectively.

Changes in versioning specifications#

In order to better manage ROCm dependencies specification and allow smoother releases of ROCm while avoiding dependency conflicts, ROCm software shall adhere to the following scheme when numbering and incrementing ROCm files versions:

rocm-<ver>, where <ver> = <x.y.z>

x.y.z denote: MAJOR.MINOR.PATCH

z: PATCH - increment z when implementing backward compatible bug fixes.

y: MINOR - increment y when implementing minor changes that add functionality but are still backward compatible.

x: MAJOR - increment x when implementing major changes that are not backward compatible.

GPU isolation techniques#

Restricting the access of applications to a subset of GPUs, aka isolating GPUs allows users to hide GPU resources from programs. The programs by default will only use the “exposed” GPUs ignoring other (hidden) GPUs in the system.

There are multiple ways to achieve isolation of GPUs in the ROCm software stack, differing in which applications they apply to and the security they provide. This page serves as an overview of the techniques.

Environment variables#

The runtimes in the ROCm software stack read these environment variables to select the exposed or default device to present to applications using them.

Environment variables shouldn’t be used for isolating untrusted applications, as an application can reset them before initializing the runtime.

ROCR_VISIBLE_DEVICES#

A list of device indices or UUIDs that will be exposed to applications.

Runtime : ROCm Software Runtime. Applies to all applications using the user mode ROCm software stack.

Example to expose the 1. device and a device based on UUID.#
export ROCR_VISIBLE_DEVICES="0,GPU-DEADBEEFDEADBEEF"

GPU_DEVICE_ORDINAL#

Devices indices exposed to OpenCL and HIP applications.

Runtime : ROCm Common Language Runtime (ROCclr). Applies to applications and runtimes using the ROCclr abstraction layer including HIP and OpenCL applications.

Example to expose the 1. and 3. device in the system.#
export GPU_DEVICE_ORDINAL="0,2"

HIP_VISIBLE_DEVICES#

Device indices exposed to HIP applications.

Runtime: HIP runtime. Applies only to applications using HIP on the AMD platform.

Example to expose the 1. and 3. devices in the system.#
export HIP_VISIBLE_DEVICES="0,2"

CUDA_VISIBLE_DEVICES#

Provided for CUDA compatibility, has the same effect as HIP_VISIBLE_DEVICES on the AMD platform.

Runtime : HIP or CUDA Runtime. Applies to HIP applications on the AMD or NVIDIA platform and CUDA applications.

OMP_DEFAULT_DEVICE#

Default device used for OpenMP target offloading.

Runtime : OpenMP Runtime. Applies only to applications using OpenMP offloading.

Example on setting the default device to the third device.#
export OMP_DEFAULT_DEVICE="2"

Docker#

Docker uses Linux kernel namespaces to provide isolated environments for applications. This isolation applies to most devices by default, including GPUs. To access them in containers explicit access must be granted, please see Accessing GPUs in containers for details. Specifically refer to Restricting GPU access on exposing just a subset of all GPUs.

Docker isolation is more secure than environment variables, and applies to all programs that use the amdgpu kernel module interfaces. Even programs that don’t use the ROCm runtime, like graphics applications using OpenGL or Vulkan, can only access the GPUs exposed to the container.

GPU passthrough to virtual machines#

Virtual machines achieve the highest level of isolation, because even the kernel of the virtual machine is isolated from the host. Devices physically installed in the host system can be passed to the virtual machine using PCIe passthrough. This allows for using the GPU with a different operating systems like a Windows guest from a Linux host.

Setting up PCIe passthrough is specific to the hypervisor used. ROCm officially supports VMware ESXi for select GPUs.

Using the LLVM ASan on a GPU (beta release)#

The LLVM AddressSanitizer (ASan) provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement. Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications exactly like pure CPU applications. However, this simplicity has not been achieved yet. This document provides documentation on using ROCm ASan.

For information about LLVM ASan, see the LLVM documentation.

Note: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.

Compiling for ASan#

The ASan process begins by compiling the application of interest with the ASan instrumentation.

Recommendations for doing this are:

  • Compile as many application and dependent library sources as possible using an AMD-built clang-based compiler such as amdclang++.

  • Add the following options to the existing compiler and linker options:

    • -fsanitize=address - enables instrumentation

    • -shared-libsan - use shared version of runtime

    • -g - add debug info for improved reporting

  • Explicitly use xnack+ in the offload architecture option. For example, --offload-arch=gfx90a:xnack+

Other architectures are allowed, but their device code will not be instrumented and a warning will be emitted.

Note: It is not an error to compile some files without ASan instrumentation, but doing so reduces the ability of the process to detect addressing errors. However, if the main program “a.out” does not directly depend on the ASan runtime (libclang_rt.asan-x86_64.so) after the build completes (check by running ldd (List Dynamic Dependencies) or readelf), the application will immediately report an error at runtime as described in the next section.

Note: When compiling OpenMP programs with ASan instrumentation, it is currently necessary to set the environment variable LIBRARY_PATH to /opt/rocm-<version>/lib/llvm/lib/asan:/opt/rocm-<version>/lib/asan. At runtime, it may be necessary to add /opt/rocm-<version>/lib/llvm/lib/asan to LD_LIBRARY_PATH.

About compilation time#

When -fsanitize=address is used, the LLVM compiler adds instrumentation code around every memory operation. This added code must be handled by all of the downstream components of the compiler toolchain and results in increased overall compilation time. This increase is especially evident in the AMDGPU device compiler and has in a few instances raised the compile time to an unacceptable level.

There are a few options if the compile time becomes unacceptable:

  • Avoid instrumentation of the files which have the worst compile times. This will reduce the effectiveness of the ASan process.

  • Add the option -fsanitize-recover=address to the compiles with the worst compile times. This option simplifies the added instrumentation resulting in faster compilation. See below for more information.

  • Disable instrumentation on a per-function basis by adding __attribute__((no_sanitize(“address”))) to functions found to be responsible for the large compile time. Again, this will reduce the effectiveness of the process.

Installing ROCm GPU ASan packages#

For a complete ROCm GPU Sanitizer installation, including packages, instrumented HSA and HIP runtimes, tools, and math libraries, use the following instruction,

    sudo apt-get install rocm-ml-sdk-asan

Using AMD-supplied ASan instrumented libraries#

ROCm releases have optional packages that contain additional ASan instrumented builds of the ROCm libraries (usually found in /opt/rocm-<version>/lib). The instrumented libraries have identical names to the regular uninstrumented libraries, and are located in /opt/rocm-<version>/lib/asan. These additional libraries are built using the amdclang++ and hipcc compilers, while some uninstrumented libraries are built with g++. The preexisting build options are used but, as described above, additional options are used: -fsanitize=address, -shared-libsan and -g.

These additional libraries avoid additional developer effort to locate repositories, identify the correct branch, check out the correct tags, and other efforts needed to build the libraries from the source. And they extend the ability of the process to detect addressing errors into the ROCm libraries themselves.

When adjusting an application build to add instrumentation, linking against these instrumented libraries is unnecessary. For example, any -L /opt/rocm-<version>/lib compiler options need not be changed. However, the instrumented libraries should be used when the application is run. It is particularly important that the instrumented language runtimes, like libamdhip64.so and librocm-core.so, are used; otherwise, device invalid access detections may not be reported.

Running ASan instrumented applications#

Preparing to run an instrumented application#

Here are a few recommendations to consider before running an ASan instrumented heterogeneous application.

  • Ensure the Linux kernel running on the system has Heterogeneous Memory Management (HMM) support. A kernel version of 5.6 or higher should be sufficient.

  • Ensure XNACK is enabled

    • For gfx90a (MI-2X0) or gfx940 (MI-3X0) use environment HSA_XNACK = 1.

    • For gfx906 (MI-50) or gfx908 (MI-100) use environment HSA_XNACK = 1 but also ensure the amdgpu kernel module is loaded with module argument noretry=0. This requirement is due to the fact that the XNACK setting for these GPUs is system-wide.

  • Ensure that the application will use the instrumented libraries when it runs. The output from the shell command ldd <application name> can be used to see which libraries will be used. If the instrumented libraries are not listed by ldd, the environment variable LD_LIBRARY_PATH may need to be adjusted, or in some cases an RPATH compiled into the application may need to be changed and the application recompiled.

  • Ensure that the application depends on the ASan runtime. This can be checked by running the command readelf -d <application name> | grep NEEDED and verifying that shared library: libclang_rt.asan-x86_64.so appears in the output. If it does not appear, when executed the application will quickly output an ASan error that looks like:

==3210==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
  • Ensure that the application llvm-symbolizer can be executed, and that it is located in /opt/rocm-<version>/llvm/bin. This executable is not strictly required, but if found is used to translate (“symbolize”) a host-side instruction address into a more useful function name, file name, and line number (assuming the application has been built to include debug information).

There is an environment variable, ASAN_OPTIONS, that can be used to adjust the runtime behavior of the ASAN runtime itself. There are more than a hundred “flags” that can be adjusted (see an old list at flags) but the default settings are correct and should be used in most cases. It must be noted that these options only affect the host ASAN runtime. The device runtime only currently supports the default settings for the few relevant options.

There are two ASAN_OPTION flags of particular note.

  • halt_on_error=0/1 default 1.

This tells the ASan runtime to halt the application immediately after detecting and reporting an addressing error. The default makes sense because the application has entered the realm of undefined behavior. If the developer wishes to have the application continue anyway, this option can be set to zero. However, the application and libraries should then be compiled with the additional option -fsanitize-recover=address. Note that the ROCm optional ASan instrumented libraries are not compiled with this option and if an error is detected within one of them, but halt_on_error is set to 0, more undefined behavior will occur.

  • detect_leaks=0/1 default 1.

This option directs the ASan runtime to enable the Leak Sanitizer (LSAN). Unfortunately, for heterogeneous applications, this default will result in significant output from the leak sanitizer when the application exits due to allocations made by the language runtime which are not considered to be to be leaks. This output can be avoided by adding detect_leaks=0 to the ASAN_OPTIONS, or alternatively by producing an LSAN suppression file (syntax described here) and activating it with environment variable LSAN_OPTIONS=suppressions=/path/to/suppression/file. When using a suppression file, a suppression report is printed by default. The suppression report can be disabled by using the LSAN_OPTIONS flag print_suppressions=0.

Runtime overhead#

Running an ASan instrumented application incurs overheads which may result in unacceptably long runtimes or failure to run at all.

Higher execution time#

ASan detection works by checking each address at runtime before the address is actually accessed by a load, store, or atomic instruction. This checking involves an additional load to “shadow” memory which records whether the address is “poisoned” or not, and additional logic that decides whether to produce an detection report or not.

This extra runtime work can cause the application to slow down by a factor of three or more, depending on how many memory accesses are executed. For heterogeneous applications, the shadow memory must be accessible by all devices and this can mean that shadow accesses from some devices may be more costly than non-shadow accesses.

Higher memory use#

The address checking described above relies on the compiler to surround each program variable with a red zone and on ASan runtime to surround each runtime memory allocation with a red zone and fill the shadow corresponding to each red zone with poison. The added memory for the red zones is additional overhead on top of the 13% overhead for the shadow memory itself.

Applications which consume most one or more available memory pools when run normally are likely to encounter allocation failures when run with instrumentation.

Runtime reporting#

It is not the intention of this document to provide a detailed explanation of all of the types of reports that can be output by the ASan runtime. Instead, the focus is on the differences between the standard reports for CPU issues, and reports for GPU issues.

An invalid address detection report for the CPU always starts with

==<PID>==ERROR: AddressSanitizer: <problem type> on address <memory address> at pc <pc> bp <bp> sp <sp> <access> of size <N> at <memory address> thread T0

and continues with a stack trace for the access, a stack trace for the allocation and deallocation, if relevant, and a dump of the shadow near the .

In contrast, an invalid address detection report for the GPU always starts with

==<PID>==ERROR: AddressSanitizer: <problem type> on amdgpu device <device> at pc <pc> <access> of size <n> in workgroup id (<X>,<Y>,<Z>)

Above, <device> is the integer device ID, and (<X>, <Y>, <Z>) is the ID of the workgroup or block where the invalid address was detected.

While the CPU report include a call stack for the thread attempting the invalid access, the GPU is currently to a call stack of size one, i.e. the (symbolized) of the invalid access, e.g.

#0 <pc> in <fuction signature> at /path/to/file.hip:<line>:<column>

This short call stack is followed by a GPU unique section that looks like

Thread ids and accessed addresses:
<lid0> <maddr 0> : <lid1> <maddr1> : ...

where each <lid j> <maddr j> indicates the lane ID and the invalid memory address held by lane j of the wavefront attempting the invalid access.

Additionally, reports for invalid GPU accesses to memory allocated by GPU code via malloc or new starting with, for example,

==1234==ERROR: AddressSanitizer: heap-buffer-overflow on amdgpu device 0 at pc 0x7fa9f5c92dcc

or

==5678==ERROR: AddressSanitizer: heap-use-after-free on amdgpu device 3 at pc 0x7f4c10062d74

currently may include one or two surprising CPU side tracebacks mentioning :hostcall”. This is due to how malloc and free are implemented for GPU code and these call stacks can be ignored.

Running with rocgdb#

rocgdb can be used to further investigate ASan detected errors, with some preparation.

Currently, the ASan runtime complains when starting rocgdb without preparation.

$ rocgdb my_app
==1122==ASan` runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.

This is solved by setting environment variable LD_PRELOAD to the path to the ASan runtime, whose path can be obtained using the command

amdclang++ -print-file-name=libclang_rt.asan-x86_64.so

It is also recommended to set the environment variable HIP_ENABLE_DEFERRED_LOADING=0 before debugging HIP applications.

After starting rocgdb breakpoints can be set on the ASan runtime error reporting entry points of interest. For example, if an ASan error report includes

WRITE of size 4 in workgroup id (10,0,0)

the rocgdb command needed to stop the program before the report is printed is

(gdb) break __asan_report_store4

Similarly, the appropriate command for a report including

READ of size <N> in workgroup ID (1,2,3)

is

(gdb) break __asan_report_load<N>

It is possible to set breakpoints on all ASan report functions using these commands:

$ rocgdb <path to application>
(gdb) start <commmand line arguments>
(gdb) rbreak ^__asan_report
(gdb) c

Using ASan with a short HIP application#

Consider the following simple and short demo of using the Address Sanitizer with a HIP application:

#include <cstdlib>
#include <hip/hip_runtime.h>

__global__ void
set1(int *p)
{
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    p[i] = 1;
}

int
main(int argc, char **argv)
{
    int m = std::atoi(argv[1]);
    int n1 = std::atoi(argv[2]);
    int n2 = std::atoi(argv[3]);
    int c = std::atoi(argv[4]);
    int *dp;
    hipMalloc(&dp, m*sizeof(int));
    hipLaunchKernelGGL(set1, dim3(n1), dim3(n2), 0, 0, dp);
    int *hp = (int*)malloc(c * sizeof(int));
    hipMemcpy(hp, dp, m*sizeof(int), hipMemcpyDeviceToHost);
    hipDeviceSynchronize();
    hipFree(dp);
    free(hp);
    std::puts("Done.");
    return 0;
}

This application will attempt to access invalid addresses for certain command line arguments. In particular, if m < n1 * n2 some device threads will attempt to access unallocated device memory.

Or, if c < m, the hipMemcpy function will copy past the end of the malloc allocated memory.

Note: The hipcc compiler is used here for simplicity.

Compiling without XNACK results in a warning.

$ hipcc -g --offload-arch=gfx90a:xnack- -fsanitize=address -shared-libsan mini.hip -o mini
clang++: warning: ignoring` `-fsanitize=address' option for offload arch 'gfx90a:xnack-`, as it is not currently supported there. Use it with an offload arch containing 'xnack+' instead [-Woption-ignored]`.

The binary compiled above will run, but the GPU code will not be instrumented and the m < n1 * n2 error will not be detected. Switching to --offload-arch=gfx90a:xnack+ in the command above results in a warning-free compilation and an instrumented application. After setting PATH, LD_LIBRARY_PATH and HSA_XNACK as described earlier, a check of the binary with ldd yields the following,

$ ldd mini
        linux-vdso.so.1 (0x00007ffd1a5ae000)
        libclang_rt.asan-x86_64.so => /opt/rocm-6.1.0-99999/llvm/lib/clang/17.0.0/lib/linux/libclang_rt.asan-x86_64.so (0x00007fb9c14b6000)
        libamdhip64.so.5 => /opt/rocm-6.1.0-99999/lib/asan/libamdhip64.so.5 (0x00007fb9bedd3000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb9beba8000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb9bea59000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb9bea3e000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb9be84a000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb9be844000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb9be821000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb9be817000)
        libamd_comgr.so.2 => /opt/rocm-6.1.0-99999/lib/asan/libamd_comgr.so.2 (0x00007fb9b4382000)
        libhsa-runtime64.so.1 => /opt/rocm-6.1.0-99999/lib/asan/libhsa-runtime64.so.1 (0x00007fb9b3b00000)
        libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007fb9b3af3000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb9c2027000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fb9b3ad7000)
        libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007fb9b3aa7000)
        libelf.so.1 => /lib/x86_64-linux-gnu/libelf.so.1 (0x00007fb9b3a89000)
        libdrm.so.2 => /opt/amdgpu/lib/x86_64-linux-gnu/libdrm.so.2 (0x00007fb9b3a70000)
        libdrm_amdgpu.so.1 => /opt/amdgpu/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1 (0x00007fb9b3a62000)

This confirms that the address sanitizer runtime is linked in, and the ASAN instrumented version of the runtime libraries are used. Checking the PATH yields

$ which llvm-symbolizer
/opt/rocm-6.1.0-99999/llvm/bin/llvm-symbolizer

Lastly, a check of the OS kernel version yields

$ uname -rv
5.15.0-73-generic #80~20.04.1-Ubuntu SMP Wed May 17 14:58:14 UTC 2023

which indicates that the required HMM support (kernel version > 5.6) is available. This completes the necessary setup. Running with m = 100, n1 = 11, n2 = 10 and c = 100 should produce a report for an invalid access by the last 10 threads.

=================================================================
==3141==ERROR: AddressSanitizer: heap-buffer-overflow on amdgpu device 0 at pc 0x7fb1410d2cc4
WRITE of size 4 in workgroup id (10,0,0)
  #0 0x7fb1410d2cc4 in set1(int*) at /home/dave/mini/mini.cpp:0:10

Thread ids and accessed addresses:
00 : 0x7fb14371d190 01 : 0x7fb14371d194 02 : 0x7fb14371d198 03 : 0x7fb14371d19c 04 : 0x7fb14371d1a0 05 : 0x7fb14371d1a4 06 : 0x7fb14371d1a8 07 : 0x7fb14371d1ac
08 : 0x7fb14371d1b0 09 : 0x7fb14371d1b4

0x7fb14371d190 is located 0 bytes after 400-byte region [0x7fb14371d000,0x7fb14371d190)
allocated by thread T0 here:
    #0 0x7fb151c76828 in hsa_amd_memory_pool_allocate /work/dave/git/compute/external/llvm-project/compiler-rt/lib/asan/asan_interceptors.cpp:692:3
    #1 ...

    #12 0x7fb14fb99ec4 in hipMalloc /work/dave/git/compute/external/clr/hipamd/src/hip_memory.cpp:568:3
    #13 0x226630 in hipError_t hipMalloc<int>(int**, unsigned long) /opt/rocm-6.1.0-99999/include/hip/hip_runtime_api.h:8367:12
    #14 0x226630 in main /home/dave/mini/mini.cpp:19:5
    #15 0x7fb14ef02082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16

Shadow bytes around the buggy address:
  0x7fb14371cf00: ...

=>0x7fb14371d180: 00 00[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7fb14371d200: ...

Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  ...
==3141==ABORTING

Running with m = 100, n1 = 10, n2 = 10 and c = 99 should produce a report for an invalid copy.

=================================================================
==2817==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x514000150dcc at pc 0x7f5509551aca bp 0x7ffc90a7ae50 sp 0x7ffc90a7a610
WRITE of size 400 at 0x514000150dcc thread T0
    #0 0x7f5509551ac9 in __asan_memcpy /work/dave/git/compute/external/llvm-project/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:61:3
    #1 ...

    #9 0x7f5507462a28 in hipMemcpy_common(void*, void const*, unsigned long, hipMemcpyKind, ihipStream_t*) /work/dave/git/compute/external/clr/hipamd/src/hip_memory.cpp:637:10
    #10 0x7f5507464205 in hipMemcpy /work/dave/git/compute/external/clr/hipamd/src/hip_memory.cpp:642:3
    #11 0x226844 in main /home/dave/mini/mini.cpp:22:5
    #12 0x7f55067c3082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16
    #13 0x22605d in _start (/home/dave/mini/mini+0x22605d)

0x514000150dcc is located 0 bytes after 396-byte region [0x514000150c40,0x514000150dcc)
allocated by thread T0 here:
    #0 0x7f5509553dcf in malloc /work/dave/git/compute/external/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:69:3
    #1 0x226817 in main /home/dave/mini/mini.cpp:21:21
    #2 0x7f55067c3082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16

SUMMARY: AddressSanitizer: heap-buffer-overflow /work/dave/git/compute/external/llvm-project/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:61:3 in __asan_memcpy
Shadow bytes around the buggy address:
  0x514000150b00: ...

=>0x514000150d80: 00 00 00 00 00 00 00 00 00[04]fa fa fa fa fa fa
  0x514000150e00: ...

Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  ...
==2817==ABORTING

Known issues with using GPU sanitizer#

  • Red zones must have limited size and it is possible for an invalid access to completely miss a red zone and not be detected.

  • Lack of detection or false reports can be caused by the runtime not properly maintaining red zone shadows.

  • Lack of detection on the GPU might also be due to the implementation not instrumenting accesses to all GPU specific address spaces. For example, in the current implementation accesses to “private” or “stack” variables on the GPU are not instrumented, and accesses to HIP shared variables (also known as “local data store” or “LDS”) are also not instrumented.

  • It can also be the case that a memory fault is hit for an invalid address even with the instrumentation. This is usually caused by the invalid address being so wild that its shadow address is outside of any memory region, and the fault actually occurs on the access to the shadow address. It is also possible to hit a memory fault for the NULL pointer. While address 0 does have a shadow location, it is not poisoned by the runtime.

Using CMake#

Most components in ROCm support CMake. Projects depending on header-only or library components typically require CMake 3.5 or higher whereas those wanting to make use of the CMake HIP language support will require CMake 3.21 or higher.

Finding dependencies#

Note

For a complete reference on how to deal with dependencies in CMake, refer to the CMake docs on find_package and the Using Dependencies Guide to get an overview of CMake related facilities.

In short, CMake supports finding dependencies in two ways:

  • In Module mode, it consults a file Find<PackageName>.cmake which tries to find the component in typical install locations and layouts. CMake ships a few dozen such scripts, but users and projects may ship them as well.

  • In Config mode, it locates a file named <packagename>-config.cmake or <PackageName>Config.cmake which describes the installed component in all regards needed to consume it.

ROCm predominantly relies on Config mode, one notable exception being the Module driving the compilation of HIP programs on NVIDIA runtimes. As such, when dependencies are not found in standard system locations, one either has to instruct CMake to search for package config files in additional folders using the CMAKE_PREFIX_PATH variable (a semi-colon separated list of file system paths), or using <PackageName>_ROOT variable on a project-specific basis.

There are nearly a dozen ways to set these variables. One may be more convenient over the other depending on your workflow. Conceptually the simplest is adding it to your CMake configuration command on the command line via -D CMAKE_PREFIX_PATH=.... . AMD packaged ROCm installs can typically be added to the config file search paths such as:

  • Windows: -D CMAKE_PREFIX_PATH=${env:HIP_PATH}

  • Linux: -D CMAKE_PREFIX_PATH=/opt/rocm

ROCm provides the respective config-file packages, and this enables find_package to be used directly. ROCm does not require any Find module as the config-file packages are shipped with the upstream projects, such as rocPRIM and other ROCm libraries.

For a complete guide on where and how ROCm may be installed on a system, refer to the installation guides for Linux and Windows.

Using HIP in CMake#

ROCm components providing a C/C++ interface support consumption via any C/C++ toolchain that CMake knows how to drive. ROCm also supports the CMake HIP language features, allowing users to program using the HIP single-source programming model. When a program (or translation-unit) uses the HIP API without compiling any GPU device code, HIP can be treated in CMake as a simple C/C++ library.

Using the HIP single-source programming model#

Source code written in the HIP dialect of C++ typically uses the .hip extension. When the HIP CMake language is enabled, it will automatically associate such source files with the HIP toolchain being used.

cmake_minimum_required(VERSION 3.21) # HIP language support requires 3.21
cmake_policy(VERSION 3.21.3...3.27)
project(MyProj LANGUAGES HIP)
add_executable(MyApp Main.hip)

Should you have existing CUDA code that is from the source compatible subset of HIP, you can tell CMake that despite their .cu extension, they’re HIP sources. Do note that this mostly facilitates compiling kernel code-only source files, as host-side CUDA API won’t compile in this fashion.

add_library(MyLib MyLib.cu)
set_source_files_properties(MyLib.cu PROPERTIES LANGUAGE HIP)

CMake itself only hosts part of the HIP language support, such as defining HIP-specific properties, etc. while the other half ships with the HIP implementation, such as ROCm. CMake will search for a file hip-lang-config.cmake describing how the the properties defined by CMake translate to toolchain invocations. If one installs ROCm using non-standard methods or layouts and CMake can’t locate this file or detect parts of the SDK, there’s a catch-all, last resort variable consulted locating this file, -D CMAKE_HIP_COMPILER_ROCM_ROOT:PATH= which should be set the root of the ROCm installation.

Note

Imported targets defined by hip-lang-config.cmake are for internal use only.

If the user doesn’t provide a semi-colon delimited list of device architectures via CMAKE_HIP_ARCHITECTURES, CMake will select some sensible default. It is advised though that if a user knows what devices they wish to target, then set this variable explicitly.

Consuming ROCm C/C++ libraries#

Libraries such as rocBLAS, rocFFT, MIOpen, etc. behave as C/C++ libraries. Illustrated in the example below is a C++ application using MIOpen from CMake. It calls find_package(miopen), which provides the MIOpen imported target. This can be linked with target_link_libraries

cmake_minimum_required(VERSION 3.5) # find_package(miopen) requires 3.5
cmake_policy(VERSION 3.5...3.27)
project(MyProj LANGUAGES CXX)
find_package(miopen)
add_library(MyLib ...)
target_link_libraries(MyLib PUBLIC MIOpen)

Note

Most libraries are designed as host-only API, so using a GPU device compiler is not necessary for downstream projects unless they use GPU device code.

Consuming the HIP API in C++ code#

Consuming the HIP API without compiling single-source GPU device code can be done using any C++ compiler. The find_package(hip) provides the hip::host imported target to use HIP in this scenario.

cmake_minimum_required(VERSION 3.5) # find_package(hip) requires 3.5
cmake_policy(VERSION 3.5...3.27)
project(MyProj LANGUAGES CXX)
find_package(hip REQUIRED)
add_executable(MyApp ...)
target_link_libraries(MyApp PRIVATE hip::host)

When mixing such CXX sources with HIP sources holding device-code, link only to hip::host. If HIP sources don’t have .hip as their extension, use set_source_files_properties(<hip_sources>… PROPERTIES LANGUAGE HIP) on them. Linking to hip::host will set all the necessary flags for the CXX sources while HIP sources inherit all flags from the built-in language support. Having HIP sources in a target will turn the LINKER_LANGUAGE into HIP.

Compiling device code in C++ language mode#

Attention

The workflow detailed here is considered legacy and is shown for understanding’s sake. It pre-dates the existence of HIP language support in CMake. If source code has HIP device code in it, it is a HIP source file and should be compiled as such. Only resort to the method below if your HIP-enabled CMake code path can’t mandate CMake version 3.21.

If code uses the HIP API and compiles GPU device code, it requires using a device compiler. The compiler for CMake can be set using either the CMAKE_C_COMPILER and CMAKE_CXX_COMPILER variable or using the CC and CXX environment variables. This can be set when configuring CMake or put into a CMake toolchain file. The device compiler must be set to a compiler that supports AMD GPU targets, which is usually Clang.

The find_package(hip) provides the hip::device imported target to add all the flags necessary for device compilation.

cmake_minimum_required(VERSION 3.8) # cxx_std_11 requires 3.8
cmake_policy(VERSION 3.8...3.27)
project(MyProj LANGUAGES CXX)
find_package(hip REQUIRED)
add_library(MyLib ...)
target_link_libraries(MyLib PRIVATE hip::device)
target_compile_features(MyLib PRIVATE cxx_std_11)

Note

Compiling for the GPU device requires at least C++11.

This project can then be configured with the following CMake commands:

  • Windows: cmake -D CMAKE_CXX_COMPILER:PATH=${env:HIP_PATH}\bin\clang++.exe

  • Linux: cmake -D CMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++

Which use the device compiler provided from the binary packages of ROCm HIP SDK and repo.radeon.com respectively.

When using the CXX language support to compile HIP device code, selecting the target GPU architectures is done via setting the GPU_TARGETS variable. CMAKE_HIP_ARCHITECTURES only exists when the HIP language is enabled. By default, this is set to some subset of the currently supported architectures of AMD ROCm. It can be set to the CMake option -D GPU_TARGETS="gfx1032;gfx1035".

ROCm CMake packages#

Component

Package

Targets

HIP

hip

hip::host, hip::device

rocPRIM

rocprim

roc::rocprim

rocThrust

rocthrust

roc::rocthrust

hipCUB

hipcub

hip::hipcub

rocRAND

rocrand

roc::rocrand

rocBLAS

rocblas

roc::rocblas

rocSOLVER

rocsolver

roc::rocsolver

hipBLAS

hipblas

roc::hipblas

rocFFT

rocfft

roc::rocfft

hipFFT

hipfft

hip::hipfft

rocSPARSE

rocsparse

roc::rocsparse

hipSPARSE

hipsparse

roc::hipsparse

rocALUTION

rocalution

roc::rocalution

RCCL

rccl

rccl

MIOpen

miopen

MIOpen

MIGraphX

migraphx

migraphx::migraphx, migraphx::migraphx_c, migraphx::migraphx_cpu, migraphx::migraphx_gpu, migraphx::migraphx_onnx, migraphx::migraphx_tf

Using CMake presets#

CMake command lines depending on how specific users like to be when compiling code can grow to unwieldy lengths. This is the primary reason why projects tend to bake script snippets into their build definitions controlling compiler warning levels, changing CMake defaults (CMAKE_BUILD_TYPE or BUILD_SHARED_LIBS just to name a few) and all sorts anti-patterns, all in the name of convenience.

Load on the command-line interface (CLI) starts immediately by selecting a toolchain, the set of utilities used to compile programs. To ease some of the toolchain related pains, CMake does consult the CC and CXX environmental variables when setting a default CMAKE_C[XX]_COMPILER respectively, but that is just the tip of the iceberg. There’s a fair number of variables related to just the toolchain itself (typically supplied using toolchain files ), and then we still haven’t talked about user preference or project-specific options.

IDEs supporting CMake (Visual Studio, Visual Studio Code, CLion, etc.) all came up with their own way to register command-line fragments of different purpose in a setup-and-forget fashion for quick assembly using graphical front-ends. This is all nice, but configurations aren’t portable, nor can they be reused in Continuous Integration (CI) pipelines. CMake has condensed existing practice into a portable JSON format that works in all IDEs and can be invoked from any command line. This is CMake Presets.

There are two types of preset files: one supplied by the project, called CMakePresets.json which is meant to be committed to version control, typically used to drive CI; and one meant for the user to provide, called CMakeUserPresets.json, typically used to house user preference and adapting the build to the user’s environment. These JSON files are allowed to include other JSON files and the user presets always implicitly includes the non-user variant.

Using HIP with presets#

Following is an example CMakeUserPresets.json file which actually compiles the amd/rocm-examples suite of sample applications on a typical ROCm installation:

{
  "version": 3,
  "cmakeMinimumRequired": {
    "major": 3,
    "minor": 21,
    "patch": 0
  },
  "configurePresets": [
    {
      "name": "layout",
      "hidden": true,
      "binaryDir": "${sourceDir}/build/${presetName}",
      "installDir": "${sourceDir}/install/${presetName}"
    },
    {
      "name": "generator-ninja-multi-config",
      "hidden": true,
      "generator": "Ninja Multi-Config"
    },
    {
      "name": "toolchain-makefiles-c/c++-amdclang",
      "hidden": true,
      "cacheVariables": {
        "CMAKE_C_COMPILER": "/opt/rocm/bin/amdclang",
        "CMAKE_CXX_COMPILER": "/opt/rocm/bin/amdclang++",
        "CMAKE_HIP_COMPILER": "/opt/rocm/bin/amdclang++"
      }
    },
    {
      "name": "clang-strict-iso-high-warn",
      "hidden": true,
      "cacheVariables": {
        "CMAKE_C_FLAGS": "-Wall -Wextra -pedantic",
        "CMAKE_CXX_FLAGS": "-Wall -Wextra -pedantic",
        "CMAKE_HIP_FLAGS": "-Wall -Wextra -pedantic"
      }
    },
    {
      "name": "ninja-mc-rocm",
      "displayName": "Ninja Multi-Config ROCm",
      "inherits": [
        "layout",
        "generator-ninja-multi-config",
        "toolchain-makefiles-c/c++-amdclang",
        "clang-strict-iso-high-warn"
      ]
    }
  ],
  "buildPresets": [
    {
      "name": "ninja-mc-rocm-debug",
      "displayName": "Debug",
      "configuration": "Debug",
      "configurePreset": "ninja-mc-rocm"
    },
    {
      "name": "ninja-mc-rocm-release",
      "displayName": "Release",
      "configuration": "Release",
      "configurePreset": "ninja-mc-rocm"
    },
    {
      "name": "ninja-mc-rocm-debug-verbose",
      "displayName": "Debug (verbose)",
      "configuration": "Debug",
      "configurePreset": "ninja-mc-rocm",
      "verbose": true
    },
    {
      "name": "ninja-mc-rocm-release-verbose",
      "displayName": "Release (verbose)",
      "configuration": "Release",
      "configurePreset": "ninja-mc-rocm",
      "verbose": true
    }
  ],
  "testPresets": [
    {
      "name": "ninja-mc-rocm-debug",
      "displayName": "Debug",
      "configuration": "Debug",
      "configurePreset": "ninja-mc-rocm",
      "execution": {
        "jobs": 0
      }
    },
    {
      "name": "ninja-mc-rocm-release",
      "displayName": "Release",
      "configuration": "Release",
      "configurePreset": "ninja-mc-rocm",
      "execution": {
        "jobs": 0
      }
    }
  ]
}

Note

Getting presets to work reliably on Windows requires some CMake improvements and/or support from compiler vendors. (Refer to Add support to the Visual Studio generators and Sourcing environment scripts .)

How ROCm uses PCIe atomics#

ROCm PCIe feature and overview of BAR memory#

ROCm is an extension of HSA platform architecture, so it shares the queuing model, memory model, signaling and synchronization protocols. Platform atomics are integral to perform queuing and signaling memory operations where there may be multiple-writers across CPU and GPU agents.

The full list of HSA system architecture platform requirements are here: HSA Sys Arch Features.

AMD ROCm Software uses the new PCI Express 3.0 (Peripheral Component Interconnect Express [PCIe] 3.0) features for atomic read-modify-write transactions which extends inter-processor synchronization mechanisms to IO to support the defined set of HSA capabilities needed for queuing and signaling memory operations.

The new PCIe atomic operations operate as completers for CAS (Compare and Swap), FetchADD, SWAP atomics. The atomic operations are initiated by the I/O device which support 32-bit, 64-bit and 128-bit operand which target address have to be naturally aligned to operation sizes.

For ROCm the Platform atomics are used in ROCm in the following ways:

  • Update HSA queue’s read_dispatch_id: 64 bit atomic add used by the command processor on the GPU agent to update the packet ID it processed.

  • Update HSA queue’s write_dispatch_id: 64 bit atomic add used by the CPU and GPU agent to support multi-writer queue insertions.

  • Update HSA Signals – 64bit atomic ops are used for CPU & GPU synchronization.

The PCIe 3.0 atomic operations feature allows atomic transactions to be requested by, routed through and completed by PCIe components. Routing and completion does not require software support. Component support for each is detectable via the Device Capabilities 2 (DevCap2) register. Upstream bridges need to have atomic operations routing enabled or the atomic operations will fail even though PCIe endpoint and PCIe I/O devices has the capability to atomic operations.

To do atomic operations routing capability between two or more Root Ports, each associated Root Port must indicate that capability via the atomic operations routing supported bit in the DevCap2 register.

If your system has a PCIe Express Switch it needs to support atomic operations routing. Atomic operations requests are permitted only if a component’s DEVCTL2.ATOMICOP_REQUESTER_ENABLE field is set. These requests can only be serviced if the upstream components support atomic operation completion and/or routing to a component which does. Atomic operations routing support=1, routing is supported; atomic operations routing support=0, routing is not supported.

An atomic operation is a non-posted transaction supporting 32-bit and 64-bit address formats, there must be a response for Completion containing the result of the operation. Errors associated with the operation (uncorrectable error accessing the target location or carrying out the atomic operation) are signaled to the requester by setting the Completion Status field in the completion descriptor, they are set to to Completer Abort (CA) or Unsupported Request (UR).

To understand more about how PCIe atomic operations work, see PCIe atomics

Linux Kernel Patch to pci_enable_atomic_request

There are also a number of papers which talk about these new capabilities:

Other I/O devices with PCIe atomics support:

  • Mellanox ConnectX-5 InfiniBand Card

  • Cray Aries Interconnect

  • Xilinx 7 Series Devices

Future bus technology with richer I/O atomics operation Support

  • GenZ

New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or newer CPUs with PCIe Generation 3.0 support.

  • Mellanox Bluefield SOC

  • Cavium Thunder X2

In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU originates two writes to two different targets:

  • Write to another GPU memory

  • Write to system memory to indicate transfer complete

They are routed off to different ends of the computer but we want to make sure the write to system memory to indicate transfer complete occurs AFTER P2P write to GPU has complete.

BAR memory overview#

On a Xeon E5 based system in the BIOS we can turn on above 4GB PCIe addressing, if so he need to set memory-mapped input/output (MMIO) base address (MMIOH base) and range (MMIO high size) in the BIOS.

In the Supermicro system in the system bios you need to see the following

  • Advanced->PCIe/PCI/PnP configuration-> Above 4G Decoding = Enabled

  • Advanced->PCIe/PCI/PnP Configuration->MMIOH Base = 512G

  • Advanced->PCIe/PCI/PnP Configuration->MMIO High Size = 256G

When we support Large Bar Capability there is a Large Bar VBIOS which also disable the IO bar.

For GFX9 and Vega10 which have Physical Address up 44 bit and 48 bit Virtual address.

  • BAR0-1 registers: 64bit, prefetchable, GPU memory. 8GB or 16GB depending on Vega10 SKU. Must be placed < 2^44 to support P2P access from other Vega10.

  • BAR2-3 registers: 64bit, prefetchable, Doorbell. Must be placed < 2^44 to support P2P access from other Vega10.

  • BAR4 register: Optional, not a boot device.

  • BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed < 4GB.

Here is how our base address register (BAR) works on GFX 8 GPUs with 40 bit Physical Address Limit

11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO
Series] (rev c1)

Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0b35

Flags: bus master, fast devsel, latency 0, IRQ 119

Memory at bf40000000 (64-bit, prefetchable) [size=256M]

Memory at bf50000000 (64-bit, prefetchable) [size=2M]

I/O ports at 3000 [size=256]

Memory at c7400000 (32-bit, non-prefetchable) [size=256K]

Expansion ROM at c7440000 [disabled] [size=128K]

Legend:

1 : GPU Frame Buffer BAR – In this example it happens to be 256M, but typically this will be size of the GPU memory (typically 4GB+). This BAR has to be placed < 2^40 to allow peer-to-peer access from other GFX8 AMD GPUs. For GFX9 (Vega GPU) the BAR has to be placed < 2^44 to allow peer-to-peer access from other GFX9 AMD GPUs.

2 : Doorbell BAR – The size of the BAR is typically will be < 10MB (currently fixed at 2MB) for this generation GPUs. This BAR has to be placed < 2^40 to allow peer-to-peer access from other current generation AMD GPUs.

3 : IO BAR – This is for legacy VGA and boot device support, but since this the GPUs in this project are not VGA devices (headless), this is not a concern even if the SBIOS does not setup.

4 : MMIO BAR – This is required for the AMD Driver SW to access the configuration registers. Since the reminder of the BAR available is only 1 DWORD (32bit), this is placed < 4GB. This is fixed at 256KB.

5 : Expansion ROM – This is required for the AMD Driver SW to access the GPU video-bios. This is currently fixed at 128KB.

For more information, you can review Overview of Changes to PCI Express 3.0.

Deep learning: Inception V3 with PyTorch#

Deep learning training#

Deep-learning models are designed to capture the complexity of the problem and the underlying data. These models are “deep,” comprising multiple component layers. Training is finding the best parameters for each model layer to achieve a well-defined objective.

The training data consists of input features in supervised learning, similar to what the learned model is expected to see during the evaluation or inference phase. The target output is also included, which serves to teach the model. A loss metric is defined as part of training that evaluates the model’s performance during the training process.

Training also includes the choice of an optimization algorithm that reduces the loss by adjusting the model’s parameters. Training is an iterative process where training data is fed in, usually split into different batches, with the entirety of the training data passed during one training epoch. Training usually is run for multiple epochs.

Training phases#

Training occurs in multiple phases for every batch of training data. the following table provides an explanation of the types of training phases.

Types of Training Phases#

Types of Phases

Forward Pass

The input features are fed into the model, whose parameters may be randomly initialized initially. Activations (outputs) of each layer are retained during this pass to help in the loss gradient computation during the backward pass.

Loss Computation

The output is compared against the target outputs, and the loss is computed.

Backward Pass

The loss is propagated backward, and the model’s error gradients are computed and stored for each trainable parameter.

Optimization Pass

The optimization algorithm updates the model parameters using the stored error gradients.

Training is different from inference, particularly from the hardware perspective. The following table shows the contrast between training and inference.

Training vs. Inference#

Training

Inference

Training is measured in hours/days.

The inference is measured in minutes.

Training is generally run offline in a data center or cloud setting.

The inference is made on edge devices.

The memory requirements for training are higher than inference due to storing intermediate data, such as activations and error gradients.

The memory requirements are lower for inference than training.

Data for training is available on the disk before the training process and is generally significant. The training performance is measured by how fast the data batches can be processed.

Inference data usually arrive stochastically, which may be batched to improve performance. Inference performance is generally measured in throughput speed to process the batch of data and the delay in responding to the input (latency).

Different quantization data types are typically chosen between training (FP32, BF16) and inference (FP16, INT8). The computation hardware has different specializations from other data types, leading to improvement in performance if a faster datatype can be selected for the corresponding task.

Case studies#

The following sections contain case studies for the Inception V3 model.

Inception V3 with PyTorch#

Convolution Neural Networks are forms of artificial neural networks commonly used for image processing. One of the core layers of such a network is the convolutional layer, which convolves the input with a weight tensor and passes the result to the next layer. Inception V3[^inception_arch] is an architectural development over the ImageNet competition-winning entry, AlexNet, using more profound and broader networks while attempting to meet computational and memory budgets.

The implementation uses PyTorch as a framework. This case study utilizes TorchVision, a repository of popular datasets and model architectures, for obtaining the model. TorchVision also provides pre-trained weights as a starting point to develop new models or fine-tune the model for a new task.

Evaluating a pre-trained model#

The Inception V3 model introduces a simple image classification task with the pre-trained model. This does not involve training but utilizes an already pre-trained model from TorchVision.

This example is adapted from the PyTorch research hub page on Inception V3.

Follow these steps:

  1. Run the PyTorch ROCm-based Docker image or refer to the section Installing PyTorch for setting up a PyTorch environment on ROCm.

    docker run -it -v $HOME:/data --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
    
  2. Run the Python shell and import packages and libraries for model creation.

    import torch
    import torchvision
    
  3. Set the model in evaluation mode. Evaluation mode directs PyTorch not to store intermediate data, which would have been used in training.

    model = torch.hub.load('pytorch/vision:v0.10.0', 'inception_v3', pretrained=True)
    model.eval()
    
  4. Download a sample image for inference.

    import urllib
    url, filename = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg")
    try: urllib.URLopener().retrieve(url, filename)
    except: urllib.request.urlretrieve(url, filename)
    
  5. Import torchvision and PILImage support libraries.

    from PIL import Image
    from torchvision import transforms
    input_image = Image.open(filename)
    
  6. Apply preprocessing and normalization.

    preprocess = transforms.Compose([
        transforms.Resize(299),
        transforms.CenterCrop(299),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    
  7. Use input tensors and unsqueeze them later.

    input_tensor = preprocess(input_image)
    input_batch = input_tensor.unsqueeze(0)
    if torch.cuda.is_available():
        input_batch = input_batch.to('cuda')
        model.to('cuda')
    
  8. Find out probabilities.

    with torch.no_grad():
        output = model(input_batch)
    print(output[0])
    probabilities = torch.nn.functional.softmax(output[0], dim=0)
    print(probabilities)
    
  9. To understand the probabilities, download and examine the ImageNet labels.

    wget https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt
    
  10. Read the categories and show the top categories for the image.

    with open("imagenet_classes.txt", "r") as f:
        categories = [s.strip() for s in f.readlines()]
    top5_prob, top5_catid = torch.topk(probabilities, 5)
    for i in range(top5_prob.size(0)):
        print(categories[top5_catid[i]], top5_prob[i].item())
    
Training Inception V3#

The previous section focused on downloading and using the Inception V3 model for a simple image classification task. This section walks through training the model on a new dataset.

Follow these steps:

  1. Run the PyTorch ROCm Docker image or refer to the section Installing PyTorch for setting up a PyTorch environment on ROCm.

    docker pull rocm/pytorch:latest
    docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
    
  2. Download an ImageNet database. For this example, the tiny-imagenet-200[^Stanford_deep_learning], a smaller ImageNet variant with 200 image classes and a training dataset with 100,000 images, was downsized to 64x64 color images.

    wget http://cs231n.stanford.edu/tiny-imagenet-200.zip
    
  3. Process the database to set the validation directory to the format expected by PyTorch’s DataLoader.

  4. Run the following script:

    import io
    import glob
    import os
    from shutil import move
    from os.path import join
    from os import listdir, rmdir
    target_folder = './tiny-imagenet-200/val/'
    val_dict = {}
    with open('./tiny-imagenet-200/val/val_annotations.txt', 'r') as f:
        for line in f.readlines():
            split_line = line.split('\t')
            val_dict[split_line[0]] = split_line[1]
    
    paths = glob.glob('./tiny-imagenet-200/val/images/*')
    for path in paths:
        file = path.split('/')[-1]
        folder = val_dict[file]
        if not os.path.exists(target_folder + str(folder)):
            os.mkdir(target_folder + str(folder))
            os.mkdir(target_folder + str(folder) + '/images')
    
    for path in paths:
        file = path.split('/')[-1]
        folder = val_dict[file]
        dest = target_folder + str(folder) + '/images/' + str(file)
        move(path, dest)
    
    rmdir('./tiny-imagenet-200/val/images')
    
  5. Open a Python shell.

  6. Import dependencies, including Torch, OS, and TorchVision.

    import torch
    import os
    import torchvision
    from torchvision import transforms
    from torchvision.transforms.functional import InterpolationMode
    
  7. Set parameters to guide the training process.

    Note

    The device is set to "cuda". In PyTorch, "cuda" is a generic keyword to denote a GPU.

    device = "cuda"
    
  8. Set the data_path to the location of the training and validation data. In this case, the tiny-imagenet-200 is present as a subdirectory to the current directory.

    data_path = "tiny-imagenet-200"
    

    The training image size is cropped for input into Inception V3.

    train_crop_size = 299
    
  9. To smooth the image, use bilinear interpolation, a resampling method that uses the distance weighted average of the four nearest pixel values to estimate a new pixel value.

    interpolation = "bilinear"
    

    The next parameters control the size to which the validation image is cropped and resized.

    val_crop_size = 299
    val_resize_size = 342
    

    The pre-trained Inception V3 model is chosen to be downloaded from torchvision.

    model_name = "inception_v3"
    pretrained = True
    

    During each training step, a batch of images is processed to compute the loss gradient and perform the optimization. In the following setting, the size of the batch is determined.

    batch_size = 32
    

    This refers to the number of CPU threads the data loader uses to perform efficient multi-process data loading.

    num_workers = 16
    

    The torch.optim package provides methods to adjust the learning rate as the training progresses. This example uses the StepLR scheduler, which decays the learning rate by lr_gamma at every lr_step_size number of epochs.

    learning_rate = 0.1
    momentum = 0.9
    weight_decay = 1e-4
    lr_step_size = 30
    lr_gamma = 0.1
    

    Note

    One training epoch is when the neural network passes an entire dataset forward and backward.

    epochs = 90
    

    The train and validation directories are determined.

    train_dir = os.path.join(data_path, "train")
    val_dir = os.path.join(data_path, "val")
    
  10. Set up the training and testing data loaders.

    interpolation = InterpolationMode(interpolation)
    
    TRAIN_TRANSFORM_IMG = transforms.Compose([
    Normalizaing and standardardizing the image
    transforms.RandomResizedCrop(train_crop_size, interpolation=interpolation),
        transforms.PILToTensor(),
        transforms.ConvertImageDtype(torch.float),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                            std=[0.229, 0.224, 0.225] )
        ])
    dataset = torchvision.datasets.ImageFolder(
        train_dir,
        transform=TRAIN_TRANSFORM_IMG
    )
    TEST_TRANSFORM_IMG = transforms.Compose([
        transforms.Resize(val_resize_size, interpolation=interpolation),
        transforms.CenterCrop(val_crop_size),
        transforms.PILToTensor(),
        transforms.ConvertImageDtype(torch.float),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                            std=[0.229, 0.224, 0.225] )
        ])
    
    dataset_test = torchvision.datasets.ImageFolder(
        val_dir,
        transform=TEST_TRANSFORM_IMG
    )
    
    print("Creating data loaders")
    train_sampler = torch.utils.data.RandomSampler(dataset)
    test_sampler = torch.utils.data.SequentialSampler(dataset_test)
    
    data_loader = torch.utils.data.DataLoader(
        dataset,
        batch_size=batch_size,
        sampler=train_sampler,
        num_workers=num_workers,
        pin_memory=True
    )
    
    data_loader_test = torch.utils.data.DataLoader(
        dataset_test, batch_size=batch_size, sampler=test_sampler, num_workers=num_workers, pin_memory=True
    )
    

    Note

    Use torchvision to obtain the Inception V3 model. Use the pre-trained model weights to speed up training.

    print("Creating model")
    print("Num classes = ", len(dataset.classes))
    model = torchvision.models.__dict__[model_name](pretrained=pretrained)
    
  11. Adapt Inception V3 for the current dataset. tiny-imagenet-200 contains only 200 classes, whereas Inception V3 is designed for 1,000-class output. The last layer of Inception V3 is replaced to match the output features required.

    model.fc = torch.nn.Linear(model.fc.in_features, len(dataset.classes))
    model.aux_logits = False
    model.AuxLogits = None
    
  12. Move the model to the GPU device.

    model.to(device)
    
  13. Set the loss criteria. For this example, Cross Entropy Loss[^cross_entropy] is used.

    criterion = torch.nn.CrossEntropyLoss()
    
  14. Set the optimizer to Stochastic Gradient Descent.

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=learning_rate,
        momentum=momentum,
        weight_decay=weight_decay
    )
    
  15. Set the learning rate scheduler.

    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=lr_step_size, gamma=lr_gamma)
    
  16. Iterate over epochs. Each epoch is a complete pass through the training data.

    print("Start training")
    for epoch in range(epochs):
        model.train()
        epoch_loss = 0
        len_dataset = 0
    
  17. Iterate over steps. The data is processed in batches, and each step passes through a full batch.

    for step, (image, target) in enumerate(data_loader):
    
  18. Pass the image and target to the GPU device.

    image, target = image.to(device), target.to(device)
    

    The following is the core training logic:

    a. The image is fed into the model.

    b. The output is compared with the target in the training data to obtain the loss.

    c. This loss is back propagated to all parameters that require optimization.

    d. The optimizer updates the parameters based on the selected optimization algorithm.

            output = model(image)
            loss = criterion(output, target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
    

    The epoch loss is updated, and the step loss prints.

            epoch_loss += output.shape[0] * loss.item()
            len_dataset += output.shape[0];
            if step % 10 == 0:
                print('Epoch: ', epoch, '| step : %d' % step, '| train loss : %0.4f' % loss.item() )
        epoch_loss = epoch_loss / len_dataset
        print('Epoch: ', epoch, '| train loss :  %0.4f' % epoch_loss )
    

    The learning rate is updated at the end of each epoch.

    lr_scheduler.step()
    

    After training for the epoch, the model evaluates against the validation dataset.

    model.eval()
        with torch.inference_mode():
            running_loss = 0
            for step, (image, target) in enumerate(data_loader_test):
                image, target = image.to(device), target.to(device)
    
                output = model(image)
                loss = criterion(output, target)
    
                running_loss += loss.item()
        running_loss = running_loss / len(data_loader_test)
        print('Epoch: ', epoch, '| test loss : %0.4f' % running_loss )
    
  19. Save the model for use in inferencing tasks.

# save model
torch.save(model.state_dict(), "trained_inception_v3.pt")

Plotting the train and test loss shows both metrics reducing over training epochs. This is demonstrated in the following image.

Inception V3 train and loss graph

Custom model with CIFAR-10 on PyTorch#

The Canadian Institute for Advanced Research (CIFAR)-10 dataset is a subset of the Tiny Images dataset (which contains 80 million images of 32x32 collected from the Internet) and consists of 60,000 32x32 color images. The images are labeled with one of 10 mutually exclusive classes: airplane, motor car, bird, cat, deer, dog, frog, cruise ship, stallion, and truck (but not pickup truck). There are 6,000 images per class, with 5,000 training and 1,000 testing images per class. Let us prepare a custom model for classifying these images using the PyTorch framework and go step-by-step as illustrated below.

Follow these steps:

  1. Import dependencies, including Torch, OS, and TorchVision.

    import torch
    import torchvision
    import torchvision.transforms as transforms
    import matplotlib.pyplot as plot
    import numpy as np
    
  2. The output of torchvision datasets is PILImage images of range [0, 1]. Transform them to Tensors of normalized range [-1, 1].

    transform = transforms.Compose(
            [transforms.ToTensor(),
                transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    

    During each training step, a batch of images is processed to compute the loss gradient and perform the optimization. In the following setting, the size of the batch is determined.

    batch_size = 4
    
  3. Download the dataset train and test datasets as follows. Specify the batch size, shuffle the dataset once, and specify the number of workers to the number of CPU threads used by the data loader to perform efficient multi-process data loading.

    train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=2)
    
  4. Follow the same procedure for the testing set.

    test_set = TorchVision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
    test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=2)
    print ("teast set and test loader")
    
  5. Specify the defined classes of images belonging to this dataset.

    classes = ('Aeroplane', 'motorcar', 'bird', 'cat', 'deer', 'puppy', 'frog', 'stallion', 'cruise', 'truck')
    print("defined classes")
    
  6. Denormalize the images and then iterate over them.

    global image_number
    image_number = 0
    def show_image(img):
        global image_number
        image_number = image_number + 1
        img = img / 2 + 0.5     # de-normalizing input image
        npimg = img.numpy()
        plot.imshow(np.transpose(npimg, (1, 2, 0)))
        plot.savefig("fig{}.jpg".format(image_number))
        print("fig{}.jpg".format(image_number))
        plot.show()
    data_iter = iter(train_loader)
    images, labels = data_iter.next()
    show_image(torchvision.utils.make_grid(images))
    print(' '.join('%5s' % classes[labels[j]] for j in range(batch_size)))
    print("image created and saved ")
    
  7. Import the torch.nn for constructing neural networks and torch.nn.functional to use the convolution functions.

    import torch.nn as nn
    import torch.nn.functional as F
    
  8. Define the CNN (Convolution Neural Networks) and relevant activation functions.

    class Net(nn.Module):
        def __init__(self):
            super().__init__()
            self.conv1 = nn.Conv2d(3, 6, 5)
            self.pool = nn.MaxPool2d(2, 2)
            self.conv2 = nn.Conv2d(6, 16, 5)
    self.pool = nn.MaxPool2d(2, 2)
    self.conv3 = nn.Conv2d(3, 6, 5)
            self.fc2 = nn.Linear(120, 84)
            self.fc3 = nn.Linear(84, 10)
    
        def forward(self, x):
            x = self.pool(F.relu(self.conv1(x)))
            x = self.pool(F.relu(self.conv2(x)))
            x = torch.flatten(x, 1) # flatten all dimensions except batch
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            x = self.fc3(x)
            return x
    net = Net()
    print("created Net() ")
    
  9. Set the optimizer to Stochastic Gradient Descent.

    import torch.optim as optim
    
  10. Set the loss criteria. For this example, Cross Entropy Loss[^cross_entropy] is used.

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
    
  11. Iterate over epochs. Each epoch is a complete pass through the training data.

    for epoch in range(2):  # loop over the dataset multiple times
    
        running_loss = 0.0
        for i, data in enumerate(train_loader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
    
            # zero the parameter gradients
            optimizer.zero_grad()
    
            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
    
            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:    # print every 2000 mini-batches
                print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000))
                running_loss = 0.0
    print('Finished Training')
    
    PATH = './cifar_net.pth'
    torch.save(net.state_dict(), PATH)
    print("saved model to path :",PATH)
    net = Net()
    net.load_state_dict(torch.load(PATH))
    print("loding back saved model")
    outputs = net(images)
    _, predicted = torch.max(outputs, 1)
    print('Predicted: ', ' '.join('%5s' % classes[predicted[j]] for j in range(4)))
    correct = 0
    total = 0
    

    As this is not training, calculating the gradients for outputs is not required.

    # calculate outputs by running images through the network
    with torch.no_grad():
        for data in test_loader:
            images, labels = data
            # calculate outputs by running images through the network
            outputs = net(images)
            # the class with the highest energy is what you can choose as prediction
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    print('Accuracy of the network on the 10000 test images: %d %%' % ( 100 * correct / total))
    # prepare to count predictions for each class
    correct_pred = {classname: 0 for classname in classes}
    total_pred = {classname: 0 for classname in classes}
    
    # again no gradients needed
    with torch.no_grad():
        for data in test_loader:
            images, labels = data
            outputs = net(images)
            _, predictions = torch.max(outputs, 1)
            # collect the correct predictions for each class
            for label, prediction in zip(labels, predictions):
                if label == prediction:
                    correct_pred[classes[label]] += 1
                total_pred[classes[label]] += 1
    # print accuracy for each class
    for classname, correct_count in correct_pred.items():
        accuracy = 100 * float(correct_count) / total_pred[classname]
        print("Accuracy for class {:5s} is: {:.1f} %".format(classname,accuracy))
    

Case study: TensorFlow with Fashion-MNIST#

Fashion-MNIST is a dataset that contains 70,000 grayscale images in 10 categories.

Implement and train a neural network model using the TensorFlow framework to classify images of clothing, like sneakers and shirts.

The dataset has 60,000 images you will use to train the network and 10,000 to evaluate how accurately the network learned to classify images. The Fashion-MNIST dataset can be accessed via TensorFlow internal libraries.

Access the source code from the following repository:

ROCm/tensorflow_fashionmnist

To understand the code step by step, follow these steps:

  1. Import libraries like TensorFlow, NumPy, and Matplotlib to train the neural network and calculate and plot graphs.

    import tensorflow as tf
    import numpy as np
    import matplotlib.pyplot as plt
    
  2. To verify that TensorFlow is installed, print the version of TensorFlow by using the below print statement:

    print(tf._version__) r
    
  3. Load the dataset from the available internal libraries to analyze and train a neural network upon the Fashion-MNIST dataset. Loading the dataset returns four NumPy arrays. The model uses the training set arrays, train_images and train_labels, to learn.

  4. The model is tested against the test set, test_images, and test_labels arrays.

    fashion_mnist = tf.keras.datasets.fashion_mnist
    (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
    

    Since you have 10 types of images in the dataset, assign labels from zero to nine. Each image is assigned one label. The images are 28x28 NumPy arrays, with pixel values ranging from zero to 255.

  5. Each image is mapped to a single label. Since the class names are not included with the dataset, store them, and later use them when plotting the images:

    class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat','Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
    
  6. Use this code to explore the dataset by knowing its dimensions:

    train_images.shape
    
  7. Use this code to print the size of this training set:

    print(len(train_labels))
    
  8. Use this code to print the labels of this training set:

    print(train_labels)
    
  9. Preprocess the data before training the network, and you can start inspecting the first image, as its pixels will fall in the range of zero to 255.

    plt.figure()
    plt.imshow(train_images[0])
    plt.colorbar()
    plt.grid(False)
    plt.show()
    

  10. From the above picture, you can see that values are from zero to 255. Before training this on the neural network, you must bring them in the range of zero to one. Hence, divide the values by 255.

    train_images = train_images / 255.0
    
    test_images = test_images / 255.0
    
  11. To ensure the data is in the correct format and ready to build and train the network, display the first 25 images from the training set and the class name below each image.

    plt.figure(figsize=(10,10))
    for i in range(25):
        plt.subplot(5,5,i+1)
        plt.xticks([])
        plt.yticks([])
        plt.grid(False)
        plt.imshow(train_images[i], cmap=plt.cm.binary)
        plt.xlabel(class_names[train_labels[i]])
    plt.show()
    

    The basic building block of a neural network is the layer. Layers extract representations from the data fed into them. Deep learning consists of chaining together simple layers. Most layers, such as tf.keras.layers.Dense, have parameters that are learned during training.

    model = tf.keras.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10)
    ])
    
    • The first layer in this network tf.keras.layers.Flatten transforms the format of the images from a two-dimensional array (of 28 x 28 pixels) to a one-dimensional array (of 28 * 28 = 784 pixels). Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn; it only reformats the data.

    • After the pixels are flattened, the network consists of a sequence of two tf.keras.layers.Dense layers. These are densely connected or fully connected neural layers. The first Dense layer has 128 nodes (or neurons). The second (and last) layer returns a logits array with a length of 10. Each node contains a score that indicates the current image belongs to one of the 10 classes.

  12. You must add the Loss function, Metrics, and Optimizer at the time of model compilation.

    model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['accuracy'])
    
    • Loss function —This measures how accurate the model is during training when you are looking to minimize this function to “steer” the model in the right direction.

    • Optimizer —This is how the model is updated based on the data it sees and its loss function.

    • Metrics —This is used to monitor the training and testing steps.

    The following example uses accuracy, the fraction of the correctly classified images.

    To train the neural network model, follow these steps:

    1. Feed the training data to the model. The training data is in the train_images and train_labels arrays in this example. The model learns to associate images and labels.

    2. Ask the model to make predictions about a test set—in this example, the test_images array.

    3. Verify that the predictions match the labels from the test_labels array.

    4. To start training, call the model.fit method because it “fits” the model to the training data.

      model.fit(train_images, train_labels, epochs=10)
      
    5. Compare how the model will perform on the test dataset.

      test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
      
      print('\nTest accuracy:', test_acc)
      
    6. With the model trained, you can use it to make predictions about some images: the model’s linear outputs and logits. Attach a softmax layer to convert the logits to probabilities, making it easier to interpret.

      probability_model = tf.keras.Sequential([model,
                                              tf.keras.layers.Softmax()])
      
      predictions = probability_model.predict(test_images)
      
    7. The model has predicted the label for each image in the testing set. Look at the first prediction:

      predictions[0]
      

      A prediction is an array of 10 numbers. They represent the model’s “confidence” that the image corresponds to each of the 10 different articles of clothing. You can see which label has the highest confidence value:

      np.argmax(predictions[0])
      
    8. Plot a graph to look at the complete set of 10 class predictions.

      def plot_image(i, predictions_array, true_label, img):
      true_label, img = true_label[i], img[i]
      plt.grid(False)
      plt.xticks([])
      plt.yticks([])
      
      plt.imshow(img, cmap=plt.cm.binary)
      
      predicted_label = np.argmax(predictions_array)
      if predicted_label == true_label:
          color = 'blue'
      else:
          color = 'red'
      
      plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label],
                                      100*np.max(predictions_array),
                                      class_names[true_label]),
                                      color=color)
      
      def plot_value_array(i, predictions_array, true_label):
      true_label = true_label[i]
      plt.grid(False)
      plt.xticks(range(10))
      plt.yticks([])
      thisplot = plt.bar(range(10), predictions_array, color="#777777")
      plt.ylim([0, 1])
      predicted_label = np.argmax(predictions_array)
      
      thisplot[predicted_label].set_color('red')
      thisplot[true_label].set_color('blue')
      
    9. With the model trained, you can use it to make predictions about some images. Review the 0th image predictions and the prediction array. Correct prediction labels are blue, and incorrect prediction labels are red. The number gives the percentage (out of 100) for the predicted label.

      i = 0
      plt.figure(figsize=(6,3))
      plt.subplot(1,2,1)
      plot_image(i, predictions[i], test_labels, test_images)
      plt.subplot(1,2,2)
      plot_value_array(i, predictions[i],  test_labels)
      plt.show()
      

      i = 12
      plt.figure(figsize=(6,3))
      plt.subplot(1,2,1)
      plot_image(i, predictions[i], test_labels, test_images)
      plt.subplot(1,2,2)
      plot_value_array(i, predictions[i],  test_labels)
      plt.show()
      

    10. Use the trained model to predict a single image.

      # Grab an image from the test dataset.
      img = test_images[1]
      print(img.shape)
      
    11. tf.keras models are optimized to make predictions on a batch, or collection, of examples at once. Accordingly, even though you are using a single image, you must add it to a list.

      # Add the image to a batch where it's the only member.
      img = (np.expand_dims(img,0))
      
      print(img.shape)
      
    12. Predict the correct label for this image.

      predictions_single = probability_model.predict(img)
      
      print(predictions_single)
      
      plot_value_array(1, predictions_single[0], test_labels)
      _ = plt.xticks(range(10), class_names, rotation=45)
      plt.show()
      

    13. tf.keras.Model.predict returns a list of lists—one for each image in the batch of data. Grab the predictions for our (only) image in the batch.

      np.argmax(predictions_single[0])
      

Case study: TensorFlow with text classification#

This procedure demonstrates text classification starting from plain text files stored on disk. You will train a binary classifier to perform sentiment analysis on an IMDB dataset. At the end of the notebook, there is an exercise for you to try in which you will train a multi-class classifier to predict the tag for a programming question on Stack Overflow.

Follow these steps:

  1. Import the necessary libraries.

    import matplotlib.pyplot as plt
    import os
    import re
    import shutil
    import string
    import tensorflow as tf
    
    from tensorflow.keras import layers
    from tensorflow.keras import losses
    
  2. Get the data for the text classification, and extract the database from the given link of IMDB.

    url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
    
    dataset = tf.keras.utils.get_file("aclImdb_v1", url,
                                        untar=True, cache_dir='.',
                                        cache_subdir='')
    
    Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    84131840/84125825 [==============================]  1s 0us/step
    84149932/84125825 [==============================]  1s 0us/step
    
  3. Fetch the data from the directory.

    dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
    print(os.listdir(dataset_dir))
    
  4. Load the data for training purposes.

    train_dir = os.path.join(dataset_dir, 'train')
    os.listdir(train_dir)
    
    ['labeledBow.feat',
    'urls_pos.txt',
    'urls_unsup.txt',
    'unsup',
    'pos',
    'unsupBow.feat',
    'urls_neg.txt',
    'neg']
    
  5. The directories contain many text files, each of which is a single movie review. To look at one of them, use the following:

    sample_file = os.path.join(train_dir, 'pos/1181_9.txt')
    with open(sample_file) as f:
    print(f.read())
    
  6. As the IMDB dataset contains additional folders, remove them before using this utility.

    remove_dir = os.path.join(train_dir, 'unsup')
    shutil.rmtree(remove_dir)
    batch_size = 32
    seed = 42
    
  7. The IMDB dataset has already been divided into train and test but lacks a validation set. Create a validation set using an 80:20 split of the training data by using the validation_split argument below:

    raw_train_ds=tf.keras.utils.text_dataset_from_directory('aclImdb/train',batch_size=batch_size, validation_split=0.2,subset='training', seed=seed)
    
  8. As you will see in a moment, you can train a model by passing a dataset directly to model.fit. If you are new to tf.data, you can also iterate over the dataset and print a few examples as follows:

    for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(3):
        print("Review", text_batch.numpy()[i])
        print("Label", label_batch.numpy()[i])
    
  9. The labels are zero or one. To see which of these correspond to positive and negative movie reviews, check the class_names property on the dataset.

    print("Label 0 corresponds to", raw_train_ds.class_names[0])
    print("Label 1 corresponds to", raw_train_ds.class_names[1])
    
  10. Next, create validation and test the dataset. Use the remaining 5,000 reviews from the training set for validation into two classes of 2,500 reviews each.

    raw_val_ds = tf.keras.utils.text_dataset_from_directory('aclImdb/train',
    batch_size=batch_size,validation_split=0.2,subset='validation', seed=seed)
    
    raw_test_ds =
    tf.keras.utils.text_dataset_from_directory(
        'aclImdb/test',
        batch_size=batch_size)
    

To prepare the data for training, follow these steps:

  1. Standardize, tokenize, and vectorize the data using the helpful tf.keras.layers.TextVectorization layer.

    def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '<br/>', ' ')
    return tf.strings.regex_replace(stripped_html,                                 '[%s]' % re.escape(string.punctuation),'')
    
  2. Create a TextVectorization layer. Use this layer to standardize, tokenize, and vectorize our data. Set the output_mode to int to create unique integer indices for each token. Note that we are using the default split function and the custom standardization function you defined above. You will also define some constants for the model, like an explicit maximum sequence_length, which will cause the layer to pad or truncate sequences to exactly sequence_length values.

    max_features = 10000
    sequence_length = 250
    vectorize_layer = layers.TextVectorization(
        standardize=custom_standardization,
        max_tokens=max_features,
        output_mode='int',
        output_sequence_length=sequence_length)
    
  3. Call adapt to fit the state of the preprocessing layer to the dataset. This causes the model to build an index of strings to integers.

    # Make a text-only dataset (without labels), then call adapt
    train_text = raw_train_ds.map(lambda x, y: x)
    vectorize_layer.adapt(train_text)
    
  4. Create a function to see the result of using this layer to preprocess some data.

    def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label
    
    text_batch, label_batch = next(iter(raw_train_ds))
    first_review, first_label = text_batch[0], label_batch[0]
    print("Review", first_review)
    print("Label", raw_train_ds.class_names[first_label])
    print("Vectorized review", vectorize_text(first_review, first_label))
    

  5. As you can see above, each token has been replaced by an integer. Look up the token (string) that each integer corresponds to by calling get_vocabulary() on the layer.

    print("1287 ---> ",vectorize_layer.get_vocabulary()[1287])
    print(" 313 ---> ",vectorize_layer.get_vocabulary()[313])
    print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary())))
    
  6. You are nearly ready to train your model. As a final preprocessing step, apply the TextVectorization layer we created earlier to train, validate, and test the dataset.

    train_ds = raw_train_ds.map(vectorize_text)
    val_ds = raw_val_ds.map(vectorize_text)
    test_ds = raw_test_ds.map(vectorize_text)
    

    The cache() function keeps data in memory after it is loaded off disk. This ensures the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.

    The prefetch() function overlaps data preprocessing and model execution while training.

    AUTOTUNE = tf.data.AUTOTUNE
    
    train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
    val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
    test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)
    
  7. Create your neural network.

    embedding_dim = 16
    model = tf.keras.Sequential([layers.Embedding(max_features + 1, embedding_dim),layers.Dropout(0.2),layers.GlobalAveragePooling1D(),
    layers.Dropout(0.2),layers.Dense(1)])
    model.summary()
    

  8. A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), use losses.BinaryCrossentropy loss function.

    model.compile(loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer='adam',metrics=tf.metrics.BinaryAccuracy(threshold=0.0))
    
  9. Train the model by passing the dataset object to the fit method.

    epochs = 10
    history = model.fit(train_ds,validation_data=val_ds,epochs=epochs)
    

  10. See how the model performs. Two values are returned: loss (a number representing our error; lower values are better) and accuracy.

    loss, accuracy = model.evaluate(test_ds)
    
    print("Loss: ", loss)
    print("Accuracy: ", accuracy)
    

    Note

    model.fit() returns a History object that contains a dictionary with everything that happened during training.

    history_dict = history.history
    history_dict.keys()
    
  11. Four entries are for each monitored metric during training and validation. Use these to plot the training and validation loss for comparison, as well as the training and validation accuracy:

    acc = history_dict['binary_accuracy']
    val_acc = history_dict['val_binary_accuracy']
    loss = history_dict['loss']
    val_loss = history_dict['val_loss']
    
    epochs = range(1, len(acc) + 1)
    
    # "bo" is for "blue dot"
    plt.plot(epochs, loss, 'bo', label='Training loss')
    # b is for "solid blue line"
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    
    plt.show()
    

    The following images illustrate the training and validation loss and the training and validation accuracy.

    Training and validation loss

    Training and validation accuracy

  12. Export the model.

    export_model = tf.keras.Sequential([
    vectorize_layer,
    model,
    layers.Activation('sigmoid')
    ])
    
    export_model.compile(
        loss=losses.BinaryCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
    )
    
    # Test it with `raw_test_ds`, which yields raw strings
    loss, accuracy = export_model.evaluate(raw_test_ds)
    print(accuracy)
    
  13. To get predictions for new examples, call model.predict().

    examples = [
    "The movie was great!",
    "The movie was okay.",
    "The movie was terrible..."
    ]
    
    export_model.predict(examples)
    

Inference optimization with MIGraphX#

The following sections cover inferencing and introduces MIGraphX.

Inference#

The inference is where capabilities learned during deep-learning training are put to work. It refers to using a fully trained neural network to make conclusions (predictions) on unseen data that the model has never interacted with before. Deep-learning inferencing is achieved by feeding new data, such as new images, to the network, giving the Deep Neural Network a chance to classify the image.

Taking our previous example of MNIST, the DNN can be fed new images of handwritten digit images, allowing the neural network to classify digits. A fully trained DNN should make accurate predictions about what an image represents, and inference cannot happen without training.

MIGraphX introduction#

MIGraphX is a graph compiler focused on accelerating the machine-learning inference that can target AMD GPUs and CPUs. MIGraphX accelerates the machine-learning models by leveraging several graph-level transformations and optimizations. These optimizations include:

  • Operator fusion

  • Arithmetic simplifications

  • Dead-code elimination

  • Common subexpression elimination (CSE)

  • Constant propagation

After doing all these transformations, MIGraphX emits code for the AMD GPU by calling to MIOpen or rocBLAS or creating HIP kernels for a particular operator. MIGraphX can also target CPUs using DNNL or ZenDNN libraries.

MIGraphX provides easy-to-use APIs in C++ and Python to import machine models in ONNX or TensorFlow. Users can compile, save, load, and run these models using the MIGraphX C++ and Python APIs. Internally, MIGraphX parses ONNX or TensorFlow models into internal graph representation where each operator in the model gets mapped to an operator within MIGraphX. Each of these operators defines various attributes such as:

  • Number of arguments

  • Type of arguments

  • Shape of arguments

After optimization passes, all these operators get mapped to different kernels on GPUs or CPUs.

After importing a model into MIGraphX, the model is represented as migraphx::program. migraphx::program is made up of migraphx::module. The program can consist of several modules, but it always has one main_module. Modules are made up of migraphx::instruction_ref. Instructions contain the migraphx::op and arguments to the operator.

Installing MIGraphX#

There are three options to get started with MIGraphX installation. MIGraphX depends on ROCm libraries; assume that the machine has ROCm installed.

Option 1: installing binaries#

To install MIGraphX on Debian-based systems like Ubuntu, use the following command:

sudo apt update && sudo apt install -y migraphx

The header files and libraries are installed under /opt/rocm-\<version\>, where <version> is the ROCm version.

Option 2: building from source#

There are two ways to build the MIGraphX sources.

  • Use the ROCm build tool - This approach uses [rbuild](https://github.com/ROCm/rbuild) to install the prerequisites and build the libraries with just one command.

    or

  • Use CMake - This approach uses a script to install the prerequisites, then uses CMake to build the source.

For detailed steps on building from source and installing dependencies, refer to the following README file:

ROCm/AMDMIGraphX

Option 3: use docker#

To use Docker, follow these steps:

  1. The easiest way to set up the development environment is to use Docker. To build Docker from scratch, first clone the MIGraphX repository by running:

    git clone --recursive https://github.com/ROCm/AMDMIGraphX
    
  2. The repository contains a Dockerfile from which you can build a Docker image as:

    docker build -t migraphx .
    
  3. Then to enter the development environment, use Docker run:

    docker run --device='/dev/kfd' --device='/dev/dri' -v=`pwd`:/code/AMDMIGraphX -w /code/AMDMIGraphX --group-add video -it migraphx
    

The Docker image contains all the prerequisites required for the installation, so users can go to the folder /code/AMDMIGraphX and follow the steps mentioned in Option 2: Building from Source.

MIGraphX example#

MIGraphX provides both C++ and Python APIs. The following sections show examples of both using the Inception v3 model. To walk through the examples, fetch the Inception v3 ONNX model by running the following:

import torch
import torchvision.models as models
inception = models.inception_v3(pretrained=True)
torch.onnx.export(inception,torch.randn(1,3,299,299), "inceptioni1.onnx")

This will create inceptioni1.onnx, which can be imported in MIGraphX using C++ or Python API.

MIGraphX Python API#

Follow these steps:

  1. To import the MIGraphX module in Python script, set PYTHONPATH to the MIGraphX libraries installation. If binaries are installed using steps mentioned in Option 1: Installing Binaries, perform the following action:

    export PYTHONPATH=$PYTHONPATH:/opt/rocm/
    
  2. The following script shows the usage of Python API to import the ONNX model, compile it, and run inference on it. Set LD_LIBRARY_PATH to /opt/rocm/ if required.

    # import migraphx and numpy
    import migraphx
    import numpy as np
    # import and parse inception model
    model = migraphx.parse_onnx("inceptioni1.onnx")
    # compile model for the GPU target
    model.compile(migraphx.get_target("gpu"))
    # optionally print compiled model
    model.print()
    # create random input image
    input_image = np.random.rand(1, 3, 299, 299).astype('float32')
    # feed image to model, 'x.1` is the input param name
    results = model.run({'x.1': input_image})
    # get the results back
    result_np = np.array(results[0])
    # print the inferred class of the input image
    print(np.argmax(result_np))
    

    Find additional examples of Python API in the /examples directory of the MIGraphX repository.

MIGraphX C++ API#

Follow these steps:

  1. The following is a minimalist example that shows the usage of MIGraphX C++ API to load ONNX file, compile it for the GPU, and run inference on it. To use MIGraphX C++ API, you only need to load the migraphx.hpp file. This example runs inference on the Inception v3 model.

    #include <vector>
    #include <string>
    #include <algorithm>
    #include <ctime>
    #include <random>
    #include <migraphx/migraphx.hpp>
    
    int main(int argc, char** argv)
    {
        migraphx::program prog;
        migraphx::onnx_options onnx_opts;
        // import and parse onnx file into migraphx::program
        prog = parse_onnx("inceptioni1.onnx", onnx_opts);
        // print imported model
        prog.print();
        migraphx::target targ = migraphx::target("gpu");
        migraphx::compile_options comp_opts;
        comp_opts.set_offload_copy();
        // compile for the GPU
        prog.compile(targ, comp_opts);
        // print the compiled program
        prog.print();
        // randomly generate input image
        // of shape (1, 3, 299, 299)
        std::srand(unsigned(std::time(nullptr)));
        std::vector<float> input_image(1*299*299*3);
        std::generate(input_image.begin(), input_image.end(), std::rand);
        // users need to provide data for the input
        // parameters in order to run inference
        // you can query into migraph program for the parameters
        migraphx::program_parameters prog_params;
        auto param_shapes = prog.get_parameter_shapes();
        auto input        = param_shapes.names().front();
        // create argument for the parameter
        prog_params.add(input, migraphx::argument(param_shapes[input], input_image.data()));
        // run inference
        auto outputs = prog.eval(prog_params);
        // read back the output
        float* results = reinterpret_cast<float*>(outputs[0].data());
        float* max     = std::max_element(results, results + 1000);
        int answer = max - results;
        std::cout << "answer: " << answer << std::endl;
    }
    
  2. To compile this program, you can use CMake and you only need to link the migraphx::c library to use MIGraphX’s C++ API. The following is the CMakeLists.txt file that can build the earlier example:

    cmake_minimum_required(VERSION 3.5)
    project (CAI)
    
    set (CMAKE_CXX_STANDARD 14)
    set (EXAMPLE inception_inference)
    
    list (APPEND CMAKE_PREFIX_PATH /opt/rocm/hip /opt/rocm)
    find_package (migraphx)
    
    message("source file: " ${EXAMPLE}.cpp " ---> bin: " ${EXAMPLE})
    add_executable(${EXAMPLE} ${EXAMPLE}.cpp)
    
    target_link_libraries(${EXAMPLE} migraphx::c)
    
  3. To build the executable file, run the following from the directory containing the inception_inference.cpp file:

    mkdir build
    cd build
    cmake ..
    make -j$(nproc)
    ./inception_inference
    

Note

Set `LD_LIBRARY_PATH` to `/opt/rocm/lib` if required during the build. Additional examples can be found in the MIGraphX repository under the `/examples/` directory.

Tuning MIGraphX#

MIGraphX uses MIOpen kernels to target AMD GPU. For the model compiled with MIGraphX, tune MIOpen to pick the best possible kernel implementation. The MIOpen tuning results in a significant performance boost. Tuning can be done by setting the environment variable MIOPEN_FIND_ENFORCE=3.

Note

The tuning process can take a long time to finish.

Example: The average inference time of the inception model example shown previously over 100 iterations using untuned kernels is 0.01383ms. After tuning, it reduces to 0.00459ms, which is a 3x improvement. This result is from ROCm v4.5 on a MI100 GPU.

Note

The results may vary depending on the system configurations.

For reference, the following code snippet shows inference runs for only the first 10 iterations for both tuned and untuned kernels:

### UNTUNED ###
iterator : 0
Inference complete
Inference time: 0.063ms
iterator : 1
Inference complete
Inference time: 0.008ms
iterator : 2
Inference complete
Inference time: 0.007ms
iterator : 3
Inference complete
Inference time: 0.007ms
iterator : 4
Inference complete
Inference time: 0.007ms
iterator : 5
Inference complete
Inference time: 0.008ms
iterator : 6
Inference complete
Inference time: 0.007ms
iterator : 7
Inference complete
Inference time: 0.028ms
iterator : 8
Inference complete
Inference time: 0.029ms
iterator : 9
Inference complete
Inference time: 0.029ms

### TUNED ###
iterator : 0
Inference complete
Inference time: 0.063ms
iterator : 1
Inference complete
Inference time: 0.004ms
iterator : 2
Inference complete
Inference time: 0.004ms
iterator : 3
Inference complete
Inference time: 0.004ms
iterator : 4
Inference complete
Inference time: 0.004ms
iterator : 5
Inference complete
Inference time: 0.004ms
iterator : 6
Inference complete
Inference time: 0.004ms
iterator : 7
Inference complete
Inference time: 0.004ms
iterator : 8
Inference complete
Inference time: 0.004ms
iterator : 9
Inference complete
Inference time: 0.004ms

YModel#

The best inference performance through MIGraphX is conditioned upon having tuned kernel configurations stored in a /home local User Database (DB). If a user were to move their model to a different server or allow a different user to use it, they would have to run through the MIOpen tuning process again to populate the next User DB with the best kernel configurations and corresponding solvers.

Tuning is time consuming, and if the users have not performed tuning, they would see discrepancies between expected or claimed inference performance and actual inference performance. This has led to repetitive and time-consuming tuning tasks for each user.

MIGraphX introduces a feature, known as YModel, that stores the kernel config parameters found during tuning into a .mxr file. This ensures the same level of expected performance, even when a model is copied to a different user/system.

The YModel feature is available starting from ROCm 5.4.1 and UIF 1.1.

YModel example#

Through the migraphx-driver functionality, you can generate .mxr files with tuning information stored inside it by passing additional --binary --output model.mxr to migraphx-driver along with the rest of the necessary flags.

For example, to generate .mxr file from the ONNX model, use the following:

./path/to/migraphx-driver compile --onnx resnet50.onnx --enable-offload-copy --binary --output resnet50.mxr

To run generated .mxr files through migraphx-driver, use the following:

./path/to/migraphx-driver run --migraphx resnet50.mxr --enable-offload-copy

Alternatively, you can use the MIGraphX C++ or Python API to generate .mxr files.

Generating an MXR file

Contribute to ROCm documentation#

All ROCm projects are GitHub-based, so if you want to contribute, you can do so by:

Important

By creating a pull request (PR), you agree to allow your contribution to be licensed under the terms of the LICENSE.txt file in the corresponding repository. Different repositories may use different licenses.

Submit a pull request#

To make edits to our documentation via PR, follow these steps:

  1. Identify the repository and the file you want to update. For example, to update this page, you would need to modify content located in this file: https://github.com/ROCm/ROCm/blob/develop/docs/contribute/contributing.md

  2. (optional, but recommended) Fork the repository.

  3. Clone the repository locally and (optionally) add your fork. Select the green ‘Code’ button and copy the URL (e.g., git@github.com:ROCm/ROCm.git).

    • From your terminal, run:

      git clone git@github.com:ROCm/ROCm.git
      
    • Optionally add your fork to this local copy of the repository by running:

      git add remote <name-of-my-fork> <git@github.com:my-username/ROCm.git>
      

      To get the URL of your fork, go to your GitHub profile, select the fork and click the green ‘Code’ button (the same process you followed to get the main GitHub repository URL).

  4. Change directory into your local copy of the repository, and run git pull (or git pull origin develop) to ensure your local copy has the most recent content.

  5. Create and checkout a new branch using the following command:

    git checkout -b <branch_name>
    
  6. Change directory into the ./docs folder and make any documentation changes locally using your preferred code editor. Follow the guidelines listed on the documentation structure page.

  7. Optionally run a local test build of the documentation to ensure the content builds and looks as expected. In your terminal, run the following commands from within the ./docs folder of your cloned repository:

    pip3 install -r sphinx/requirements.txt  # You only need to run this command once
    python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
    

    The build output files are located in the docs/_build folder. To preview your build, open the index file (docs/_build/html/index.html) file. For more information, see Building documentation. To learn more about our build tools, see Documentation toolchain.

  8. Commit your changes and push them to GitHub by running:

    git add <path-to-my-modified-file> # To add all modified files, you can use: git add .
    git commit -m "my-updates"
    git push <name-of-my-fork>
    

    After pushing, you will get a GitHub link in the terminal output. Copy this link and paste it into a browser to create your PR.

Create an issue#

  1. To create a new GitHub issue, select the ‘Issues’ tab in the appropriate repository (e.g., https://github.com/ROCm/ROCm/issues).

  2. Use the search bar to make sure the issue doesn’t already exist.

  3. If your issue is not already listed, select the green ‘New issue’ button to the right of the page. Select the type of issue and fill in the resulting template.

General issue guidelines#

  • Use your best judgement for issue creation. If your issue is already listed, upvote the issue and comment or post to provide additional details, such as how you reproduced this issue.

  • If you’re not sure if your issue is the same, err on the side of caution and file your issue. You can add a comment to include the issue number (and link) for the similar issue. If we evaluate your issue as being the same as the existing issue, we’ll close the duplicate.

  • If your issue doesn’t exist, use the issue template to file a new issue.

    • When filing an issue, be sure to provide as much information as possible, including script output so we can collect information about your configuration. This helps reduce the time required to reproduce your issue.

    • Check your issue regularly, as we may require additional information to successfully reproduce the issue.

Suggest a new feature#

Use the GitHub Discussion forum (Ideas category) to propose new features. Our maintainers are happy to provide direction and feedback on feature development.

Future development workflow#

The current ROCm development workflow is GitHub-based. If, in the future, we change this platform, the tools and links may change. In this instance, we will update contribution guidelines accordingly.

Documentation structure#

Our documentation follows the Pitchfork folder structure. Most documentation files are stored in the /docs folder. Some special files (such as release, contributing, and changelog) are stored in the root (/) folder.

All images are stored in the /docs/data folder. An image’s file path mirrors that of the documentation file where it is used.

Our naming structure uses kebab case; for example, my-file-name.rst.

Supported formats and syntax#

Our documentation includes both Markdown and RST files. We are gradually transitioning existing Markdown to RST in order to more effectively meet our documentation needs. When contributing, RST is preferred; if you must use Markdown, use GitHub-flavored Markdown.

We use Sphinx Design syntax and compile our API references using Doxygen.

The following table shows some common documentation components and the syntax convention we use for each:

Component RST syntax
Code blocks
.. code-block:: language-name

  My code block.
Cross-referencing internal files
:doc:`Title <../path/to/file/filename>`
External links
`link name  <URL>`_
Headings
******************
Chapter title (H1)
******************

Section title (H2)
===============

Subsection title (H3)
---------------------

Sub-subsection title (H4)
^^^^^^^^^^^^^^^^^^^^
Images
.. image:: image1.png
Internal links
1. Add a tag to the section you want to reference:

.. _my-section-tag: section-1

Section 1
==========

2. Link to your tag:

As shown in :ref:`section-1`.
Lists
# Ordered (numbered) list item

* Unordered (bulleted) list item
Math (block)
.. math::

  A = \begin{pmatrix}
          0.0 & 1.0 & 1.0 & 3.0 \\
          4.0 & 5.0 & 6.0 & 7.0 \\
        \end{pmatrix}
Math (inline)
:math:`2 \times 2 `
Notes
.. note::

  My note here.
Tables
.. csv-table::  Optional title here
  :widths: 30, 70  #optional column widths
  :header: "entry1 header", "entry2 header"

   "entry1", "entry2"

Language and style#

We use the Google developer documentation style guide to guide our content.

Font size and type, page layout, white space control, and other formatting details are controlled via rocm-docs-core. If you want to notify us of any formatting issues, create a pull request in our rocm-docs-core GitHub repository.

Building our documentation#

To learn how to build our documentation, refer to Building documentation.

ROCm documentation toolchain#

Our documentation relies on several open source toolchains and sites.

rocm-docs-core#

rocm-docs-core is an AMD-maintained project that applies customization for our documentation. This project is the tool most ROCm repositories use as part of the documentation build. It is also available as a pip package on PyPI.

See the user and developer guides for rocm-docs-core at rocm-docs-core documentation.

Sphinx#

Sphinx is a documentation generator originally used for Python. It is now widely used in the open source community.

Sphinx External ToC#

Sphinx External ToC is a Sphinx extension used for ROCm documentation navigation. This tool generates a navigation menu on the left based on a YAML file (_toc.yml.in) that contains the table of contents.

Sphinx-book-theme#

Sphinx-book-theme is a Sphinx theme that defines the base appearance for ROCm documentation. ROCm documentation applies some customization, such as a custom header and footer on top of the Sphinx Book Theme.

Sphinx Design#

Sphinx design is a Sphinx extension that adds design functionality. ROCm documentation uses Sphinx Design for grids, cards, and synchronized tabs.

Doxygen#

Doxygen is a documentation generator that extracts information from inline code. ROCm projects typically use Doxygen for public API documentation (unless the upstream project uses a different tool).

Breathe#

Breathe is a Sphinx plugin to integrate Doxygen content.

MyST#

Markedly Structured Text (MyST) is an extended flavor of Markdown (CommonMark) influenced by reStructuredText (RST) and Sphinx. It’s integrated into ROCm documentation by the Sphinx extension myst-parser. A MyST syntax cheat sheet is available on the Jupyter reference site.

Read the Docs#

Read the Docs is the service that builds and hosts the HTML documentation generated using Sphinx to our end users.

Building documentation#

You can build our documentation via GitHub (in a pull request) or locally (using the command line or Visual Studio (VS) Code.

GitHub#

If you open a pull request on the develop branch of a ROCm repository and scroll to the bottom of the page, there is a summary panel. Next to the line docs/readthedocs.com:advanced-micro-devices-demo, there is a Details link. If you click this, it takes you to the Read the Docs build for your pull request.

Screenshot of the GitHub documentation build link

If you don’t see this line, click Show all checks to get an itemized view.

Command line#

You can build our documentation via the command line using Python. We use Python 3.8; other versions may not support the build.

Use the Python Virtual Environment (venv) and run the following commands from the project root:

python3 -mvenv .venv

.venv/bin/python     -m pip install -r docs/sphinx/requirements.txt
.venv/bin/python     -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html

Navigate to _build/html/index.html and open this file in a web browser.

Visual Studio Code#

With the help of a few extensions, you can create a productive environment to author and test documentation locally using Visual Studio (VS) Code. Follow these steps to configure VS Code:

  1. Install the required extensions:

    • Python: (ms-python.python)

    • Live Server: (ritwickdey.LiveServer)

  2. Add the following entries to .vscode/settings.json.

      {
        "liveServer.settings.root": "/.vscode/build/html",
        "liveServer.settings.wait": 1000,
        "python.terminal.activateEnvInCurrentTerminal": true
      }
    
    • liveServer.settings.root: Sets the root of the output website for live previews. Must be changed alongside the tasks.json command.

    • liveServer.settings.wait: Tells the live server to wait with the update in order to give Sphinx time to regenerate the site contents and not refresh before the build is complete.

    • python.terminal.activateEnvInCurrentTerminal: Activates the automatic virtual environment, so you can build the site from the integrated terminal.

  3. Add the following tasks to .vscode/tasks.json.

      {
        "version": "2.0.0",
        "tasks": [
          {
            "label": "Build Docs",
            "type": "process",
            "windows": {
              "command": "${workspaceFolder}/.venv/Scripts/python.exe"
            },
            "command": "${workspaceFolder}/.venv/bin/python3",
            "args": [
              "-m",
              "sphinx",
              "-j",
              "auto",
              "-T",
              "-b",
              "html",
              "-d",
              "${workspaceFolder}/.vscode/build/doctrees",
              "-D",
              "language=en",
              "${workspaceFolder}/docs",
              "${workspaceFolder}/.vscode/build/html"
            ],
            "problemMatcher": [
              {
                "owner": "sphinx",
                "fileLocation": "absolute",
                "pattern": {
                  "regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):(\\d+):\\s+(WARNING|ERROR):\\s+(.*)$",
                  "file": 1,
                  "line": 2,
                  "severity": 3,
                  "message": 4
                }
              },
              {
              "owner": "sphinx",
                "fileLocation": "absolute",
                "pattern": {
                  "regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):{1,2}\\s+(WARNING|ERROR):\\s+(.*)$",
                  "file": 1,
                  "severity": 2,
                  "message": 3
                }
              }
            ],
            "group": {
              "kind": "build",
              "isDefault": true
            }
          }
        ]
      }
    

    Implementation detail: two problem matchers were needed to be defined, because VS Code doesn’t tolerate some problem information being potentially absent. While a single regex could match all types of errors, if a capture group remains empty (the line number doesn’t show up in all warning/error messages) but the pattern references said empty capture group, VS Code discards the message completely.

  4. Configure the Python virtual environment (venv).

    From the Command Palette, run Python: Create Environment. Select venv environment and docs/sphinx/requirements.txt.

  5. Build the docs.

    Launch the default build task using one of the following options:

    • A hotkey (the default is Ctrl+Shift+B)

    • Issuing the Tasks: Run Build Task from the Command Palette

  6. Open the live preview.

    Navigate to the site output within VS Code: right-click on .vscode/build/html/index.html and select Open with Live Server. The contents should update on every rebuild without having to refresh the browser.

Providing feedback#

There are four standard ways to provide feedback on this repository.

Pull request#

All contributions to ROCm documentation should arrive via the GitHub Flow targeting the develop branch of the repository. If you are unable to contribute via the GitHub Flow, feel free to email us at rocm-feedback@amd.com.

For more in-depth information on creating a pull request (PR), see Contributing.

GitHub discussions#

To ask questions or view answers to frequently asked questions, refer to GitHub Discussions. On GitHub Discussions, in addition to asking and answering questions, members can share updates, have open-ended conversations, and follow along on via public announcements.

GitHub issue#

Issues on existing or absent documentation can be filed in GitHub Issues.

Email#

Send other feedback or questions to rocm-feedback@amd.com.

ROCm license#

MIT License

Copyright © 2023 Advanced Micro Devices, Inc. All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Note

The preceding license applies to the ROCm repository, which primarily contains documentation. For licenses related to other ROCm components, refer to the following section.

ROCm component licenses#

ROCm is released by Advanced Micro Devices, Inc. and is licensed per component separately. The following table is a list of ROCm components with links to their respective license terms. These components may include third party components subject to additional licenses. Please review individual repositories for more information.

Component

License

AMDMIGraphX

MIT

HIPCC

MIT

HIPIFY

MIT

HIP

MIT

MIOpenGEMM

MIT

MIOpen

MIT

MIVisionX

MIT

RCP

MIT

ROCK-Kernel-Driver

GPL 2.0 WITH Linux-syscall-note

ROCR-Runtime

The University of Illinois/NCSA

ROCT-Thunk-Interface

MIT

ROCclr

MIT

ROCdbgapi

MIT

ROCgdb

GNU General Public License v2.0

ROCm-CompilerSupport

The University of Illinois/NCSA

ROCm-Device-Libs

The University of Illinois/NCSA

ROCm-OpenCL-Runtime/api/opencl/khronos/icd

Apache 2.0

ROCm-OpenCL-Runtime

MIT

ROCmValidationSuite

MIT

Tensile

MIT

aomp-extras

MIT

aomp

Apache 2.0

atmi

MIT

clang-ocl

MIT

flang

Apache 2.0

half

MIT

hipBLAS

MIT

hipCUB

Custom

hipFFT

MIT

hipSOLVER

MIT

hipSPARSELt

MIT

hipSPARSE

MIT

hipTensor

MIT

hipamd

MIT

hipfort

MIT

llvm-project

Apache

rccl

Custom

rdc

MIT

rocALUTION

MIT

rocBLAS

MIT

rocFFT

MIT

rocPRIM

MIT

rocRAND

MIT

rocSOLVER

BSD-2-Clause

rocSPARSE

MIT

rocThrust

Apache 2.0

rocWMMA

MIT

rocm-cmake

MIT

rocm_bandwidth_test

The University of Illinois/NCSA

rocm_smi_lib

The University of Illinois/NCSA

rocminfo

The University of Illinois/NCSA

rocprofiler

MIT

rocr_debug_agent

The University of Illinois/NCSA

roctracer

MIT

rocm-llvm-alt

AMD Proprietary License

Open sourced ROCm components are released via public GitHub repositories, packages on https://repo.radeon.com and other distribution channels. Proprietary products are only available on https://repo.radeon.com. Currently, only one component of ROCm, rocm-llvm-alt is governed by a proprietary license. Proprietary components are organized in a proprietary subdirectory in the package repositories to distinguish from open sourced packages.

Note

The following additional terms and conditions apply to your use of ROCm technical documentation.

©2023 Advanced Micro Devices, Inc. All rights reserved.

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED “AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD Arrow logo, ROCm, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

Package licensing#

Attention

AQL Profiler and AOCC CPU optimization are both provided in binary form, each subject to the license agreement enclosed in the directory for the binary and is available here: /opt/rocm/share/doc/rocm-llvm-alt/EULA. By using, installing, copying or distributing AQL Profiler and/or AOCC CPU Optimizations, you agree to the terms and conditions of this license agreement. If you do not agree to the terms of this agreement, do not install, copy or use the AQL Profiler and/or the AOCC CPU Optimizations.

For the rest of the ROCm packages, you can find the licensing information at the following location: /opt/rocm/share/doc/<component-name>/

For example, you can fetch the licensing information of the _amd_comgr_ component (Code Object Manager) from the amd_comgr folder. A file named LICENSE.txt contains the license details at: /opt/rocm-5.4.3/share/doc/amd_comgr/LICENSE.txt