This is an old version of ROCm documentation. Read the latest ROCm release documentation to stay informed of all our developments.

AMD ROCm™ Documentation

AMD ROCm™ Documentation#

Applies to Linux

2023-06-22

366 min read time

What is ROCm?

ROCm is an open-source stack for GPU computation. ROCm is primarily Open-Source Software (OSS) that allows developers the freedom to customize and tailor their GPU software for their own needs while collaborating with a community of other developers, and helping each other find solutions in an agile, flexible, rapid and secure manner. more…

What is ROCm?#

ROCm is an open-source stack for GPU computation. ROCm is primarily Open-Source Software (OSS) that allows developers the freedom to customize and tailor their GPU software for their own needs while collaborating with a community of other developers, and helping each other find solutions in an agile, flexible, rapid and secure manner.

ROCm is a collection of drivers, development tools and APIs enabling GPU programming from the low-level kernel to end-user applications. ROCm is powered by AMD’s Heterogeneous-computing Interface for Portability (HIP), an OSS C++ GPU programming environment and its corresponding runtime. HIP allows ROCm developers to create portable applications on different platforms by deploying code on a range of platforms, from dedicated gaming GPUs to exascale HPC clusters. ROCm supports programming models such as OpenMP and OpenCL, and includes all the necessary OSS compilers, debuggers and libraries. ROCm is fully integrated into ML frameworks such as PyTorch and TensorFlow. ROCm can be deployed in many ways, including through the use of containers such as Docker, Spack, and your own build from source.

ROCm’s goal is to allow our users to maximize their GPU hardware investment. ROCm is designed to help develop, test and deploy GPU accelerated HPC, AI, scientific computing, CAD, and other applications in a free, open-source, integrated and secure software ecosystem.

Quick Start (Linux)#

Add Repositories#

1. Download and convert the package signing key

# Make the directory if it doesn't exist yet.
# This location is recommended by the distribution maintainers.
sudo mkdir --parents --mode=0755 /etc/apt/keyrings
# Download the key, convert the signing-key to a full
# keyring required by apt and store in the keyring directory
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | \
    gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null

2. Add the repositories

# Kernel driver repository for focal
sudo tee /etc/apt/sources.list.d/amdgpu.list <<'EOF'
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/latest/ubuntu focal main
EOF
# ROCm repository for focal
sudo tee /etc/apt/sources.list.d/rocm.list <<'EOF'
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/debian focal main
EOF
# Kernel driver repository for jammy
sudo tee /etc/apt/sources.list.d/amdgpu.list <<'EOF'
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/latest/ubuntu jammy main
EOF
# ROCm repository for jammy
sudo tee /etc/apt/sources.list.d/rocm.list <<'EOF'
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/debian jammy main
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' | sudo tee /etc/apt/preferences.d/rocm-pin-600
EOF

3. Update the list of packages

sudo apt update

1. Add the repositories

# Add the amdgpu module repository for RHEL 8.6
sudo tee /etc/yum.repos.d/amdgpu.repo <<'EOF'
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/latest/rhel/8.6/main/x86_64
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
# Add the rocm repository for RHEL 8
sudo tee /etc/yum.repos.d/rocm.repo <<'EOF'
[rocm]
name=rocm
baseurl=https://repo.radeon.com/rocm/rhel8/latest/main
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
# Add the amdgpu module repository for RHEL 8.7
sudo tee /etc/yum.repos.d/amdgpu.repo <<'EOF'
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/latest/rhel/8.7/main/x86_64
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
# Add the rocm repository for RHEL 8
sudo tee /etc/yum.repos.d/rocm.repo <<'EOF'
[rocm]
name=rocm
baseurl=https://repo.radeon.com/rocm/rhel8/latest/main
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
# Add the amdgpu module repository for RHEL 9.1
sudo tee /etc/yum.repos.d/amdgpu.repo <<'EOF'
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/latest/rhel/9.1/main/x86_64
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
# Add the rocm repository for RHEL 9
sudo tee /etc/yum.repos.d/rocm.repo <<'EOF'
[rocm]
name=rocm
baseurl=https://repo.radeon.com/rocm/rhel9/latest/main
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF

2. Clean cached files from enabled repositories

sudo yum clean all

1. Add the repositories

# Add the amdgpu module repository for SLES 15.4
sudo tee /etc/zypp/repos.d/amdgpu.repo <<'EOF'
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/latest/sle/15.4/main/x86_64
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
# Add the rocm repository for SLES
sudo tee /etc/zypp/repos.d/rocm.repo <<'EOF'
[rocm]
name=rocm
baseurl=https://repo.radeon.com/rocm/zyp/zypper
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF

2. Update the new repository

sudo zypper ref

Install Drivers#

Install the amdgpu-dkms kernel module, aka driver, on your system.

sudo apt install amdgpu-dkms
sudo yum install amdgpu-dkms
sudo zypper install amdgpu-dkms

Install ROCm Runtimes#

Install the rocm-hip-libraries meta-package. This contains dependencies for most common ROCm applications.

sudo apt install rocm-hip-libraries
sudo yum install rocm-hip-libraries
sudo zypper install rocm-hip-libraries

Reboot the system#

Loading the new driver requires a reboot of the system.

sudo reboot

Deploy ROCm on Linux#

Start with Quick Start (Linux) or follow the detailed instructions below.

Prepare to Install#

Prerequisites

The prerequisites page lists the required steps before installation.

Install Choices

Package manager vs AMDGPU Installer

Standard Packages vs Multi-Version Packages

Choose your install method#

Package Manager

Directly use your distribution’s package manager to install ROCm.

AMDGPU Installer

Use an installer tool that orchestrates changes via the package manager.

See Also#

ROCm Installation Options (Linux)#

Users installing ROCm must choose between various installation options. A new user should follow the Quick Start guide.

Package Manager versus AMDGPU Installer?#

ROCm supports two methods for installation:

  • Directly using the Linux distribution’s package manager

  • The amdgpu-install script

There is no difference in the final installation state when choosing either option.

Using the distribution’s package manager lets the user install, upgrade and uninstall using familiar commands and workflows. Third party ecosystem support is the same as your OS package manager.

The amdgpu-install script is a wrapper around the package manager. The same packages are installed by this script as the package manager system.

The installer automates the installation process for the AMDGPU and ROCm stack. It handles the complete installation process for ROCm, including setting up the repository, cleaning the system, updating, and installing the desired drivers and meta-packages. Users who are less familiar with the package manager can choose this method for ROCm installation.

Single Version ROCm install versus Multi-Version#

ROCm packages are versioned with both semantic versioning that is package specific and a ROCm release version.

Single-version Installation#

The single-version ROCm installation refers to the following:

  • Installation of a single instance of the ROCm release on a system

  • Use of non-versioned ROCm meta-packages

Multi-version Installation#

The multi-version installation refers to the following:

  • Installation of multiple instances of the ROCm stack on a system. Extending the package name and its dependencies with the release version adds the ability to support multiple versions of packages simultaneously.

  • Use of versioned ROCm meta-packages.

Attention

ROCm packages that were previously installed from a single-version installation must be removed before proceeding with the multi-version installation to avoid conflicts.

Note

Multiversion install is not available for the kernel driver module, also referred to as AMDGPU.

The following image demonstrates the difference between single-version and multi-version ROCm installation types:

ROCm Installation Types#

Installation Prerequisites (Linux)#

You must perform the following steps before installing ROCm and check if the system meets all the requirements to proceed with the installation.

Confirm the System Has a Supported Linux Distribution Version#

The ROCm installation is supported only on specific Linux distributions and kernel versions.

Check the Linux Distribution and Kernel Version on Your System#

This section discusses obtaining information about the Linux distribution and kernel version.

Linux Distribution Information#

Verify the Linux distribution using the following steps:

  1. To obtain the Linux distribution information, type the following command on your system from the Command Line Interface (CLI):

    uname -m && cat /etc/*release
    
  2. Confirm that the obtained Linux distribution information matches with those listed in Supported Distributions.

    Example: Running the command above on an Ubuntu system results in the following output:

    x86_64
    DISTRIB_ID=Ubuntu
    DISTRIB_RELEASE=20.04
    DISTRIB_CODENAME=focal
    DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"
    
Kernel Information#

Verify the kernel version using the following steps:

  1. To check the kernel version of your Linux system, type the following command:

    uname -srmv
    

    Example: The output of the command above lists the kernel version in the following format:

    Linux 5.15.0-46-generic #44~20.04.5-Ubuntu SMP Fri Jun 24 13:27:29 UTC 2022 x86_64
    
  2. Confirm that the obtained kernel version information matches with system requirements as listed in Supported Distributions.

Additional package repositories#

On some distributions the ROCm packages depend on packages outside the default package repositories. These extra repositories need to be enabled before installation. Follow the instructions below based on your distributions.

All packages are available in the default Ubuntu repositories, therefore no additional repositories need to be added.

1. Add the EPEL repository

wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
sudo rpm -ivh epel-release-latest-8.noarch.rpm
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
sudo rpm -ivh epel-release-latest-9.noarch.rpm

2. Enable the CodeReady Linux Builder repository

Run the following command and follow the instructions.

sudo crb enable

Add the perl languages repository.

zypper addrepo https://download.opensuse.org/repositories/devel:languages:perl/SLE_15_SP4/devel:languages:perl.repo

Kernel headers and development packages#

The driver package uses DKMS to build the amdgpu-dkms module (driver) for the installed kernels. This requires the Linux kernel headers and modules to be installed for each. Usually these are automatically installed with the kernel, but if you have multiple kernel versions or you have downloaded the kernel images and not the kernel meta-packages then they must be manually installed.

To install for the currently active kernel run the command corresponding to your distribution.

sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
sudo yum install kernel-headers kernel-devel
sudo zypper install kernel-default-devel

Setting Permissions for Groups#

This section provides steps to add any current user to a video group to access GPU resources. Use of the video group is recommended for all ROCm-supported operating systems.

  1. To check the groups in your system, issue the following command:

    groups
    
  2. Add yourself to the render and video group using the command:

    sudo usermod -a -G render,video $LOGNAME
    

To add all future users to the video and render groups by default, run the following commands:

echo 'ADD_EXTRA_GROUPS=1' | sudo tee -a /etc/adduser.conf
echo 'EXTRA_GROUPS=video' | sudo tee -a /etc/adduser.conf
echo 'EXTRA_GROUPS=render' | sudo tee -a /etc/adduser.conf

Installation via Package manager#

Install

How to install ROCm?

Upgrade

Instructions for upgrading an existing ROCm installation.

Uninstall

Steps for removing ROCm packages libraries and tools.

Package Manager Integration

Information about packages.

See Also#

Installation (Linux)#

Understanding the Release-specific AMDGPU and ROCm Repositories on Linux Distributions#

The release-specific repositories consist of packages from a specific release of versions of AMDGPU and ROCm. The repositories are not updated for the latest packages with subsequent releases. When a new ROCm release is available, the new repository, specific to that release, is added. You can select a specific release to install, update the previously installed single version to the later available release, or add the latest version of ROCm along with the currently installed version by using the multi-version ROCm packages.

Step by Step Instructions#

1. Download and convert the package signing key

# Make the directory if it doesn't exist yet.
# This location is recommended by the distribution maintainers.
sudo mkdir --parents --mode=0755 /etc/apt/keyrings
# Download the key, convert the signing-key to a full
# keyring required by apt and store in the keyring directory
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | \
    gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null

Note

The GPG key may change; ensure it is updated when installing a new release. If the key signature verification fails while updating, re-add the key from the ROCm to the apt repository as mentioned above. The current rocm.gpg.key is not available in a standard key ring distribution but has the following SHA1 sum hash: 73f5d8100de6048aa38a8b84cd9a87f05177d208 rocm.gpg.key

2. Add the AMDGPU Repository and Install the Kernel-mode Driver

Tip

If you have a version of the kernel-mode driver installed, you may skip this section.

To add the AMDGPU repository, follow these steps:

# amdgpu repository for bionic
echo 'deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/21.50.2/ubuntu bionic main' \
    | sudo tee /etc/apt/sources.list.d/amdgpu.list
sudo apt update
# amdgpu repository for focal
echo 'deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/21.50.2/ubuntu focal main' \
    | sudo tee /etc/apt/sources.list.d/amdgpu.list
sudo apt update

Install the kernel mode driver and reboot the system using the following commands:

sudo apt install amdgpu-dkms
sudo reboot

3. Add the ROCm Repository

To add the ROCm repository, use the following steps:

# ROCm repositories for bionic
for ver in 5.0 5.0.2; do
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/$ver bionic main" \
    | sudo tee --append /etc/apt/sources.list.d/rocm.list
done
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \
    | sudo tee /etc/apt/preferences.d/rocm-pin-600
sudo apt update
# ROCm repositories for focal
for ver in 5.0 5.0.2; do
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/$ver focal main" \
    | sudo tee --append /etc/apt/sources.list.d/rocm.list
done
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \
    | sudo tee /etc/apt/preferences.d/rocm-pin-600
sudo apt update

4. Install packages

Install packages of your choice in a single-version ROCm install or in a multi-version ROCm install fashion. For more information on what single/multi-version installations are, refer to Single Version ROCm install versus Multi-Version. For a comprehensive list of meta-packages, refer to Meta-packages and Their Descriptions.

  • Sample Single-version installation

    sudo apt install rocm-hip-sdk
    
  • Sample Multi-version installation

    sudo apt install rocm-hip-sdk5.0.2
    

1. Add the AMDGPU Stack Repository and Install the Kernel-mode Driver

Tip

If you have a version of the kernel-mode driver installed, you may skip this section.

sudo tee /etc/yum.repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/21.50.2/rhel/7.9/main/x86_64/
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
sudo yum clean all
sudo tee /etc/yum.repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/21.50.2/rhel/8.4/main/x86_64/
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
sudo yum clean all
sudo tee /etc/yum.repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/21.50.2/rhel/8.5/main/x86_64/
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
sudo yum clean all

Install the kernel mode driver and reboot the system using the following commands:

sudo yum install amdgpu-dkms
sudo reboot

2. Add the ROCm Stack Repository

To add the ROCm repository, use the following steps, based on your distribution:

for ver in 5.0 5.0.2; do
sudo tee --append /etc/yum.repos.d/rocm.repo <<EOF
[ROCm-$ver]
name=ROCm$ver
baseurl=https://repo.radeon.com/rocm/yum/$ver/main
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
done
sudo yum clean all
for ver in 5.0 5.0.2; do
sudo tee --append /etc/yum.repos.d/rocm.repo <<EOF
[ROCm-$ver]
name=ROCm$ver
baseurl=https://repo.radeon.com/rocm/rhel8/$ver/main
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
done
sudo yum clean all

3. Install packages

Install packages of your choice in a single-version ROCm install or in a multi-version ROCm install fashion. For more information on what single/multi-version installations are, refer to Single Version ROCm install versus Multi-Version. For a comprehensive list of meta-packages, refer to Meta-packages and Their Descriptions.

  • Sample Single-version installation

    sudo yum install rocm-hip-sdk
    
  • Sample Multi-version installation

    sudo yum install rocm-hip-sdk5.0.2
    

1. Add the AMDGPU Repository and Install the Kernel-mode Driver

Tip

If you have a version of the kernel-mode driver installed, you may skip this section.

sudo tee /etc/zypp/repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/21.50.2/sle/15.3/main/x86_64
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
sudo zypper ref

Install the kernel mode driver and reboot the system using the following commands:

sudo zypper --gpg-auto-import-keys install amdgpu-dkms
sudo reboot

2. Add the ROCm Stack Repository

To add the ROCm repository, use the following steps:

for ver in 5.0 5.0.2; do
sudo tee --append /etc/zypp/repos.d/rocm.repo <<EOF
[ROCm-$ver]
name=ROCm$ver
name=rocm
baseurl=https://repo.radeon.com/rocm/zyp/$ver/main
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
done
sudo zypper ref

3. Install packages

Install packages of your choice in a single-version ROCm install or in a multi-version ROCm install fashion. For more information on what single/multi-version installations are, refer to Single Version ROCm install versus Multi-Version. For a comprehensive list of meta-packages, refer to Meta-packages and Their Descriptions.

  • Sample Single-version installation

    sudo zypper --gpg-auto-import-keys install rocm-hip-sdk
    
  • Sample Multi-version installation

    sudo zypper --gpg-auto-import-keys install rocm-hip-sdk5.0.2
    
Post-install Actions and Verification Process#

The post-install actions listed here are optional and depend on your use case, but are generally useful. Verification of the install is advised.

Post-install Actions#
  1. Instruct the system linker where to find the shared objects (.so files) for ROCm applications.

    sudo tee --append /etc/ld.so.conf.d/rocm.conf <<EOF
    /opt/rocm/lib
    /opt/rocm/lib64
    EOF
    sudo ldconfig
    

    Note

    Multi-version installations require extra care. Having multiple versions on the system linker library search path is unadvised. One must take care both at compile-time and at run-time to assure that the proper libraries are picked up. You can override ld.so.conf entries on a case-by-case basis using the LD_LIBRARY_PATH environmental variable.

  2. Add binary paths to the PATH environment variable.

    export PATH=$PATH:/opt/rocm-5.0.2/bin:/opt/rocm-5.0.2/opencl/bin
    

    Attention

    When using CMake to build applications, having the ROCm install location on the PATH subtly affects how ROCm libraries are searched for. See Config Mode Search Procedure and CMAKE_FIND_USE_SYSTEM_ENVIRONMENT_PATH for details.

    (Entries in the PATH minus bin and sbin are added to library search paths, therefore this convenience will affect builds and result in ROCm libraries almost always being found. This may be an issue when you’re developing these libraries or want to use self-built versions of them.)

Verifying Kernel-mode Driver Installation#

Check the installation of the kernel-mode driver by typing the command given below:

dkms status
Verifying ROCm Installation#

After completing the ROCm installation, execute the following commands on the system to verify if the installation is successful. If you see your GPUs listed by both commands, the installation is considered successful:

/opt/rocm/bin/rocminfo
# OR
/opt/rocm/opencl/bin/clinfo
Verifying Package Installation#

To ensure the packages are installed successfully, use the following commands:

sudo apt list --installed
sudo yum list installed
sudo zypper search --installed-only

Upgrade ROCm with the package manager#

This section explains how to upgrade the existing AMDGPU driver and ROCm packages to the latest version using your OS’s distributed package manager.

Note

Package upgrade is applicable to single-version packages only. If the preference is to install an updated version of the ROCm along with the currently installed version, refer to the Installation (Linux) page.

Upgrade Steps#
Update the AMDGPU repository#

Execute the commands below based on your distribution to point the amdgpu repository to the new release.

# amdgpu repository for bionic
echo 'deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/21.50.2/ubuntu bionic main' \
    | sudo tee /etc/apt/sources.list.d/amdgpu.list
sudo apt update
# amdgpu repository for focal
echo 'deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/21.50.2/ubuntu focal main' \
    | sudo tee /etc/apt/sources.list.d/amdgpu.list
sudo apt update
sudo tee /etc/yum.repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/21.50.2/rhel/7.9/main/x86_64/
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
sudo yum clean all
sudo tee /etc/yum.repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/21.50.2/rhel/8.4/main/x86_64/
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
sudo yum clean all
sudo tee /etc/yum.repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/21.50.2/rhel/8.5/main/x86_64/
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
sudo yum clean all
sudo tee /etc/zypp/repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/21.50.2/sle/15.3/main/x86_64
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
sudo zypper ref
Upgrade the kernel-mode driver & reboot#

Upgrade the kernel mode driver and reboot the system using the following commands based on your distribution:

sudo apt install amdgpu-dkms
sudo reboot
sudo yum install amdgpu-dkms
sudo reboot
sudo zypper --gpg-auto-import-keys install amdgpu-dkms
sudo reboot
Update the ROCm repository#

Execute the commands below based on your distribution to point the rocm repository to the new release.

echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/5.0.5 bionic main" \
    | sudo tee /etc/apt/sources.list.d/rocm.list
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \
    | sudo tee /etc/apt/preferences.d/rocm-pin-600
sudo apt update
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/5.0.2 focal main" \
    | sudo tee /etc/apt/sources.list.d/rocm.list
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \
    | sudo tee /etc/apt/preferences.d/rocm-pin-600
sudo apt update
sudo tee /etc/yum.repos.d/rocm.repo <<EOF
[ROCm-5.0.22]
name=ROCm5.0.2
baseurl=https://repo.radeon.com/rocm/yum/5.0.2/main
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
sudo yum clean all
sudo tee /etc/yum.repos.d/rocm.repo <<EOF
[ROCm-5.0.2]
name=ROCm5.0.2
baseurl=https://repo.radeon.com/rocm/rhel8/5.0.2/main
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
sudo yum clean all
sudo tee /etc/zypp/repos.d/rocm.repo <<EOF
[ROCm-5.0.2]
name=ROCm5.0.2
name=rocm
baseurl=https://repo.radeon.com/rocm/zyp/5.0.2/main
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
sudo zypper ref
Upgrade the ROCm packages#

Your packages can be upgraded now through their meta-packages, see the following example based on your distribution:

sudo apt install --only-upgrade rocm-hip-sdk
sudo yum update rocm-hip-sdk
sudo zypper --gpg-auto-import-keys update rocm-hip-sdk
Verification Process#

To verify if the upgrade is successful, refer to the Post-install Actions and Verification Process given in the Installation section.

Uninstallation with package manager (Linux)#

This section describes how to uninstall ROCm with the Linux distribution’s package manager. This method should be used if ROCm was installed via the package manager. If the installer script was used for installation, then it should be used for uninstallation too, refer to Installer Script Uninstallation (Linux).

Uninstalling Specific Meta-packages

# Uninstall single-version ROCm packages
sudo apt autoremove <package-name>
# Uninstall multiversion ROCm packages
sudo apt autoremove <package-name with release version>

Complete Uninstallation of ROCm Packages

# Uninstall single-version ROCm packages
sudo apt autoremove rocm-core
# Uninstall multiversion ROCm packages
sudo apt autoremove rocm-core<release version>

Uninstall Kernel-mode Driver

sudo apt autoremove amdgpu-dkms

Remove ROCm and AMDGPU Repositories

  1. Execute these commands:

    sudo rm /etc/apt/sources.list.d/<rocm_repository-name>.list
    sudo rm /etc/apt/sources.list.d/<amdgpu_repository-name>.list
    
  2. Clear the cache and clean the system.

    sudo rm -rf /var/cache/apt/*
    sudo apt-get clean all
    
  3. Restart the system.

    sudo reboot
    

Uninstalling Specific Meta-packages

# Uninstall single-version ROCm packages
sudo yum remove <package-name>
# Uninstall multiversion ROCm packages
sudo yum remove <package-name with release version>

Complete Uninstallation of ROCm Packages

# Uninstall single-version ROCm packages
sudo yum remove rocm-core
# Uninstall multiversion ROCm packages
sudo yum remove rocm-core<release version>

Uninstall Kernel-mode Driver

sudo yum autoremove amdgpu-dkms

Remove ROCm and AMDGPU Repositories

  1. Execute these commands:

    sudo rm -rf /etc/yum.repos.d/<rocm_repository-name> # Remove only rocm repo
    sudo rm -rf /etc/yum.repos.d/<amdgpu_repository-name> # Remove only amdgpu repo
    
  2. Clear the cache and clean the system.

    sudo rm -rf /var/cache/yum #Remove the cache
    sudo yum clean all
    
  3. Restart the system.

    sudo reboot
    

Uninstalling Specific Meta-packages

# Uninstall all single-version ROCm packages
sudo zypper remove <package-name>
# Uninstall all multiversion ROCm packages
sudo zypper remove <package-name with release version>

Complete Uninstallation of ROCm Packages

# Uninstall all single-version ROCm packages
sudo zypper remove rocm-core
# Uninstall all multiversion ROCm packages
sudo zypper remove rocm-core<release version>

Uninstall Kernel-mode Driver

sudo zypper remove --clean-deps amdgpu-dkms

Remove ROCm and AMDGPU Repositories

  1. Execute these commands:

    sudo zypper removerepo <rocm_repository-name>
    sudo zypper removerepo <amdgpu_repository-name>
    
  2. Clear the cache and clean the system.

    sudo zypper clean --all
    
  3. Restart the system.

    sudo reboot
    

Package Manager Integration#

This section provides information about the required meta-packages for the following AMD ROCm programming models:

  • Heterogeneous-Computing Interface for Portability (HIP)

  • OpenCL™

  • OpenMP™

ROCm Package Naming Conventions#

A meta-package is a grouping of related packages and dependencies used to support a specific use case.

Example: Running HIP applications

All meta-packages exist in both versioned and non-versioned forms.

  • Non-versioned packages – For a single-version installation of the ROCm stack

  • Versioned packages – For multi-version installations of the ROCm stack

ROCm Release Package Naming#

Fig. 2 demonstrates the single and multi-version ROCm packages’ naming structure, including examples for various Linux distributions. See terms below:

Module - It is the part of the package that represents the name of the ROCm component.

Example: The examples mentioned in the image represent the ROCm HIP module.

Module version - It is the version of the library released in that package. It should increase with a newer release.

Release version - It shows the ROCm release version when the package was released.

Example: 50400 points to the ROCm 5.4.0 release.

Build id - It represents the Jenkins build number for that release.

Arch - It shows the architecture for which the package was created.

Distro - It describes the distribution for which the package was created. It is valid only for rpm packages.

Example: el8 represents RHEL 8.x packages.

Components of ROCm Programming Models#

Fig. 3 demonstrates the high-level layered architecture of ROCm programming models and their meta-packages. All meta-packages are a combination of required packages and libraries.

Example:

  • rocm-hip-runtime is used to deploy on supported machines to execute HIP applications.

  • rocm-hip-sdk contains runtime components to deploy and execute HIP applications.

ROCm Meta Packages#

Note

rocm-llvm is not a meta-package but a single package that installs the ROCm clang compiler files.

Meta-packages and Their Descriptions#

Meta-packages

Description

rocm-language-runtime

The ROCm runtime

rocm-hip-runtime

Run HIP applications written for the AMD platform

rocm-opencl-runtime

Run OpenCL-based applications on the AMD platform

rocm-hip-runtime-devel

Develop applications on HIP or port from CUDA

rocm-opencl-sdk

Develop applications in OpenCL for the AMD platform

rocm-hip-libraries

HIP libraries optimized for the AMD platform

rocm-hip-sdk

Develop or port HIP applications and libraries for the AMD platform

rocm-developer-tools

Debug and profile HIP applications

rocm-ml-sdk

Develop and run Machine Learning applications with optimized for AMD

rocm-ml-libraries

Key Machine Learning libraries, specifically MIOpen

rocm-openmp-sdk

Develop OpenMP-based applications for the AMD platform

rocm-openmp-runtime

Run OpenMP-based applications for the AMD platform

Packages in ROCm Programming Models#

This section discusses the available meta-packages and their packages. The following image visualizes the meta-packages and their associated packages in a ROCm programming model.

Associated Packages#

  • Meta-packages can include another meta-package.

  • rocm-core package is common across all the meta-packages.

  • Meta-packages and associated packages are represented in the same color.

Note

Fig. 4 is for informational purposes only, as the individual packages in a meta-package are subject to change. Install meta-packages, and not individual packages, to avoid conflicts.

AMDGPU Install Script#

Install

How to install ROCm?

Upgrade

Instructions for upgrading an existing ROCm installation.

Uninstall

Steps for removing ROCm packages libraries and tools.

See Also#

Installation with install script#

Prior to beginning, please ensure you have the prerequisites installed.

Download the Installer Script#

To download and install the amdgpu-install script on the system, use the following commands based on your distribution.

sudo apt update
wget https://repo.radeon.com/amdgpu-install/21.50.2/ubuntu/bionic/amdgpu-install_21.50.2.50002-1_all.deb
sudo apt install ./amdgpu-install_21.50.2.50002-1_all.deb
sudo apt update
wget https://repo.radeon.com/amdgpu-install/21.50.2/ubuntu/focal/amdgpu-install_21.50.2.50002-1_all.deb
sudo apt install ./amdgpu-install_21.50.2.50002-1_all.deb
sudo yum install https://repo.radeon.com/amdgpu-install/21.50.2/rhel/7.9/amdgpu-install-21.50.2.50002-1.el7.noarch.rpm
sudo yum install https://repo.radeon.com/amdgpu-install/21.50.2/rhel/8.4/amdgpu-install-21.50.2.50002-1.el7.noarch.rpm
sudo yum install https://repo.radeon.com/amdgpu-install/21.50.2/rhel/8.5/amdgpu-install-21.50.2.50002-1.el7.noarch.rpm
sudo zypper --no-gpg-checks install https://repo.radeon.com/amdgpu-install/21.50.2/sle/15/amdgpu-install-21.50.2.50002-1.noarch.rpm
Use cases#

Instead of installing individual applications or libraries the installer script groups packages into specific use cases, matching typical workflows and runtimes.

To display a list of available use cases execute the command:

sudo amdgpu-install --list-usecase

The available use-cases will be printed in a format similar to the example output below.

If --usecase option is not present, the default selection is "graphics,opencl,hip"

Available use cases:
rocm(for users and developers requiring full ROCm stack)
- OpenCL (ROCr/KFD based) runtime
- HIP runtimes
- Machine learning framework
- All ROCm libraries and applications
- ROCm Compiler and device libraries
- ROCr runtime and thunk
lrt(for users of applications requiring ROCm runtime)
- ROCm Compiler and device libraries
- ROCr runtime and thunk
opencl(for users of applications requiring OpenCL on Vega or
later products)
- ROCr based OpenCL
- ROCm Language runtime

openclsdk (for application developers requiring ROCr based OpenCL)
- ROCr based OpenCL
- ROCm Language runtime
- development and SDK files for ROCr based OpenCL

hip(for users of HIP runtime on AMD products)
- HIP runtimes
hiplibsdk (for application developers requiring HIP on AMD products)
- HIP runtimes
- ROCm math libraries
- HIP development libraries

To install use cases specific to your requirements, use the installer amdgpu-install as follows:

  • To install a single use case add it with the --usecase option:

    sudo amdgpu-install --usecase=rocm
    
  • For multiple use cases separate them with commas:

    sudo amdgpu-install --usecase=hiplibsdk,rocm
    
Single-version ROCm Installation#

By default (without the --rocmrelease option) the installer script will install packages in the single-version layout.

Multi-version ROCm Installation#

For the multi-version ROCm installation you must use the installer script from the latest release of ROCm that you wish to install.

Example: If you want to install ROCm releases 5.0.0 and 5.0.2 simultaneously, you are required to download the installer from the latest ROCm release v5.0.2.

Add Required Repositories#

You must add the ROCm repositories manually for all ROCm releases you want to install except the latest one. The amdgpu-install script automatically adds the required repositories for the latest release.

Run the following commands based on your distribution to add the repositories:

for ver in 5.0; do
echo "deb [arch=amd64 signed-by=/etc/apt/trusted.gpg.d/rocm-keyring.gpg] https://repo.radeon.com/rocm/apt/$ver bionic main" | sudo tee /etc/apt/sources.list.d/rocm.list
done
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' | sudo tee /etc/apt/preferences.d/rocm-pin-600
sudo apt update
for ver in 5.0; do
echo "deb [arch=amd64 signed-by=/etc/apt/trusted.gpg.d/rocm-keyring.gpg] https://repo.radeon.com/rocm/apt/$ver focal main" | sudo tee /etc/apt/sources.list.d/rocm.list
done
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' | sudo tee /etc/apt/preferences.d/rocm-pin-600
sudo apt update
for ver in 5.0; do
sudo tee --append /etc/yum.repos.d/rocm.repo <<EOF
[ROCm-$ver]
name=ROCm$ver
baseurl=https://repo.radeon.com/rocm/yum/$ver/main
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
done
sudo yum clean all
for ver in 5.0;
sudo tee --append /etc/yum.repos.d/rocm.repo <<EOF
[ROCm-$ver]
name=ROCm$ver
baseurl=https://repo.radeon.com/rocm/rhel8/$ver/main
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
done
sudo yum clean all
for ver in 5.0; do
sudo tee --append /etc/zypp/repos.d/rocm.repo <<EOF
name=rocm
baseurl=https://repo.radeon.com/rocm/$ver/sle/15/main/x86_64
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
done
sudo zypper ref
Install packages#

Use the installer script as given below:

sudo amdgpu-install --usecase=rocm --rocmrelease=<release-number-1>
sudo amdgpu-install --usecase=rocm --rocmrelease=<release-number-2>
sudo amdgpu-install --usecase=rocm --rocmrelease=<release-number-3>

Following are examples of ROCm multi-version installation. The kernel-mode driver, associated with the ROCm release v5.3, will be installed as its latest release in the list.

sudo amdgpu-install --usecase=rocm --rocmrelease=5.0.0
sudo amdgpu-install --usecase=rocm --rocmrelease=5.0.2
Additional options#
Unattended installation#

Adding -y as a parameter to amdgpu-install skips user prompts (for automation). Example: amdgpu-install -y --usecase=rocm

Skipping kernel mode driver installation#

The installer script tries to install the kernel mode driver along with the requested use cases. This might be unnecessary as in the case of docker containers or you may wish to keep a specific version when using multi-version installation, and not have the last installed version overwrite the kernel mode driver.

To skip the installation of the kernel-mode driver add the --no-dkms option when calling the installer script.

Upgrading with the Installer Script (Linux)#

The upgrade procedure with the installer script is exactly the same as installing for 1st time use. Refer to the Installation with install script section on the exact procedure to follow.

Installer Script Uninstallation (Linux)#

To uninstall all ROCm packages and the kernel-mode driver the following commands can be used.

Uninstalling Single-Version Install

sudo amdgpu-install --uninstall

Uninstalling a Specific ROCm Release

sudo amdgpu-install --uninstall --rocmrelease=<release-number>

Uninstalling all ROCm Releases

sudo amdgpu-install --uninstall --rocmrelease=all

Deploy ROCm Docker containers#

Prerequisites#

Docker containers share the kernel with the host operating system, therefore the ROCm kernel-mode driver must be installed on the host. Please refer to using-the-package-manager on installing amdgpu-dkms. The other user-space parts (like the HIP-runtime or math libraries) of the ROCm stack will be loaded from the container image and don’t need to be installed to the host.

Accessing GPUs in containers#

In order to access GPUs in a container (to run applications using HIP, OpenCL or OpenMP offloading) explicit access to the GPUs must be granted.

The ROCm runtimes make use of multiple device files:

  • /dev/kfd: the main compute interface shared by all GPUs

  • /dev/dri/renderD<node>: direct rendering interface (DRI) devices for each GPU. <node> is a number for each card in the system starting from 128.

Exposing these devices to a container is done by using the --device option, i.e. to allow access to all GPUs expose /dev/kfd and all /dev/dri/renderD devices:

docker run --device /dev/kfd --device /dev/renderD128 --device /dev/renderD129 ...

More conveniently, instead of listing all devices, the entire /dev/dri folder can be exposed to the new container:

docker run --device /dev/kfd --device /dev/dri

Note that this gives more access than strictly required, as it also exposes the other device files found in that folder to the container.

Restricting a container to a subset of the GPUs#

If a /dev/dri/renderD device is not exposed to a container then it cannot use the GPU associated with it; this allows to restrict a container to any subset of devices.

For example to allow the container to access the first and third GPU start it like:

docker run --device /dev/kfd --device /dev/dri/renderD128 --device /dev/dri/renderD130 <image>

Additional Options#

The performance of an application can vary depending on the assignment of GPUs and CPUs to the task. Typically, numactl is installed as part of many HPC applications to provide GPU/CPU mappings. This Docker runtime option supports memory mapping and can improve performance.

--security-opt seccomp=unconfined

This option is recommended for Docker Containers running HPC applications.

docker run --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined ...

Docker images in the ROCm ecosystem#

Base images#

RadeonOpenCompute/ROCm-docker hosts images useful for users wishing to build their own containers leveraging ROCm. The built images are available from Docker Hub. In particular rocm/rocm-terminal is a small image with the prerequisites to build HIP applications, but does not include any libraries.

Applications#

AMD provides pre-built images for various GPU-ready applications through its Infinity Hub at https://www.amd.com/en/technologies/infinity-hub. Examples for invoking each application and suggested parameters used for benchmarking are also provided there.

Release Notes#

The release notes for the ROCm platform.


ROCm 5.0.2#

Fixed Defects#

The following defects are fixed in the ROCm v5.0.2 release.

Issue with hostcall Facility in HIP Runtime#

In ROCm v5.0, when using the “assert()” call in a HIP kernel, the compiler may sometimes fail to emit kernel metadata related to the hostcall facility, which results in incomplete initialization of the hostcall facility in the HIP runtime. This can cause the HIP kernel to crash when it attempts to execute the “assert()” call.

The root cause was an incorrect check in the compiler to determine whether the hostcall facility is required by the kernel. This is fixed in the ROCm v5.0.2 release.

The resolution includes a compiler change, which emits the required metadata by default, unless the compiler can prove that the hostcall facility is not required by the kernel. This ensures that the “assert()” call never fails.

Note: This fix may lead to breakage in some OpenMP offload use cases, which use print inside a target region and result in an abort in device code. The issue will be fixed in a future release. Compatibility Matrix Updates to ROCm Deep Learning Guide

The compatibility matrix in the AMD Deep Learning Guide is updated for ROCm v5.0.2.

Release Notes#

The release notes for the ROCm platform.


ROCm 5.0.2#

Fixed Defects#

The following defects are fixed in the ROCm v5.0.2 release.

Issue with hostcall Facility in HIP Runtime#

In ROCm v5.0, when using the “assert()” call in a HIP kernel, the compiler may sometimes fail to emit kernel metadata related to the hostcall facility, which results in incomplete initialization of the hostcall facility in the HIP runtime. This can cause the HIP kernel to crash when it attempts to execute the “assert()” call.

The root cause was an incorrect check in the compiler to determine whether the hostcall facility is required by the kernel. This is fixed in the ROCm v5.0.2 release.

The resolution includes a compiler change, which emits the required metadata by default, unless the compiler can prove that the hostcall facility is not required by the kernel. This ensures that the “assert()” call never fails.

Note: This fix may lead to breakage in some OpenMP offload use cases, which use print inside a target region and result in an abort in device code. The issue will be fixed in a future release. Compatibility Matrix Updates to ROCm Deep Learning Guide

The compatibility matrix in the AMD Deep Learning Guide is updated for ROCm v5.0.2.

Library Changes in ROCM 5.0.2#

Library

Version

hipBLAS

0.49.0

hipCUB

2.10.13

hipFFT

1.0.4

hipSOLVER

1.2.0

hipSPARSE

2.0.0

rccl

2.10.3

rocALUTION

2.0.1

rocBLAS

2.42.0

rocFFT

1.0.13

rocPRIM

2.10.12

rocRAND

2.10.12

rocSOLVER

3.16.0

rocSPARSE

2.0.0

rocThrust

2.13.0

Tensile

4.31.0


ROCm 5.0.1#

Deprecations and Warnings#

Refactor of HIPCC/HIPCONFIG#

In prior ROCm releases, by default, the hipcc/hipconfig Perl scripts were used to identify and set target compiler options, target platform, compiler, and runtime appropriately.

In ROCm v5.0.1, hipcc.bin and hipconfig.bin have been added as the compiled binary implementations of the hipcc and hipconfig. These new binaries are currently a work-in-progress, considered, and marked as experimental. ROCm plans to fully transition to hipcc.bin and hipconfig.bin in the a future ROCm release. The existing hipcc and hipconfig Perl scripts are renamed to hipcc.pl and hipconfig.pl respectively. New top-level hipcc and hipconfig Perl scripts are created, which can switch between the Perl script or the compiled binary based on the environment variable HIPCC_USE_PERL_SCRIPT.

In ROCm 5.0.1, by default, this environment variable is set to use hipcc and hipconfig through the Perl scripts.

Subsequently, Perl scripts will no longer be available in ROCm in a future release.

Library Changes in ROCM 5.0.1#

Library

Version

hipBLAS

0.49.0

hipCUB

2.10.13

hipFFT

1.0.4

hipSOLVER

1.2.0

hipSPARSE

2.0.0

rccl

2.10.3

rocALUTION

2.0.1

rocBLAS

2.42.0

rocFFT

1.0.13

rocPRIM

2.10.12

rocRAND

2.10.12

rocSOLVER

3.16.0

rocSPARSE

2.0.0

rocThrust

2.13.0

Tensile

4.31.0


ROCm 5.0.0#

What’s New in This Release#

HIP Enhancements#

The ROCm v5.0 release consists of the following HIP enhancements.

HIP Installation Guide Updates#

The HIP Installation Guide is updated to include building HIP from source on the NVIDIA platform.

Refer to the HIP Installation Guide v5.0 for more details.

Managed Memory Allocation#

Managed memory, including the __managed__ keyword, is now supported in the HIP combined host/device compilation. Through unified memory allocation, managed memory allows data to be shared and accessible to both the CPU and GPU using a single pointer. The allocation is managed by the AMD GPU driver using the Linux Heterogeneous Memory Management (HMM) mechanism. The user can call managed memory API hipMallocManaged to allocate a large chunk of HMM memory, execute kernels on a device, and fetch data between the host and device as needed.

Note

In a HIP application, it is recommended to do a capability check before calling the managed memory APIs. For example,

int managed_memory = 0;
HIPCHECK(hipDeviceGetAttribute(&managed_memory,
  hipDeviceAttributeManagedMemory,p_gpuDevice));
if (!managed_memory ) {
  printf ("info: managed memory access not supported on the device %d\n Skipped\n", p_gpuDevice);
}
else {
  HIPCHECK(hipSetDevice(p_gpuDevice));
  HIPCHECK(hipMallocManaged(&Hmm, N * sizeof(T)));
. . .
}

Note

The managed memory capability check may not be necessary; however, if HMM is not supported, managed malloc will fall back to using system memory. Other managed memory API calls will, then, have

Refer to the HIP API documentation for more details on managed memory APIs.

For the application, see

ROCm-Developer-Tools/HIP

New Environment Variable#

The following new environment variable is added in this release:

Environment Variable

Value

Description

HSA_COOP_CU_COUNT

0 or 1 (default is 0)

Some processors support more CUs than can reliably be used in a cooperative dispatch. Setting the environment variable HSA_COOP_CU_COUNT to 1 will cause ROCr to return the correct CU count for cooperative groups through the HSA_AMD_AGENT_INFO_COOPERATIVE_COMPUTE_UNIT_COUNT attribute of hsa_agent_get_info(). Setting HSA_COOP_CU_COUNT to other values, or leaving it unset, will cause ROCr to return the same CU count for the attributes HSA_AMD_AGENT_INFO_COOPERATIVE_COMPUTE_UNIT_COUNT and HSA_AMD_AGENT_INFO_COMPUTE_UNIT_COUNT. Future ROCm releases will make HSA_COOP_CU_COUNT=1 the default.

Breaking Changes#

Runtime Breaking Change#

Re-ordering of the enumerated type in hip_runtime_api.h to better match NV. See below for the difference in enumerated types.

ROCm software will be affected if any of the defined enums listed below are used in the code. Applications built with ROCm v5.0 enumerated types will work with a ROCm 4.5.2 driver. However, an undefined behavior error will occur with a ROCm v4.5.2 application that uses these enumerated types with a ROCm 5.0 runtime.

typedef enum hipDeviceAttribute_t {
-    hipDeviceAttributeMaxThreadsPerBlock,       ///< Maximum number of threads per block.
-    hipDeviceAttributeMaxBlockDimX,             ///< Maximum x-dimension of a block.
-    hipDeviceAttributeMaxBlockDimY,             ///< Maximum y-dimension of a block.
-    hipDeviceAttributeMaxBlockDimZ,             ///< Maximum z-dimension of a block.
-    hipDeviceAttributeMaxGridDimX,              ///< Maximum x-dimension of a grid.
-    hipDeviceAttributeMaxGridDimY,              ///< Maximum y-dimension of a grid.
-    hipDeviceAttributeMaxGridDimZ,              ///< Maximum z-dimension of a grid.
-    hipDeviceAttributeMaxSharedMemoryPerBlock,  ///< Maximum shared memory available per block in
-                                                ///< bytes.
-    hipDeviceAttributeTotalConstantMemory,      ///< Constant memory size in bytes.
-    hipDeviceAttributeWarpSize,                 ///< Warp size in threads.
-    hipDeviceAttributeMaxRegistersPerBlock,  ///< Maximum number of 32-bit registers available to a
-                                             ///< thread block. This number is shared by all thread
-                                             ///< blocks simultaneously resident on a
-                                             ///< multiprocessor.
-    hipDeviceAttributeClockRate,             ///< Peak clock frequency in kilohertz.
-    hipDeviceAttributeMemoryClockRate,       ///< Peak memory clock frequency in kilohertz.
-    hipDeviceAttributeMemoryBusWidth,        ///< Global memory bus width in bits.
-    hipDeviceAttributeMultiprocessorCount,   ///< Number of multiprocessors on the device.
-    hipDeviceAttributeComputeMode,           ///< Compute mode that device is currently in.
-    hipDeviceAttributeL2CacheSize,  ///< Size of L2 cache in bytes. 0 if the device doesn't have L2
-                                    ///< cache.
-    hipDeviceAttributeMaxThreadsPerMultiProcessor,  ///< Maximum resident threads per
-                                                    ///< multiprocessor.
-    hipDeviceAttributeComputeCapabilityMajor,       ///< Major compute capability version number.
-    hipDeviceAttributeComputeCapabilityMinor,       ///< Minor compute capability version number.
-    hipDeviceAttributeConcurrentKernels,  ///< Device can possibly execute multiple kernels
-                                          ///< concurrently.
-    hipDeviceAttributePciBusId,           ///< PCI Bus ID.
-    hipDeviceAttributePciDeviceId,        ///< PCI Device ID.
-    hipDeviceAttributeMaxSharedMemoryPerMultiprocessor,  ///< Maximum Shared Memory Per
-                                                         ///< Multiprocessor.
-    hipDeviceAttributeIsMultiGpuBoard,                   ///< Multiple GPU devices.
-    hipDeviceAttributeIntegrated,                        ///< iGPU
-    hipDeviceAttributeCooperativeLaunch,                 ///< Support cooperative launch
-    hipDeviceAttributeCooperativeMultiDeviceLaunch,      ///< Support cooperative launch on multiple devices
-    hipDeviceAttributeMaxTexture1DWidth,    ///< Maximum number of elements in 1D images
-    hipDeviceAttributeMaxTexture2DWidth,    ///< Maximum dimension width of 2D images in image elements
-    hipDeviceAttributeMaxTexture2DHeight,   ///< Maximum dimension height of 2D images in image elements
-    hipDeviceAttributeMaxTexture3DWidth,    ///< Maximum dimension width of 3D images in image elements
-    hipDeviceAttributeMaxTexture3DHeight,   ///< Maximum dimensions height of 3D images in image elements
-    hipDeviceAttributeMaxTexture3DDepth,    ///< Maximum dimensions depth of 3D images in image elements
+    hipDeviceAttributeCudaCompatibleBegin = 0,

-    hipDeviceAttributeHdpMemFlushCntl,      ///< Address of the HDP_MEM_COHERENCY_FLUSH_CNTL register
-    hipDeviceAttributeHdpRegFlushCntl,      ///< Address of the HDP_REG_COHERENCY_FLUSH_CNTL register
+    hipDeviceAttributeEccEnabled = hipDeviceAttributeCudaCompatibleBegin, ///< Whether ECC support is enabled.
+    hipDeviceAttributeAccessPolicyMaxWindowSize,        ///< Cuda only. The maximum size of the window policy in bytes.
+    hipDeviceAttributeAsyncEngineCount,                 ///< Cuda only. Asynchronous engines number.
+    hipDeviceAttributeCanMapHostMemory,                 ///< Whether host memory can be mapped into device address space
+    hipDeviceAttributeCanUseHostPointerForRegisteredMem,///< Cuda only. Device can access host registered memory
+                                                        ///< at the same virtual address as the CPU
+    hipDeviceAttributeClockRate,                        ///< Peak clock frequency in kilohertz.
+    hipDeviceAttributeComputeMode,                      ///< Compute mode that device is currently in.
+    hipDeviceAttributeComputePreemptionSupported,       ///< Cuda only. Device supports Compute Preemption.
+    hipDeviceAttributeConcurrentKernels,                ///< Device can possibly execute multiple kernels concurrently.
+    hipDeviceAttributeConcurrentManagedAccess,          ///< Device can coherently access managed memory concurrently with the CPU
+    hipDeviceAttributeCooperativeLaunch,                ///< Support cooperative launch
+    hipDeviceAttributeCooperativeMultiDeviceLaunch,     ///< Support cooperative launch on multiple devices
+    hipDeviceAttributeDeviceOverlap,                    ///< Cuda only. Device can concurrently copy memory and execute a kernel.
+                                                        ///< Deprecated. Use instead asyncEngineCount.
+    hipDeviceAttributeDirectManagedMemAccessFromHost,   ///< Host can directly access managed memory on
+                                                        ///< the device without migration
+    hipDeviceAttributeGlobalL1CacheSupported,           ///< Cuda only. Device supports caching globals in L1
+    hipDeviceAttributeHostNativeAtomicSupported,        ///< Cuda only. Link between the device and the host supports native atomic operations
+    hipDeviceAttributeIntegrated,                       ///< Device is integrated GPU
+    hipDeviceAttributeIsMultiGpuBoard,                  ///< Multiple GPU devices.
+    hipDeviceAttributeKernelExecTimeout,                ///< Run time limit for kernels executed on the device
+    hipDeviceAttributeL2CacheSize,                      ///< Size of L2 cache in bytes. 0 if the device doesn't have L2 cache.
+    hipDeviceAttributeLocalL1CacheSupported,            ///< caching locals in L1 is supported
+    hipDeviceAttributeLuid,                             ///< Cuda only. 8-byte locally unique identifier in 8 bytes. Undefined on TCC and non-Windows platforms
+    hipDeviceAttributeLuidDeviceNodeMask,               ///< Cuda only. Luid device node mask. Undefined on TCC and non-Windows platforms
+    hipDeviceAttributeComputeCapabilityMajor,           ///< Major compute capability version number.
+    hipDeviceAttributeManagedMemory,                    ///< Device supports allocating managed memory on this system
+    hipDeviceAttributeMaxBlocksPerMultiProcessor,       ///< Cuda only. Max block size per multiprocessor
+    hipDeviceAttributeMaxBlockDimX,                     ///< Max block size in width.
+    hipDeviceAttributeMaxBlockDimY,                     ///< Max block size in height.
+    hipDeviceAttributeMaxBlockDimZ,                     ///< Max block size in depth.
+    hipDeviceAttributeMaxGridDimX,                      ///< Max grid size  in width.
+    hipDeviceAttributeMaxGridDimY,                      ///< Max grid size  in height.
+    hipDeviceAttributeMaxGridDimZ,                      ///< Max grid size  in depth.
+    hipDeviceAttributeMaxSurface1D,                     ///< Maximum size of 1D surface.
+    hipDeviceAttributeMaxSurface1DLayered,              ///< Cuda only. Maximum dimensions of 1D layered surface.
+    hipDeviceAttributeMaxSurface2D,                     ///< Maximum dimension (width, height) of 2D surface.
+    hipDeviceAttributeMaxSurface2DLayered,              ///< Cuda only. Maximum dimensions of 2D layered surface.
+    hipDeviceAttributeMaxSurface3D,                     ///< Maximum dimension (width, height, depth) of 3D surface.
+    hipDeviceAttributeMaxSurfaceCubemap,                ///< Cuda only. Maximum dimensions of Cubemap surface.
+    hipDeviceAttributeMaxSurfaceCubemapLayered,         ///< Cuda only. Maximum dimension of Cubemap layered surface.
+    hipDeviceAttributeMaxTexture1DWidth,                ///< Maximum size of 1D texture.
+    hipDeviceAttributeMaxTexture1DLayered,              ///< Cuda only. Maximum dimensions of 1D layered texture.
+    hipDeviceAttributeMaxTexture1DLinear,               ///< Maximum number of elements allocatable in a 1D linear texture.
+                                                        ///< Use cudaDeviceGetTexture1DLinearMaxWidth() instead on Cuda.
+    hipDeviceAttributeMaxTexture1DMipmap,               ///< Cuda only. Maximum size of 1D mipmapped texture.
+    hipDeviceAttributeMaxTexture2DWidth,                ///< Maximum dimension width of 2D texture.
+    hipDeviceAttributeMaxTexture2DHeight,               ///< Maximum dimension hight of 2D texture.
+    hipDeviceAttributeMaxTexture2DGather,               ///< Cuda only. Maximum dimensions of 2D texture if gather operations  performed.
+    hipDeviceAttributeMaxTexture2DLayered,              ///< Cuda only. Maximum dimensions of 2D layered texture.
+    hipDeviceAttributeMaxTexture2DLinear,               ///< Cuda only. Maximum dimensions (width, height, pitch) of 2D textures bound to pitched memory.
+    hipDeviceAttributeMaxTexture2DMipmap,               ///< Cuda only. Maximum dimensions of 2D mipmapped texture.
+    hipDeviceAttributeMaxTexture3DWidth,                ///< Maximum dimension width of 3D texture.
+    hipDeviceAttributeMaxTexture3DHeight,               ///< Maximum dimension height of 3D texture.
+    hipDeviceAttributeMaxTexture3DDepth,                ///< Maximum dimension depth of 3D texture.
+    hipDeviceAttributeMaxTexture3DAlt,                  ///< Cuda only. Maximum dimensions of alternate 3D texture.
+    hipDeviceAttributeMaxTextureCubemap,                ///< Cuda only. Maximum dimensions of Cubemap texture
+    hipDeviceAttributeMaxTextureCubemapLayered,         ///< Cuda only. Maximum dimensions of Cubemap layered texture.
+    hipDeviceAttributeMaxThreadsDim,                    ///< Maximum dimension of a block
+    hipDeviceAttributeMaxThreadsPerBlock,               ///< Maximum number of threads per block.
+    hipDeviceAttributeMaxThreadsPerMultiProcessor,      ///< Maximum resident threads per multiprocessor.
+    hipDeviceAttributeMaxPitch,                         ///< Maximum pitch in bytes allowed by memory copies
+    hipDeviceAttributeMemoryBusWidth,                   ///< Global memory bus width in bits.
+    hipDeviceAttributeMemoryClockRate,                  ///< Peak memory clock frequency in kilohertz.
+    hipDeviceAttributeComputeCapabilityMinor,           ///< Minor compute capability version number.
+    hipDeviceAttributeMultiGpuBoardGroupID,             ///< Cuda only. Unique ID of device group on the same multi-GPU board
+    hipDeviceAttributeMultiprocessorCount,              ///< Number of multiprocessors on the device.
+    hipDeviceAttributeName,                             ///< Device name.
+    hipDeviceAttributePageableMemoryAccess,             ///< Device supports coherently accessing pageable memory
+                                                        ///< without calling hipHostRegister on it
+    hipDeviceAttributePageableMemoryAccessUsesHostPageTables, ///< Device accesses pageable memory via the host's page tables
+    hipDeviceAttributePciBusId,                         ///< PCI Bus ID.
+    hipDeviceAttributePciDeviceId,                      ///< PCI Device ID.
+    hipDeviceAttributePciDomainID,                      ///< PCI Domain ID.
+    hipDeviceAttributePersistingL2CacheMaxSize,         ///< Cuda11 only. Maximum l2 persisting lines capacity in bytes
+    hipDeviceAttributeMaxRegistersPerBlock,             ///< 32-bit registers available to a thread block. This number is shared
+                                                        ///< by all thread blocks simultaneously resident on a multiprocessor.
+    hipDeviceAttributeMaxRegistersPerMultiprocessor,    ///< 32-bit registers available per block.
+    hipDeviceAttributeReservedSharedMemPerBlock,        ///< Cuda11 only. Shared memory reserved by CUDA driver per block.
+    hipDeviceAttributeMaxSharedMemoryPerBlock,          ///< Maximum shared memory available per block in bytes.
+    hipDeviceAttributeSharedMemPerBlockOptin,           ///< Cuda only. Maximum shared memory per block usable by special opt in.
+    hipDeviceAttributeSharedMemPerMultiprocessor,       ///< Cuda only. Shared memory available per multiprocessor.
+    hipDeviceAttributeSingleToDoublePrecisionPerfRatio, ///< Cuda only. Performance ratio of single precision to double precision.
+    hipDeviceAttributeStreamPrioritiesSupported,        ///< Cuda only. Whether to support stream priorities.
+    hipDeviceAttributeSurfaceAlignment,                 ///< Cuda only. Alignment requirement for surfaces
+    hipDeviceAttributeTccDriver,                        ///< Cuda only. Whether device is a Tesla device using TCC driver
+    hipDeviceAttributeTextureAlignment,                 ///< Alignment requirement for textures
+    hipDeviceAttributeTexturePitchAlignment,            ///< Pitch alignment requirement for 2D texture references bound to pitched memory;
+    hipDeviceAttributeTotalConstantMemory,              ///< Constant memory size in bytes.
+    hipDeviceAttributeTotalGlobalMem,                   ///< Global memory available on devicice.
+    hipDeviceAttributeUnifiedAddressing,                ///< Cuda only. An unified address space shared with the host.
+    hipDeviceAttributeUuid,                             ///< Cuda only. Unique ID in 16 byte.
+    hipDeviceAttributeWarpSize,                         ///< Warp size in threads.

-    hipDeviceAttributeMaxPitch,             ///< Maximum pitch in bytes allowed by memory copies
-    hipDeviceAttributeTextureAlignment,     ///<Alignment requirement for textures
-    hipDeviceAttributeTexturePitchAlignment, ///<Pitch alignment requirement for 2D texture references bound to pitched memory;
-    hipDeviceAttributeKernelExecTimeout,    ///<Run time limit for kernels executed on the device
-    hipDeviceAttributeCanMapHostMemory,     ///<Device can map host memory into device address space
-    hipDeviceAttributeEccEnabled,           ///<Device has ECC support enabled
+    hipDeviceAttributeCudaCompatibleEnd = 9999,
+    hipDeviceAttributeAmdSpecificBegin = 10000,

-    hipDeviceAttributeCooperativeMultiDeviceUnmatchedFunc,        ///< Supports cooperative launch on multiple
-                                                                  ///devices with unmatched functions
-    hipDeviceAttributeCooperativeMultiDeviceUnmatchedGridDim,     ///< Supports cooperative launch on multiple
-                                                                  ///devices with unmatched grid dimensions
-    hipDeviceAttributeCooperativeMultiDeviceUnmatchedBlockDim,    ///< Supports cooperative launch on multiple
-                                                                  ///devices with unmatched block dimensions
-    hipDeviceAttributeCooperativeMultiDeviceUnmatchedSharedMem,   ///< Supports cooperative launch on multiple
-                                                                  ///devices with unmatched shared memories
-    hipDeviceAttributeAsicRevision,         ///< Revision of the GPU in this device
-    hipDeviceAttributeManagedMemory,        ///< Device supports allocating managed memory on this system
-    hipDeviceAttributeDirectManagedMemAccessFromHost, ///< Host can directly access managed memory on
-                                                      /// the device without migration
-    hipDeviceAttributeConcurrentManagedAccess,  ///< Device can coherently access managed memory
-                                                /// concurrently with the CPU
-    hipDeviceAttributePageableMemoryAccess,     ///< Device supports coherently accessing pageable memory
-                                                /// without calling hipHostRegister on it
-    hipDeviceAttributePageableMemoryAccessUsesHostPageTables, ///< Device accesses pageable memory via
-                                                              /// the host's page tables
-    hipDeviceAttributeCanUseStreamWaitValue ///< '1' if Device supports hipStreamWaitValue32() and
-                                            ///< hipStreamWaitValue64() , '0' otherwise.
+    hipDeviceAttributeClockInstructionRate = hipDeviceAttributeAmdSpecificBegin,  ///< Frequency in khz of the timer used by the device-side "clock*"
+    hipDeviceAttributeArch,                                     ///< Device architecture
+    hipDeviceAttributeMaxSharedMemoryPerMultiprocessor,         ///< Maximum Shared Memory PerMultiprocessor.
+    hipDeviceAttributeGcnArch,                                  ///< Device gcn architecture
+    hipDeviceAttributeGcnArchName,                              ///< Device gcnArch name in 256 bytes
+    hipDeviceAttributeHdpMemFlushCntl,                          ///< Address of the HDP_MEM_COHERENCY_FLUSH_CNTL register
+    hipDeviceAttributeHdpRegFlushCntl,                          ///< Address of the HDP_REG_COHERENCY_FLUSH_CNTL register
+    hipDeviceAttributeCooperativeMultiDeviceUnmatchedFunc,      ///< Supports cooperative launch on multiple
+                                                                ///< devices with unmatched functions
+    hipDeviceAttributeCooperativeMultiDeviceUnmatchedGridDim,   ///< Supports cooperative launch on multiple
+                                                                ///< devices with unmatched grid dimensions
+    hipDeviceAttributeCooperativeMultiDeviceUnmatchedBlockDim,  ///< Supports cooperative launch on multiple
+                                                                ///< devices with unmatched block dimensions
+    hipDeviceAttributeCooperativeMultiDeviceUnmatchedSharedMem, ///< Supports cooperative launch on multiple
+                                                                ///< devices with unmatched shared memories
+    hipDeviceAttributeIsLargeBar,                               ///< Whether it is LargeBar
+    hipDeviceAttributeAsicRevision,                             ///< Revision of the GPU in this device
+    hipDeviceAttributeCanUseStreamWaitValue,                    ///< '1' if Device supports hipStreamWaitValue32() and
+                                                                ///< hipStreamWaitValue64() , '0' otherwise.

+    hipDeviceAttributeAmdSpecificEnd = 19999,
+    hipDeviceAttributeVendorSpecificBegin = 20000,
+    // Extended attributes for vendors
 } hipDeviceAttribute_t;

 enum hipComputeMode {

Known Issues#

Incorrect dGPU Behavior When Using AMDVBFlash Tool#

The AMDVBFlash tool, used for flashing the VBIOS image to dGPU, does not communicate with the ROM Controller specifically when the driver is present. This is because the driver, as part of its runtime power management feature, puts the dGPU to a sleep state.

As a workaround, users can run amdgpu.runpm=0, which temporarily disables the runtime power management feature from the driver and dynamically changes some power control-related sysfs files.

Issue with START Timestamp in ROCProfiler#

Users may encounter an issue with the enabled timestamp functionality for monitoring one or multiple counters. ROCProfiler outputs the following four timestamps for each kernel:

  • Dispatch

  • Start

  • End

  • Complete

Issue#

This defect is related to the Start timestamp functionality, which incorrectly shows an earlier time than the Dispatch timestamp.

To reproduce the issue,

  1. Enable timing using the –timestamp on flag.

  2. Use the -i option with the input filename that contains the name of the counter(s) to monitor.

  3. Run the program.

  4. Check the output result file.

Current behavior#

BeginNS is lower than DispatchNS, which is incorrect.

Expected behavior#

The correct order is:

Dispatch < Start < End < Complete

Users cannot use ROCProfiler to measure the time spent on each kernel because of the incorrect timestamp with counter collection enabled.

Radeon Pro V620 and W6800 Workstation GPUs#
No Support for SMI and ROCDebugger on SRIOV#

System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV environment on any GPU. For more information, refer to the Systems Management Interface documentation.

Deprecations and Warnings#

ROCm Libraries Changes – Deprecations and Deprecation Removal#
  • The hipFFT.h header is now provided only by the hipFFT package. Up to ROCm 5.0, users would get hipFFT.h in the rocFFT package too.

  • The GlobalPairwiseAMG class is now entirely removed, users should use the PairwiseAMG class instead.

  • The rocsparse_spmm signature in 5.0 was changed to match that of rocsparse_spmm_ex. In 5.0, rocsparse_spmm_ex is still present, but deprecated. Signature diff for rocsparse_spmm rocsparse_spmm in 5.0

    rocsparse_status rocsparse_spmm(rocsparse_handle            handle,
                                    rocsparse_operation         trans_A,
                                    rocsparse_operation         trans_B,
                                    const void*                 alpha,
                                    const rocsparse_spmat_descr mat_A,
                                    const rocsparse_dnmat_descr mat_B,
                                    const void*                 beta,
                                    const rocsparse_dnmat_descr mat_C,
                                    rocsparse_datatype          compute_type,
                                    rocsparse_spmm_alg          alg,
                                    rocsparse_spmm_stage        stage,
                                    size_t*                     buffer_size,
                                    void*                       temp_buffer);
    

    rocSPARSE_spmm in 4.0

    rocsparse_status rocsparse_spmm(rocsparse_handle            handle,
                                    rocsparse_operation         trans_A,
                                    rocsparse_operation         trans_B,
                                    const void*                 alpha,
                                    const rocsparse_spmat_descr mat_A,
                                    const rocsparse_dnmat_descr mat_B,
                                    const void*                 beta,
                                    const rocsparse_dnmat_descr mat_C,
                                    rocsparse_datatype          compute_type,
                                    rocsparse_spmm_alg          alg,
                                    size_t*                     buffer_size,
                                    void*                       temp_buffer);
    
HIP API Deprecations and Warnings#
Warning - Arithmetic Operators of HIP Complex and Vector Types#

In this release, arithmetic operators of HIP complex and vector types are deprecated.

  • As alternatives to arithmetic operators of HIP complex types, users can use arithmetic operators of std::complex types.

  • As alternatives to arithmetic operators of HIP vector types, users can use the operators of the native clang vector type associated with the data member of HIP vector types.

During the deprecation, two macros _HIP_ENABLE_COMPLEX_OPERATORS and _HIP_ENABLE_VECTOR_OPERATORS are provided to allow users to conditionally enable arithmetic operators of HIP complex or vector types.

Note, the two macros are mutually exclusive and, by default, set to Off.

The arithmetic operators of HIP complex and vector types will be removed in a future release.

Refer to the HIP API Guide for more information.

Warning - Compiler-Generated Code Object Version 4 Deprecation#

Support for loading compiler-generated code object version 4 will be deprecated in a future release with no release announcement and replaced with code object 5 as the default version.

The current default is code object version 4.

Warning - MIOpenTensile Deprecation#

MIOpenTensile will be deprecated in a future release.

Library Changes in ROCM 5.0.0#

Library

Version

hipBLAS

0.49.0

hipCUB

2.10.13

hipFFT

1.0.4

hipSOLVER

1.2.0

hipSPARSE

2.0.0

rccl

2.10.3

rocALUTION

2.0.1

rocBLAS

2.42.0

rocFFT

1.0.13

rocPRIM

2.10.12

rocRAND

2.10.12

rocSOLVER

3.16.0

rocSPARSE

2.0.0

rocThrust

2.13.0

Tensile

4.31.0

hipBLAS 0.49.0#

hipBLAS 0.49.0 for ROCm 5.0.0

Added#
  • Added rocSOLVER functions to hipblas-bench

  • Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex

  • Added compilation warning for future trmm changes

  • Added documentation to hipblas.h

  • Added option to forgo pivoting for getrf and getri when ipiv is nullptr

  • Added code coverage option

Fixed#
  • Fixed use of incorrect ‘HIP_PATH’ when building from source.

  • Fixed windows packaging

  • Allowing negative increments in hipblas-bench

  • Removed boost dependency

hipCUB 2.10.13#

hipCUB 2.10.13 for ROCm 5.0.0

Fixed#
  • Added missing includes to hipcub.hpp

Added#
  • Bfloat16 support to test cases (device_reduce & device_radix_sort)

  • Device merge sort

  • Block merge sort

  • API update to CUB 1.14.0

Changed#
  • The SetupNVCC.cmake automatic target selector select all of the capabalities of all available card for NVIDIA backend.

hipFFT 1.0.4#

hipFFT 1.0.4 for ROCm 5.0.0

Fixed#
  • Add calls to rocFFT setup/cleanup.

  • Cmake fixes for clients and backend support.

Added#
  • Added support for Windows 10 as a build target.

hipSOLVER 1.2.0#

hipSOLVER 1.2.0 for ROCm 5.0.0

Added#
  • Added functions

    • sytrf

      • hipsolverSsytrf_bufferSize, hipsolverDsytrf_bufferSize, hipsolverCsytrf_bufferSize, hipsolverZsytrf_bufferSize

      • hipsolverSsytrf, hipsolverDsytrf, hipsolverCsytrf, hipsolverZsytrf

Fixed#
  • Fixed use of incorrect HIP_PATH when building from source (#40). Thanks @jakub329homola!

hipSPARSE 2.0.0#

hipSPARSE 2.0.0 for ROCm 5.0.0

Added#
  • Added (conjugate) transpose support for csrmv, hybmv and spmv routines

rccl 2.10.3#

RCCL 2.10.3 for ROCm 5.0.0

Added#
  • Compatibility with NCCL 2.10.3

Known Issues#
  • Managed memory is not currently supported for clique-based kernels

rocALUTION 2.0.1#

rocALUTION 2.0.1 for ROCm 5.0.0

Changed#
  • Removed deprecated GlobalPairwiseAMG class, please use PairwiseAMG instead.

  • Changed to C++ 14 Standard

Improved#
  • Added sanitizer option

  • Improved documentation

rocBLAS 2.42.0#

rocBLAS 2.42.0 for ROCm 5.0.0

Added#
  • Added rocblas_get_version_string_size convenience function

  • Added rocblas_xtrmm_outofplace, an out-of-place version of rocblas_xtrmm

  • Added hpl and trig initialization for gemm_ex to rocblas-bench

  • Added source code gemm. It can be used as an alternative to Tensile for debugging and development

  • Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex

Optimizations#
  • Improved performance of non-batched and batched single-precision GER for size m > 1024. Performance enhanced by 5-10% measured on a MI100 (gfx908) GPU.

  • Improved performance of non-batched and batched HER for all sizes and data types. Performance enhanced by 2-17% measured on a MI100 (gfx908) GPU.

Changed#
  • Instantiate templated rocBLAS functions to reduce size of librocblas.so

  • Removed static library dependency on msgpack

  • Removed boost dependencies for clients

Fixed#
  • Option to install script to build only rocBLAS clients with a pre-built rocBLAS library

  • Correctly set output of nrm2_batched_ex and nrm2_strided_batched_ex when given bad input

  • Fix for dgmm with side == rocblas_side_left and a negative incx

  • Fixed out-of-bounds read for small trsm

  • Fixed numerical checking for tbmv_strided_batched

rocFFT 1.0.13#

rocFFT 1.0.13 for ROCm 5.0.0

Optimizations#
  • Improved many plans by removing unnecessary transpose steps.

  • Optimized scheme selection for 3D problems.

    • Imposed less restrictions on 3D_BLOCK_RC selection. More problems can use 3D_BLOCK_RC and have some performance gain.

    • Enabled 3D_RC. Some 3D problems with SBCC-supported z-dim can use less kernels and get benefit.

    • Force –length 336 336 56 (dp) use faster 3D_RC to avoid it from being skipped by conservative threshold test.

  • Optimized some even-length R2C/C2R cases by doing more operations in-place and combining pre/post processing into Stockham kernels.

  • Added radix-17.

Added#
  • Added new kernel generator for select fused-2D transforms.

Fixed#
  • Improved large 1D transform decompositions.

rocPRIM 2.10.12#

rocPRIM 2.10.12 for ROCm 5.0.0

Fixed#
  • Enable bfloat16 tests and reduce threshold for bfloat16

  • Fix device scan limit_size feature

  • Non-optimized builds no longer trigger local memory limit errors

Added#
  • Added scan size limit feature

  • Added reduce size limit feature

  • Added transform size limit feature

  • Add block_load_striped and block_store_striped

  • Add gather_to_blocked to gather values from other threads into a blocked arrangement

  • The block sizes for device merge sorts initial block sort and its merge steps are now separate in its kernel config

    • the block sort step supports multiple items per thread

Changed#
  • size_limit for scan, reduce and transform can now be set in the config struct instead of a parameter

  • Device_scan and device_segmented_scan: inclusive_scan now uses the input-type as accumulator-type, exclusive_scan uses initial-value-type.

    • This particularly changes behaviour of small-size input types with large-size output types (e.g. short input, int output).

    • And low-res input with high-res output (e.g. float input, double output)

  • Revert old Fiji workaround, because they solved the issue at compiler side

  • Update README cmake minimum version number

  • Block sort support multiple items per thread

    • currently only powers of two block sizes, and items per threads are supported and only for full blocks

  • Bumped the minimum required version of CMake to 3.16

Known Issues#
  • Unit tests may soft hang on MI200 when running in hipMallocManaged mode.

  • device_segmented_radix_sort, device_scan unit tests failing for HIP on Windows

  • ReduceEmptyInput cause random faulire with bfloat16

rocRAND 2.10.12#

rocRAND 2.10.12 for ROCm 5.0.0

Changed#
  • No updates or changes for ROCm 5.0.0.

rocSOLVER 3.16.0#

rocSOLVER 3.16.0 for ROCm 5.0.0

Added#
  • Symmetric matrix factorizations:

    • LASYF

    • SYTF2, SYTRF (with batched and strided_batched versions)

  • Added rocsolver_get_version_string_size to help with version string queries

  • Added rocblas_layer_mode_ex and the ability to print kernel calls in the trace and profile logs

  • Expanded batched and strided_batched sample programs.

Optimized#
  • Improved general performance of LU factorization

  • Increased parallelism of specialized kernels when compiling from source, reducing build times on multi-core systems.

Changed#
  • The rocsolver-test client now prints the rocSOLVER version used to run the tests, rather than the version used to build them

  • The rocsolver-bench client now prints the rocSOLVER version used in the benchmark

Fixed#
  • Added missing stdint.h include to rocsolver.h

rocSPARSE 2.0.0#

rocSPARSE 2.0.0 for ROCm 5.0.0

Added#
  • csrmv, coomv, ellmv, hybmv for (conjugate) transposed matrices

  • csrmv for symmetric matrices

Changed#
  • spmm_ex is now deprecated and will be removed in the next major release

Improved#
  • Optimization for gtsv

rocThrust 2.13.0#

rocThrust 2.13.0 for ROCm 5.0.0

Added#
  • Updated to match upstream Thrust 1.13.0

  • Updated to match upstream Thrust 1.14.0

  • Added async scan

Changed#
  • Scan algorithms: inclusive_scan now uses the input-type as accumulator-type, exclusive_scan uses initial-value-type.

    • This particularly changes behaviour of small-size input types with large-size output types (e.g. short input, int output).

    • And low-res input with high-res output (e.g. float input, double output)

Tensile 4.31.0#

Tensile 4.31.0 for ROCm 5.0.0

Added#
  • DirectToLds support (x2/x4)

  • DirectToVgpr support for DGEMM

  • Parameter to control number of files kernels are merged into to better parallelize kernel compilation

  • FP16 alternate implementation for HPA HGEMM on aldebaran

Optimized#
  • Add DGEMM NN custom kernel for HPL on aldebaran

Changed#
  • Update tensile_client executable to std=c++14

Removed#
  • Remove unused old Tensile client code

Fixed#
  • Fix hipErrorInvalidHandle during benchmarks

  • Fix addrVgpr for atomic GSU

  • Fix for Python 3.8: add case for Constant nodeType

  • Fix architecture mapping for gfx1011 and gfx1012

  • Fix PrintSolutionRejectionReason verbiage in KernelWriter.py

  • Fix vgpr alignment problem when enabling flat buffer load

GPU and OS Support (Linux)#

Supported Distributions#

AMD ROCm™ Platform supports the following Linux distributions.

Distribution

Processor Architectures

Validated Kernel

CentOS 8.3

x86-64

4.18

CentOS 7.9

x86-64

3.10

RHEL 8.5, 8.4

x86-64

4.18

RHEL 7.9

x86-64

3.10

SLES 15 SP3

x86-64

5.3.18

Ubuntu 20.04.3 LTS

x86-64

5.8

Ubuntu 18.04.5 LTS

x86-64

5.4.0

Virtualization Support#

ROCm supports virtualization for select GPUs only as shown below.

Hypervisor

Version

GPU

Validated Guest OS (validated kernel)

VMWare

ESXi 8

MI250

Ubuntu 20.04 (5.15.0-56-generic)

VMWare

ESXi 8

MI210

Ubuntu 20.04 (5.15.0-56-generic), SLES 15 SP4 (5.14.21-150400.24.18-default)

VMWare

ESXi 7

MI210

Ubuntu 20.04 (5.15.0-56-generic), SLES 15 SP4 (5.14.21-150400.24.18-default)

GPU Support Table#

Use Driver Shipped with ROCm

Product Name

Architecture

LLVM Target

Support

AMD Instinct™ MI250X

CDNA2

gfx90a

AMD Instinct™ MI250

CDNA2

gfx90a

AMD Instinct™ MI210

CDNA2

gfx90a

AMD Instinct™ MI100

CDNA

gfx908

AMD Instinct™ MI50

GCN5.1

gfx906

AMD Instinct™ MI25

GCN5.0

gfx900

Use Radeon Pro Driver

Name

Architecture

LLVM Target

Support

AMD Radeon™ Pro W6800

RDNA2

gfx1030

AMD Radeon™ Pro V620

RDNA2

gfx1030

AMD Radeon™ Pro VII

GCN5.1

gfx906

Use Radeon Pro Driver

Name

Architecture

LLVM Target

Support

AMD Radeon™ VII

GCN5.1

gfx906

Support Status#

  • ✅: Supported - AMD enables these GPUs in our software distributions for the corresponding ROCm product.

  • ⚠️: Deprecated - Support will be removed in a future release.

  • ❌: Unsupported - This configuration is not enabled in our software distributions.

CPU Support#

ROCm requires CPUs that support PCIe™ Atomics. Modern CPUs after the release of 1st generation AMD Zen CPU and Intel™ Haswell support PCIe Atomics.

Compatibility#

User space & Kernel Fusion Driver

Forward and backward compatibility of ROCm user space components and the kernel space Kernel Fusion Driver (KFD).

Docker Image Support

ROCm releases several Docker container images.

3rd Party Support

Several 3rd party libraries ship with ROCm enablement as well as several ROCm components provide interfaces compatible with 3rd party solutions.

User/Kernel-Space Support Matrix#

ROCm™ provides forward and backward compatibility between the Kernel Fusion Driver (KFD) and its user space software for +/- 2 releases. This table shows the compatibility combinations that are currently supported.

KFD

Tested user space versions

5.0.2

5.1.0, 5.2.0

5.1.0

5.0.2

5.1.3

5.2.0, 5.3.0

5.2.0

5.0.2, 5.1.3

Docker Image Support Matrix#

The software support matrices for ROCm container releases is listed.

ROCm 5.6#

PyTorch#
Ubuntu+ rocm5.6_internal_testing +169530b#
CentOS7+ rocm5.6_internal_testing +169530b#
1.13 +bfeb431#
1.12 +05d5d04#
TensorFlow#
tensorflow_develop-upstream-QA-rocm56 +c88a9f4#
r2.11-rocm-enhanced +5be4141#
r2.10-rocm-enhanced +72789a3#

3rd Party Support Matrix#

ROCm™ supports various 3rd party libraries and frameworks. Supported versions are tested and known to work. Non-supported versions of 3rd parties may also work, but aren’t tested.

Deep Learning#

ROCm releases support the most recent and two prior releases of PyTorch and TensorFlow

ROCm

PyTorch

TensorFlow

MAGMA

5.0.2

1.8, 1.9, 1.10

2.6, 2.7, 2.8

5.1.3

1.9, 1.10, 1.11

2.7, 2.8, 2.9

5.2.x

1.10, 1.11, 1.12

2.8, 2.9, 2.9

5.3.x

1.10.1, 1.11, 1.12.1, 1.13

2.8, 2.9, 2.10

5.4.x

1.10.1, 1.11, 1.12.1, 1.13

2.8, 2.9, 2.10, 2.11

2.5.4

Communication libraries#

ROCm supports OpenUCX an “an open-source, production-grade communication framework for data-centric and high-performance applications”.

UCX version

ROCm 5.4 and older

ROCm 5.5 and newer

-1.14.0

COMPATIBLE

INCOMPATIBLE

1.14.1+

COMPATIBLE

COMPATIBLE

Algorithm libraries#

ROCm releases provide algorithm libraries with interfaces compatible with contemporary CUDA / NVIDIA HPC SDK alternatives.

  • Thrust → rocThrust

  • CUB → hipCUB

ROCm

Thrust / CUB

HPC SDK

5.0.2

1.14

21.9

5.1.3

1.15

22.1

5.2.x

1.15

22.2, 22.3

For the latest documentation of these libraries, refer to the associated documentation.

Licensing Terms#

ROCm™ is released by Advanced Micro Devices, Inc. and is licensed per component separately. The following table is a list of ROCm components with links to their respective license terms. These components may include third party components subject to additional licenses. Please review individual repositories for more information. The table shows ROCm components, the name of license and link to the license terms. The table is ordered to follow ROCm’s manifest file.

Component

License

ROCK-Kernel-Driver

GPL 2.0 WITH Linux-syscall-note

ROCT-Thunk-Interface

MIT

ROCR-Runtime

The University of Illinois/NCSA

rocm_smi_lib

The University of Illinois/NCSA

rocm-cmake

MIT

rocminfo

The University of Illinois/NCSA

rocprofiler

MIT

roctracer

MIT

ROCm-OpenCL-Runtime

MIT

ROCm-OpenCL-Runtime/api/opencl/khronos/icd

Apache 2.0

clang-ocl

MIT

HIP

MIT

hipamd

MIT

ROCclr

MIT

HIPIFY

MIT

HIPCC

MIT

llvm-project

Apache

rocm-llvm-alt

AMD Proprietary License

ROCm-Device-Libs

The University of Illinois/NCSA

atmi

MIT

ROCm-CompilerSupport

The University of Illinois/NCSA

rocr_debug_agent

The University of Illinois/NCSA

rocm_bandwidth_test

The University of Illinois/NCSA

half

MIT

RCP

MIT

ROCgdb

GNU General Public License v2.0

ROCdbgapi

MIT

rdc

MIT

rocBLAS

MIT

Tensile

MIT

hipBLAS

MIT

rocFFT

MIT

hipFFT

MIT

rocRAND

MIT

rocSPARSE

MIT

rocSOLVER

BSD-2-Clause

hipSOLVER

MIT

hipSPARSE

MIT

rocALUTION

MIT

MIOpenGEMM

MIT

MIOpen

MIT

rccl

Custom

MIVisionX

MIT

rocThrust

Apache 2.0

hipCUB

Custom

rocPRIM

MIT

rocWMMA

MIT

hipfort

MIT

ROCmValidationSuite

MIT

aomp

Apache 2.0

aomp-extras

MIT

flang

Apache 2.0

Open sourced ROCm components are released via public GitHub repositories, packages on https://repo.radeon.com and other distribution channels. Proprietary products are only available on https://repo.radeon.com. Currently, only one component of ROCm, rocm-llvm-alt is governed by a proprietary license. Proprietary components are organized in a proprietary subdirectory in the package repositories to distinguish from open sourced packages.

The additional terms and conditions below apply to your use of ROCm technical documentation.

©2023 Advanced Micro Devices, Inc. All rights reserved.

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED “AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD Arrow logo, ROCm, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

Package Licensing#

Attention

AQL Profiler and AOCC CPU optimization are both provided in binary form, each subject to the license agreement enclosed in the directory for the binary and is available here: /opt/rocm/share/doc/rocm-llvm-alt/EULA. By using, installing, copying or distributing AQL Profiler and/or AOCC CPU Optimizations, you agree to the terms and conditions of this license agreement. If you do not agree to the terms of this agreement, do not install, copy or use the AQL Profiler and/or the AOCC CPU Optimizations.

For the rest of the ROCm packages, you can find the licensing information at the following location: /opt/rocm/share/doc/<component-name>/

For example, you can fetch the licensing information of the _amd_comgr_ component (Code Object Manager) from the amd_comgr folder. A file named LICENSE.txt contains the license details at: /opt/rocm-5.0.2/share/doc/amd_comgr/LICENSE.txt

All Reference Material#

ROCm Software Groups#

HIP is both AMD’s GPU programming language extension and the GPU runtime.

HIP Math Libraries support the following domains:

ROCm template libraries for C++ primitives and algorithms are as follows:

Inter and intra-node communication is supported by the following projects:

Libraries related to AI.

Computer vision related projects.

Compilers and Tools#

ROCmCC

ROCmCC is a Clang/LLVM-based compiler. It is optimized for high-performance computing on AMD GPUs and CPUs and supports various heterogeneous programming models such as HIP, OpenMP, and OpenCL.

ROCgdb

This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger.

ROCProfiler

ROC profiler library. Profiling with performance counters and derived metrics. Library supports GFX8/GFX9. Hardware specific low-level performance analysis interface for profiling of GPU compute applications. The profiling includes hardware performance counters with complex performance metrics.

ROCTracer

Callback/Activity Library for Performance tracing AMD GPU’s

ROCdbgapi

The AMD Debugger API is a library that provides all the support necessary for a debugger and other tools to perform low level control of the execution and inspection of execution state of AMD’s commercially available GPU architectures.

See Also#

Compiler Reference Guide#

Introduction to Compiler Reference Guide#

ROCmCC is a Clang/LLVM-based compiler. It is optimized for high-performance computing on AMD GPUs and CPUs and supports various heterogeneous programming models such as HIP, OpenMP, and OpenCL.

ROCmCC is made available via two packages: rocm-llvm and rocm-llvm-alt. The differences are listed in the table below.

Differences between rocm-llvm and rocm-llvm-alt#

rocm-llvm

rocm-llvm-alt

Installed by default when ROCm™ itself is installed

An optional package

Provides an open-source compiler

Provides an additional closed-source compiler for users interested in additional CPU optimizations not available in rocm-llvm

For more details, see:

ROCm Compiler Interfaces#

ROCm currently provides two compiler interfaces for compiling HIP programs:

  • /opt/rocm/bin/hipcc

  • /opt/rocm/bin/amdclang++

Both leverage the same LLVM compiler technology with the AMD GCN GPU support; however, they offer a slightly different user experience. The hipcc command-line interface aims to provide a more familiar user interface to users who are experienced in CUDA but relatively new to the ROCm/HIP development environment. On the other hand, amdclang++ provides a user interface identical to the clang++ compiler. It is more suitable for experienced developers who want to directly interact with the clang compiler and gain full control of their application’s build process.

The major differences between hipcc and amdclang++ are listed below:

Differences between hipcc and amdclang++#

*

hipcc

amdclang++

Compiling HIP source files

Treats all source files as HIP language source files

Enables the HIP language support for files with the .hip extension or through the -x hip compiler option

Detecting GPU architecture

Auto-detects the GPUs available on the system and generates code for those devices when no GPU architecture is specified

Has AMD GCN gfx803 as the default GPU architecture. The --offload-arch compiler option may be used to target other GPU architectures

Finding a HIP installation

Finds the HIP installation based on its own location and its knowledge about the ROCm directory structure

First looks for HIP under the same parent directory as its own LLVM directory and then falls back on /opt/rocm. Users can use the --rocm-path option to instruct the compiler to use HIP from the specified ROCm installation.

Linking to the HIP runtime library

Is configured to automatically link to the HIP runtime from the detected HIP installation

Requires the --hip-link flag to be specified to link to the HIP runtime. Alternatively, users can use the -l<dir> -lamdhip64 option to link to a HIP runtime library.

Device function inlining

Inlines all GPU device functions, which provide greater performance and compatibility for codes that contain file scoped or device function scoped __shared__ variables. However, it may increase compile time.

Relies on inlining heuristics to control inlining. Users experiencing performance or compilation issues with code using file scoped or device function scoped __shared__ variables could try -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false to work around the issue. There are plans to address these issues with future compiler improvements.

Source code location

ROCm-Developer-Tools/HIPCC

RadeonOpenCompute/llvm-project

Compiler Options and Features#

This chapter discusses compiler options and features.

AMD GPU Compilation#

This section outlines commonly used compiler flags for hipcc and amdclang++.

-x hip#

Compiles the source file as a HIP program.

-fopenmp#

Enables the OpenMP support.

-fopenmp-targets=<gpu>#

Enables the OpenMP target offload support of the specified GPU architecture.

Gpu:

The GPU architecture. E.g. gfx908.

--gpu-max-threads-per-block=<value>:#

Sets the default limit of threads per block. Also referred to as the launch bounds.

Value:

The default maximum amount of threads per block.

-munsafe-fp-atomics#

Enables unsafe floating point atomic instructions (AMDGPU only).

-ffast-math#

Allows aggressive, lossy floating-point optimizations.

-mwavefrontsize64, -mno-wavefrontsize64#

Sets wavefront size to be 64 or 32 on RDNA architectures.

-mcumode#

Switches between CU and WGP modes on RDNA architectures.

--offload-arch=<gpu>#

HIP offloading target ID. May be specified more than once.

Gpu:

The a device architecture followed by target ID features delimited by a colon. Each target ID feature is a predefined string followed by a plus or minus sign (e.g. gfx908:xnack+:sramecc-).

-g#

Generates source-level debug information.

-fgpu-rdc, -fno-gpu-rdc#

Generates relocatable device code, also known as separate compilation mode.

AMD Optimizations for Zen Architectures#

The CPU compiler optimizations described in this chapter originate from the AMD Optimizing C/C++ Compiler (AOCC) compiler. They are available in ROCmCC if the optional rocm-llvm-alt package is installed. The user’s interaction with the compiler does not change once rocm-llvm-alt is installed. The user should use the same compiler entry point, provided AMD provides high-performance compiler optimizations for Zen-based processors in AOCC.

For more information, refer to https://www.amd.com/en/developer/aocc.html.

-famd-opt#

Enables a default set of AMD proprietary optimizations for the AMD Zen CPU architectures.

-fno-amd-opt disables the AMD proprietary optimizations.

The -famd-opt flag is useful when a user wants to build with the proprietary optimization compiler and not have to depend on setting any of the other proprietary optimization flags.

Note

-famd-opt can be used in addition to the other proprietary CPU optimization flags. The table of optimizations below implicitly enables the invocation of the AMD proprietary optimizations compiler, whereas the -famd-opt flag requires this to be handled explicitly.

-fstruct-layout=[1,2,3,4,5,6,7]#

Analyzes the whole program to determine if the structures in the code can be peeled and the pointer or integer fields in the structure can be compressed. If feasible, this optimization transforms the code to enable these improvements. This transformation is likely to improve cache utilization and memory bandwidth. It is expected to improve the scalability of programs executed on multiple cores.

This is effective only under -flto, as the whole program analysis is required to perform this optimization. Users can choose different levels of aggressiveness with which this optimization can be applied to the application, with 1 being the least aggressive and 7 being the most aggressive level.

-fstruct-layout Values and Their Effects#

-fstruct-layout value

Structure peeling

Pointer size after selective compression of self-referential pointers in structures, wherever safe

Type of structure fields eligible for compression

Whether compression performed under safety check

1

Enabled

NA

NA

NA

2

Enabled

32-bit

NA

NA

3

Enabled

16-bit

NA

NA

4

Enabled

32-bit

Integer

Yes

5

Enabled

16-bit

Integer

Yes

6

Enabled

32-bit

64-bit signed int or unsigned int. Users must ensure that the values assigned to 64-bit signed int fields are in range -(2^31 - 1) to +(2^31 - 1) and 64-bit unsigned int fields are in the range 0 to +(2^31 - 1). Otherwise, you may obtain incorrect results.

No. Users must ensure the safety based on the program compiled.

7

Enabled

16-bit

64-bit signed int or unsigned int. Users must ensure that the values assigned to 64-bit signed int fields are in range -(2^31 - 1) to +(2^31 - 1) and 64-bit unsigned int fields are in the range 0 to +(2^31 - 1). Otherwise, you may obtain incorrect results.

No. Users must ensure the safety based on the program compiled.

-fitodcalls#

Promotes indirect-to-direct calls by placing conditional calls. Application or benchmarks that have a small and deterministic set of target functions for function pointers passed as call parameters benefit from this optimization. Indirect-to-direct call promotion transforms the code to use all possible determined targets under runtime checks and falls back to the original code for all the other cases. Runtime checks are introduced by the compiler for each of these possible function pointer targets followed by direct calls to the targets.

This is a link time optimization, which is invoked as -flto -fitodcalls

-fitodcallsbyclone#

Performs value specialization for functions with function pointers passed as an argument. It does this specialization by generating a clone of the function. The cloning of the function happens in the call chain as needed, to allow conversion of indirect function call to direct call.

This complements -fitodcalls optimization and is also a link time optimization, which is invoked as -flto -fitodcallsbyclone.

-fremap-arrays#

Transforms the data layout of a single dimensional array to provide better cache locality. This optimization is effective only under -flto, as the whole program needs to be analyzed to perform this optimization, which can be invoked as -flto -fremap-arrays.

-finline-aggressive#

Enables improved inlining capability through better heuristics. This optimization is more effective when used with -flto, as the whole program analysis is required to perform this optimization, which can be invoked as -flto -finline-aggressive.

-fnt-store (non-temporal store)#

Generates a non-temporal store instruction for array accesses in a loop with a large trip count.

-fnt-store=aggressive#

This is an experimental option to generate non-temporal store instruction for array accesses in a loop, whose iteration count cannot be determined at compile time. In this case, the compiler assumes the iteration count to be huge.

Optimizations Through Driver -mllvm <options>#

The following optimization options must be invoked through driver -mllvm <options>:

-enable-partial-unswitch#

Enables partial loop unswitching, which is an enhancement to the existing loop unswitching optimization in LLVM. Partial loop unswitching hoists a condition inside a loop from a path for which the execution condition remains invariant, whereas the original loop unswitching works for a condition that is completely loop invariant. The condition inside the loop gets hoisted out from the invariant path, and the original loop is retained for the path where the condition is variant.

-aggressive-loop-unswitch#

Experimental option that enables aggressive loop unswitching heuristic (including -enable-partial-unswitch) based on the usage of the branch conditional values. Loop unswitching leads to code bloat. Code bloat can be minimized if the hoisted condition is executed more often. This heuristic prioritizes the conditions based on the number of times they are used within the loop. The heuristic can be controlled with the following options:

  • -unswitch-identical-branches-min-count=<n>

    • Enables unswitching of a loop with respect to a branch conditional value (B), where B appears in at least <n> compares in the loop. This option is enabled with -aggressive-loop-unswitch. The default value is 3.

    Usage: -mllvm -aggressive-loop-unswitch -mllvm -unswitch-identical-branches-min-count=<n>

    Where, n is a positive integer and lower value of <n> facilitates more unswitching.

  • -unswitch-identical-branches-max-count=<n>

    • Enables unswitching of a loop with respect to a branch conditional value (B), where B appears in at most <n> compares in the loop. This option is enabled with -aggressive-loop-unswitch. The default value is 6.

    Usage: -mllvm -aggressive-loop-unswitch -mllvm -unswitch-identical-branches-max-count=<n>

    Where, n is a positive integer and higher value of <n> facilitates more unswitching.

    Note

    These options may facilitate more unswitching under some workloads. Since loop-unswitching inherently leads to code bloat, facilitating more unswitching may significantly increase the code size. Hence, it may also lead to longer compilation times.

-enable-strided-vectorization#

Enables strided memory vectorization as an enhancement to the interleaved vectorization framework present in LLVM. It enables the effective use of gather and scatter kind of instruction patterns. This flag must be used along with the interleave vectorization flag.

-enable-epilog-vectorization#

Enables vectorization of epilog-iterations as an enhancement to existing vectorization framework. This enables generation of an additional epilog vector loop version for the remainder iterations of the original vector loop. The vector size or factor of the original loop should be large enough to allow an effective epilog vectorization of the remaining iterations. This optimization takes place only when the original vector loop is vectorized with a vector width or factor of 16. This vectorization width of 16 may be overwritten by -min-width-epilog-vectorization command-line option.

-enable-redundant-movs#

Removes any redundant mov operations including redundant loads from memory and stores to memory. This can be invoked using -Wl,-plugin-opt=-enable-redundant-movs.

-merge-constant#

Attempts to promote frequently occurring constants to registers. The aim is to reduce the size of the instruction encoding for instructions using constants and obtain a performance improvement.

-function-specialize#

Optimizes the functions with compile time constant formal arguments.

-lv-function-specialization#

Generates specialized function versions when the loops inside function are vectorizable and the arguments are not aliased with each other.

-enable-vectorize-compares#

Enables vectorization on certain loops with conditional breaks assuming the memory accesses are safely bound within the page boundary.

-inline-recursion=[1,2,3,4]#

Enables inlining for recursive functions based on heuristics where the aggressiveness of heuristics increases with the level (1-4). The default level is 2. Higher levels may lead to code bloat due to expansion of recursive functions at call sites.

-inline-recursion Level and Their Effects#

-inline-recursion value

Inline depth of heuristics used to enable inlining for recursive functions

1

1

2

1

3

1

4

10

This is more effective with -flto as the whole program needs to be analyzed to perform this optimization, which can be invoked as -flto -inline-recursion=[1,2,3,4].

-reduce-array-computations=[1,2,3]#

Performs array data flow analysis and optimizes the unused array computations.

-reduce-array-computations Values and Their Effects#

-reduce-array-computations value

Array elements eligible for elimination of computations

1

Unused

2

Zero valued

3

Both unused and zero valued

This optimization is effective with -flto as the whole program needs to be analyzed to perform this optimization, which can be invoked as -flto -reduce-array-computations=[1,2,3].

-global-vectorize-slp={true,false}#

Vectorizes the straight-line code inside a basic block with data reordering vector operations. This option is set to true by default.

-region-vectorize#

Experimental flag for enabling vectorization on certain loops with complex control flow, which the normal vectorizer cannot handle.

This optimization is effective with -flto as the whole program needs to be analyzed to perform this optimization, which can be invoked as -flto -region-vectorize.

-enable-x86-prefetching#

Enables the generation of x86 prefetch instruction for the memory references inside a loop or inside an innermost loop of a loop nest to prefetch the second dimension of multidimensional array/memory references in the innermost loop of a loop nest. This is an experimental pass; its profitability is being improved.

-suppress-fmas#

Identifies the reduction patterns on FMA and suppresses the FMA generation, as it is not profitable on the reduction patterns.

-enable-icm-vrp#

Enables estimation of the virtual register pressure before performing loop invariant code motion. This estimation is used to control the number of loop invariants that will be hoisted during the loop invariant code motion.

-loop-splitting#

Enables splitting of loops into multiple loops to eliminate the branches, which compare the loop induction with an invariant or constant expression. This option is enabled under -O3 by default. To disable this optimization, use -loop-splitting=false.

-enable-ipo-loop-split#

Enables splitting of loops into multiple loops to eliminate the branches, which compares the loop induction with a constant expression. This constant expression can be derived through inter-procedural analysis. This option is enabled under -O3 by default. To disable this optimization, use -enable-ipo-loop-split=false.

-compute-interchange-order#

Enables heuristic for finding the best possible interchange order for a loop nest. To enable this option, use -enable-loopinterchange. This option is set to false by default.

Usage:

-mllvm -enable-loopinterchange -mllvm -compute-interchange-order
-convert-pow-exp-to-int={true,false}#

Converts the call to floating point exponent version of pow to its integer exponent version if the floating-point exponent can be converted to integer. This option is set to true by default.

-do-lock-reordering={none,normal,aggressive}#

Reorders the control predicates in increasing order of complexity from outer predicate to inner when it is safe. The normal mode reorders simple expressions, while the aggressive mode reorders predicates involving function calls if no side effects are determined. This option is set to normal by default.

-fuse-tile-inner-loop#

Enables fusion of adjacent tiled loops as a part of loop tiling transformation. This option is set to false by default.

-Hz,1,0x1 [Fortran]#

Helps to preserve array index information for array access expressions which get linearized in the compiler front end. The preserved information is used by the compiler optimization phase in performing optimizations such as loop transformations. It is recommended that any user who is using optimizations such as loop transformations and other optimizations requiring de-linearized index expressions should use the Hz option. This option has no impact on any other aspects of the Flang front end.

Inline ASM Statements#

Inline assembly (ASM) statements allow a developer to include assembly instructions directly in either host or device code. While the ROCm compiler supports ASM statements, their use is not recommended for the following reasons:

  • The compiler’s ability to produce both correct code and to optimize surrounding code is impeded.

  • The compiler does not parse the content of the ASM statements and so cannot “see” its contents.

  • The compiler must make conservative assumptions in an effort to retain correctness.

  • The conservative assumptions may yield code that, on the whole, is less performant compared to code without ASM statements. It is possible that a syntactically correct ASM statement may cause incorrect runtime behavior.

  • ASM statements are often ASIC-specific; code containing them is less portable and adds a maintenance burden to the developer if different ASICs are targeted.

  • Writing correct ASM statements is often difficult; we strongly recommend thorough testing of any use of ASM statements.

Note

For developers who choose to include ASM statements in the code, AMD is interested in understanding the use case and appreciates feedback at RadeonOpenCompute/ROCm#issues

Miscellaneous OpenMP Compiler Features#

This section discusses features that have been added or enhanced in the OpenMP compiler.

Offload-arch Tool#

An LLVM library and tool that is used to query the execution capability of the current system as well as to query requirements of a binary file. It is used by OpenMP device runtime to ensure compatibility of an image with the current system while loading it. It is compatible with target ID support and multi-image fat binary support.

Usage:

offload-arch [Options] [Optional lookup-value]

When used without an option, offload-arch prints the value of the first offload arch found in the underlying system. This can be used by various clang front ends. For example, to compile for OpenMP offloading on your current system, invoke clang with the following command:

clang -fopenmp -fopenmp-targets=`offload-arch` foo.c

If an optional lookup-value is specified, offload-arch will check if the value is either a valid offload-arch or a codename and look up requested additional information.

The following command provides all the information for offload-arch gfx906:

offload-arch gfx906 -v

The options are listed below:

-a#

Prints values for all devices. Do not stop at the first device found.

-m#

Prints device code name (often found in pci.ids file).

-n#

Prints numeric pci-id.

-t#

Prints clang offload triple to use for the offload arch.

-v#

Verbose. Implies: -a -m -n -t. For: all devices, prints codename, numeric value, and triple.

-f <file>#

Prints offload requirements including offload-arch for each compiled offload image built into an application binary file.

-c#

Prints offload capabilities of the underlying system. This option is used by the language runtime to select an image when multiple images are available. A capability must exist for each requirement of the selected image.

There are symbolic link aliases amdgpu-offload-arch and nvidia-arch for offload-arch. These aliases return 1 if no AMD GCN GPU or CUDA GPU is found. These aliases are useful in determining whether architecture-specific tests should be run or to conditionally load architecture-specific software.

Command-Line Simplification Using offload-arch Flag#

Legacy mechanism of specifying offloading target for OpenMP involves using three flags, -fopenmp-targets, -Xopenmp-target, and -march. The first two flags take a target triple (like amdgcn-amd-amdhsa or nvptx64-nvidia-cuda), while the last flag takes device name (like gfx908 or sm_70) as input. Alternatively, users of ROCmCC compiler can use the flag —offload-arch for a combined effect of the above three flags.

Example:

# Legacy mechanism
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx906 helloworld.c -o helloworld

Example:

# Using offload-arch flag
clang -fopenmp -target x86_64-linux-gnu \
--offload-arch=gfx906 helloworld.c -o helloworld.

To ensure backward compatibility, both styles are supported. This option is compatible with target ID support and multi-image fat binaries.

Target ID Support for OpenMP#

The ROCmCC compiler supports specification of target features along with the GPU name while specifying a target offload device in the command line, using -march or --offload-arch options. The compiled image in such cases is specialized for a given configuration of device and target features (target ID).

Example:

# compiling for a gfx908 device with XNACK paging support turned ON
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:xnack+ helloworld.c -o helloworld

Example:

# compiling for a gfx908 device with SRAMECC support turned OFF
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:sramecc- helloworld.c -o helloworld

Example:

# compiling for a gfx908 device with SRAMECC support turned ON and XNACK paging support turned OFF
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:sramecc+:xnack- helloworld.c -o helloworld

The target ID specified on the command line is passed to the clang driver using target-feature flag, to the LLVM optimizer and back end using -mattr flag, and to linker using -plugin-opt=-mattr flag. This feature is compatible with offload-arch command-line option and multi-image binaries for multiple architectures.

Multi-image Fat Binary for OpenMP#

The ROCmCC compiler is enhanced to generate binaries that can contain heterogenous images. This heterogeneity could be in terms of:

  • Images of different architectures, like AMD GCN and NVPTX

  • Images of same architectures but for different GPUs, like gfx906 and gfx908

  • Images of same architecture and same GPU but for different target features, like gfx908:xnack+ and gfx908:xnack-

An appropriate image is selected by the OpenMP device runtime for execution depending on the capability of the current system. This feature is compatible with target ID support and offload-arch command-line options and uses offload-arch tool to determine capability of the current system.

Example:

clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908 \
helloworld.c -o helloworld

Example:

clang -fopenmp -target x86_64-linux-gnu \
--offload-arch=gfx906 \
--offload-arch=gfx908 \
helloworld.c -o helloworld

Example:

clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa,amdgcn-amd-amdhsa,amdgcn-amd-amdhsa \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc+:xnack+ \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc-:xnack+ \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc+:xnack- \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc-:xnack- \
helloworld.c -o helloworld

The ROCmCC compiler creates an instance of toolchain for each unique combination of target triple and the target GPU (along with the associated target features). clang-offload-wrapper tool is modified to insert a new structure __tgt_image_info along with each image in the binary. Device runtime is also modified to query this structure to identify a compatible image based on the capability of the current system.

Unified Shared Memory (USM)#

The following OpenMP pragma is available on MI200, and it must be executed with xnack+ support.

omp requires unified_shared_memory

For more details on USM refer to the Unified Shared Memory section of the OpenMP Guide.

Support Status of Other Clang Options#

The following table lists the other Clang options and their support status.

Clang Options#

Option

Support Status

Description

-###

Supported

Prints (but does not run) the commands to run for this compilation

--analyzer-output <value>

Supported

“Static analyzer report output format (`html

--analyze

Supported

Runs the static analyzer

-arcmt-migrate-emit-errors

Unsupported

Emits ARC errors even if the migrator can fix them

-arcmt-migrate-report-output  <value>

Unsupported

Output path for the plist report

-byteswapio

Supported

Swaps byte-order for unformatted input/output

-B <dir>

Supported

Adds <dir> to search path for binaries and object files used implicitly

-CC

Supported

Includes comments from within the macros in the preprocessed output

-cl-denorms-are-zero

Supported

OpenCL only. Allows denormals to be flushed to zero

-cl-fast-relaxed-math

Supported

OpenCL only. Sets -cl-finite-math-only and -cl-unsafe-math-optimizations and defines __FAST_RELAXED_MATH__

-cl-finite-math-only

Supported

OpenCL only. Allows floating-point optimizations that assume arguments and results are not NaNs or +-Inf

-cl-fp32-correctly-rounded-divide-sqrt

Supported

OpenCL only. Specifies that single-precision floating-point divide and sqrt used in the program source are correctly rounded

-cl-kernel-arg-info

Supported

OpenCL only. Generates kernel argument metadata

-cl-mad-enable

Supported

OpenCL only. Allows use of less precise MAD computations in the generated binary

-cl-no-signed-zeros

Supported

OpenCL only. Allows use of less precise no-signed-zeros computations in the generated binary

-cl-opt-disable

Supported

OpenCL only. Disables all optimizations. By default, optimizations are enabled.

-cl-single-precision-constant

Supported

OpenCL only. Treats double-precision floating-point constant as single precision constant

-cl-std= <value>

Supported

OpenCL language standard to compile for

-cl-strict-aliasing

Supported

OpenCL only. This option is added for compatibility with OpenCL 1.0.

-cl-uniform-work-group-size

Supported

OpenCL only. Defines the global work-size to be a multiple of the work-group size specified for clEnqueueNDRangeKernel

-cl-unsafe-math-optimizations

Supported

OpenCL only. Allows unsafe floating-point optimizations. Also implies -cl-no-signed-zeros and -cl-mad-enable

--config <value>

Supported

Specifies configuration file

--cuda-compile-host-device

Supported

Compiles CUDA code for both host and device (default). Has no effect on non-CUDA compilations

--cuda-device-only

Supported

Compiles CUDA code for device only

--cuda-host-only

Supported

Compiles CUDA code for host only. Has no effect on non-CUDA compilations

--cuda-include-ptx=<value>

Unsupported

Includes PTX for the following GPU architecture (e.g. sm_35) or “all.” May be specified more than once

--cuda-noopt-device-debug

Unsupported

Enables device-side debug info generation. Disables ptxas optimizations

--cuda-path-ignore-env

Unsupported

Ignores environment variables to detect CUDA installation

--cuda-path=<value>

Unsupported

CUDA installation path

-cxx-isystem <directory>

Supported

Adds a directory to the C++ SYSTEM include search path

-C

Supported

Includes comments in the preprocessed output

-c

Supported

Runs only preprocess, compile, and assemble steps

-dD

Supported

Prints macro definitions in -E mode in addition to the normal output

-dependency-dot <value>

Supported

Writes DOT-formatted header dependencies to the specified filename

-dependency-file <value>

Supported

Writes dependency output to the specified filename (or -)

-dI

Supported

Prints include directives in -E mode in addition to the normal output

-dM

Supported

Prints macro definitions in -E mode instead of the normal output

-dsym-dir <dir>

Unsupported

Outputs dSYMs (if any) to the specified directory

-D <macro>

Supported

=<value>. Defines <macro> to <value> (or 1 if <value> omitted)

-emit-ast

Supported

Emits Clang AST files for source inputs

-emit-interface-stubs

Supported

Generates interface stub files

-emit-llvm

Supported

Uses the LLVM representation for assembler and object files

-emit-merged-ifs

Supported

Generates interface stub files and emits merged text not binary

--emit-static-lib

Supported

Enables linker job to emit a static library

-enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang

Supported

Declares enabling trivial automatic variable initialization to zero for benchmarking purpose with the knowledge that it will eventually be removed

-E

Supported

Runs the preprocessor only

-fAAPCSBitfieldLoad

Unsupported

Follows the AAPCS standard where all volatile bit-field writes generate at least one load (ARM only)

-faddrsig

Supported

Emits an address-significance table

-faligned-allocation

Supported

Enables C++17 aligned allocation functions

-fallow-editor-placeholders

Supported

Treats editor placeholders as valid source code

-fallow-fortran-gnu-ext

Supported

Allows Fortran GNU extensions

-fansi-escape-codes

Supported

Uses ANSI escape codes for diagnostics

-fapple-kext

Unsupported

Uses Apple’s kernel extensions ABI

-fapple-link-rtlib

Unsupported

Forces linking of the clang built-ins runtime library

-fapple-pragma-pack

Unsupported

Enables Apple gcc-compatible #pragma pack handling

-fapplication-extension

Unsupported

Restricts code to those available for App Extensions

-fbackslash

Supported

Treats backslash as C-style escape character

-fbasic-block-sections= <value>

Supported

“Places each function’s basic blocks in unique sections (ELF Only) : all | labels | none | list= <file>”

-fblocks

Supported

Enables the ‘blocks’ language feature

-fborland-extensions

Unsupported

Accepts non-standard constructs supported by the Borland compile

-fbuild-session-file= <file>

Supported

Uses the last modification time of <file> as the build session timestamp

-fbuild-session-timestamp= <time since Epoch in seconds>

Supported

Specifies starting time of the current build session

-fbuiltin-module-map

Unsupported

Loads the Clang built-ins module map file

-fcall-saved-x10

Unsupported

Makes the x10 register call-saved (AArch64 only)

-fcall-saved-x11

Unsupported

Makes the x11 register call-saved (AArch64 only)

-fcall-saved-x12

Unsupported

Makes the x12 register call-saved (AArch64 only)

-fcall-saved-x13

Unsupported

Makes the x13 register call-saved (AArch64 only)

-fcall-saved-x14

Unsupported

Makes the x14 register call-saved (AArch64 only)

-fcall-saved-x15

Unsupported

Makes the x15 register call-saved (AArch64 only)

-fcall-saved-x18

Unsupported

Makes the x18 register call-saved (AArch64 only)

-fcall-saved-x8

Unsupported

Makes the x8 register call-saved (AArch64 only)

-fcall-saved-x9

Unsupported

Makes the x9 register call-saved (AArch64 only)

-fcf-protection= <value>

Unsupported

Specifies the instrument control-flow architecture protection using options: return, branch, full, none

-fcf-protection

Unsupported

Enables cf-protection in ‘full’ mode

-fchar8_t

Supported

Enables C++ built-in type char8_t

-fclang-abi-compat= <version>

Supported

Attempts to match the ABI of Clang <version>

-fcolor-diagnostics

Supported

Enables colors in diagnostics

-fcomment-block-commands= <arg>

Supported

Treats each comma-separated argument in <arg> as a documentation comment block command

-fcommon

Supported

Places uninitialized global variables in a common block

-fcomplete-member-pointers

Supported

Requires member pointer base types to be complete if they are significant under the Microsoft ABI

-fconvergent-functions

Supported

Assumes functions to be convergent

-fcoroutines-ts

Supported

Enables support for the C++ Coroutines TS

-fcoverage-mapping

Unsupported

Generates coverage mapping to enable code coverage analysis

-fcs-profile-generate= <directory>

Unsupported

Generates instrumented code to collect context-sensitive execution counts into <directory>/default.profraw (overridden by LLVM_PROFILE_FILE env var)

-fcs-profile-generate

Unsupported

Generates instrumented code to collect context-sensitive execution counts into default.profraw (overridden by LLVM_PROFILE_FILE env var)

-fcuda-approx-transcendentals

Unsupported

Uses approximate transcendental functions

-fcuda-flush-denormals-to-zero

Supported

Flushes denormal floating-point values to zero in CUDA device mode

-fcuda-short-ptr

Unsupported

Uses 32-bit pointers for accessing const/local/shared address spaces

-fcxx-exceptions

Supported

Enables C++ exceptions

-fdata-sections

Supported

Places each data in its section

-fdebug-compilation-dir <value>

Supported

Specifies the compilation directory for embedding the debug info

-fdebug-default-version= <value>

Supported

Specifies the default DWARF version to use, if a -g option caused DWARF debug info to be produced

-fdebug-info-for-profiling

Supported

Emits extra debug info to make the sample profile more accurate

-fdebug-macro

Supported

Emits macro debug information

-fdebug-prefix-map= <value>

Supported

Remaps file source paths in debug info

-fdebug-ranges-base-address

Supported

Uses DWARF base address selection entries in .debug ranges

-fdebug-types-section

Supported

Places debug types in their section (ELF only)

-fdeclspec

Supported

Allows __declspec as a keyword

-fdelayed-template-parsing

Supported

Parses templated function definitions at the end of the translation unit

-fdelete-null-pointer-checks

Supported

Treats usage of null pointers as undefined behavior (default)

-fdiagnostics-absolute-paths

Supported

Prints absolute paths in diagnostics

-fdiagnostics-hotness-threshold= <number>

Unsupported

Prevents optimization remarks from being output if they do not have at least the specified number of profile count

-fdiagnostics-parseable-fixits

Supported

Prints fix-its in machine parseable form

-fdiagnostics-print-source-range-info

Supported

Prints source range spans in numeric form

-fdiagnostics-show-hotness

Unsupported

Enables profile hotness information in diagnostic line

-fdiagnostics-show-note-include-stack

Supported

Displays include stacks for diagnostic notes

-fdiagnostics-show-option

Supported

Prints option name with mappable diagnostics

-fdiagnostics-show-template-tree

Supported

Prints a template comparison tree for differing templates

-fdigraphs

Supported

Enables alternative token representations ‘ <:’, ‘:>’, ‘ <%’, ‘%>’, ‘%:’, ‘%:%:’ (default)

-fdiscard-value-names

Supported

Discards value names in LLVM IR

-fdollars-in-identifiers

Supported

Allows “$” in identifiers

-fdouble-square-bracket-attributes

Supported

Enables ‘[[]]’ attributes in all C and C++ language modes

-fdwarf-exceptions

Unsupported

Uses DWARF style exceptions

-feliminate-unused-debug-types

Supported

Eliminates debug info for defined but unused types

-fembed-bitcode-marker

Supported

Embeds placeholder LLVM IR data as a marker

-fembed-bitcode= <option>

Supported

Embeds LLVM bitcode (option: off, all, bitcode, marker)

-fembed-bitcode

Supported

Embeds LLVM IR bitcode as data

-femit-all-decls

Supported

Emits all declarations, even if unused

-femulated-tls

Supported

Uses emutls functions to access thread_local variables

-fenable-matrix

Supported

Enables matrix data type and related built-in functions

-fexceptions

Supported

Enables support for exception handling

-fexperimental-new-constant-interpreter

Supported

Enables the experimental new constant interpreter

-fexperimental-new-pass-manager

Supported

Enables an experimental new pass manager in LLVM

-fexperimental-relative-c+±abi-vtables

Supported

Uses the experimental C++ class ABI for classes with virtual tables

-fexperimental-strict-floating-point

Supported

Enables experimental strict floating point in LLVM

-ffast-math

Supported

Allows aggressive, lossy floating-point optimizations

-ffile-prefix-map= <value>

Supported

Remaps file source paths in debug info and predefined preprocessor macros

-ffine-grained-bitfield-accesses

Supported

Uses separate accesses for consecutive bitfield runs with legal widths and alignments

-ffixed-form

Supported

Enables fixed-form format for Fortran

-ffixed-point

Supported

Enables fixed point types

-ffixed-r19

Unsupported

Reserves the r19 register (Hexagon only)

-ffixed-r9

Unsupported

Reserves the r9 register (ARM only)

-ffixed-x10

Unsupported

Reserves the x10 register (AArch64/RISC-V only)

-ffixed-x11

Unsupported

Reserves the x11 register (AArch64/RISC-V only)

-ffixed-x12

Unsupported

Reserves the x12 register (AArch64/RISC-V only)

-ffixed-x13

Unsupported

Reserves the x13 register (AArch64/RISC-V only)

-ffixed-x14

Unsupported

Reserves the x14 register (AArch64/RISC-V only)

-ffixed-x15

Unsupported

Reserves the x15 register (AArch64/RISC-V only)

-ffixed-x16

Unsupported

Reserves the x16 register (AArch64/RISC-V only)

-ffixed-x17

Unsupported

Reserves the x17 register (AArch64/RISC-V only)

-ffixed-x18

Unsupported

Reserves the x18 register (AArch64/RISC-V only)

-ffixed-x19

Unsupported

Reserves the x19 register (AArch64/RISC-V only)

-ffixed-x1

Unsupported

Reserves the x1 register (AArch64/RISC-V only)

-ffixed-x20

Unsupported

Reserves the x20 register (AArch64/RISC-V only)

-ffixed-x21

Unsupported

Reserves the x21 register (AArch64/RISC-V only)

-ffixed-x22

Unsupported

Reserves the x22 register (AArch64/RISC-V only)

-ffixed-x23

Unsupported

Reserves the x23 register (AArch64/RISC-V only)

-ffixed-x24

Unsupported

Reserves the x24 register (AArch64/RISC-V only)

-ffixed-x25

Unsupported

Reserves the x25 register (AArch64/RISC-V only)

-ffixed-x26

Unsupported

Reserves the x26 register (AArch64/RISC-V only)

-ffixed-x27

Unsupported

Reserves the x27 register (AArch64/RISC-V only)

-ffixed-x28

Unsupported

Reserves the x28 register (AArch64/RISC-V only)

-ffixed-x29

Unsupported

Reserves the x29 register (AArch64/RISC-V only)

-ffixed-x2

Unsupported

Reserves the x2 register (AArch64/RISC-V only)

-ffixed-x30

Unsupported

Reserves the x30 register (AArch64/RISC-V only)

-ffixed-x31

Unsupported

Reserves the x31 register (AArch64/RISC-V only)

-ffixed-x3

Unsupported

Reserves the x3 register (AArch64/RISC-V only)

-ffixed-x4

Unsupported

Reserves the x4 register (AArch64/RISC-V only)

-ffixed-x5

Unsupported

Reserves the x5 register (AArch64/RISC-V only)

-ffixed-x6

Unsupported

Reserves the x6 register (AArch64/RISC-V only)

-ffixed-x7

Unsupported

Reserves the x7 register (AArch64/RISC-V only)

-ffixed-x8

Unsupported

Reserves the x8 register (AArch64/RISC-V only)

-ffixed-x9

Unsupported

Reserves the x9 register (AArch64/RISC-V only)

-fforce-dwarf-frame

Supported

Mandatorily emits a debug frame section

-fforce-emit-vtables

Supported

Emits more virtual tables to improve devirtualization

-fforce-enable-int128

Supported

Enables support for int128_t type

-ffp-contract= <value>

Supported

Forms fused FP ops (e.g. FMAs): fast (everywhere) \ on (according to FP_CONTRACT pragma) \ off (never fuse). Default is “fast” for CUDA/HIP and “on” for others.

-ffp-exception-behavior= <value>

Supported

Specifies the exception behavior of floating-point operations

-ffp-model= <value>

Supported

Controls the semantics of floating-point calculations

-ffree-form

Supported

Enables free-form format for Fortran

-ffreestanding

Supported

Asserts the compilation to take place in a freestanding environment

-ffunc-args-alias

Supported

Allows the function arguments aliases (equivalent to ansi alias)

-ffunction-sections

Supported

Places each function in its section

-fglobal-isel

Supported

Enables the global instruction selector

-fgnu-keywords

Supported

Allows GNU-extension keywords regardless of a language standard

-fgnu-runtime

Unsupported

Generates output compatible with the standard GNU Objective-C runtime

-fgnu89-inline

Unsupported

Uses the gnu89 inline semantics

-fgnuc-version= <value>

Supported

Sets various macros to claim compatibility with the given GCC version (default is 4.2.1)

-fgpu-allow-device-init

Supported

Allows device-side init function in HIP

-fgpu-rdc

Supported

Generates relocatable device code, also known as separate compilation mode

-fhip-new-launch-api

Supported

Uses new kernel launching API for HIP

-fignore-exceptions

Supported

Enables support for ignoring exception handling constructs

-fimplicit-module-maps

Unsupported

Implicitly searches the file system for module map files

-finline-functions

Supported

Inlines suitable functions

-finline-hint-functions

Supported

Inlines functions that are (explicitly or implicitly) marked inline

-finstrument-function-entry-bare

Unsupported

Allows instrument function entry only after inlining, without arguments to the instrumentation call

-finstrument-functions-after-inlining

Unsupported

Similar to -finstrument-functions option but inserts the calls after inlining

-finstrument-functions

Unsupported

Generates calls to instrument function entry and exit

-fintegrated-as

Supported

Enables the integrated assembler

-fintegrated-cc1

Supported

Runs cc1 in-process

-fjump-tables

Supported

Uses jump tables for lowering switches

-fkeep-static-consts

Supported

Keeps static const variables if unused

-flax-vector-conversions= <value>

Supported

Enables implicit vector bit-casts

-flto-jobs= <value>

Unsupported

Controls the backend parallelism of -flto=thin (A default value of 0 means the number of threads will be derived from the number of CPUs detected.)

-flto= <value>

Unsupported

Sets LTO mode to either “full” or “thin”

-flto

Unsupported

Enables LTO in “full” mode

-fmacro-prefix-map= <value>

Supported

Remaps file source paths in predefined preprocessor macros

-fmath-errno

Supported

Requires math functions to indicate errors by setting errno

-fmax-tokens= <value>

Supported

Specifies max total number of preprocessed tokens for -Wmax-tokens

-fmax-type-align= <value>

Supported

Specifies the maximum alignment to enforce on pointers lacking an explicit alignment

-fmemory-profile

Supported

Enables heap memory profiling

-fmerge-all-constants

Supported

Allows merging of constants

-fmessage-length= <value>

Supported

Formats message diagnostics to fit within N columns

-fmodule-file=[ <name>=] <file>

Unsupported

Specifies the mapping of module name to precompiled module file. Loads a module file if name is omitted

-fmodule-map-file= <file>

Unsupported

Loads the specified module map file

-fmodule-name= <name>

Unsupported

Specifies the name of the module to build

-fmodules-cache-path= <directory>

Unsupported

Specifies the module cache path

-fmodules-decluse

Unsupported

Asserts declaration of modules used within a module

-fmodules-disable-diagnostic-validation

Unsupported

Disables validation of the diagnostic options when loading the module

-fmodules-ignore-macro= <value>

Unsupported

Ignores the definition of the specified macro when building and loading modules

-fmodules-prune-after= <seconds>

Unsupported

Specifies the interval (in seconds) after which a module file is to be considered unused

-fmodules-prune-interval= <seconds>

Unsupported

Specifies the interval (in seconds) between attempts to prune the module cache

-fmodules-search-all

Unsupported

Searches even non-imported modules to resolve references

-fmodules-strict-decluse

Unsupported

Similar to -fmodules-decluse option but requires all headers to be in the modules

-fmodules-ts

Unsupported

Enables support for the C++ Modules TS

-fmodules-user-build-path <directory>

Unsupported

Specifies the module user build path

-fmodules-validate-input-files-content

Supported

Validates PCM input files based on content if mtime differs

-fmodules-validate-once-per-build-session

Unsupported

Prohibits verification of input files for the modules if the module has been successfully validated or loaded during the current build session

-fmodules-validate-system-headers

Supported

Validates the system headers that a module depends on when loading the module

-fmodules

Unsupported

Enables the “modules” language feature

-fms-compatibility-version= <value>

Supported

Specifies the dot-separated value representing the Microsoft compiler version number to report in _MSC_VER (0 = do not define it (default))

-fms-compatibility

Supported

Enables full Microsoft Visual C++ compatibility

-fms-extensions

Supported

Accepts some non-standard constructs supported by the Microsoft compiler

-fmsc-version= <value>

Supported

Specifies the Microsoft compiler version number to report in _MSC_VER (0 = do not define it (default))

-fnew-alignment= <align>

Supported

Specifies the largest alignment guaranteed by “::operator new(size_t)”

-fno-addrsig

Supported

Prohibits emitting an address-significance table

-fno-allow-fortran-gnu-ext

Supported

Allows Fortran GNU extensions

-fno-assume-sane-operator-new

Supported

Prohibits the assumption that C++’s global operator new cannot alias any pointer

-fno-autolink

Supported

Disables generation of linker directives for automatic library linking

-fno-backslash

Supported

Allows treatment of backslash like any other character in character strings

-fno-builtin- <value>

Supported

Disables implicit built-in knowledge of a specific function

-fno-builtin

Supported

Disables implicit built-in knowledge of functions

-fno-c+±static-destructors

Supported

Disables C++ static destructor registration

-fno-char8_t

Supported

Disables C++ built-in type char8_t

-fno-color-diagnostics

Supported

Disables colors in diagnostics

-fno-common

Supported

Compiles common globals like normal definitions

-fno-complete-member-pointers

Supported

Eliminates the requirement for the member pointer base types to be complete if they would be significant under the Microsoft ABI

-fno-constant-cfstrings

Supported

Disables creation of CodeFoundation-type constant strings

-fno-coverage-mapping

Supported

Disables code coverage analysis

-fno-crash-diagnostics

Supported

Disables auto-generation of preprocessed source files and a script for reproduction during a Clang crash

-fno-cuda-approx-transcendentals

Unsupported

Eliminates the usage of approximate transcendental functions

-fno-debug-macro

Supported

Prohibits emitting the macro debug information

-fno-declspec

Unsupported

Disallows declspec as a keyword

-fno-delayed-template-parsing

Supported

Disables delayed template parsing

-fno-delete-null-pointer-checks

Supported

Prohibits the treatment of null pointers as undefined behavior

-fno-diagnostics-fixit-info

Supported

Prohibits including fixit information in diagnostics

-fno-digraphs

Supported

Disallows alternative token representations “ <:’, ‘:>’, ‘ <%’, ‘%>’, ‘%:’, ‘%:%:”

-fno-discard-value-names

Supported

Prohibits discarding value names in LLVM IR

-fno-dollars-in-identifiers

Supported

Disallows ‘$’ in identifiers

-fno-double-square-bracket-attributes

Supported

Disables ‘[[]]’ attributes in all C and C++ language modes

-fno-elide-constructors

Supported

Disables C++ copy constructor elision

-fno-elide-type

Supported

Prohibits eliding types when printing diagnostics

-fno-eliminate-unused-debug-types

Supported

Emits debug info for defined but unused types

-fno-exceptions

Supported

Disables support for exception handling

-fno-experimental-new-pass-manager

Supported

Disables an experimental new pass manager in LLVM

-fno-experimental-relative-c+±abi-vtables

Supported

Prohibits using the experimental C++ class ABI for classes with virtual tables

-fno-fine-grained-bitfield-accesses

Supported

Allows using large-integer access for consecutive bitfield runs

-fno-fixed-form

Supported

Disables fixed-form format for Fortran

-fno-fixed-point

Supported

Disables fixed point types

-fno-force-enable-int128

Supported

Disables support for int128_t type

-fno-fortran-main

Supported

Prohibits linking in Fortran main

-fno-free-form

Supported

Disables free-form format for Fortran

-fno-func-args-alias

Supported

Allows the function argument alias (equivalent to ansi alias)

-fno-global-isel

Supported

Disables the global instruction selector

-fno-gnu-inline-asm

Supported

Disables GNU style inline asm

-fno-gpu-allow-device-init

Supported

Disallows device-side init function in HIP

-fno-hip-new-launch-api

Supported

Disallows new kernel launching API for HIP

-fno-integrated-as

Supported

Disables the integrated assembler

-fno-integrated-cc1

Supported

Spawns a separate process for each cc1

-fno-jump-tables

Supported

Disallows jump tables for lowering switches

-fno-keep-static-consts

Supported

Prohibits keeping static const variables if unused

-fno-lto

Supported

Disables LTO mode (default)

-fno-memory-profile

Supported

Disables heap memory profiling

-fno-merge-all-constants

Supported

Disallows merging of constants

-fno-no-access-control

Supported

Disables C++ access control

-fno-objc-infer-related-result-type

Supported

Prohibits inferring Objective-C related result type based on the method family

-fno-operator-names

Supported

Disallows treatment of C++ operator name keywords as synonyms for operators

-fno-pch-codegen

Supported

Disallows code-generation for uses of the PCH that assumes building an explicit object file for the PCH

-fno-pch-debuginfo

Supported

Prohibits generation of debug info for types in an object file built from this PCH or elsewhere

-fno-plt

Supported

Asserts usage of GOT indirection instead of PLT to make external function calls (x86 only)

-fno-preserve-as-comments

Supported

Prohibits preserving comments in inline assembly

-fno-profile-generate

Supported

Disables generation of profile instrumentation

-fno-profile-instr-generate

Supported

Disables generation of profile instrumentation

-fno-profile-instr-use

Supported

Disables usage of instrumentation data for profile-guided optimization

-fno-register-global-dtors-with-atexit

Supported

Disallows usage of atexit or __cxa_atexit to register global destructors

-fno-rtlib-add-rpath

Supported

Prohibits adding -rpath with architecture-specific resource directory to the linker flags

-fno-rtti-data

Supported

Disables generation of RTTI data

-fno-rtti

Supported

Disables generation of rtti information

-fno-sanitize-address-poison-custom-array-cookie

Supported on Host only

Disables poisoning of array cookies when using custom operator new[] in AddressSanitizer

-fno-sanitize-address-use-after-scope

Supported on Host only

Disables use-after-scope detection in AddressSanitizer

-fno-sanitize-address-use-odr-indicator

Supported on Host only

Disables ODR indicator globals

-fno-sanitize-blacklist

Supported on Host only

Prohibits using blacklist file for sanitizers

-fno-sanitize-cfi-canonical-jump-tables

Supported on Host only

Prohibits making the jump table addresses canonical in the symbol table

-fno-sanitize-cfi-cross-dso

Supported on Host only

Disables control flow integrity (CFI) checks for cross-DSO calls

-fno-sanitize-coverage= <value>

Supported on Host only

Disables specified features of coverage instrumentation for Sanitizers

-fno-sanitize-memory-track-origins

Supported on Host only

Disables origins tracking in MemorySanitizer

-fno-sanitize-memory-use-after-dtor

Supported on Host only

Disables use-after-destroy detection in MemorySanitizer

-fno-sanitize-recover= <value>

Supported on Host only

Disables recovery for specified sanitizers

-fno-sanitize-stats

Supported on Host only

Disables sanitizer statistics gathering

-fno-sanitize-thread-atomics

Supported on Host only

Disables atomic operations instrumentation in ThreadSanitizer

-fno-sanitize-thread-func-entry-exit

Supported on Host only

Disables function entry/exit instrumentation in ThreadSanitizer

-fno-sanitize-thread-memory-access

Supported on Host only

Disables memory access instrumentation in ThreadSanitizer

-fno-sanitize-trap= <value>

Supported on Host only

Disables trapping for specified sanitizers

-fno-sanitize-trap

Supported on Host only

Disables trapping for all sanitizers

-fno-short-wchar

Supported

Forces wchar_t to be an unsigned int

-fno-show-column

Supported

Prohibits including column number on diagnostics

-fno-show-source-location

Supported

Prohibits including source location information with diagnostics

-fno-signed-char

Supported

char is unsigned

-fno-signed-zeros

Supported

Allows optimizations that ignore the sign of floating point zeros

-fno-spell-checking

Supported

Disables spell-check

-fno-split-machine-functions

Supported

Disables late function splitting using profile information (x86 ELF)

-fno-stack-clash-protection

Supported

Disables stack clash protection

-fno-stack-protector

Supported

Disables the use of stack protectors

-fno-standalone-debug

Supported

Limits debug information produced to reduce size of debug binary

-fno-strict-float-cast-overflow

Supported

Relaxes language rules and tries to match the behavior of the target’s native float-to-int conversion instructions

-fno-strict-return

Supported

Prohibits treating the control flow paths that fall off the end of a non-void function as unreachable

-fno-sycl

Unsupported

Disables SYCL kernels compilation for device

-fno-temp-file

Supported

Asserts direct creation of compilation output files. This may lead to incorrect incremental builds if the compiler crashes.

-fno-threadsafe-statics

Supported

Prohibits emitting code to make initialization of local statics thread safe

-fno-trigraphs

Supported

Prohibits processing trigraph sequences

-fno-unique-section-names

Supported

Prohibits the usage of unique names for text and data sections

-fno-unroll-loops

Supported

Turns off the loop unroller

-fno-use-cxa-atexit

Supported

Prohibits the usage of __cxa_atexit for calling destructors

-fno-use-flang-math-libs

Supported

Asserts the usage of Flang internal runtime math library instead of LLVM math intrinsics

-fno-use-init-array

Supported

Asserts the usage of .ctors/.dtors instead of .init_array/.fini_array

-fno-visibility-inlines-hidden-static-local-var

Supported

Disables -fvisibility-inlines-hidden-static-local-var (This is the default on non-darwin targets.)

-fno-xray-function-index

Unsupported

Allows omitting function index section at the expense of single-function patching performance

-fno-zero-initialized-in-bss

Supported

Prohibits placing zero initialized data in BSS

-fobjc-arc-exceptions

Unsupported

Asserts using EH-safe code when synthesizing retains and releases in -fobjc-arc

-fobjc-arc

Unsupported

Synthesizes retain and release calls for Objective-C pointers

-fobjc-exceptions

Unsupported

Enables Objective-C exceptions

-fobjc-runtime= <value>

Unsupported

Specifies the target Objective-C runtime kind and version

-fobjc-weak

Unsupported

Enables ARC-style weak references in Objective-C

-fopenmp-simd

Unsupported

Emits OpenMP code only for SIMD-based constructs

-fopenmp-targets= <value>

Unsupported

Specifies a comma-separated list of triples OpenMP offloading targets to be supported

-fopenmp

Unsupported

Parses OpenMP pragmas and generates parallel code

-foptimization-record-file= <file>

Supported

Specifies the output name of the file containing the optimization remarks. Implies -fsave-optimization-record. On Darwin platforms, this cannot be used with multiple -arch <arch> options.

-foptimization-record-passes= <regex>

Supported

Exclusively allows the inclusion of passes that match a specified regular expression in the generated optimization record (By default, include all passes.)

-forder-file-instrumentation

Supported

Generates instrumented code to collect order file into default.profraw file (overridden by ‘=’ form of option or LLVM_PROFILE_FILE env var)

-fpack-struct= <value>

Unsupported

Specifies the default maximum struct packing alignment

-fpascal-strings

Supported

Recognizes and constructs Pascal-style string literals

-fpass-plugin= <dsopath>

Supported

Loads pass plugin from a dynamic shared object file (only with new pass manager)

-fpatchable-function-entry= <N,M>

Supported

Generates M NOPs before function entry and N-M NOPs after function entry

-fpcc-struct-return

Unsupported

Overrides the default ABI to return all structs on the stack

-fpch-codegen

Supported

Generates code for using this PCH that assumes building an explicit object file for the PCH

-fpch-debuginfo

Supported

Generates debug info for types exclusively in an object file built from this PCH

-fpch-instantiate-templates

Supported

Instantiates templates already while building a PCH

-fpch-validate-input-files-content

Supported

Validates PCH input files based on the content if mtime differs

-fplugin= <dsopath>

Supported

Loads the named plugin (dynamic shared object)

-fprebuilt-module-path= <directory>

Unsupported

Specifies the prebuilt module path

-fprofile-exclude-files= <value>

Unsupported

Exclusively instruments those functions from files where names do not match all the regexes separated by a semicolon

-fprofile-filter-files= <value>

Unsupported

Exclusively instruments those functions from files where names match any regex separated by a semicolon

-fprofile-generate= <directory>

Unsupported

Generates instrumented code to collect execution counts into <directory>/default.profraw (overridden by LLVM_PROFILE_FILE env var)

-fprofile-generate

Unsupported

Generates instrumented code to collect execution counts into default.profraw (overridden by LLVM_PROFILE_FILE env var)

-fprofile-instr-generate= <file>

Unsupported

Generates instrumented code to collect execution counts into <file> (overridden by LLVM_PROFILE_FILE env var)

-fprofile-instr-generate

Unsupported

Generates instrumented code to collect execution counts into default.profraw file (overridden by ‘=’ form of option or LLVM_PROFILE_FILE env var)

-fprofile-instr-use= <value>

Unsupported

Uses instrumentation data for profile-guided optimization

-fprofile-remapping-file= <file>

Unsupported

Uses the remappings described in <file> to match the profile data against the names in the program

-fprofile-sample-accurate

Unsupported

Specifies that the sample profile is accurate

-fprofile-sample-use= <value>

Unsupported

Enables sample-based profile-guided optimizations

-fprofile-use= <pathname>

Unsupported

Uses instrumentation data for profile-guided optimization. If pathname is a directory, it reads from <pathname>/default.profdata. Otherwise, it reads from file <pathname>.

-freciprocal-math

Supported

Allows division operations to be reassociated

-freg-struct-return

Unsupported

Overrides the default ABI to return small structs in registers

-fregister-global-dtors-with-atexit

Supported

Uses atexit or __cxa_atexit to register global destructors

-frelaxed-template-template-args

Supported

Enables C++17 relaxed template argument matching

-freroll-loops

Supported

Turns on loop reroller

-fropi

Unsupported

Generates read-only position independent code (ARM only)

-frtlib-add-rpath

Supported

Adds -rpath with architecture-specific resource directory to the linker flags

-frwpi

Unsupported

Generates read-write position-independent code (ARM only)

-fsanitize-address-field-padding= <value>

Supported on Host only

Specifies the level of field padding for AddressSanitizer

-fsanitize-address-globals-dead-stripping

Supported on Host only

Enables linker dead stripping of globals in AddressSanitizer

-fsanitize-address-poison-custom-array-cookie

Supported on Host only

Enables poisoning of array cookies when using custom operator new[] in AddressSanitizer

-fsanitize-address-use-after-scope

Supported on Host only

Enables use-after-scope detection in AddressSanitizer

-fsanitize-address-use-odr-indicator

Supported on Host only

Enables ODR indicator globals to avoid false ODR violation reports in partially sanitized programs at the cost of an increase in binary size

-fsanitize-blacklist= <value>

Supported on Host only

Specifies the path to blacklisted files for sanitizers

-fsanitize-cfi-canonical-jump-tables

Supported on Host only

Makes the jump table addresses canonical in the symbol table

-fsanitize-cfi-cross-dso

Supported on Host only

Enables control flow integrity (CFI) checks for cross-DSO calls

-fsanitize-cfi-icall-generalize-pointers

Supported on Host only

Generalizes pointers in CFI indirect call type signature checks

-fsanitize-coverage-allowlist= <value>

Supported on Host only

Restricts sanitizer coverage instrumentation exclusively to modules and functions that match the provided special case list, except the blocked ones

-fsanitize-coverage-blacklist= <value>

Supported on Host only

Deprecated; use -fsanitize-coverage-blocklist= instead.

-fsanitize-coverage-blocklist= <value>

Supported on Host only

Disables sanitizer coverage instrumentation for modules and functions that match the provided special case list, even the allowed ones

-fsanitize-coverage-whitelist= <value>

Supported on Host only

Deprecated; use -fsanitize-coverage-allowlist= instead.

-fsanitize-coverage= <value>

Supported on Host only

Specifies the type of coverage instrumentation for Sanitizers

-fsanitize-hwaddress-abi= <value>

Supported on Host only

Selects the HWAddressSanitizer ABI to target (interceptor or platform, default interceptor). This option is currently unused.

-fsanitize-memory-track-origins= <value>

Supported on Host only

Enables origins tracking in MemorySanitizer

-fsanitize-memory-track-origins

Supported on Host only

Enables origins tracking in MemorySanitizer

-fsanitize-memory-use-after-dtor

Supported on Host only

Enables use-after-destroy detection in MemorySanitizer

-fsanitize-recover= <value>

Supported on Host only

Enables recovery for specified sanitizers

-fsanitize-stats

Supported on Host only

Enables sanitizer statistics gathering

-fsanitize-system-blacklist= <value>

Supported on Host only

Specifies the path to system blacklist files for sanitizers

-fsanitize-thread-atomics

Supported on Host only

Enables atomic operations instrumentation in ThreadSanitizer (default)

-fsanitize-thread-func-entry-exit

Supported on Host only

Enables function entry/exit instrumentation in ThreadSanitizer (default)

-fsanitize-thread-memory-access

Supported on Host only

Enables memory access instrumentation in ThreadSanitizer (default)

-fsanitize-trap= <value>

Supported on Host only

Enables trapping for specified sanitizers

-fsanitize-trap

Supported on Host only

Enables trapping for all sanitizers

-fsanitize-undefined-strip-path-components= <number>

Supported on Host only

Strips (or keeps only, if negative) the given number of path components when emitting check metadata

-fsanitize= <check>

Supported on Host only

Turns on runtime checks for various forms of undefined or suspicious behavior. See user manual for available checks.

-fsave-optimization-record= <format>

Supported

Generates an optimization record file in the specified format

-fsave-optimization-record

Supported

Generates a YAML optimization record file

-fseh-exceptions

Supported

Uses SEH style exceptions

-fshort-enums

Supported

Allocates to an enum type only as many bytes as it needs for the declared range of possible values

-fshort-wchar

Unsupported

Forces wchar_t to be a short unsigned int

-fshow-overloads= <value>

Supported

Specifies which overload candidates are shown when overload resolution fails. Values = best\all; default value = “all”

-fsigned-char

Supported

Asserts that the char is signed

-fsized-deallocation

Supported

Enables C++14 sized global deallocation functions

-fsjlj-exceptions

Supported

Uses SjLj style exceptions

-fslp-vectorize

Supported

Enables the superword-level parallelism vectorization passes

-fsplit-dwarf-inlining

Unsupported

Provides minimal debug info in the object/executable to facilitate online symbolication/stack traces in the absence of .dwo/.dwp files when using Split DWARF

-fsplit-lto-unit

Unsupported

Enables splitting of the LTO unit

-fsplit-machine-functions

Supported

Enables late function splitting using profile information (x86 ELF)

-fstack-clash-protection

Supported

Enables stack clash protection

-fstack-protector-all

Unsupported

Enables stack protectors for all functions

-fstack-protector-strong

Unsupported

Enables stack protectors for some functions vulnerable to stack smashing. Compared to -fstack-protector, this uses a stronger heuristic that includes functions containing arrays of any size (and any type), as well as any calls to allocate or the taking of an address from a local variable.

-fstack-protector

Unsupported

Enables stack protectors for some functions vulnerable to stack smashing. This uses a loose heuristic that considers the functions to be vulnerable if they contain a char (or 8bit integer) array or constant-size calls to alloca, which are of greater size than ssp-buffer-size (default: 8 bytes). All variable-size calls to alloca are considered vulnerable. A function with a stack protector has a guard value added to the stack frame that is checked on function exit. The guard value must be positioned in the stack frame such that a buffer overflow from a vulnerable variable will overwrite the guard value before overwriting the function’s return address. The reference stack guard value is stored in a global variable.

-fstack-size-section

Supported

Emits section containing metadata on function stack sizes

-fstandalone-debug

Supported

Emits full debug info for all types used by the program

-fstrict-enums

Supported

Enables optimizations based on the strict definition of an enum’s value range

-fstrict-float-cast-overflow

Supported

Assumes the overflowing float-to-int casts to be undefined (default)

-fstrict-vtable-pointers

Supported

Enables optimizations based on the strict rules for overwriting polymorphic C++ objects

-fsycl

Unsupported

Enables SYCL kernels compilation for device

-fsystem-module

u

Builds this module as a system module. Only used with -emit-module

-fthin-link-bitcode= <value>

Supported

Writes minimized bitcode to <file> for the ThinLTO thin link only

-fthinlto-index= <value>

Unsupported

Performs ThinLTO import using the provided function summary index

-ftime-trace-granularity= <value>

Supported

Specifies the minimum time granularity (in microseconds) traced by time profiler

-ftime-trace

Supported

Turns on time profiler. Generates JSON file based on output filename

-ftrap-function= <value>

Unsupported

Issues call to specified function rather than a trap instruction

-ftrapv-handler= <function name>

Unsupported

Specifies the function to be called on overflow

-ftrapv

Unsupported

Traps on integer overflow

-ftrigraphs

Supported

Processes trigraph sequences

-ftrivial-auto-var-init-stop-after= <value>

Supported

Stops initializing trivial automatic stack variables after the specified number of instances

-ftrivial-auto-var-init= <value>

Supported

Initializes trivial automatic stack variables. Values: uninitialized (default) / pattern

-funique-basic-block-section-names

Supported

Uses unique names for basic block sections (ELF only)

-funique-internal-linkage-names

Supported

Makes the Internal Linkage Symbol names unique by appending the MD5 hash of the module path

-funroll-loops

Supported

Turns on loop unroller

-fuse-flang-math-libs

Supported

Uses Flang internal runtime math library instead of LLVM math intrinsics

-fuse-line-directives

Supported

Uses #line in preprocessed output

-fvalidate-ast-input-files-content

Supported

Computes and stores the hash of input files used to build an AST. Files with mismatching mtimes are considered valid if both have identical contents.

-fveclib= <value>

Unsupported

Uses the given vector functions library

-fvectorize

Unsupported

Enables the loop vectorization passes

-fverbose-asm

Supported

Generates verbose assembly output

-fvirtual-function-elimination

Supported

Enables dead virtual function elimination optimization. Requires -flto=full

-fvisibility-global-new-delete-hidden

Supported

Marks the visibility of global C++ operators “new” and “delete” as hidden

-fvisibility-inlines-hidden-static-local-var

Supported

Marks the visibility of static variables in inline C++ member functions as hidden by default when -fvisibility-inlines-hidden is enabled

-fvisibility-inlines-hidden

Supported

Marks the visibility of inline C++ member functions as hidden by default

-fvisibility-ms-compat

Supported

Marks the visibility of global types as default and global functions and variables as hidden by default

-fvisibility= <value>

Supported

Sets the default symbol visibility for all global declarations to the specified value

-fwasm-exceptions

Unsupported

Uses WebAssembly style exceptions

-fwhole-program-vtables

Unsupported

Enables whole program vtable optimization. Requires -flto

-fwrapv

Supported

Treats signed integer overflow as two’s complement

-fwritable-strings

Supported

Stores string literals as writable data

-fxray-always-emit-customevents

Unsupported

Mandates emitting __xray_customevent(…) calls even if the containing function is not always instrumented

-fxray-always-emit-typedevents

Unsupported

Mandates emitting __xray_typedevent(…) calls even if the containing function is not always instrumented

-fxray-always-instrument= <value>

Unsupported

Deprecated: Specifies the filename defining the whitelist for imbuing the “always instrument” XRay attribute

-fxray-attr-list= <value>

Unsupported

Specifies the filename defining the list of functions/types for imbuing XRay attributes

-fxray-ignore-loops

Unsupported

Prohibits instrumenting functions with loops unless they also meet the minimum function size

-fxray-instruction-threshold= <value>

Unsupported

Sets the minimum function size to instrument with Xray

-fxray-instrumentation-bundle= <value>

Unsupported

Specifies which XRay instrumentation points to emit. Values: all/ none/ function-entry/ function-exit/ function/ custom. Default is “all,” and “function” includes both “function-entry” and “function-exit.”

-fxray-instrument

Unsupported

Generates XRay instrumentation sleds on function entry and exit

-fxray-link-deps

Unsupported

Informs Clang to add the link dependencies for XRay

-fxray-modes= <value>

Unsupported

Specifies the list of modes to link in by default into the XRay instrumented binaries

-fxray-never-instrument= <value>

Unsupported

Deprecated: Specifies the filename defining the whitelist for imbuing the “never instrument” XRay attribute

-fzvector

Supported

Enables System z vector language extension

-F <value>

Unsupported

Adds directory to the framework include search path

–gcc-toolchain= <value>

Supported

Uses the gcc toolchain at the given directory

-gcodeview-ghash

Supported

Emits type record hashes in a .debug$H section

-gcodeview

Supported

Generates code view debug information

-gdwarf-2

Supported

Generates source-level debug information with dwarf version 2

-gdwarf-3

Supported

Generates source-level debug information with dwarf version 3

-gdwarf-4

Supported

Generates source-level debug information with dwarf version 4

-gdwarf-5

Supported

Generates source-level debug information with dwarf version 5

-gdwarf

Supported

Generates source-level debug information with the default DWARF version

-gembed-source

Supported

Embeds source text in DWARF debug sections

-gline-directives-only

Supported

Emits debug line info directives only.

-gline-tables-only

Supported

Emits debug line number tables only.

-gmodules

Supported

Generates debug info with external references to clang modules or precompiled headers

-gno-embed-source

Supported

Restores the default behavior of not embedding the source text in DWARF debug sections

-gno-inline-line-tables

Supported

Prohibits emitting inline line tables

–gpu-max-threads-per-block= <value>

Supported

Specifies the default max threads per block for kernel launch bounds for HIP

-gsplit-dwarf= <value>

Supported

Sets DWARF fission mode to values: “split”/ “single”

-gz= <value>

Supported

Specifies DWARF debug section’s compression type

-gz

Supported

Shows DWARF debug section”s compression type

-G <size>

Unsupported

Puts objects of maximum <size> bytes into small data section (MIPS / Hexagon)

-g

Supported

Generates source-level debug information

–help-hidden

Supported

Displays help for hidden options

-help

Supported

Displays available options

–hip-device-lib= <value>

Supported

Specifies the HIP device library

–hip-link

Supported

Links clang-offload-bundler bundles for HIP

–hip-version= <value>

Supported

Allows specification of HIP version in the format: major/minor/patch

-H

Supported

Shows header “includes” and nesting depth

-I-

Supported

Restricts all prior -I flags to double-quoted inclusion and removes the current directory from include path

-ibuiltininc

Supported

Enables built-in #include directories even when -nostdinc is used before or after -ibuiltininc. Using -nobuiltininc after the option disables it

-idirafter <value>

Supported

Adds the directory to AFTER include search path

-iframeworkwithsysroot <directory>

Unsupported

Adds the directory to SYSTEM framework search path; absolute paths are relative to -isysroot

-iframework <value>

Unsupported

Adds the directory to SYSTEM framework search path

-imacros <file>

Supported

Specifies the file containing macros to be included before parsing

-include-pch <file>

Supported

Includes the specified precompiled header file

-include <file>

Supported

Includes the specified file before parsing

-index-header-map

Supported

Makes the next included directory (-I or -F) an indexer header map

-iprefix <dir>

Supported

Sets the -iwithprefix/-iwithprefixbefore prefix

-iquote <directory>

Supported

Adds the directory to QUOTE include search path

-isysroot <dir>

Supported

Sets the system root directory (usually /)

-isystem-after <directory>

Supported

Adds the directory to end of the SYSTEM include search path

-isystem <directory>

Supported

Adds the directory to SYSTEM include search path

-ivfsoverlay <value>

Supported

Overlays the virtual filesystem described by the specified file over the real file system

-iwithprefixbefore <dir>

Supported

Sets the directory to include search path with prefix

-iwithprefix <dir>

Supported

Sets the directory to SYSTEM include search path with prefix

-iwithsysroot <directory>

Supported

Adds directory to SYSTEM include search path; absolute paths are relative to -isysroot

-I <dir>

Supported

Adds directory to include search path. If there are multiple -I options, these directories are searched in the order they are given before the standard system directories are searched. If the same directory is in the SYSTEM include search paths, for example, if also specified with -isystem, the -I option is ignored.

–libomptarget-nvptx-path= <value>

Unsupported

Specifies path to libomptarget-nvptx libraries

-L <dir>

Supported

Adds directory to library search path

-mabicalls

Unsupported

Enables SVR4-style position-independent code (Mips only)

-maix-struct-return

Unsupported

Returns all structs in memory (PPC32 only)

-malign-branch-boundary= <value>

Supported

Specifies the boundary’s size to align branches

-malign-branch= <value>

Supported

Specifies the types of branches to align

-malign-double

Supported

Aligns doubles to two words in structs (x86 only)

-Mallocatable= <value>

Unsupported

Provides semantics for assignments to allocatables. Value: F03/ F95.

-mbackchain

Unsupported

Links stack frames through backchain on System Z

-mbranch-protection= <value>

Unsupported

Enforces targets of indirect branches and function returns

-mbranches-within-32B-boundaries

Supported

Aligns selected branches (fused, jcc, jmp) within 32-byte boundary

-mcmodel=medany

Unsupported

Equivalent to -mcmodel=medium, compatible with RISC-V gcc

-mcmodel=medlow

Unsupported

Equivalent to -mcmodel=small, compatible with RISC-V gcc

-mcmse

Unsupported

Allows use of CMSE (Armv8-M Security Extensions)

-mcode-object-v3

Supported

Legacy option to specify code object ABI V2 (-mnocode-object-v3) or V3 (-mcode-object-v3) (AMDGPU only)

-mcode-object-version= <version>

Supported

Specifies code object ABI version. Default value: 4. (AMDGPU only).

-mcrc

Unsupported

Allows use of CRC instructions (ARM/Mips only)

-mcumode

Supported

Specifies CU (-mcumode) or WGP (-mno-cumode) wavefront execution mode (AMDGPU only)

-mdouble= <value>

Supported

Forces double to be 32 bits or 64 bits

-MD

Supported

Writes a depfile containing user and system headers

-meabi <value>

Supported

Sets EABI type. Value: 4/ 5/ gnu. Default depends on triple

-membedded-data

Unsupported

Places constants in the .rodata section instead of the .sdata section even if they meet the -G <size> threshold (MIPS)

-menable-experimental-extensions

Unsupported

Enables usage of experimental RISC-V extensions.

-mexec-model= <value>

Unsupported

Specifies the execution model (WebAssembly only)

-mexecute-only

Unsupported

Disallows generation of data access to code sections (ARM only)

-mextern-sdata

Unsupported

Assumes externally defined data to be in the small data if it meets the -G <size> threshold (MIPS)

-mfentry

Unsupported

Inserts calls to fentry at function entry (x86/SystemZ only)

-mfix-cortex-a53-835769

Unsupported

Workaround Cortex-A53 erratum 835769 (AArch64 only)

0

Unsupported

Asserts usage of 32-bit floating point registers (MIPS only)

0

Unsupported

Asserts usage of 64-bit floating point registers (MIPS only)

-MF <file>

Supported

Writes depfile output from -MMD, -MD, -MM, or -M to <file>

-mgeneral-regs-only

Unsupported

Generates code that exclusively uses the general-purpose registers (AArch64 only)

-mglobal-merge

Supported

Enables merging of globals

-mgpopt

Unsupported

Allows using GP relative accesses for symbols known to be in a small data section (MIPS)

-MG

Supported

Adds missing headers to depfile

-mharden-sls= <value>

Unsupported

Sets straight-line speculation hardening scope

-mhvx-length= <value>

Unsupported

Sets Hexagon Vector Length

-mhvx= <value>

Unsupported

Sets Hexagon Vector eXtensions

-mhvx

Unsupported

Enables Hexagon Vector eXtensions

-miamcu

Unsupported

Allows using Intel MCU ABI

–migrate

Unsupported

Runs the migrator

-mincremental-linker-compatible

Supported

(integrated-as) Emits an object file that can be used with an incremental linker

-mindirect-jump= <value>

Unsupported

Changes indirect jump instructions to inhibit speculation

-Minform= <value>

Supported

Sets error level of messages to display

-mios-version-min= <value>

Unsupported

Sets iOS deployment target

-MJ <value>

Unsupported

Writes a compilation database entry per input

-mllvm <value>

Supported

Specifies additional arguments to forward to LLVM’s option processing

-mlocal-sdata

Unsupported

Extends the -G behavior to object local data (MIPS)

-mlong-calls

Supported

Generates branches with extended addressability, usually via indirect jumps

-mlong-double-128

Supported on Host only

Forces long double to be 128 bits

-mlong-double-64

Supported

Forces long double to be 64 bits

-mlong-double-80

Supported on Host only

Forces long double to be 80 bits, padded to 128 bits for storage

-mlvi-cfi

Supported on Host only

Enables only control-flow mitigations for Load Value Injection (LVI)

-mlvi-hardening

Supported on Host only

Enables all mitigations for Load Value Injection (LVI)

-mmacosx-version-min= <value>

Unsupported

Sets Mac OS X deployment target

-mmadd4

Supported

Enables the generation of 4-operand madd.s, madd.d, and related instructions

-mmark-bti-property

Unsupported

Adds .note.gnu.property with BTI to assembly files (AArch64 only)

-MMD

Supported

Writes a depfile containing user headers

-mmemops

Supported

Enables generation of memop instructions

-mms-bitfields

Unsupported

Sets the default structure layout to be compatible with the Microsoft compiler standard

-mmsa

Unsupported

Enables MSA ASE (MIPS only)

-mmt

Unsupported

Enables MT ASE (MIPS only)

-MM

Supported

Similar to -MMD but also implies -E and writes to stdout by default

-mno-abicalls

Unsupported

Disables SVR4-style position-independent code (Mips only)

-mno-crc

Unsupported

Disallows use of CRC instructions (MIPS only)

-mno-embedded-data

Unsupported

Prohibits placing constants in the .rodata section instead of the .sdata if they meet the -G <size> threshold (MIPS)

-mno-execute-only

Unsupported

Allows generation of data access to code sections (ARM only)

-mno-extern-sdata

Unsupported

Prohibits assuming the externally defined data to be in the small data if it meets the -G <size> threshold (MIPS)

-mno-fix-cortex-a53-835769

Unsupported

Disallows workaround Cortex-A53 erratum 835769 (AArch64 only)

-mno-global-merge

Supported

Disables merging of globals

-mno-gpopt

Unsupported

Prohibits using GP relative accesses for symbols known to be in a small data section (MIPS)

-mno-hvx

Unsupported

Disables Hexagon Vector eXtensions.

-mno-implicit-float

Supported

Prohibits generating implicit floating-point instructions

-mno-incremental-linker-compatible

Supported

(integrated-as) Emits an object file that cannot be used with an incremental linker

-mno-local-sdata

Unsupported

Prohibits extending the -G behavior to object local data (MIPS)

-mno-long-calls

Supported

Restores the default behavior of not generating long calls

-mno-lvi-cfi

Supported on Host only

Disables control-flow mitigations for Load Value Injection (LVI)

-mno-lvi-hardening

Supported on Host only

Disables mitigations for Load Value Injection (LVI)

-mno-madd4

Supported

Disables the generation of 4-operand madd.s, madd.d, and related instructions

-mno-memops

Supported

Disables the generation of memop instructions

-mno-movt

Supported

Disallows usage of movt/movw pairs (ARM only)

-mno-ms-bitfields

Supported

Prohibits setting the default structure layout to be compatible with the Microsoft compiler standard

-mno-msa

Unsupported

Disables MSA ASE (MIPS only)

-mno-mt

Unsupported

Disables MT ASE (MIPS only)

-mno-neg-immediates

Supported

Disallows converting instructions with negative immediates to their negation or inversion

-mno-nvj

Supported

Disables generation of new-value jumps

-mno-nvs

Supported

Disables generation of new-value stores

-mno-outline

Unsupported

Disables function outlining (AArch64 only)

-mno-packets

Supported

Disables generation of instruction packets

-mno-relax

Supported

Disables linker relaxation

-mno-restrict-it

Unsupported

Allows generation of deprecated IT blocks for ARMv8. It is off by default for ARMv8 Thumb mode

-mno-save-restore

Unsupported

Disables usage of library calls for save and restore

-mno-seses

Unsupported

Disables speculative execution side-effect suppression (SESES)

-mno-stack-arg-probe

Supported

Disables stack probes which are enabled by default

-mno-tls-direct-seg-refs

Supported

Disables direct TLS access through segment registers

-mno-unaligned-access

Unsupported

Forces all memory accesses to be aligned (AArch32/AArch64 only)

-mno-wavefrontsize64

Supported

Asserts wavefront size to 32 (AMDGPU only)

-mnocrc

Unsupported

Disallows usage of CRC instructions (ARM only)

-mnop-mcount

Supported

Generates mcount/__fentry__ calls as nops. To activate, they need to be patched in

-mnvj

Supported

Enables generation of new-value jumps

-mnvs

Supported

Enables generation of new-value stores

-module-dependency-dir <value>

Unsupported

Specifies directory for dumping module dependencies

-module-file-info

Unsupported

Provides information about a particular module file

-momit-leaf-frame-pointer

Supported

Omits frame pointer setup for leaf functions

-moutline

Unsupported

Enables function outlining (AArch64 only)

-mpacked-stack

Unsupported

Asserts the usage of packed stack layout (SystemZ only)

-mpackets

Supported

Enables generation of instruction packets

-mpad-max-prefix-size= <value>

Supported

Specifies maximum number of prefixes to use for padding

-mpie-copy-relocations

Supported

Asserts the usage of copy relocations support for PIE builds

-mprefer-vector-width= <value>

Unsupported

Specifies preferred vector width for auto-vectorization. Default value: “none,” which allows target specific decisions.

-MP

Supported

Creates phony target for each dependency (other than the main file)

-mqdsp6-compat

Unsupported

Enables hexagon-qdsp6 backward compatibility

-MQ <value>

Supported

Specifies the name of the main file output to quote in depfile

-mrecord-mcount

Supported

Generates a __mcount_loc section entry for each fentry call

-mrelax-all

Supported

(integrated-as) Relaxes all machine instructions

-mrelax

Supported

Enables linker relaxation

-mrestrict-it

Unsupported

Disallows generation of deprecated IT blocks for ARMv8. It is on by default for ARMv8 Thumb mode.

-mrtd

Unsupported

Makes StdCall calling the default convention

-msave-restore

Unsupported

Enables using library calls for save and restore

-mseses

Unsupported

Enables speculative execution side effect suppression (SESES). Includes LVI control flow integrity mitigations.

-msign-return-address= <value>

Unsupported

Specifies the return address signing scope

-msmall-data-limit= <value>

Supported

Puts global and static data smaller than the specified limit into a special section

-msoft-float

Supported

Uses software floating point

-msram-ecc

Supported

Legacy option to specify SRAM ECC mode (AMDGPU only). Should use –offload-arch with sramecc+ instead.

-mstack-alignment= <value>

Unsupported

Sets the stack alignment

-mstack-arg-probe

Unsupported

Enables stack probes

-mstack-probe-size= <value>

Unsupported

Sets the stack probe size

-mstackrealign

Unsupported

Forces realign the stack at entry on every function

-msve-vector-bits= <value>

Unsupported

Specifies the size in bits of an SVE vector register. Defaults to the vector length agnostic value of “scalable” (AArch64 only).

-msvr4-struct-return

Unsupported

Returns small structs in registers (PPC32 only)

-mthread-model <value>

Supported

Specifies the thread model to use. Value: posix/single. Default: posix.

-mtls-direct-seg-refs

Supported

Enables direct TLS access through segment registers (default)

-mtls-size= <value>

Unsupported

Specifies the bit size of immediate TLS offsets (AArch64 ELF only). Value: 12 (for 4KB) \ 24 (for 16MB, default) \ 32 (for 4GB) \ 48 (for 256TB, needs -mcmodel=large).

-mtp= <value>

Unsupported

Specifies the thread pointer access method. Value: AArch32/AArch64 only

-mtune= <value>

Supported on Host only

Supported on X86 only. Otherwise accepted for compatibility with GCC.

-MT <value>

Unsupported

Specifies the name of main file output in depfile

-munaligned-access

Unsupported

Allows memory accesses to be unaligned (AArch32/AArch64 only)

-MV

Supported

Uses NMake/Jom format for the depfile

-mwavefrontsize64

Supported

Asserts wavefront size of 64 (AMDGPU only)

-mxnack

Supported

Legacy option to specify XNACK mode (AMDGPU only). Use –offload-arch with :xnack+ instead.

-M

Supported

Similar to -MD but also implies -E and writes to stdout by default

–no-cuda-include-ptx= <value>

Supported

Prohibits including PTX for the specified GPU architecture (e.g. sm_35) or “all”. May be specified more than once.

–no-cuda-version-check

Supported

Disallows erroring out if the detected version of the CUDA install is too low for the requested CUDA GPU architecture

-no-flang-libs

Supported

Prohibits linking against Flang libraries

–no-offload-arch= <value>

Supported

Removes CUDA/HIP offloading device architecture (e.g. sm_35, gfx906) from the list of devices to compile for. “all” resets the list to its default value

–no-system-header-prefix= <prefix>

Supported

Assumes no system header for all #include paths starting with the given <prefix>

-nobuiltininc

Supported

Disables built-in #include directories

-nogpuinc

Supported

Prohibits adding CUDA/HIP include paths and includes default CUDA/HIP wrapper header files

-nogpulib

Supported

Prohibits linking device library for CUDA/HIP device compilation

-nostdinc++

Unsupported

Disables standard #include directories for the C++ standard library

-ObjC++

Unsupported

Treats source input files as Objective-C++ inputs

-objcmt-atomic-property

Unsupported

Enables migration to “atomic” properties

-objcmt-migrate-all

Unsupported

Enables migration to modern ObjC

-objcmt-migrate-annotation

Unsupported

Enables migration to property and method annotations

-objcmt-migrate-designated-init

Unsupported

Enables migration to infer NS_DESIGNATED_INITIALIZER for initializer methods

-objcmt-migrate-instancetype

Unsupported

Enables migration to infer instancetype for method result type

-objcmt-migrate-literals

Unsupported

Enables migration to modern ObjC literals

-objcmt-migrate-ns-macros

Unsupported

Enables migration to NS_ENUM/NS_OPTIONS macros

-objcmt-migrate-property-dot-syntax

Unsupported

Enables migration of setter/getter messages to property-dot syntax

-objcmt-migrate-property

Unsupported

Enables migration to modern ObjC property

-objcmt-migrate-protocol-conformance

Unsupported

Enables migration to add protocol conformance on classes

-objcmt-migrate-readonly-property

Unsupported

Enables migration to modern ObjC readonly property

-objcmt-migrate-readwrite-property

Unsupported

Enables migration to modern ObjC readwrite property

-objcmt-migrate-subscripting

Unsupported

Enables migration to modern ObjC subscripting

-objcmt-ns-nonatomic-iosonly

Unsupported

Enables migration to use NS_NONATOMIC_IOSONLY macro for setting property’s “atomic” attribute

-objcmt-returns-innerpointer-property

Unsupported

Enables migration to annotate property with NS_RETURNS_INNER_POINTER

-objcmt-whitelist-dir-path= <value>

Unsupported

Modifies exclusively the files with the filename present in the given directory

-ObjC

Unsupported

Treats source input files as Objective-C inputs

–offload-arch= <value>

Supported

Specifies CUDA offloading device architecture (e.g. sm_35), or HIP offloading target ID in the form of a device architecture followed by target ID features delimited by a colon. Each target ID feature is a predefined string followed by a plus or minus sign (e.g. gfx908:xnack+:sramecc-). May be specified more than once.

-o <file>

Supported

Writes output to the given <file>

-parallel-jobs= <value>

Supported

Specifies the number of parallel jobs allowed

-pg

Supported

Enables mcount instrumentation

-pipe

Supported

Asserts using pipes between commands, when possible.

–precompile

Supported

Only precompiles the input

-print-effective-triple

Supported

Prints the effective target triple

-print-file-name= <file>

Supported

Prints the full library path of the given <file>

-print-ivar-layout

Unsupported

Enables Objective-C Ivar layout bitmap print trace

-print-libgcc-file-name

Supported

“Prints the library path for the currently used compiler runtime library (“”libgcc.a”” or “”libclang_rt.builtins.*.a””)”

-print-prog-name= <name>

Supported

Prints the full program path of the given <name>

-print-resource-dir

Supported

Prints the resource directory pathname

-print-search-dirs

Supported

Prints the paths used for finding libraries and programs

-print-supported-cpus

Supported

Prints the supported CPU models for the given target. If target is not specified, it prints the supported CPUs for the default target.

-print-target-triple

Supported

Prints the normalized target triple

-print-targets

Supported

Prints the registered targets

-pthread

Supported

Supports POSIX threads in the generated code

–ptxas-path= <value>

Unsupported

Specifies the path to ptxas (used for compiling CUDA code)

-P

Supported

Disables linemarker output in -E mode

-Qn

Supported

Prohibits emitting metadata containing compiler name and version

-Qunused-arguments

Supported

Prohibits emitting warning for unused driver arguments

-Qy

Supported

Emits metadata containing compiler name and version

-relocatable-pch

Supported

Allows to build a relocatable precompiled header

-rewrite-legacy-objc

Unsupported

Rewrites Legacy Objective-C source to C++

-rewrite-objc

Unsupported

Rewrites Objective-C source to C++

–rocm-device-lib-path= <value>

Supported

Specifies ROCm device library path. Alternative to rocm-path

–rocm-path= <value>

Supported

Specifies ROCm installation path that is used for finding and automatically linking required bitcode libraries

-Rpass-analysis= <value>

Supported

Reports transformation analysis by optimization passes whose names match the given POSIX regular expression

-Rpass-missed= <value>

Supported

Reports missed transformations by optimization passes whose names match the given POSIX regular expression

-Rpass= <value>

Supported

Reports transformations by optimization passes whose names match the given POSIX regular expression

-rtlib= <value>

Unsupported

Specifies the compiler runtime library to be used

-R <remark>

Unsupported

Enables the specified remark

-save-stats= <value>

Supported

Saves llvm statistics

-save-stats

Supported

Saves llvm statistics

-save-temps= <value>

Supported

Saves intermediate compilation results

-save-temps

Supported

Saves intermediate compilation results

-serialize-diagnostics= <value>

Supported

Serializes compiler diagnostics to the specified file

-shared-libsan

Unsupported

Dynamically links the sanitizer runtime

-static-flang-libs

Supported

Asserts linking using static Flang libraries

-static-libsan

Unsupported

Statically links the sanitizer runtime

-static-openmp

Supported

Asserts using the static host OpenMP runtime while linking

-std= <value>

Supported

Specifies the language standard to compile for.

-stdlib+±isystem <directory>

Supported

Specifies the directory to be used as the C++ standard library include path

-stdlib= <value>

Supported

Specifies the C++ standard library to be used

-sycl-std= <value>

Unsupported

Specifies the SYCL language standard to compile for

–system-header-prefix= <prefix>

Supported

Assumes all #include paths starting with the given <prefix> to include a system header

-S

Supported

Runs only preprocess and compilation steps

–target= <value>

Supported

Generates code for the given target

-Tbss <addr>

Supported

Sets the starting address of BSS to the given <addr>

-Tdata <addr>

Supported

Sets the starting address of DATA to the given <addr>

-time

Supported

Times individual commands

-traditional-cpp

Unsupported

Enables some traditional CPP emulation

-trigraphs

Supported

Processes trigraph sequences

-Ttext <addr>

Supported

Sets starting address of TEXT to the given <addr>

-T \ <script\>

Unsupported

Specifies the given. \ <script\> as linker script

-undef

Supported

undefs all system defines

-unwindlib= <value>

Supported

Specifies the unwind library to be used

-U <macro>

Supported

Undefines the given <macro>

–verify-debug-info

Supported

Verifies the binary representation of the debug output

-verify-pch

Unsupported

Loads and verifies if a precompiled header file is stale

–version

Supported

Prints version information

-v

Supported

Shows commands to be run, and uses verbose output

-Wa, <arg>

Supported

Passes the comma-separated arguments in the given <arg> to the assembler

-Wdeprecated

Supported

Enables warnings for deprecated constructs and defines_DEPRECATED

-Wl, <arg>

Supported

Passes comma-separated arguments in <arg> to the linker.

-working-directory <value>

Supported

Resolves file paths relative to the specified directory

-Wp, <arg>

Supported

Passes comma-separated arguments in <arg> to the preprocessor

-W <warning>

Supported

Enables the specified warning

-w

Supported

Suppresses all warnings

-Xanalyzer <arg>

Supported

Passes <arg> to the static analyzer

-Xarch_device <arg>

Supported

Passes <arg> to the CUDA/HIP device compilation

-Xarch_host <arg>

Supported

Passes <arg> to the CUDA/HIP host compilation

-Xassembler <arg>

Supported

Passes <arg> to the assembler

-Xclang <arg>

Supported

Passes <arg> to the clang compiler

-Xcuda-fatbinary <arg>

Supported

Passes <arg> to fatbinary invocation

-Xcuda-ptxas <arg>

Supported

Passes <arg> to the ptxas assembler

-Xlinker <arg>

Supported

Passes <arg> to the linker

-Xopenmp-target= <triple> <arg>

Supported

Passes <arg> to the target offloading toolchain identified by <triple>

-Xopenmp-target <arg>

Supported

Passes <arg> to the target offloading toolchain

-Xpreprocessor <arg>

Supported

Passes <arg> to the preprocessor

-x <language>

Supported

Assumes subsequent input files to have the given type <language>

-z <arg>

Supported

Passes -z <arg> to the linker

HIP#

HIP is both AMD’s GPU programming language extension and the GPU runtime. This page introduces the HIP runtime and other HIP libraries and tools.

HIP Runtime#

The HIP Runtime is used to enable GPU acceleration for all HIP language based products.

Porting tools#

HIPIFY assists with porting applications from based on CUDA to the HIP Runtime. Supported CUDA APIs are documented here as well.

OpenMP Support in ROCm#

Introduction#

The ROCm™ installation includes an LLVM-based implementation that fully supports the OpenMP 4.5 standard and a subset of OpenMP 5.0, 5.1, and 5.2 standards. Fortran, C/C++ compilers, and corresponding runtime libraries are included. Along with host APIs, the OpenMP compilers support offloading code and data onto GPU devices. This document briefly describes the installation location of the OpenMP toolchain, example usage of device offloading, and usage of rocprof with OpenMP applications. The GPUs supported are the same as those supported by this ROCm release. See the list of supported GPUs in GPU and OS Support (Linux).

Installation#

The OpenMP toolchain is automatically installed as part of the standard ROCm installation and is available under /opt/rocm-{version}/llvm. The sub-directories are:

bin: Compilers (flang and clang) and other binaries.

  • examples: The usage section below shows how to compile and run these programs.

  • include: Header files.

  • lib: Libraries including those required for target offload.

  • lib-debug: Debug versions of the above libraries.

OpenMP: Usage#

The example programs can be compiled and run by pointing the environment variable ROCM_PATH to the ROCm install directory.

Example:

export ROCM_PATH=/opt/rocm-{version}
cd $ROCM_PATH/share/openmp-extras/examples/openmp/veccopy
sudo make run

Note

sudo is required since we are building inside the /opt directory. Alternatively, copy the files to your home directory first.

The above invocation of Make compiles and runs the program. Note the options that are required for target offload from an OpenMP program:

-fopenmp --offload-arch=<gpu-arch>

Note

The Makefile in the example above uses a more classical and verbose set of flags which can also be used:

-fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa

Obtain the value of gpu-arch by running the following command:

% /opt/rocm-{version}/bin/rocminfo | grep gfx

See the complete list of compiler command-line references here.

Using rocprof with OpenMP#

The following steps describe a typical workflow for using rocprof with OpenMP code compiled with AOMP:

  1. Run rocprof with the program command line:

    % rocprof <application> <args>
    

    This produces a results.csv file in the user’s current directory that shows basic stats such as kernel names, grid size, number of registers used, etc. The user can choose to specify the preferred output file name using the o option.

  2. Add options for a detailed result:

    --stats: % rocprof --stats <application> <args>
    

    The stats option produces timestamps for the kernels. Look into the output CSV file for the field, DurationNs, which is useful in getting an understanding of the critical kernels in the code.

    Apart from --stats, the option --timestamp on produces a timestamp for the kernels.

  3. After learning about the required kernels, the user can take a detailed look at each one of them. rocprof has support for hardware counters: a set of basic and a set of derived ones. See the complete list of counters using options –list-basic and –list-derived. rocprof accepts either a text or an XML file as an input.

For more details on rocprof, refer to the ROCm Profiling Tools document on rocprof.

Using Tracing Options#

Prerequisite: When using the --sys-trace option, compile the OpenMP program with:

    -Wl,-rpath,/opt/rocm-{version}/lib -lamdhip64

The following tracing options are widely used to generate useful information:

  • --hsa-trace: This option is used to get a JSON output file with the HSA API execution traces and a flat profile in a CSV file.

  • --sys-trace: This allows programmers to trace both HIP and HSA calls. Since this option results in loading libamdhip64.so, follow the prerequisite as mentioned above.

A CSV and a JSON file are produced by the above trace options. The CSV file presents the data in a tabular format, and the JSON file can be visualized using Google Chrome at chrome://tracing/ or Perfetto. Navigate to Chrome or Perfetto and load the JSON file to see the timeline of the HSA calls.

For more details on tracing, refer to the ROCm Profiling Tools document on rocprof.

Environment Variables#

Environment Variable

Description

OMP_NUM_TEAMS

The implementation chooses the number of teams for kernel launch. The user can change this number for performance tuning using this environment variable, subject to implementation limits.

LIBOMPTARGET_KERNEL_TRACE

This environment variable is used to print useful statistics for device operations. Setting it to 1 and running the program emits the name of every kernel launched, the number of teams and threads used, and the corresponding register usage. Setting it to 2 additionally emits timing information for kernel launches and data transfer operations between the host and the device.

LIBOMPTARGET_INFO

This environment variable is used to print informational messages from the device runtime as the program executes. Users can request fine-grain information by setting it to the value of 1 or higher and can set the value of -1 for complete information.

LIBOMPTARGET_DEBUG

If a debug version of the device library is present, setting this environment variable to 1 and using that library emits further detailed debugging information about data transfer operations and kernel launch.

GPU_MAX_HW_QUEUES

This environment variable is used to set the number of HSA queues in the OpenMP runtime.

OpenMP: Features#

The OpenMP programming model is greatly enhanced with the following new features implemented in the past releases.

Unified Shared Memory#

Unified Shared Memory (USM) provides a pointer-based approach to memory management. To implement USM, fulfill the following system requirements along with Xnack capability.

Prerequisites#
  • Linux Kernel versions above 5.14

  • Latest KFD driver packaged in ROCm stack

  • Xnack, as USM support can only be tested with applications compiled with Xnack capability

Xnack Capability#

When enabled, Xnack capability allows GPU threads to access CPU (system) memory, allocated with OS-allocators, such as malloc, new, and mmap. Xnack must be enabled both at compile- and run-time. To enable Xnack support at compile-time, the programmer should use

--offload-arch=gfx908:xnack+

Or, equivalently

--offload-arch=gfx908

Note

The second case is called Xnack-any and it is functionally equivalent to the first case.

At runtime, programmers enable Xnack functionality on a per-application basis using an environment variable:

HSA_XNACK=1

When Xnack support is not needed, then applications can be built to maximize resource utilization using:

--offload-arch=gfx908:xnack-

At runtime, the HSA_XNACK environment variable can be set to 0, as Xnack functionality is not needed.

Unified Shared Memory Pragma#

This OpenMP pragma is available on MI200 through xnack+ support.

omp requires unified_shared_memory

As stated in the OpenMP specifications, this pragma makes the map clause on target constructs optional. By default, on MI200, all memory allocated on the host is fine grain. Using the map clause on a target clause is allowed, which transforms the access semantics of the associated memory to coarse grain.

A simple program demonstrating the use of this feature is:
$ cat parallel_for.cpp
#include <stdlib.h>
#include <stdio.h>

#define N 64
#pragma omp requires unified_shared_memory
int main() {
  int n = N;
  int *a = new int[n];
  int *b = new int[n];

  for(int i = 0; i < n; i++)
    b[i] = i;

  #pragma omp target parallel for map(to:b[:n])
  for(int i = 0; i < n; i++)
    a[i] = b[i];

  for(int i = 0; i < n; i++)
    if(a[i] != i)
      printf("error at %d: expected %d, got %d\n", i, i+1, a[i]);

  return 0;
}
$ clang++ -O2 -target x86_64-pc-linux-gnu -fopenmp --offload-arch=gfx90a:xnack+ parallel_for.cpp
$ HSA_XNACK=1 ./a.out

In the above code example, pointer “a” is not mapped in the target region, while pointer “b” is. Both are valid pointers on the GPU device and passed by-value to the kernel implementing the target region. This means the pointer values on the host and the device are the same.

The difference between the memory pages pointed to by these two variables is that the pages pointed by “a” are in fine-grain memory, while the pages pointed to by “b” are in coarse-grain memory during and after the execution of the target region. This is accomplished in the OpenMP runtime library with calls to the ROCR runtime to set the pages pointed by “b” as coarse grain.

OMPT Target Support#

The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the OpenMP specification document. These APIs allow first-party tools to examine the profile and kernel traces that execute on a device. A tool can register callbacks for data transfer and kernel dispatch entry points or use APIs to start and stop tracing for device-related activities such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled, trace records for device activities are collected during program execution and returned to the tool using the APIs described in the specification.

The following example demonstrates how a tool uses the supported OMPT target APIs. The README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to be followed, and the provided example can be run as shown below:

cd $ROCM_PATH/share/openmp-extras/examples/tools/ompt/veccopy-ompt-target-tracing
sudo make run

The file veccopy-ompt-target-tracing.c simulates how a tool initiates device activity tracing. The file callbacks.h shows the callbacks registered and implemented by the tool.

Floating Point Atomic Operations#

The MI200-series GPUs support the generation of hardware floating-point atomics using the OpenMP atomic pragma. The support includes single- and double-precision floating-point atomic operations. The programmer must ensure that the memory subjected to the atomic operation is in coarse-grain memory by mapping it explicitly with the help of map clauses when not implicitly mapped by the compiler as per the OpenMP specifications. This makes these hardware floating-point atomic instructions “fast,” as they are faster than using a default compare-and-swap loop scheme, but at the same time “unsafe,” as they are not supported on fine-grain memory. The operation in unified_shared_memory mode also requires programmers to map the memory explicitly when not implicitly mapped by the compiler.

To request fast floating-point atomic instructions at the file level, use compiler flag -munsafe-fp-atomics or a hint clause on a specific pragma:

double a = 0.0;
#pragma omp atomic hint(AMD_fast_fp_atomics)
a = a + 1.0;

NOTE AMD_unsafe_fp_atomics is an alias for AMD_fast_fp_atomics, and AMD_safe_fp_atomics is implemented with a compare-and-swap loop.

To disable the generation of fast floating-point atomic instructions at the file level, build using the option -msafe-fp-atomics or use a hint clause on a specific pragma:

double a = 0.0;
#pragma omp atomic hint(AMD_safe_fp_atomics)
a = a + 1.0;

The hint clause value always has a precedence over the compiler flag, which allows programmers to create atomic constructs with a different behavior than the rest of the file.

See the example below, where the user builds the program using -msafe-fp-atomics to select a file-wide “safe atomic” compilation. However, the fast atomics hint clause over variable “a” takes precedence and operates on “a” using a fast/unsafe floating-point atomic, while the variable “b” in the absence of a hint clause is operated upon using safe floating-point atomics as per the compiler flag.

double a = 0.0;.
#pragma omp atomic hint(AMD_fast_fp_atomics)
a = a + 1.0;

double b = 0.0;
#pragma omp atomic
b = b + 1.0;

Address Sanitizer (ASan) Tool#

Address Sanitizer is a memory error detector tool utilized by applications to detect various errors ranging from spatial issues such as out-of-bound access to temporal issues such as use-after-free. The AOMP compiler supports ASan for AMD GPUs with applications written in both HIP and OpenMP.

Features Supported on Host Platform (Target x86_64):

  • Use-after-free

  • Buffer overflows

  • Heap buffer overflow

  • Stack buffer overflow

  • Global buffer overflow

  • Use-after-return

  • Use-after-scope

  • Initialization order bugs

Features Supported on AMDGPU Platform (amdgcn-amd-amdhsa):

  • Heap buffer overflow

  • Global buffer overflow

Software (Kernel/OS) Requirements: Unified Shared Memory support with Xnack capability. See the section on Unified Shared Memory for prerequisites and details on Xnack.

Example:

  • Heap buffer overflow

void  main() {
.......  // Some program statements
.......  // Some program statements
#pragma omp target map(to : A[0:N], B[0:N]) map(from: C[0:N])
{
#pragma omp parallel for
    for(int i =0 ; i < N; i++){
    C[i+10] = A[i] + B[i];
  }   // end of for loop
}
.......   // Some program statements
}// end of main

See the complete sample code for heap buffer overflow here.

  • Global buffer overflow

#pragma omp declare target
   int A[N],B[N],C[N];
#pragma omp end declare target
void main(){
......  // some program statements
......  // some program statements
#pragma omp target data map(to:A[0:N],B[0:N]) map(from: C[0:N])
{
#pragma omp target update to(A,B)
#pragma omp target parallel for
for(int i=0; i<N; i++){
    C[i]=A[i*100]+B[i+22];
} // end of for loop
#pragma omp target update from(C)
}
........  // some program statements
} // end of main

See the complete sample code for global buffer overflow here.

No-loop Kernel Generation#

The No-loop kernel generation feature optimizes the compiler performance by generating a specialized kernel for certain OpenMP Target Constructs such as target teams distribute parallel for. The specialized kernel generation assumes that every thread executes a single iteration of the user loop, which implies that the runtime launches a total number of GPU threads equal to or greater than the iteration space size of the target region loop. This allows the compiler to generate code for the loop body without an enclosing loop, resulting in reduced control-flow complexity and potentially better performance.

To enable the generation of the specialized kernel, follow these guidelines:

  • Do not specify teams, threads, and schedule-related environment variables. The num_teams or a thread_limit clause in an OpenMP target construct acts as an override and prevents the generation of the specialized kernel. As the user is unable to specify the number of teams and threads used within target regions in the absence of the above-mentioned environment variables, the runtime will select the best values for the launch configuration based on runtime knowledge of the program.

  • Assert the absence of the above-mentioned environment variables by adding the command-line option -fopenmp-target-ignore-env-vars. This option also allows programmers to enable the No-loop functionality at lower optimization levels.

  • Also, the No-loop functionality is automatically enabled when -O3 or -Ofast is used for compilation. To disable this feature, use -fno-openmp-target-ignore-env-vars.

Note The compiler might not generate the No-loop kernel in certain scenarios where the performance improvement is not substantial.

Cross-Team Optimized Reductions#

In scenarios where a No-loop kernel is generated but the OpenMP construct has a reduction clause, the compiler may generate optimized code utilizing efficient Cross-Team (Xteam) communication. No separate user option is required, and there is a significant performance improvement with Xteam reduction. New APIs for Xteam reduction are implemented in the device runtime, and clang generates these APIs automatically.

Math Libraries#

AMD provides various math domain and support libraries as part of the ROCm.

rocLIB vs. hipLIB#

Several libraries are prefixed with either “roc” or “hip”. The rocLIB variants (such as rocRAND, rocBLAS) are tested and optimized for AMD hardware using supported toolchains. The hipLIB variants (such as hipRAND, hipBLAS) are compatibility layers that provide an interface akin to their cuLIB (such as cuRAND, cuBLAS) variants while performing static dispatching of API calls to the appropriate vendor libraries as their back-ends. Due to their static dispatch nature, support for either vendor is decided at compile-time of the hipLIB in question. For dynamic dispatch between vendor implementations, refer to the Orochi library.

Linear Algebra Libraries#

ROCm libraries for linear algebra are as follows:

rocBLAS is an AMD GPU optimized library for BLAS (Basic Linear Algebra Subprograms).

hipBLAS is a compatibility layer for GPU accelerated BLAS optimized for AMD GPUs via rocBLAS and rocSOLVER. hipBLAS allows for a common interface for other GPU BLAS libraries.

hipBLASLt

hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond traditional BLAS library. hipBLASLt is exposed APIs in HIP programming language with an underlying optimized generator as a back-end kernel provider.

rocALUTION is a sparse linear algebra library with focus on exploring fine-grained parallelism on top of AMD’s ROCm runtime and toolchains, targeting modern CPU and GPU platforms.

rocSOLVER provides a subset of LAPACK (Linear Algebra Package) functionality on the ROCm platform.

hipSOLVER is a LAPACK marshalling library supporting both rocSOLVER and cuSOLVER as backends whilst exporting a unified interface.

rocSPARSE is a library to provide BLAS for sparse computations.

hipSPARSE is a marshalling library to provide sparse BLAS functionality, supporting both rocSPARSE and cuSPARSE as backends.

Fast Fourier Transforms#

ROCm libraries for FFT are as follows:

rocFFT is an AMD GPU optimized library for FFT.

hipFFT is a compatibility layer for GPU accelerated FFT optimized for AMD GPUs using rocFFT. hipFFT allows for a common interface for other non AMD GPU FFT libraries.

Random Numbers#

rocRAND is an AMD GPU optimized library for pseudo-random number generators (PRNG).

hipRAND

hipRAND is a compatibility layer for GPU accelerated pseudo-random number generation (PRNG) optimized for AMD GPUs using rocRAND. hipRAND allows for a common interface for other non AMD GPU PRNG libraries.

C++ Primitive Libraries#

ROCm template libraries for algorithms are as follows:

rocPRIM is an AMD GPU optimized template library of algorithm primitives, like transforms, reductions, scans, etc. It also serves as a common back-end for similar libraries found inside ROCm.

rocThrust is a template library of algorithm primitives with a Thrust-compatible interface. Their CPU back-ends are identical, while the GPU back-end calls into rocPRIM.

hipCUB is a template library of algorithm primitives with a CUB-compatible interface. It’s back-end is rocPRIM.

Communication Libraries#

RCCL (pronounced “Rickle”) is a stand-alone library of standard collective communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, gather, scatter, and all-to-all. The collective operations are implemented using ring and tree algorithms and have been optimized for throughput and latency.

AI Libraries#

AMD’s library for high performance machine learning primitives.

Composable Kernel

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators

  • Documentation

Computer Vision#

MIVisionX toolkit is a set of comprehensive computer vision and machine intelligence libraries, utilities, and applications bundled into a single toolkit. AMD MIVisionX also delivers a highly optimized open-source implementation of the Khronos OpenVX™ and OpenVX™ Extensions.

rocAL

The AMD ROCm Augmentation Library (rocAL) is designed to efficiently decode and process images and videos from a variety of storage formats and modify them through a processing graph programmable by the user. rocAL currently provides C API.

  • Documentation

Management Tools#

AMD SMI

The AMD System Management Interface Library, or AMD SMI library, is a C library for Linux that provides a user space interface for applications to monitor and control AMD devices.

This tool acts as a command line interface for manipulating and monitoring the AMD GPU kernel, and is intended to replace and deprecate the existing rocm_smi.py CLI tool. It uses ctypes to call the rocm_smi_lib API.

The ROCm™ Data Center Tool simplifies the administration and addresses key infrastructure challenges in AMD GPUs in cluster and data center environments.

Validation Tools#

The ROCm Validation Suite is a system administrator’s and cluster manager’s tool for detecting and troubleshooting common problems affecting AMD GPU(s) running in a high-performance computing environment, enabled using the ROCm software stack on a compatible platform.

TransferBench

TransferBench is a simple utility capable of benchmarking simultaneous transfers between user-specified devices (CPUs/GPUs).

  • Documentation

  • Changelog

  • transferbench:examples/index

All Explanation Material#

Compiler Nomencalture

ROCm ships multiple compilers of varying origins and purposes. This article disambiguates compiler naming used throughout the documentation.

Using CMake

ROCm components ship with 1st party CMake support. This article details how that support works and how to use it.

Linux Folder Structure Reorganization

ROCm™ packages have adopted the Linux foundation file system hierarchy standard to ensure ROCm components follow open source conventions for Linux-based distributions.

GPU Isolation Techniques

Restricting the access of applications to a subset of GPUs, aka isolating GPUs allows users to hide GPU resources from programs.

GPU Architectures

AMD documentation around architectural details from both the CDNA and RDNA product lines.

ROCm Compilers Disambiguation#

ROCm ships multiple compilers of varying origins and purposes. This article disambiguates compiler naming used throughout the documentation.

Compiler Terms#

Term

Description

amdclang++

Clang/LLVM-based compiler that is part of rocm-llvm package. The source code is available at https://github.com/RadeonOpenCompute/llvm-project.

AOCC

Closed-source clang-based compiler that includes additional CPU optimizations. Offered as part of ROCm via the rocm-llvm-alt package. See for details, https://developer.amd.com/amd-aocc/.

HIP-Clang

Informal term for the amdclang++ compiler

HIPify

Tools including hipify-clang and hipify-perl, used to automatically translate CUDA source code into portable HIP C++. The source code is available at https://github.com/ROCm-Developer-Tools/HIPIFY

hipcc

HIP compiler driver. A utility that invokes clang or nvcc depending on the target and passes the appropriate include and library options for the target compiler and HIP infrastructure. The source code is available at https://github.com/ROCm-Developer-Tools/HIPCC.

ROCmCC

Clang/LLVM-based compiler. ROCmCC in itself is not a binary but refers to the overall compiler.

Using CMake#

Most components in ROCm support CMake. Projects depending on header-only or library components typically require CMake 3.5 or higher whereas those wanting to make use of CMake’s HIP language support will require CMake 3.21 or higher.

Finding Dependencies#

Note

For a complete reference on how to deal with dependencies in CMake, refer to the CMake docs on find_package and the Using Dependencies Guide to get an overview of CMake’s related facilities.

In short, CMake supports finding dependencies in two ways:

  • In Module mode, it consults a file Find<PackageName>.cmake which tries to find the component in typical install locations and layouts. CMake ships a few dozen such scripts, but users and projects may ship them as well.

  • In Config mode, it locates a file named <packagename>-config.cmake or <PackageName>Config.cmake which describes the installed component in all regards needed to consume it.

ROCm predominantly relies on Config mode, one notable exception being the Module driving the compilation of HIP programs on Nvidia runtimes. As such, when dependencies are not found in standard system locations, one either has to instruct CMake to search for package config files in additional folders using the CMAKE_PREFIX_PATH variable (a semi-colon separated list of filesystem paths), or using <PackageName>_ROOT variable on a project-specific basis.

There are nearly a dozen ways to set these variables. One may be more convenient over the other depending on your workflow. Conceptually the simplest is adding it to your CMake configuration command on the command-line via -D CMAKE_PREFIX_PATH=.... . AMD packaged ROCm installs can typically be added to the config file search paths such as:

  • Windows: -D CMAKE_PREFIX_PATH=${env:HIP_PATH}

  • Linux: -D CMAKE_PREFIX_PATH=/opt/rocm

ROCm provides the respective config-file packages, and this enables find_package to be used directly. ROCm does not require any Find module as the config-file packages are shipped with the upstream projects, such as rocPRIM and other ROCm libraries.

For a complete guide on where and how ROCm may be installed on a system, refer to the installation guides in these docs (Linux).

Using HIP in CMake#

ROCm componenents providing a C/C++ interface support being consumed using any C/C++ toolchain that CMake knows how to drive. ROCm also supports CMake’s HIP language features, allowing users to program using the HIP single-source programming model. When a program (or translation-unit) uses the HIP API without compiling any GPU device code, HIP can be treated in CMake as a simple C/C++ library.

Using the HIP single-source programming model#

Source code written in the HIP dialect of C++ typically uses the .hip extension. When the HIP CMake language is enabled, it will automatically associate such source files with the HIP toolchain being used.

cmake_minimum_required(VERSION 3.21) # HIP language support requires 3.21
cmake_policy(VERSION 3.21.3...3.27)
project(MyProj LANGUAGES HIP)
add_executable(MyApp Main.hip)

Should you have existing CUDA code that is from the source compatible subset of HIP, you can tell CMake that despite their .cu extension, they’re HIP sources. Do note that this mostly facilitates compiling kernel code-only source files, as host-side CUDA API won’t compile in this fashion.

add_library(MyLib MyLib.cu)
set_source_files_properties(MyLib.cu PROPERTIES LANGUAGE HIP)

CMake itself only hosts part of the HIP language support, such as defining HIP-specific properties, etc. while the other half ships with the HIP implementation, such as ROCm. CMake will search for a file hip-lang-config.cmake describing how the the properties defined by CMake translate to toolchain invocations. If one installs ROCm using non-standard methods or layouts and CMake can’t locate this file or detect parts of the SDK, there’s a catch-all, last resort variable consulted locating this file, -D CMAKE_HIP_COMPILER_ROCM_ROOT:PATH= which should be set the root of the ROCm installation.

If the user doesn’t provide a semi-colon delimited list of device architectures via CMAKE_HIP_ARCHITECTURES, CMake will select some sensible default. It is advised though that if a user knows what devices they wish to target, then set this variable explicitly.

Consuming ROCm C/C++ Libraries#

Libraries such as rocBLAS, rocFFT, MIOpen, etc. behave as C/C++ libraries. Illustrated in the example below is a C++ application using MIOpen from CMake. It calls find_package(miopen), which provides the MIOpen imported target. This can be linked with target_link_libraries

cmake_minimum_required(VERSION 3.5) # find_package(miopen) requires 3.5
cmake_policy(VERSION 3.5...3.27)
project(MyProj LANGUAGES CXX)
find_package(miopen)
add_library(MyLib ...)
target_link_libraries(MyLib PUBLIC MIOpen)

Note

Most libraries are designed as host-only API, so using a GPU device compiler is not necessary for downstream projects unless they use GPU device code.

Consuming the HIP API in C++ code#

Use the HIP API without compiling the GPU device code. As there is no GPU code, any C or C++ compiler can be used. The find_package(hip) provides the hip::host imported target to use HIP in this context.

cmake_minimum_required(VERSION 3.5) # find_package(hip) requires 3.5
cmake_policy(VERSION 3.5...3.27)
project(MyProj LANGUAGES CXX)
find_package(hip REQUIRED)
add_executable(MyApp ...)
target_link_libraries(MyApp PRIVATE hip::host)

Compiling device code in C++ language mode#

Attention

The workflow detailed here is considered legacy and is shown for understanding’s sake. It pre-dates the existence of HIP language support in CMake. If source code has HIP device code in it, it is a HIP source file and should be compiled as such. Only resort to the method below if your HIP-enabled CMake codepath can’t mandate CMake version 3.21.

If code uses the HIP API and compiles GPU device code, it requires using a device compiler. The compiler for CMake can be set using either the CMAKE_C_COMPILER and CMAKE_CXX_COMPILER variable or using the CC and CXX environment variables. This can be set when configuring CMake or put into a CMake toolchain file. The device compiler must be set to a compiler that supports AMD GPU targets, which is usually Clang.

The find_package(hip) provides the hip::device imported target to add all the flags necessary for device compilation.

cmake_minimum_required(VERSION 3.8) # cxx_std_11 requires 3.8
cmake_policy(VERSION 3.8...3.27)
project(MyProj LANGUAGES CXX)
find_package(hip REQUIRED)
add_library(MyLib ...)
target_link_libraries(MyLib PRIVATE hip::device)
target_compile_features(MyLib PRIVATE cxx_std_11)

Note

Compiling for the GPU device requires at least C++11.

This project can then be configured with for eg.

  • Windows: cmake -D CMAKE_CXX_COMPILER:PATH=${env:HIP_PATH}\bin\clang++.exe

  • Linux: cmake -D CMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++

Which use the device compiler provided from the binary packages of ROCm HIP SDK and repo.radeon.com respectively.

When using the CXX language support to compile HIP device code, selecting the target GPU architectures is done via setting the GPU_TARGETS variable. CMAKE_HIP_ARCHITECTURES only exists when the HIP language is enabled. By default, this is set to some subset of the currently supported architectures of AMD ROCm. It can be set to eg. -D GPU_TARGETS="gfx1032;gfx1035".

ROCm CMake Packages#

Component

Package

Targets

HIP

hip

hip::host, hip::device

rocPRIM

rocprim

roc::rocprim

rocThrust

rocthrust

roc::rocthrust

hipCUB

hipcub

hip::hipcub

rocRAND

rocrand

roc::rocrand

rocBLAS

rocblas

roc::rocblas

rocSOLVER

rocsolver

roc::rocsolver

hipBLAS

hipblas

roc::hipblas

rocFFT

rocfft

roc::rocfft

hipFFT

hipfft

hip::hipfft

rocSPARSE

rocsparse

roc::rocsparse

hipSPARSE

hipsparse

roc::hipsparse

rocALUTION

rocalution

roc::rocalution

RCCL

rccl

rccl

MIOpen

miopen

MIOpen

Using CMake Presets#

CMake command-lines depending on how specific users like to be when compiling code can grow to unwieldy lengths. This is the primary reason why projects tend to bake script snippets into their build definitions controlling compiler warning levels, changing CMake defaults (CMAKE_BUILD_TYPE or BUILD_SHARED_LIBS just to name a few) and all sorts anti-patterns, all in the name of convenience.

Load on the command-line interface (CLI) starts immediately by selecting a toolchain, the set of utilities used to compile programs. To ease some of the toolchain related pains, CMake does consult the CC and CXX environmental variables when setting a default CMAKE_C[XX]_COMPILER respectively, but that is just the tip of the iceberg. There’s a fair number of variables related to just the toolchain itself (typically supplied using toolchain files ), and then we still haven’t talked about user preference or project-specific options.

IDEs supporting CMake (Visual Studio, Visual Studio Code, CLion, etc.) all came up with their own way to register command-line fragments of different purpose in a setup’n’forget fashion for quick assembly using graphical front-ends. This is all nice, but configurations aren’t portable, nor can they be reused in Continuous Intergration (CI) pipelines. CMake has condensed existing practice into a portable JSON format that works in all IDEs and can be invoked from any command-line. This is CMake Presets .

There are two types of preset files: one supplied by the project, called CMakePresets.json which is meant to be committed to version control, typically used to drive CI; and one meant for the user to provide, called CMakeUserPresets.json, typically used to house user preference and adapting the build to the user’s environment. These JSON files are allowed to include other JSON files and the user presets always implicitly includes the non-user variant.

Using HIP with presets#

Following is an example CMakeUserPresets.json file which actually compiles the amd/rocm-examples suite of sample applications on a typical ROCm installation:

{
  "version": 3,
  "cmakeMinimumRequired": {
    "major": 3,
    "minor": 21,
    "patch": 0
  },
  "configurePresets": [
    {
      "name": "layout",
      "hidden": true,
      "binaryDir": "${sourceDir}/build/${presetName}",
      "installDir": "${sourceDir}/install/${presetName}"
    },
    {
      "name": "generator-ninja-multi-config",
      "hidden": true,
      "generator": "Ninja Multi-Config"
    },
    {
      "name": "toolchain-makefiles-c/c++-amdclang",
      "hidden": true,
      "cacheVariables": {
        "CMAKE_C_COMPILER": "/opt/rocm/bin/amdclang",
        "CMAKE_CXX_COMPILER": "/opt/rocm/bin/amdclang++",
        "CMAKE_HIP_COMPILER": "/opt/rocm/bin/amdclang++"
      }
    },
    {
      "name": "clang-strict-iso-high-warn",
      "hidden": true,
      "cacheVariables": {
        "CMAKE_C_FLAGS": "-Wall -Wextra -pedantic",
        "CMAKE_CXX_FLAGS": "-Wall -Wextra -pedantic",
        "CMAKE_HIP_FLAGS": "-Wall -Wextra -pedantic"
      }
    },
    {
      "name": "ninja-mc-rocm",
      "displayName": "Ninja Multi-Config ROCm",
      "inherits": [
        "layout",
        "generator-ninja-multi-config",
        "toolchain-makefiles-c/c++-amdclang",
        "clang-strict-iso-high-warn"
      ]
    }
  ],
  "buildPresets": [
    {
      "name": "ninja-mc-rocm-debug",
      "displayName": "Debug",
      "configuration": "Debug",
      "configurePreset": "ninja-mc-rocm"
    },
    {
      "name": "ninja-mc-rocm-release",
      "displayName": "Release",
      "configuration": "Release",
      "configurePreset": "ninja-mc-rocm"
    },
    {
      "name": "ninja-mc-rocm-debug-verbose",
      "displayName": "Debug (verbose)",
      "configuration": "Debug",
      "configurePreset": "ninja-mc-rocm",
      "verbose": true
    },
    {
      "name": "ninja-mc-rocm-release-verbose",
      "displayName": "Release (verbose)",
      "configuration": "Release",
      "configurePreset": "ninja-mc-rocm",
      "verbose": true
    }
  ],
  "testPresets": [
    {
      "name": "ninja-mc-rocm-debug",
      "displayName": "Debug",
      "configuration": "Debug",
      "configurePreset": "ninja-mc-rocm",
      "execution": {
        "jobs": 0
      }
    },
    {
      "name": "ninja-mc-rocm-release",
      "displayName": "Release",
      "configuration": "Release",
      "configurePreset": "ninja-mc-rocm",
      "execution": {
        "jobs": 0
      }
    }
  ]
}

Note

Getting presets to work reliably on Windows requires some CMake improvements and/or support from compiler vendors. (Refer to Add support to the Visual Studio generators and Sourcing environment scripts .)

Linux Folder Structure Reorganization#

Introduction#

ROCm™ packages have adopted the Linux foundation file system hierarchy standard to ensure ROCm components follow open source conventions for Linux-based distributions. Following is the ROCm proposed file structure.

/opt/rocm-<ver>
    | -- bin
         | -- all public binaries
    | -- lib
         | -- lib<soname>.so->lib<soname>.so.major->lib<soname>.so.major.minor.patch
              (public libaries to link with applications)
         | -- <component>
              | -- architecture dependent libraries and binaries used internally by components
         | -- cmake
              | -- <component>
                   | --<component>-config.cmake
    | -- libexec
         | -- <component>
              | -- non ISA/architecture independent executables used internally by components
    | -- include
         | -- <component>
              | -- public header files
    | -- share
         | -- html
              | -- <component>
                   | -- html documentation
         | -- info
              | -- <component>
                   | -- info files
         | -- man
              | -- <component>
                   | -- man pages
         | -- doc
              | -- <component>
                   | -- license files
         | -- <component>
              | -- samples
              | -- architecture independent misc files

Changes from earlier ROCm versions#

ROCm with the file reorganization is going to have a lean structure. Following table gives the comparison with new and old folder structure.

 ______________________________________________________
|  New File Structure         |  Old File Structure    |
|_____________________________|________________________|
| /opt/rocm-<ver>             | /opt/rocm-<ver>        |
|     | -- bin                |     | -- bin           |
|     | -- lib                |     | -- lib           |
|          | -- cmake         |     | -- include       |
|     | -- libexec            |     | -- <component_1> |
|     | -- include            |          | -- bin      |
|          | -- <component_1> |          | -- cmake    |
|     | -- share              |          | -- doc      |
|          | -- html          |          | -- lib      |
|          | -- info          |          | -- include  |
|          | -- man           |          | -- samples  |
|          | -- doc           |     | -- <component_n> |
|          | -- <component_1> |          | -- bin      |
|               | -- samples  |          | -- cmake    |
|               | -- ..       |          | -- doc      |
|          | -- <component_n> |          | -- lib      |
|               | -- samples  |          | -- include  |
|               | -- ..       |          | -- samples  |
|______________________________________________________|

ROCm File reorganization transition plan#

New file organization for ROCm was first introduced ROCm v5.2 release. Backward compatibility was in place to make sure users had a chance to change their applications using ROCm. ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.

Wrapper header files#

Wrapper header files are placed in the old location ( /opt/rocm-xxx/<component>/include) with a warning message to include files from the new location (/opt/rocm-xxx/include) as shown in the example below.

#pragma message "This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip."
#include "hip/hip_runtime.h"

The deprecation plan for backward compatibility wrapper header files is as follows

  • #pragma message announcing deprecation – ROCm v5.2 release.

  • #pragma message changed to #warning – Future release, tentatively ROCm v5.5.

  • #warning changed to #error – Future release, tentatively ROCm v5.6.

  • Backward compatibility wrappers removed – Future release, tentatively ROCm v6.0.

Executable files#

Executable files are available in the /opt/rocm-xxx/bin folder. For backward compatibility, the old library location (/opt/rocm-xxx/<component>/bin) has a soft link to the library at the new location. Soft links will be removed in a future release, tentatively ROCm v6.0.

$ ls -l /opt/rocm/hip/bin/
lrwxrwxrwx 1 root root   24 Jan 1 23:32 hipcc -> ../../bin/hipcc

Library files#

Library files are available in the /opt/rocm-xxx/lib folder. For backward compatibility, the old library location (/opt/rocm-xxx/<component>/lib) has a soft link to the library at the new location. Soft links will be removed in a future release, tentatively ROCm v6.0.

$ ls -l /opt/rocm/hip/lib/
drwxr-xr-x 4 root root 4096 Jan 1 10:45 cmake
lrwxrwxrwx 1 root root   24 Jan 1 23:32 libamdhip64.so -> ../../lib/libamdhip64.so

CMake Config files#

All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component> folder. For backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake) consist of a soft link to the new CMake config. Soft links will be removed in a future release, tentatively ROCm v6.0.

$ ls -l /opt/rocm/hip/lib/cmake/hip/
lrwxrwxrwx 1 root root 42 Jan 1 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake

Changes required in applications using ROCm#

Applications using ROCm are advised to use the new file paths. As the old files will be deprecated in a future release. Application have to make sure to include correct header file and use correct search paths.

  1. #include<header_file.h> needs to be changed to #include <component/header_file.h>

    For example: #include <hip.h> needs to change to #include <hip/hip.h>

  2. Any variable in CMake or Makefiles pointing to component folder needs to changed.

    For example: VAR1=/opt/rocm/hip needs to be changed to VAR1=/opt/rocm VAR2=/opt/rocm/hsa needs to be changed to VAR2=/opt/rocm

  3. Any reference to /opt/rocm/<component>/bin or /opt/rocm/<component>/lib needs to be changed to /opt/rocm/bin and /opt/rocm/lib/ respectively.

References#

ROCm deprecation warning

Linux File System Standard

GPU Isolation Techniques#

Restricting the access of applications to a subset of GPUs, aka isolating GPUs allows users to hide GPU resources from programs. The programs by default will only use the “exposed” GPUs ignoring other (hidden) GPUs in the system.

There are multiple ways to achieve isolation of GPUs in the ROCm software stack, differing in which applications they apply to and the security they provide. This page serves as an overview of the techniques.

Environment Variables#

The runtimes in the ROCm software stack read these environment variables to select the exposed or default device to present to applications using them.

Environment variables shouldn’t be used for isolating untrusted applications, as an application can reset them before initializing the runtime.

ROCR_VISIBLE_DEVICES#

A list of device indices or UUIDs that will be exposed to applications.

Runtime : ROCm Platform Runtime. Applies to all applications using the user mode ROCm software stack.

Example to expose the 1. device and a device based on UUID.#
export ROCR_VISIBLE_DEVICES="0,GPU-DEADBEEFDEADBEEF"

GPU_DEVICE_ORDINAL#

Devices indices exposed to OpenCL and HIP applications.

Runtime : ROCm Common Language Runtime (ROCclr). Applies to applications and runtimes using the ROCclr abstraction layer including HIP and OpenCL applications.

Example to expose the 1. and 3. device in the system.#
export GPU_DEVICE_ORDINAL="0,2"

HIP_VISIBLE_DEVICES#

Device indices exposed to HIP applications.

Runtime : HIP Runtime. Applies only to applications using HIP on the AMD platform.

Example to expose the 1. and 3. devices in the system.#
export HIP_VISIBLE_DEVICES="0,2"

CUDA_VISIBLE_DEVICES#

Provided for CUDA compatibility, has the same effect as HIP_VISIBLE_DEVICES on the AMD platform.

Runtime : HIP or CUDA Runtime. Applies to HIP applications on the AMD or NVIDIA platform and CUDA applications.

OMP_DEFAULT_DEVICE#

Default device used for OpenMP target offloading.

Runtime : OpenMP Runtime. Applies only to applications using OpenMP offloading.

Example on setting the default device to the third device.#
export OMP_DEFAULT_DEVICE="2"

Docker#

Docker uses Linux kernel namespaces to provide isolated environments for applications. This isolation applies to most devices by default, including GPUs. To access them in containers explicit access must be granted, please see Accessing GPUs in containers for details. Specifically refer to Restricting a container to a subset of the GPUs on exposing just a subset of all GPUs.

Docker isolation is more secure than environment variables, and applies to all programs that use the amdgpu kernel module interfaces. Even programs that don’t use the ROCm runtime, like graphics applications using OpenGL or Vulkan, can only access the GPUs exposed to the container.

GPU Passthrough to Virtual Machines#

Virtual machines achieve the highest level of isolation, because even the kernel of the virtual machine is isolated from the host. Devices physically installed in the host system can be passed to the virtual machine using PCIe passthrough. This allows for using the GPU with a different operating systems like a Windows guest from a Linux host.

Setting up PCIe passthrough is specific to the hypervisor used. ROCm officially supports VMware ESXi for select GPUs.

GPU Architectures#

Architecture Guides#

AMD Instinct MI200

Review hardware aspects of the AMD Instinct™ MI250 accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs.

AMD Instinct MI100

Review hardware aspects of the AMD Instinct™ MI100 accelerators and the CDNA™ 1 architecture that is the foundation of these GPUs.

ISA Documentation#

White Papers#

AMD Instinct Hardware#

This chapter briefly reviews hardware aspects of the AMD Instinct MI250 accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs.

AMD CDNA 2 Micro-architecture#

The micro-architecture of the AMD Instinct MI250 accelerators is based on the AMD CDNA 2 architecture that targets compute applications such as HPC, artificial intelligence (AI), and Machine Learning (ML) and that run on everything from individual servers to the world’s largest exascale supercomputers. The overall system architecture is designed for extreme scalability and compute performance.

Fig. 5 shows the components of a single Graphics Compute Die (GCD ) of the CDNA 2 architecture. On the top and the bottom are AMD Infinity Fabric™ interfaces and their physical links that are used to connect the GPU die to the other system-level components of the node (see also Section 2.2). Both interfaces can drive four AMD Infinity Fabric links. One of the AMD Infinity Fabric links of the controller at the bottom can be configured as a PCIe link. Each of the AMD Infinity Fabric links between GPUs can run at up to 25 GT/sec, which correlates to a peak transfer bandwidth of 50 GB/sec for a 16-wide link ( two bytes per transaction). Section 2.2 has more details on the number of AMD Infinity Fabric links and the resulting transfer rates between the system-level components.

To the left and the right are memory controllers that attach the High Bandwidth Memory (HBM) modules to the GCD. AMD Instinct MI250 GPUs use HBM2e, which offers a peak memory bandwidth of 1.6 TB/sec per GCD.

The execution units of the GPU are depicted in Fig. 5 as Compute Units (CU). The MI250 GCD has 104 active CUs. Each compute unit is further subdivided into four SIMD units that process SIMD instructions of 16 data elements per instruction (for the FP64 data type). This enables the CU to process 64 work items (a so-called “wavefront”) at a peak clock frequency of 1.7 GHz. Therefore, the theoretical maximum FP64 peak performance per GCD is 45.3 TFLOPS for vector instructions. The MI250 compute units also provide specialized execution units (also called matrix cores), which are geared toward executing matrix operations like matrix-matrix multiplications. For FP64, the peak performance of these units amounts to 90.5 TFLOPS.

Structure of a single GCD in the AMD Instinct MI250 accelerator.

Figure 1: Structure of a single GCD in the AMD Instinct MI250 accelerator.#

Peak-performance capabilities of the MI250 OAM for different data types.#

Computation and Data Type

FLOPS/CLOCK/CU

Peak TFLOPS

Matrix FP64

256

90.5

Vector FP64

128

45.3

Matrix FP32

256

90.5

Packed FP32

256

90.5

Vector FP32

128

45.3

Matrix FP16

1024

362.1

Matrix BF16

1024

362.1

Matrix INT8

1024

362.1

Table 8 summarizes the aggregated peak performance of the AMD Instinct MI250 OCP Open Accelerator Modules (OAM, OCP is short for Open Compute Platform) and its two GCDs for different data types and execution units. The middle column lists the peak performance (number of data elements processed in a single instruction) of a single compute unit if a SIMD (or matrix) instruction is being retired in each clock cycle. The third column lists the theoretical peak performance of the OAM module. The theoretical aggregated peak memory bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD).

Dual-GCD architecture of the AMD Instinct MI250 accelerators.

Dual-GCD architecture of the AMD Instinct MI250 accelerators.#

Fig. 6 shows the block diagram of an OAM package that consists of two GCDs, each of which constitutes one GPU device in the system. The two GCDs in the package are connected via four AMD Infinity Fabric links running at a theoretical peak rate of 25 GT/sec, giving 200 GB/sec peak transfer bandwidth between the two GCDs of an OAM, or a bidirectional peak transfer bandwidth of 400 GB/sec for the same.

Node-level Architecture#

Fig. 7 shows the node-level architecture of a system that is based on the AMD Instinct MI250 accelerator. The MI250 OAMs attach to the host system via PCIe Gen 4 x16 links (yellow lines). Each GCD maintains its own PCIe x16 link to the host part of the system. Depending on the server platform, the GCD can attach to the AMD EPYC processor directly or via an optional PCIe switch . Note that some platforms may offer an x8 interface to the GCDs, which reduces the available host-to-GPU bandwidth.

Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor.

Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor.#

Fig. 7 shows the node-level architecture of a system with AMD EPYC processors in a dual-socket configuration and four AMD Instinct MI250 accelerators. The MI250 OAMs attach to the host processors system via PCIe Gen 4 x16 links (yellow lines). Depending on the system design, a PCIe switch may exist to make more PCIe lanes available for additional components like network interfaces and/or storage devices. Each GCD maintains its own PCIe x16 link to the host part of the system or to the PCIe switch. Please note, some platforms may offer an x8 interface to the GCDs, which will reduce the available host-to-GPU bandwidth.

Between the OAMs and their respective GCDs, a peer-to-peer (P2P) network allows for direct data exchange between the GPU dies via AMD Infinity Fabric links ( black, green, and red lines). Each of these 16-wide links connects to one of the two GPU dies in the MI250 OAM and operates at 25 GT/sec, which corresponds to a theoretical peak transfer rate of 50 GB/sec per link (or 100 GB/sec bidirectional peak transfer bandwidth). The GCD pairs 2 and 6 as well as GCDs 0 and 4 connect via two XGMI links, which is indicated by the thicker red line in Fig. 7.

AMD Instinct™ MI100 Hardware#

In this chapter, we are going to briefly review hardware aspects of the AMD Instinct™ MI100 accelerators and the CDNA architecture that is the foundation of these GPUs.

System Architecture#

Fig. 8 shows the node-level architecture of a system that comprises two AMD EPYC™ processors and (up to) eight AMD Instinct™ accelerators. The two EPYC processors are connected to each other with the AMD Infinity™ fabric which provides a high-bandwidth (up to 18 GT/sec) and coherent links such that each processor can access the available node memory as a single shared-memory domain in a non-uniform memory architecture (NUMA) fashion. In a 2P, or dual-socket, configuration, three AMD Infinity™ fabric links are available to connect the processors plus one PCIe Gen 4 x16 link per processor can attach additional I/O devices such as the host adapters for the network fabric.

Node-level system architecture with two AMD EPYC™ processors and eight AMD Instinct™ accelerators.

Structure of a single GCD in the AMD Instinct MI250 accelerator.#

In a typical node configuration, each processor can host up to four AMD Instinct™ accelerators that are attached using PCIe Gen 4 links at 16 GT/sec, which corresponds to a peak bidirectional link bandwidth of 32 GB/sec. Each hive of four accelerators can participate in a fully connected, coherent AMD Instinct™ fabric that connects the four accelerators using 23 GT/sec AMD Infinity fabric links that run at a higher frequency than the inter-processor links. This inter-GPU link can be established in certified server systems if the GPUs are mounted in neighboring PCIe slots by installing the AMD Infinity Fabric™ bridge for the AMD Instinct™ accelerators.

Micro-architecture#

The micro-architecture of the AMD Instinct accelerators is based on the AMD CDNA architecture, which targets compute applications such as high-performance computing (HPC) and AI & machine learning (ML) that run on everything from individual servers to the world’s largest exascale supercomputers. The overall system architecture is designed for extreme scalability and compute performance.

Structure of the AMD Instinct accelerator (MI100 generation).

Structure of the AMD Instinct accelerator (MI100 generation).#

Fig. 9 shows the AMD Instinct accelerator with its PCIe Gen 4 x16 link (16 GT/sec, at the bottom) that connects the GPU to (one of) the host processor(s). It also shows the three AMD Infinity Fabric ports that provide high-speed links (23 GT/sec, also at the bottom) to the other GPUs of the local hive as shown in Fig. 8.

On the left and right of the floor plan, the High Bandwidth Memory (HBM) attaches via the GPU’s memory controller. The MI100 generation of the AMD Instinct accelerator offers four stacks of HBM generation 2 (HBM2) for a total of 32GB with a 4,096bit-wide memory interface. The peak memory bandwidth of the attached HBM2 is 1.228 TB/sec at a memory clock frequency of 1.2 GHz.

The execution units of the GPU are depicted in Fig. 9 as Compute Units (CU). There are a total 120 compute units that are physically organized into eight Shader Engines (SE) with fifteen compute units per shader engine. Each compute unit is further sub-divided into four SIMD units that process SIMD instructions of 16 data elements per instruction. This enables the CU to process 64 data elements (a so-called ‘wavefront’) at a peak clock frequency of 1.5 GHz. Therefore, the theoretical maximum FP64 peak performance is 11.5 TFLOPS (4 [SIMD units] x 16 [elements per instruction] x 120 [CU] x 1.5 [GHz]).

Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA architecture

Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA architecture#

Fig. 10 shows the block diagram of a single CU of an AMD Instinct™ MI100 accelerator and summarizes how instructions flow through the execution engines. The CU fetches the instructions via a 32KB instruction cache and moves them forward to execution via a dispatcher. The CU can handle up to ten wavefronts at a time and feed their instructions into the execution unit. The execution unit contains 256 vector general-purpose registers (VGPR) and 800 scalar general-purpose registers (SGPR). The VGPR and SGPR are dynamically allocated to the executing wavefronts. A wavefront can access a maximum of 102 scalar registers. Excess scalar-register usage will cause register spilling and thus may affect execution performance.

A wavefront can occupy any number of VGPRs from 0 to 256, directly affecting occupancy; that is, the number of concurrently active wavefronts in the CU. For instance, with 119 VGPRs used, only two wavefronts can be active in the CU at the same time. With the instruction latency of four cycles per SIMD instruction, the occupancy should be as high as possible such that the compute unit can improve execution efficiency by scheduling instructions from multiple wavefronts.

Peak-performance capabilities of MI100 for different data types.#

Computation and Data Type

FLOPS/CLOCK/CU

Peak TFLOPS

Vector FP64

64

11.5

Matrix FP32

256

46.1

Vector FP32

128

23.1

Matrix FP16

1024

184.6

Matrix BF16

512

92.3

How ROCm uses PCIe Atomics#

ROCm PCIe Feature and Overview BAR Memory#

ROCm is an extension of HSA platform architecture, so it shares the queueing model, memory model, signaling and synchronization protocols. Platform atomics are integral to perform queuing and signaling memory operations where there may be multiple-writers across CPU and GPU agents.

The full list of HSA system architecture platform requirements are here: HSA Sys Arch Features.

The ROCm Platform uses the new PCI Express 3.0 (PCIe 3.0) features for Atomic Read-Modify-Write Transactions which extends inter-processor synchronization mechanisms to IO to support the defined set of HSA capabilities needed for queuing and signaling memory operations.

The new PCIe AtomicOps operate as completers for CAS (Compare and Swap), FetchADD, SWAP atomics. The AtomicsOps are initiated by the I/O device which support 32-bit, 64-bit and 128-bit operand which target address have to be naturally aligned to operation sizes.

For ROCm the Platform atomics are used in ROCm in the following ways:

  • Update HSA queue’s read_dispatch_id: 64 bit atomic add used by the command processor on the GPU agent to update the packet ID it processed.

  • Update HSA queue’s write_dispatch_id: 64 bit atomic add used by the CPU and GPU agent to support multi-writer queue insertions.

  • Update HSA Signals – 64bit atomic ops are used for CPU & GPU synchronization.

The PCIe 3.0 AtomicOp feature allows atomic transactions to be requested by, routed through and completed by PCIe components. Routing and completion does not require software support. Component support for each is detectable via the DEVCAP2 register. Upstream bridges need to have AtomicOp routing enabled or the Atomic Operations will fail even though PCIe endpoint and PCIe I/O Devices has the capability to Atomics Operations.

To do AtomicOp routing capability between two or more Root Ports, each associated Root Port must indicate that capability via the AtomicOp Routing Supported bit in the Device Capabilities 2 register.

If your system has a PCIe Express Switch it needs to support AtomicsOp routing. Again AtomicOp requests are permitted only if a component’s DEVCTL2.ATOMICOP_REQUESTER_ENABLE field is set. These requests can only be serviced if the upstream components support AtomicOp completion and/or routing to a component which does. AtomicOp Routing Support=1 Routing is supported, AtomicOp Routing Support=0 routing is not supported.

Atomic Operation is a Non-Posted transaction supporting 32-bit and 64-bit address formats, there must be a response for Completion containing the result of the operation. Errors associated with the operation (uncorrectable error accessing the target location or carrying out the Atomic operation) are signaled to the requester by setting the Completion Status field in the completion descriptor, they are set to to Completer Abort (CA) or Unsupported Request (UR).

To understand more about how PCIe Atomic operations work PCIe Atomics

Linux Kernel Patch to pci_enable_atomic_request

There are also a number of papers which talk about these new capabilities:

Other I/O devices with PCIe Atomics support

Future bus technology with richer I/O Atomics Operation Support

New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or newer CPU’s with PCIe Generation 3.0 support.

In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU originates two writes to two different targets:

1. write to another GPU memory,
2. then write to system memory to indicate transfer complete.

They are routed off to different ends of the computer but we want to make sure the write to system memory to indicate transfer complete occurs AFTER P2P write to GPU has complete.

Good Paper on Understanding PCIe Generation 3 Throughput

BAR Memory Overview#

On a Xeon E5 based system in the BIOS we can turn on above 4GB PCIe addressing, if so he need to set MMIO Base address ( MMIOH Base) and Range ( MMIO High Size) in the BIOS.

In SuperMicro system in the system bios you need to see the following

  • Advanced->PCIe/PCI/PnP configuration-> Above 4G Decoding = Enabled

  • Advanced->PCIe/PCI/PnP Configuration->MMIOH Base = 512G

  • Advanced->PCIe/PCI/PnP Configuration->MMIO High Size = 256G

When we support Large Bar Capability there is a Large Bar Vbios which also disable the IO bar.

For GFX9 and Vega10 which have Physical Address up 44 bit and 48 bit Virtual address.

  • BAR0-1 registers: 64bit, prefetchable, GPU memory. 8GB or 16GB depending on Vega10 SKU. Must be placed < 2^44 to support P2P access from other Vega10.

  • BAR2-3 registers: 64bit, prefetchable, Doorbell. Must be placed < 2^44 to support P2P access from other Vega10.

  • BAR4 register: Optional, not a boot device.

  • BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed < 4GB.

Here is how our BAR works on GFX 8 GPU’s with 40 bit Physical Address Limit

11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev c1)

Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0b35

Flags: bus master, fast devsel, latency 0, IRQ 119

Memory at bf40000000 (64-bit, prefetchable) [size=256M]

Memory at bf50000000 (64-bit, prefetchable) [size=2M]

I/O ports at 3000 [size=256]

Memory at c7400000 (32-bit, non-prefetchable) [size=256K]

Expansion ROM at c7440000 [disabled] [size=128K]

Legend:

1 : GPU Frame Buffer BAR – In this example it happens to be 256M, but typically this will be size of the GPU memory (typically 4GB+). This BAR has to be placed < 2^40 to allow peer-to-peer access from other GFX8 AMD GPUs. For GFX9 (Vega GPU) the BAR has to be placed < 2^44 to allow peer-to-peer access from other GFX9 AMD GPUs.

2 : Doorbell BAR – The size of the BAR is typically will be < 10MB (currently fixed at 2MB) for this generation GPUs. This BAR has to be placed < 2^40 to allow peer-to-peer access from other current generation AMD GPUs.

3 : IO BAR - This is for legacy VGA and boot device support, but since this the GPUs in this project are not VGA devices (headless), this is not a concern even if the SBIOS does not setup.

4 : MMIO BAR – This is required for the AMD Driver SW to access the configuration registers. Since the reminder of the BAR available is only 1 DWORD (32bit), this is placed < 4GB. This is fixed at 256KB.

5 : Expansion ROM – This is required for the AMD Driver SW to access the GPU’s video-bios. This is currently fixed at 128KB.

Excepts form Overview of Changes to PCI Express 3.0#

By Mike Jackson, Senior Staff Architect, MindShare, Inc.#

Atomic Operations – Goal:#

Support SMP-type operations across a PCIe network to allow for things like offloading tasks between CPU cores and accelerators like a GPU. The spec says this enables advanced synchronization mechanisms that are particularly useful with multiple producers or consumers that need to be synchronized in a non-blocking fashion. Three new atomic non-posted requests were added, plus the corresponding completion (the address must be naturally aligned with the operand size or the TLP is malformed):

  • Fetch and Add – uses one operand as the “add” value. Reads the target location, adds the operand, and then writes the result back to the original location.

  • Unconditional Swap – uses one operand as the “swap” value. Reads the target location and then writes the swap value to it.

  • Compare and Swap – uses 2 operands: first data is compare value, second is swap value. Reads the target location, checks it against the compare value and, if equal, writes the swap value to the target location.

  • AtomicOpCompletion – new completion to give the result so far atomic request and indicate that the atomicity of the transaction has been maintained.

Since AtomicOps are not locked they don’t have the performance downsides of the PCI locked protocol. Compared to locked cycles, they provide “lower latency, higher scalability, advanced synchronization algorithms, and dramatically lower impact on other PCIe traffic.” The lock mechanism can still be used across a bridge to PCI or PCI-X to achieve the desired operation.

AtomicOps can go from device to device, device to host, or host to device. Each completer indicates whether it supports this capability and guarantees atomic access if it does. The ability to route AtomicOps is also indicated in the registers for a given port.

ID-based Ordering – Goal:#

Improve performance by avoiding stalls caused by ordering rules. For example, posted writes are never normally allowed to pass each other in a queue, but if they are requested by different functions, we can have some confidence that the requests are not dependent on each other. The previously reserved Attribute bit [2] is now combined with the RO bit to indicate ID ordering with or without relaxed ordering.

This only has meaning for memory requests, and is reserved for Configuration or IO requests. Completers are not required to copy this bit into a completion, and only use the bit if their enable bit is set for this operation.

To read more on PCIe Gen 3 new options https://www.mindshare.com/files/resources/PCIe%203-0.pdf

All How-To Material#

Tuning Guides

Use case-specific system setup and tuning guides.

Deep Learning Guide

Installation of various Deep Learning frameworks and applications.

GPU-Enabled MPI

This chapter exemplifies how to set up Open MPI with the ROCm platform.

System Debugging Guide

Useful commands to debug misbehaving ROCm installations.

Tuning Guides#

Use case-specific system setup and tuning guides.

High Performance Computing#

High Performance Computing (HPC) workloads have unique requirements. The default hardware and BIOS configurations for OEM platforms may not provide optimal performance for HPC workloads. To enable optimal HPC settings on a per-platform and per-workload level, this guide calls out:

  • BIOS settings that can impact performance

  • Hardware configuration best practices

  • Supported versions of operating systems

  • Workload-specific recommendations for optimal BIOS and operating system settings

There is also a discussion on the AMD Instinct™ software development environment, including information on how to install and run the DGEMM, STREAM, HPCG, and HPL benchmarks. This guidance provides a good starting point but is not exhaustively tested across all compilers.

Prerequisites to understanding this document and to performing tuning of HPC applications include:

  • Experience in configuring servers

  • Administrative access to the server’s Management Interface (BMC)

  • Administrative access to the operating system

  • Familiarity with the OEM server’s BMC (strongly recommended)

  • Familiarity with the OS specific tools for configuration, monitoring, and troubleshooting (strongly recommended)

This document provides guidance on tuning systems with various AMD Instinct™ accelerators for HPC workloads. This document is not an all-inclusive guide, and some items referred to may have similar, but different, names in various OEM systems (for example, OEM-specific BIOS settings). This document also provides suggestions on items that should be the initial focus of additional, application-specific tuning.

This document is based on the AMD EPYC™ 7003-series processor family (former codename “Milan”).

While this guide is a good starting point, developers are encouraged to perform their own performance testing for additional tuning.

AMD Instinct™ MI200

This chapter goes through how to configure your AMD Instinct™ MI200 accelerated compute nodes to get the best performance out of them.

AMD Instinct™ MI100

This chapter briefly reviews hardware aspects of the AMD Instinct™ MI100 accelerators and the CDNA™ 1 architecture that is the foundation of these GPUs.

Workstation#

Workstation workloads, much like High Performance Computing have a unique set of requirements, a blend of both graphics and compute, certification, stability and the list continues.

The document covers specific software requirements and processes needed to use these GPUs for Single Root I/O Virtualization (SR-IOV) and Machine Learning (ML).

The main purpose of this document is to help users utilize the RDNA 2 GPUs to their full potential.

AMD Radeon™ PRO W6000 and V620

This chapter describes the AMD GPUs with RDNA™ 2 architecture, namely AMD Radeon PRO W6800 and AMD Radeon PRO V620

MI200 High Performance Computing and Tuning Guide#

System Settings#

This chapter reviews system settings that are required to configure the system for AMD Instinct MI250 accelerators and improve the performance of the GPUs. It is advised to configure the system for the best possible host configuration according to the “High Performance Computing (HPC) Tuning Guide for AMD EPYC 7003 Series Processors.”

Configure the system BIOS settings as explained in System BIOS Settings and enact the below given settings via the command line as explained in Operating System Settings:

  • Core C states

  • IOMMU (if needed)

System BIOS Settings#

For maximum MI250 GPU performance on systems with AMD EPYC™ 7003-series processors (codename “Milan”) and AMI System BIOS, the following configuration of system BIOS settings has been validated. These settings must be used for the qualification process and should be set as default values for the system BIOS. Analogous settings for other non-AMI System BIOS providers could be set similarly. For systems with Intel processors, some settings may not apply or be available as listed in Table 10.

Recommended settings for the system BIOS in a GIGABYTE platform.#

BIOS Setting Location

Parameter

Value

Comments

Advanced / PCI Subsystem Settings

Above 4G Decoding

Enabled

GPU Large BAR Support

Advanced / PCI Subsystem Settings

SR-IOV Support

Disabled

Disable Single Root IO Virtualization

AMD CBS / CPU Common Options

Global C-state Control

Auto

Global Core C-States

AMD CBS / CPU Common Options

CCD/Core/Thread Enablement

Accept

Global Core C-States

AMD CBS / CPU Common Options / Performance

SMT Control

Disable

Global Core C-States

AMD CBS / DF Common Options / Memory Addressing

NUMA nodes per socket

NPS 1,2,4

NUMA Nodes (NPS)

AMD CBS / DF Common Options / Memory Addressing

Memory interleaving

Auto

Numa Nodes (NPS)

AMD CBS / DF Common Options / Link

4-link xGMI max speed

18 Gbps

Set AMD CPU xGMI speed to highest rate supported

AMD CBS / NBIO Common Options

IOMMU

Disable

AMD CBS / NBIO Common Options

PCIe Ten Bit Tag Support

Auto

AMD CBS / NBIO Common Options

Preferred IO

Bus

AMD CBS / NBIO Common Options

Preferred IO Bus

“Use lspci to find pci device id”

AMD CBS / NBIO Common Options

Enhanced Preferred IO Mode

Enable

AMD CBS / NBIO Common Options / SMU Common Options

Determinism Control

Manual

AMD CBS / NBIO Common Options / SMU Common Options

Determinism Slider

Power

AMD CBS / NBIO Common Options / SMU Common Options

cTDP Control

Manual

Set cTDP to the maximum supported by the installed CPU

AMD CBS / NBIO Common Options / SMU Common Options

cTDP

280

AMD CBS / NBIO Common Options / SMU Common Options

Package Power Limit Control

Manual

Set Package Power Limit to the maximum supported by the installed CPU

AMD CBS / NBIO Common Options / SMU Common Options

Package Power Limit

280

AMD CBS / NBIO Common Options / SMU Common Options

xGMI Link Width Control

Manual

Set AMD CPU xGMI width to 16 bits

AMD CBS / NBIO Common Options / SMU Common Options

xGMI Force Link Width

2

AMD CBS / NBIO Common Options / SMU Common Options

xGMI Force Link Width Control

Force

AMD CBS / NBIO Common Options / SMU Common Options

APBDIS

1

AMD CBS / NBIO Common Options / SMU Common Options

DF C-states

Enabled

AMD CBS / NBIO Common Options / SMU Common Options

Fixed SOC P-state

P0

AMD CBS / UMC Common Options / DDR4 Common Options

Enforce POR

Accept

AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR

Overclock

Enabled

AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR

Memory Clock Speed

1600 MHz

Set to max Memory Speed, if using 3200 MHz DIMMs

AMD CBS / UMC Common Options / DDR4 Common Options / DRAM Controller Configuration / DRAM Power Options

Power Down Enable

Disabled

RAM Power Down

AMD CBS / Security

TSME

Disabled

Memory Encryption

Memory Configuration#

For setting the memory addressing modes (see Table 10), especially the number of NUMA nodes per socket/processor (NPS), follow the guidance of the “High Performance Computing (HPC) Tuning Guide for AMD EPYC 7003 Series Processors” to provide the optimal configuration for host side computation. For most HPC workloads, NPS=4 is the recommended value.

Operating System Settings#
CPU Core State - “C States”#

There are several Core-States, or C-states that an AMD EPYC CPU can idle within:

  • C0: active. This is the active state while running an application.

  • C1: idle

  • C2: idle and power gated. This is a deeper sleep state and will have a greater latency when moving back to the C0 state, compared to when the CPU is coming out of C1.

Disabling C2 is important for running with a high performance, low-latency network. To disable power-gating on all cores run the following on Linux systems:

cpupower idle-set -d 2

Note that the cpupower tool must be installed, as it is not part of the base packages of most Linux® distributions. The package needed varies with the respective Linux distribution.

sudo apt install linux-tools-common
sudo yum install cpupowerutils
sudo zypper install cpupower
AMD-IOPM-UTIL#

This section applies to AMD EPYC™ 7002 processors to optimize advanced Dynamic Power Management (DPM) in the I/O logic (see NBIO description above) for performance. Certain I/O workloads may benefit from disabling this power management. This utility disables DPM for all PCI-e root complexes in the system and locks the logic into the highest performance operational mode.

Disabling I/O DPM will reduce the latency and/or improve the throughput of low-bandwidth messages for PCI-e InfiniBand NICs and GPUs. Other workloads with low-bandwidth bursty PCI-e I/O characteristics may benefit as well if multiple such PCI-e devices are installed in the system.

The actions of the utility do not persist across reboots. There is no need to change any existing firmware settings when using this utility. The “Preferred I/O” and “Enhanced Preferred I/O” settings should remain unchanged at enabled.

Tip

The recommended method to use the utility is either to create a system start-up script, for example, a one-shot systemd service unit, or run the utility when starting up a job scheduler on the system. The installer packages (see Power Management Utility) will create and enable a systemd service unit for you. This service unit is configured to run in one-shot mode. This means that even when the service unit runs as expected, the status of the service unit will show inactive. This is the expected behavior when the utility runs normally. If the service unit shows failed, the utility did not run as expected. The output in either case can be shown with the systemctl status command.

Stopping the service unit has no effect since the utility does not leave anything running. To undo the effects of the utility, disable the service unit with the systemctl disable command and reboot the system.

The utility does not have any command-line options, and it must be run with super-user permissions.

Systems with 256 CPU Threads - IOMMU Configuration#

For systems that have 256 logical CPU cores or more (e.g., 64-core AMD EPYC™ 7763 in a dual-socket configuration and SMT enabled), setting the Input-Output Memory Management Unit (IOMMU) configuration to “disabled” can limit the number of available logical cores to 255. The reason is that the Linux® kernel disables X2APIC in this case and falls back to Advanced Programmable Interrupt Controller (APIC), which can only enumerate a maximum of 255 (logical) cores.

If SMT is enabled by setting “CCD/Core/Thread Enablement > SMT Control” to “enable”, the following steps can be applied to the system to enable all (logical) cores of the system:

  • In the server BIOS, set IOMMU to “Enabled”.

  • When configuring the Grub boot loader, add the following arguments for the Linux kernel: amd_iommu=on iommu=pt

  • Update Grub to use the modified configuration:

    sudo grub2-mkconfig -o /boot/grub2/grub.cfg
    
  • Reboot the system.

  • Verify IOMMU passthrough mode by inspecting the kernel log via dmesg:

    [...]
    [   0.000000] Kernel command line: [...] amd_iommu=on iommu=pt
       [...]
    

Once the system is properly configured, the AMD ROCm platform can be installed.

System Management#

For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to Deploy ROCm on Linux. For verifying that the installation was successful, refer to Verifying Kernel-mode Driver Installation and Validation Tools. Should verification fail, consult the System Debugging Guide.

Hardware Verification with ROCm#

The AMD ROCm™ platform ships with tools to query the system structure. To query the GPU hardware, the rocm-smi command is available. It can show available GPUs in the system with their device ID and their respective firmware (or VBIOS) versions:

rocm-smi --showhw output on an 8*MI200 system.

rocm-smi --showhw output on an 8*MI200 system.#

To see the system structure, the localization of the GPUs in the system, and the fabric connections between the system components, use:

rocm-smi --showtopo output on an 8*MI200 system.

rocm-smi --showtopo output on an 8*MI200 system.#

  • The first block of the output shows the distance between the GPUs similar to what the numactl command outputs for the NUMA domains of a system. The weight is a qualitative measure for the “distance” data must travel to reach one GPU from another one. While the values do not carry a special (physical) meaning, the higher the value the more hops are needed to reach the destination from the source GPU.

  • The second block has a matrix named “Hops between two GPUs”, where 1 means the two GPUs are directly connected with XGMI, 2 means both GPUs are linked to the same CPU socket and GPU communications will go through the CPU, and 3 means both GPUs are linked to different CPU sockets so communications will go through both CPU sockets. This number is one for all GPUs in this case since they are all connected to each other through the Infinity Fabric links.

  • The third block outputs the link types between the GPUs. This can either be “XGMI” for AMD Infinity Fabric links or “PCIE” for PCIe Gen4 links.

  • The fourth block reveals the localization of a GPU with respect to the NUMA organization of the shared memory of the AMD EPYC processors.

To query the compute capabilities of the GPU devices, use rocminfo command. It lists specific details about the GPU devices, including but not limited to the number of compute units, width of the SIMD pipelines, memory information, and instruction set architecture:

rocminfo output fragment on an 8*MI200 system.

rocminfo output fragment on an 8*MI200 system.#

For a complete list of architecture (LLVM target) names, refer to GPU OS Support.

Testing Inter-device Bandwidth#

mi100-hw-verification showed the rocm-smi --showtopo command to show how the system structure and how the GPUs are located and connected in this structure. For more details, the rocm-bandwidth-test can run benchmarks to show the effective link bandwidth between the components of the system.

The ROCm Bandwidth Test program can be installed with the following package-manager commands:

sudo apt install rocm-bandwidth-test
sudo yum install rocm-bandwidth-test
sudo zypper install rocm-bandwidth-test

Alternatively, the source code can be downloaded and built from source.

The output will list the available compute devices (CPUs and GPUs), including their device ID and PCIe ID:

rocm-bandwidth-test output fragment on an 8*MI200 system listing devices.

rocm-bandwidth-test output fragment on an 8*MI200 system listing devices.#

The output will also show a matrix that contains a “1” if a device can communicate to another device (CPU and GPU) of the system and it will show the NUMA distance (similar to rocm-smi):

rocm-bandwidth-test output fragment on an 8*MI200 system showing inter-device access matrix and NUMA distances.

rocm-bandwidth-test output fragment on an 8*MI200 system showing inter-device access matrix and NUMA distances.#

The output also contains the measured bandwidth for unidirectional and bidirectional transfers between the devices (CPU and GPU):

rocm-bandwidth-test output fragment on an 8*MI200 system showing uni- and bidirectional bandwidths.

rocm-bandwidth-test output fragment on an 8*MI200 system showing uni- and bidirectional bandwidths.#

MI100 High Performance Computing and Tuning Guide#

System Settings#

This chapter reviews system settings that are required to configure the system for AMD Instinct™ MI100 accelerators and that can improve performance of the GPUs. It is advised to configure the system for best possible host configuration according to the “High Performance Computing (HPC) Tuning Guide for AMD EPYC™ 7002 Series Processors” or “High Performance Computing (HPC) Tuning Guide for AMD EPYC™ 7003 Series Processors” depending on the processor generation of the system.

In addition to the BIOS settings listed below the following settings (System BIOS Settings) will also have to be enacted via the command line (see Operating System Settings):

  • Core C states

  • AMD-PCI-UTIL (on AMD EPYC™ 7002 series processors)

  • IOMMU (if needed)

System BIOS Settings#

For maximum MI100 GPU performance on systems with AMD EPYC™ 7002 series processors (codename “Rome”) and AMI System BIOS, the following configuration of System BIOS settings has been validated. These settings must be used for the qualification process and should be set as default values for the system BIOS. Analogous settings for other non-AMI System BIOS providers could be set similarly. For systems with Intel processors, some settings may not apply or be available as listed in Table 11.

Recommended settings for the system BIOS in a GIGABYTE platform.#

BIOS Setting Location

Parameter

Value

Comments

Advanced / PCI Subsystem Settings

Above 4G Decoding

Enabled

GPU Large BAR Support

AMD CBS / CPU Common Options

Global C-state Control

Auto

Global Core C-States

AMD CBS / CPU Common Options

CCD/Core/Thread Enablement

Accept

Global Core C-States

AMD CBS / CPU Common Options / Performance

SMT Control

Disable

Global Core C-States

AMD CBS / DF Common Options / Memory Addressing

NUMA nodes per socket

NPS 1,2,4

NUMA Nodes (NPS)

AMD CBS / DF Common Options / Memory Addressing

Memory interleaving

Auto

Numa Nodes (NPS)

AMD CBS / DF Common Options / Link

4-link xGMI max speed

18 Gbps

Set AMD CPU xGMI speed to highest rate supported

AMD CBS / DF Common Options / Link

3-link xGMI max speed

18 Gbps

Set AMD CPU xGMI speed to highest rate supported

AMD CBS / NBIO Common Options

IOMMU

Disable

AMD CBS / NBIO Common Options

PCIe Ten Bit Tag Support

Enable

AMD CBS / NBIO Common Options

Preferred IO

Manual

AMD CBS / NBIO Common Options

Preferred IO Bus

“Use lspci to find pci device id”

AMD CBS / NBIO Common Options

Enhanced Preferred IO Mode

Enable

AMD CBS / NBIO Common Options / SMU Common Options

Determinism Control

Manual

AMD CBS / NBIO Common Options / SMU Common Options

Determinism Slider

Power

AMD CBS / NBIO Common Options / SMU Common Options

cTDP Control

Manual

AMD CBS / NBIO Common Options / SMU Common Options

cTDP

240

AMD CBS / NBIO Common Options / SMU Common Options

Package Power Limit Control

Manual

AMD CBS / NBIO Common Options / SMU Common Options

Package Power Limit

240

AMD CBS / NBIO Common Options / SMU Common Options

xGMI Link Width Control

Manual

AMD CBS / NBIO Common Options / SMU Common Options

xGMI Force Link Width

2

AMD CBS / NBIO Common Options / SMU Common Options

xGMI Force Link Width Control

Force

AMD CBS / NBIO Common Options / SMU Common Options

APBDIS

1

AMD CBS / NBIO Common Options / SMU Common Options

DF C-states

Auto

AMD CBS / NBIO Common Options / SMU Common Options

Fixed SOC P-state

P0

AMD CBS / UMC Common Options / DDR4 Common Options

Enforce POR

Accept

AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR

Overclock

Enabled

AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR

Memory Clock Speed

1600 MHz

Set to max Memory Speed, if using 3200 MHz DIMMs

AMD CBS / UMC Common Options / DDR4 Common Options / DRAM Controller Configuration / DRAM Power Options

Power Down Enable

Disabled

RAM Power Down

AMD CBS / Security

TSME

Disabled

Memory Encryption

Memory Configuration#

For the memory addressing modes (see Table 11), especially the number of NUMA nodes per socket/processor (NPS), the recommended setting is to follow the guidance of the “High Performance Computing (HPC) Tuning Guide for AMD EPYC™ 7002 Series Processors” and “High Performance Computing (HPC) Tuning Guide for AMD EPYC™ 7003 Series Processors” to provide the optimal configuration for host side computation.

If the system is set to one NUMA domain per socket/processor (NPS1), bidirectional copy bandwidth between host memory and GPU memory may be slightly higher (up to about 16% more) than with four NUMA domains per socket processor (NPS4). For memory bandwidth sensitive applications using MPI, NPS4 is recommended. For applications that are not optimized for NUMA locality, NPS1 is the recommended setting.

Operating System Settings#
CPU Core State - “C States”#

There are several Core-States, or C-states that an AMD EPYC CPU can idle within:

  • C0: active. This is the active state while running an application.

  • C1: idle

  • C2: idle and power gated. This is a deeper sleep state and will have a greater latency when moving back to the C0 state, compared to when the CPU is coming out of C1.

Disabling C2 is important for running with a high performance, low-latency network. To disable power-gating on all cores run the following on Linux systems:

cpupower idle-set -d 2

Note that the cpupower tool must be installed, as it is not part of the base packages of most Linux® distributions. The package needed varies with the respective Linux distribution.

sudo apt install linux-tools-common
sudo yum install cpupowerutils
sudo zypper install cpupower
AMD-IOPM-UTIL#

This section applies to AMD EPYC™ 7002 processors to optimize advanced Dynamic Power Management (DPM) in the I/O logic (see NBIO description above) for performance. Certain I/O workloads may benefit from disabling this power management. This utility disables DPM for all PCI-e root complexes in the system and locks the logic into the highest performance operational mode.

Disabling I/O DPM will reduce the latency and/or improve the throughput of low-bandwidth messages for PCI-e InfiniBand NICs and GPUs. Other workloads with low-bandwidth bursty PCI-e I/O characteristics may benefit as well if multiple such PCI-e devices are installed in the system.

The actions of the utility do not persist across reboots. There is no need to change any existing firmware settings when using this utility. The “Preferred I/O” and “Enhanced Preferred I/O” settings should remain unchanged at enabled.

Tip

The recommended method to use the utility is either to create a system start-up script, for example, a one-shot systemd service unit, or run the utility when starting up a job scheduler on the system. The installer packages (see Power Management Utility) will create and enable a systemd service unit for you. This service unit is configured to run in one-shot mode. This means that even when the service unit runs as expected, the status of the service unit will show inactive. This is the expected behavior when the utility runs normally. If the service unit shows failed, the utility did not run as expected. The output in either case can be shown with the systemctl status command.

Stopping the service unit has no effect since the utility does not leave anything running. To undo the effects of the utility, disable the service unit with the systemctl disable command and reboot the system.

The utility does not have any command-line options, and it must be run with super-user permissions.

Systems with 256 CPU Threads - IOMMU Configuration#

For systems that have 256 logical CPU cores or more (e.g., 64-core AMD EPYC™ 7763 in a dual-socket configuration and SMT enabled), setting the Input-Output Memory Management Unit (IOMMU) configuration to “disabled” can limit the number of available logical cores to 255. The reason is that the Linux® kernel disables X2APIC in this case and falls back to Advanced Programmable Interrupt Controller (APIC), which can only enumerate a maximum of 255 (logical) cores.

If SMT is enabled by setting “CCD/Core/Thread Enablement > SMT Control” to “enable”, the following steps can be applied to the system to enable all (logical) cores of the system:

  • In the server BIOS, set IOMMU to “Enabled”.

  • When configuring the Grub boot loader, add the following arguments for the Linux kernel: amd_iommu=on iommu=pt

  • Update Grub to use the modified configuration:

    sudo grub2-mkconfig -o /boot/grub2/grub.cfg
    
  • Reboot the system.

  • Verify IOMMU passthrough mode by inspecting the kernel log via dmesg:

    [...]
    [   0.000000] Kernel command line: [...] amd_iommu=on iommu=pt
       [...]
    

Once the system is properly configured, the AMD ROCm platform can be installed.

System Management#

For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to Deploy ROCm on Linux. For verifying that the installation was successful, refer to Verifying Kernel-mode Driver Installation and Validation Tools. Should verification fail, consult the System Debugging Guide.

Hardware Verification with ROCm#

The AMD ROCm™ platform ships with tools to query the system structure. To query the GPU hardware, the rocm-smi command is available. It can show available GPUs in the system with their device ID and their respective firmware (or VBIOS) versions:

rocm-smi --showhw output on an 8*MI100 system.

rocm-smi --showhw output on an 8*MI100 system.#

Another important query is to show the system structure, the localization of the GPUs in the system, and the fabric connections between the system components:

rocm-smi --showtopo output on an 8*MI100 system.

rocm-smi --showtopo output on an 8*MI100 system.#

The previous command shows the system structure in four blocks:

  • The first block of the output shows the distance between the GPUs similar to what the numactl command outputs for the NUMA domains of a system. The weight is a qualitative measure for the “distance” data must travel to reach one GPU from another one. While the values do not carry a special (physical) meaning, the higher the value the more hops are needed to reach the destination from the source GPU.

  • The second block has a matrix for the number of hops required to send data from one GPU to another. For the GPUs in the local hive, this number is one, while for the others it is three (one hop to leave the hive, one hop across the processors, and one hop within the destination hive).

  • The third block outputs the link types between the GPUs. This can either be “XGMI” for AMD Infinity Fabric™ links or “PCIE” for PCIe Gen4 links.

  • The fourth block reveals the localization of a GPU with respect to the NUMA organization of the shared memory of the AMD EPYC™ processors.

To query the compute capabilities of the GPU devices, the rocminfo command is available with the AMD ROCm™ platform. It lists specific details about the GPU devices, including but not limited to the number of compute units, width of the SIMD pipelines, memory information, and instruction set architecture:

rocminfo output fragment on an 8*MI100 system.

rocminfo output fragment on an 8*MI100 system.#

For a complete list of architecture (LLVM target) names, refer to GPU OS Support.

Testing Inter-device Bandwidth#

mi100-hw-verification showed the rocm-smi --showtopo command to show how the system structure and how the GPUs are located and connected in this structure. For more details, the rocm-bandwidth-test can run benchmarks to show the effective link bandwidth between the components of the system.

The ROCm Bandwidth Test program can be installed with the following package-manager commands:

sudo apt install rocm-bandwidth-test
sudo yum install rocm-bandwidth-test
sudo zypper install rocm-bandwidth-test

Alternatively, the source code can be downloaded and built from source.

The output will list the available compute devices (CPUs and GPUs):

rocm-bandwidth-test output fragment on an 8*MI100 system listing devices.

rocm-bandwidth-test output fragment on an 8*MI100 system listing devices.#

The output will also show a matrix that contains a “1” if a device can communicate to another device (CPU and GPU) of the system and it will show the NUMA distance (similar to rocm-smi):

rocm-bandwidth-test output fragment on an 8*MI100 system showing inter-device access matrix.

rocm-bandwidth-test output fragment on an 8*MI100 system showing inter-device access matrix.#

rocm-bandwidth-test output fragment on an 8*MI100 system showing inter-device NUMA distance.

rocm-bandwidth-test output fragment on an 8*MI100 system showing inter-device NUMA distance.#

The output also contains the measured bandwidth for unidirectional and bidirectional transfers between the devices (CPU and GPU):

rocm-bandwidth-test output fragment on an 8*MI100 system showing uni- and bidirectional bandwidths.

rocm-bandwidth-test output fragment on an 8*MI100 system showing uni- and bidirectional bandwidths.#

RDNA2 Workstation Tuning Guide#

System Settings#

This chapter reviews system settings that are required to configure the system for ROCm virtualization on RDNA2-based AMD Radeon™ PRO GPUs. Installing ROCm on Bare Metal follows the routine ROCm installation procedure.

To enable ROCm virtualization on V620, one has to setup Single Root I/O Virtualization (SR-IOV) in the BIOS via setting found in the following (System BIOS Settings). A tested configuration can be followed in (Operating System Settings).

Attention

SR-IOV is supported on V620 and unsupported on W6800.

System BIOS Settings#
Settings for the system BIOS in an ASrock platform.#

Advanced / North Bridge Configuration

IOMMU

Enabled

Input-output Memory Management Unit

Advanced / North Bridge Configuration

ACS Enable

Enabled

Access Control Service

Advanced / PCIe/PCI/PnP Configuration

SR-IOV Support

Enabled

Single Root I/O Virtualization

Advanced / ACPI settings

PCI AER Support

Enabled

Advanced Error Reporting

To set up the host, update SBIOS to version 1.2a.

Operating System Settings#
System Configuration Prerequisites#

Server

SMC 4124 [AS -4124GS-TNR]

Host OS

Ubuntu 20.04.3 LTS

Host Kernel

5.4.0-97-generic

CPU

AMD EPYC 7552 48-Core Processor

GPU

RDNA2 V620 (D603GLXE)

SBIOS

Version SMC_r_1.2a

VBIOS

113-D603GLXE-077

Guest OS 1

Ubuntu 20.04.5 LTS

Guest OS 2

RHEL 9.0

GIM Driver

gim-dkms_1.0.0.1234577_all

VM CPU Cores

32

VM RAM

64 GB

Install the following Kernel-based Virtual Machine (KVM) Hypervisor packages:

sudo apt-get -y install qemu-kvm qemu-utils  bridge-utils virt-manager  gir1.2-spiceclientgtk*  gir1.2-spice-client-gtk* libvirt-daemon-system dnsmasq-base
sudo virsh net-start default /*to enable Virtual network by default

Enable IOMMU in GRUB settings by adding the following line to /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=on" for AMD CPU

Update grub and reboot

sudo update=grub
sudo reboot

Install the GPU-IOV Module (GIM, where IOV is I/O Virtualization) driver and follow the steps below. To obtain the GIM driver, write to us here:

sudo dpkg -i <gim_driver>
sudo reboot
# Load Host Driver to Create 1VF
sudo modprobe gim vf_num=1
# Note: If GIM driver loaded successfully, we could see "gim info:(gim_init:213) *****Running GIM*****" in dmesg
lspci -d 1002:

Which should output something like:

01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1478
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1479
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73a1
03:02.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73ae → VF
Guest OS installation#

First, assign GPU virtual function (VF) to VM using the following steps.

  1. Shut down the VM.

  2. Run virt-manager

  3. In the Virtual Machine Manager GUI, select the VM and click Open.

    Virtual Machine Manager

    Virtual Machine Manager#

  4. In the VM GUI, go to Show Virtual Hardware Details > Add Hardware to configure hardware.

    Show Virtual Hardware Details

    Virtual Machine Manager#

  5. Go to Add Hardware > PCI Host Device > VF and click Finish.

    VF Selection

    VF Selection#

Then start the VM.

Finally install ROCm on the virtual machine (VM). For detailed instructions, refer to the ROCm Installation Guide. For any issue encountered during installation, write to us here.

Deep Learning Guide#

The following sections cover the different framework installations for ROCm and Deep Learning applications. Fig. 27 provides the sequential flow for the use of each framework. Refer to the ROCm Compatible Frameworks Release Notes for each framework’s most current release notes at Deep Learning.

_images/image.005.png

ROCm Compatible Frameworks Flowchart#

Frameworks Installation#

Magma Installation for ROCm#

MAGMA for ROCm#

Matrix Algebra on GPU and Multi-core Architectures, abbreviated as MAGMA, is a collection of next-generation dense linear algebra libraries that is designed for heterogeneous architectures, such as multiple GPUs and multi- or many-core CPUs.

MAGMA provides implementations for CUDA, HIP, Intel Xeon Phi, and OpenCL™. For more information, refer to https://icl.utk.edu/magma/index.html.

Using MAGMA for PyTorch#

Tensor is fundamental to Deep Learning techniques because it provides extensive representational functionalities and math operations. This data structure is represented as a multidimensional matrix. MAGMA accelerates tensor operations with a variety of solutions including driver routines, computational routines, BLAS routines, auxiliary routines, and utility routines.

Build MAGMA from Source#

To build MAGMA from the source, follow these steps:

  1. In the event you want to compile only for your uarch, use:

    export PYTORCH_ROCM_ARCH=<uarch>
    

    <uarch> is the architecture reported by the rocminfo command.

  2. Use the following:

    export PYTORCH_ROCM_ARCH=<uarch>
    
    # "install" hipMAGMA into /opt/rocm/magma by copying after build
    git clone https://bitbucket.org/icl/magma.git
    pushd magma
    # Fixes memory leaks of magma found while executing linalg UTs
    git checkout 5959b8783e45f1809812ed96ae762f38ee701972
    cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
    echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
    echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc
    echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc
    export PATH="${PATH}:/opt/rocm/bin"
    if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
      amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`
    else
      amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`
    fi
    for arch in $amdgpu_targets; do
      echo "DEVCCFLAGS += --amdgpu-target=$arch" >> make.inc
    done
    # hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
    sed -i 's/^FOPENMP/#FOPENMP/g' make.inc
    make -f make.gen.hipMAGMA -j $(nproc)
    LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda
    make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda
    popd
    mv magma /opt/rocm
    

References#

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” CoRR, p. abs/1512.00567, 2015

PyTorch, [Online]. Available: https://pytorch.org/vision/stable/index.html

PyTorch, [Online]. Available: https://pytorch.org/hub/pytorch_vision_inception_v3/

Stanford, [Online]. Available: http://cs231n.stanford.edu/

Wikipedia, [Online]. Available: https://en.wikipedia.org/wiki/Cross_entropy

AMD, “ROCm issues,” [Online]. Available: RadeonOpenCompute/ROCm#issues

PyTorch, [Online image]. https://pytorch.org/assets/brand-guidelines/PyTorch-Brand-Guidelines.pdf

TensorFlow, [Online image]. https://www.tensorflow.org/extras/tensorflow_brand_guidelines.pdf

MAGMA, [Online image]. https://bitbucket.org/icl/magma/src/master/docs/

Docker, [Online]. https://docs.docker.com/get-started/overview/

Torchvision, [Online]. Available https://pytorch.org/vision/master/index.html?highlight=torchvision#module-torchvision

PyTorch Installation for ROCm#

PyTorch#

PyTorch is an open source Machine Learning Python library, primarily differentiated by Tensor computing with GPU acceleration and a type-based automatic differentiation. Other advanced features include:

  • Support for distributed training

  • Native ONNX support

  • C++ front-end

  • The ability to deploy at scale using TorchServe

  • A production-ready deployment mechanism through TorchScript

Installing PyTorch#

To install ROCm on bare metal, refer to the sections GPU and OS Support (Linux) and Compatibility for hardware, software and 3rd-party framework compatibility between ROCm and PyTorch. The recommended option to get a PyTorch environment is through Docker. However, installing the PyTorch wheels package on bare metal is also supported.

Option 2: Install PyTorch Using Wheels Package#

PyTorch supports the ROCm platform by providing tested wheels packages. To access this feature, refer to https://pytorch.org/get-started/locally/ and choose the “ROCm” compute platform. Fig. 28 is a matrix from http://pytorch.org/ that illustrates the installation compatibility between ROCm and the PyTorch build.

_images/image.006.png

Installation Matrix from Pytorch#

To install PyTorch using the wheels package, follow these installation steps:

  1. Choose one of the following options: a. Obtain a base Docker image with the correct user-space ROCm version installed from https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04.

    or

    b. Download a base OS Docker image and install ROCm following the installation directions in the section Installation. ROCm 5.2 is installed in this example, as supported by the installation matrix from http://pytorch.org/.

    or

    c. Install on bare metal. Skip to Step 3.

    docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest
    
  2. Start the Docker container, if not installing on bare metal.

    docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest
    
  3. Install any dependencies needed for installing the wheels package.

    sudo apt update
    sudo apt install libjpeg-dev python3-dev
    pip3 install wheel setuptools
    
  4. Install torch, torchvision, and torchaudio as specified by the installation matrix.

    Note

    ROCm 5.2 PyTorch wheel in the command below is shown for reference.

    pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/rocm5.2/
    
Option 3: Install PyTorch Using PyTorch ROCm Base Docker Image#

A prebuilt base Docker image is used to build PyTorch in this option. The base Docker has all dependencies installed, including:

  • ROCm

  • Torchvision

  • Conda packages

  • Compiler toolchain

Additionally, a particular environment flag (BUILD_ENVIRONMENT) is set, and the build scripts utilize that to determine the build environment configuration.

Follow these steps:

  1. Obtain the Docker image.

    docker pull rocm/pytorch:latest-base
    

    The above will download the base container, which does not contain PyTorch.

  2. Start a Docker container using the image.

    docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest-base
    

    You can also pass the -v argument to mount any data directories from the host onto the container.

  3. Clone the PyTorch repository.

    cd ~
    git clone https://github.com/pytorch/pytorch.git
    cd pytorch
    git submodule update --init --recursive
    
  4. Build PyTorch for ROCm.

    Note

    By default in the rocm/pytorch:latest-base, PyTorch builds for these architectures simultaneously:

    • gfx900

    • gfx906

    • gfx908

    • gfx90a

    • gfx1030

  5. To determine your AMD uarch, run:

    rocminfo | grep gfx
    
  6. In the event you want to compile only for your uarch, use:

    export PYTORCH_ROCM_ARCH=<uarch>
    

    <uarch> is the architecture reported by the rocminfo command.

  7. Build PyTorch using the following command:

    ./.jenkins/pytorch/build.sh
    

    This will first convert PyTorch sources for HIP compatibility and build the PyTorch framework.

  8. Alternatively, build PyTorch by issuing the following commands:

    python3 tools/amd_build/build_amd.py
    USE_ROCM=1 MAX_JOBS=4 python3 setup.py install --user
    
Option 4: Install Using PyTorch Upstream Docker File#

Instead of using a prebuilt base Docker image, you can build a custom base Docker image using scripts from the PyTorch repository. This will utilize a standard Docker image from operating system maintainers and install all the dependencies required to build PyTorch, including

  • ROCm

  • Torchvision

  • Conda packages

  • Compiler toolchain

Follow these steps:

  1. Clone the PyTorch repository on the host.

    cd ~
    git clone https://github.com/pytorch/pytorch.git
    cd pytorch
    git submodule update --init --recursive
    
  2. Build the PyTorch Docker image.

    cd.circleci/docker
    ./build.sh pytorch-linux-bionic-rocm<version>-py3.7
    # eg. ./build.sh pytorch-linux-bionic-rocm3.10-py3.7
    

    This should be complete with a message “Successfully build <image_id>.”

  3. Start a Docker container using the image:

    docker run -it --cap-add=SYS_PTRACE --security-opt
    seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add
    video --ipc=host --shm-size 8G <image_id>
    

    You can also pass -v argument to mount any data directories from the host onto the container.

  4. Clone the PyTorch repository.

    cd ~
    git clone https://github.com/pytorch/pytorch.git
    cd pytorch
    git submodule update --init --recursive
    
  5. Build PyTorch for ROCm.

    Note

    By default in the rocm/pytorch:latest-base, PyTorch builds for these architectures simultaneously:

    • gfx900

    • gfx906

    • gfx908

    • gfx90a

    • gfx1030

  6. To determine your AMD uarch, run:

    rocminfo | grep gfx
    
  7. If you want to compile only for your uarch:

    export PYTORCH_ROCM_ARCH=<uarch>
    

    <uarch> is the architecture reported by the rocminfo command.

  8. Build PyTorch using:

    ./.jenkins/pytorch/build.sh
    

This will first convert PyTorch sources to be HIP compatible and then build the PyTorch framework.

Alternatively, build PyTorch by issuing the following commands:

python3 tools/amd_build/build_amd.py
USE_ROCM=1 MAX_JOBS=4 python3 setup.py install --user
Test the PyTorch Installation#

You can use PyTorch unit tests to validate a PyTorch installation. If using a prebuilt PyTorch Docker image from AMD ROCm DockerHub or installing an official wheels package, these tests are already run on those configurations. Alternatively, you can manually run the unit tests to validate the PyTorch installation fully.

Follow these steps:

  1. Test if PyTorch is installed and accessible by importing the torch package in Python.

    Note

    Do not run in the PyTorch git folder.

    python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure'
    
  2. Test if the GPU is accessible from PyTorch. In the PyTorch framework, torch.cuda is a generic mechanism to access the GPU; it will access an AMD GPU only if available.

    python3 -c 'import torch; print(torch.cuda.is_available())'
    
  3. Run the unit tests to validate the PyTorch installation fully. Run the following command from the PyTorch home directory:

    BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT:-rocm} ./.jenkins/pytorch/test.sh
    

    This ensures that even for wheel installs in a non-controlled environment, the required environment variable will be set to skip certain unit tests for ROCm.

    Note

    Make sure the PyTorch source code is corresponding to the PyTorch wheel or installation in the Docker image. Incompatible PyTorch source code might give errors when running the unit tests.

    This will first install some dependencies, such as a supported torchvision version for PyTorch. torchvision is used in some PyTorch tests for loading models. Next, this will run all the unit tests.

    Note

    Some tests may be skipped, as appropriate, based on your system configuration. All features of PyTorch are not supported on ROCm, and the tests that evaluate these features are skipped. In addition, depending on the host memory, or the number of available GPUs, other tests may be skipped. No test should fail if the compilation and installation are correct.

  4. Run individual unit tests with the following command:

    PYTORCH_TEST_WITH_ROCM=1 python3 test/test_nn.py --verbose
    

    test_nn.py can be replaced with any other test set.

Run a Basic PyTorch Example#

The PyTorch examples repository provides basic examples that exercise the functionality of the framework. MNIST (Modified National Institute of Standards and Technology) database is a collection of handwritten digits that may be used to train a Convolutional Neural Network for handwriting recognition. Alternatively, ImageNet is a database of images used to train a network for visual object recognition.

Follow these steps:

  1. Clone the PyTorch examples repository.

    git clone https://github.com/pytorch/examples.git
    
  2. Run the MNIST example.

    cd examples/mnist
    
  3. Follow the instructions in the README file in this folder. In this case:

    pip3 install -r requirements.txt
    python3 main.py
    
  4. Run the ImageNet example.

    cd examples/imagenet
    
  5. Follow the instructions in the README file in this folder. In this case:

    pip3 install -r requirements.txt
    python3 main.py
    

References#

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” CoRR, p. abs/1512.00567, 2015

PyTorch, [Online]. Available: https://pytorch.org/vision/stable/index.html

PyTorch, [Online]. Available: https://pytorch.org/hub/pytorch_vision_inception_v3/

Stanford, [Online]. Available: http://cs231n.stanford.edu/

Wikipedia, [Online]. Available: https://en.wikipedia.org/wiki/Cross_entropy

AMD, “ROCm issues,” [Online]. Available: RadeonOpenCompute/ROCm#issues

PyTorch, [Online image]. https://pytorch.org/assets/brand-guidelines/PyTorch-Brand-Guidelines.pdf

TensorFlow, [Online image]. https://www.tensorflow.org/extras/tensorflow_brand_guidelines.pdf

MAGMA, [Online image]. https://bitbucket.org/icl/magma/src/master/docs/

Docker, [Online]. https://docs.docker.com/get-started/overview/

Torchvision, [Online]. Available https://pytorch.org/vision/master/index.html?highlight=torchvision#module-torchvision

TensorFlow Installation for ROCm#

TensorFlow#

TensorFlow is an open source library for solving Machine Learning, Deep Learning, and Artificial Intelligence problems. It can be used to solve many problems across different sectors and industries but primarily focuses on training and inference in neural networks. It is one of the most popular and in-demand frameworks and is very active in open source contribution and development.

Installing TensorFlow#

The following sections contain options for installing TensorFlow.

Option 1: Install TensorFlow Using Docker Image#

To install ROCm on bare metal, follow the section Installation (Linux). The recommended option to get a TensorFlow environment is through Docker.

Using Docker provides portability and access to a prebuilt Docker container that has been rigorously tested within AMD. This might also save compilation time and should perform as tested without facing potential installation issues. Follow these steps:

  1. Pull the latest public TensorFlow Docker image.

    docker pull rocm/tensorflow:latest
    
  2. Once you have pulled the image, run it by using the command below:

    docker run -it --network=host --device=/dev/kfd --device=/dev/dri
    --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE
    --security-opt seccomp=unconfined rocm/tensorflow:latest
    
Option 2: Install TensorFlow Using Wheels Package#

To install TensorFlow using the wheels package, follow these steps:

  1. Check the Python version.

    python3 --version
    

    If:

    Then:

    The Python version is less than 3.7

    Upgrade Python.

    The Python version is more than 3.7

    Skip this step and go to Step 3.

    Note

    The supported Python versions are:

    • 3.7

    • 3.8

    • 3.9

    • 3.10

    sudo apt-get install python3.7 # or python3.8 or python 3.9 or python 3.10
    
  2. Set up multiple Python versions using update-alternatives.

    update-alternatives --query python3
    sudo update-alternatives --install
    /usr/bin/python3 python3 /usr/bin/python[version] [priority]
    

    Note

    Follow the instruction in Step 2 for incompatible Python versions.

    sudo update-alternatives --config python3
    
  3. Follow the screen prompts, and select the Python version installed in Step 2.

  4. Install or upgrade PIP.

    sudo apt install python3-pip
    

    To install PIP, use the following:

    /usr/bin/python[version]  -m pip install --upgrade pip
    

    Upgrade PIP for Python version installed in step 2:

    sudo pip3 install --upgrade pip
    
  5. Install TensorFlow for the Python version as indicated in Step 2.

    /usr/bin/python[version] -m pip install --user tensorflow-rocm==[wheel-version] --upgrade
    

    For a valid wheel version for a ROCm release, refer to the instruction below:

    sudo apt install rocm-libs rccl
    
  6. Update protobuf to 3.19 or lower.

    /usr/bin/python3.7  -m pip install protobuf=3.19.0
    sudo pip3 install tensorflow
    
  7. Set the environment variable PYTHONPATH.

    export PYTHONPATH="./.local/lib/python[version]/site-packages:$PYTHONPATH"  #Use same python version as in step 2
    
  8. Install libraries.

    sudo apt install rocm-libs rccl
    
  9. Test installation.

    python3 -c 'import tensorflow' 2> /dev/null && echo 'Success' || echo 'Failure'
    

    Note

    For details on tensorflow-rocm wheels and ROCm version compatibility, see: ROCmSoftwarePlatform/tensorflow-upstream

Test the TensorFlow Installation#

To test the installation of TensorFlow, run the container image as specified in the previous section Installing TensorFlow. Ensure you have access to the Python shell in the Docker container.

python3 -c 'import tensorflow' 2> /dev/null && echo ‘Success’ || echo ‘Failure’
Run a Basic TensorFlow Example#

The TensorFlow examples repository provides basic examples that exercise the framework’s functionality. The MNIST database is a collection of handwritten digits that may be used to train a Convolutional Neural Network for handwriting recognition.

Follow these steps:

  1. Clone the TensorFlow example repository.

    cd ~
    git clone https://github.com/tensorflow/models.git
    
  2. Install the dependencies of the code, and run the code.

    #pip3 install requirement.txt
    #python mnist_tf.py
    

References#

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” CoRR, p. abs/1512.00567, 2015

PyTorch, [Online]. Available: https://pytorch.org/vision/stable/index.html

PyTorch, [Online]. Available: https://pytorch.org/hub/pytorch_vision_inception_v3/

Stanford, [Online]. Available: http://cs231n.stanford.edu/

Wikipedia, [Online]. Available: https://en.wikipedia.org/wiki/Cross_entropy

AMD, “ROCm issues,” [Online]. Available: RadeonOpenCompute/ROCm#issues

PyTorch, [Online image]. https://pytorch.org/assets/brand-guidelines/PyTorch-Brand-Guidelines.pdf

TensorFlow, [Online image]. https://www.tensorflow.org/extras/tensorflow_brand_guidelines.pdf

MAGMA, [Online image]. https://bitbucket.org/icl/magma/src/master/docs/

Docker, [Online]. https://docs.docker.com/get-started/overview/

Torchvision, [Online]. Available https://pytorch.org/vision/master/index.html?highlight=torchvision#module-torchvision

GPU-Enabled MPI#

The Message Passing Interface (MPI) is a standard API for distributed and parallel application development that can scale to multi-node clusters. To facilitate the porting of applications to clusters with GPUs, ROCm enables various technologies. These technologies allow users to directly use GPU pointers in MPI calls and enable ROCm-aware MPI libraries to deliver optimal performance for both intra-node and inter-node GPU-to-GPU communication.

The AMD kernel driver exposes Remote Direct Memory Access (RDMA) through the PeerDirect interfaces to allow Host Channel Adapters (HCA, a type of Network Interface Card or NIC) to directly read and write to the GPU device memory with RDMA capabilities. These interfaces are currently registered as a peer_memory_client with Mellanox’s OpenFabrics Enterprise Distribution (OFED) ib_core kernel module to allow high-speed DMA transfers between GPU and HCA. These interfaces are used to optimize inter-node MPI message communication.

This chapter exemplifies how to set up Open MPI with the ROCm platform. The Open MPI project is an open source implementation of the Message Passing Interface (MPI) that is developed and maintained by a consortium of academic, research, and industry partners.

Several MPI implementations can be made ROCm-aware by compiling them with Unified Communication Framework (UCX) support. One notable exception is MVAPICH2: It directly supports AMD GPUs without using UCX, and you can download it here. Use the latest version of the MVAPICH2-GDR package.

The Unified Communication Framework, is an open source cross-platform framework whose goal is to provide a common set of communication interfaces that targets a broad set of network programming models and interfaces. UCX is ROCm-aware, and ROCm technologies are used directly to implement various network operation primitives. For more details on the UCX design, refer to it’s documentation.

Building UCX#

The following section describes how to set up UCX so it can be used to compile Open MPI. The following environment variables are set, such that all software components will be installed in the same base directory (we assume to install them in your home directory; for other locations adjust the below environment variables accordingly, and make sure you have write permission for that location):

export INSTALL_DIR=$HOME/ompi_for_gpu
export BUILD_DIR=/tmp/ompi_for_gpu_build
mkdir -p $BUILD_DIR
The following sequences of build commands assume either the ROCmCC or the AOMP
compiler is active in the environment, which will execute the commands.

Install UCX#

The next step is to set up UCX by compiling its source code and install it:

export UCX_DIR=$INSTALL_DIR/ucx
cd $BUILD_DIR
git clone https://github.com/openucx/ucx.git -b v1.14.1
cd ucx
./autogen.sh
mkdir build
cd build
../contrib/configure-release -prefix=$UCX_DIR \
    --with-rocm=/opt/rocm \
    --without-cuda -enable-optimizations -disable-logging \
    --disable-debug -disable-assertions \
    --disable-params-check -without-java
make -j $(nproc)
make -j $(nproc) install

The following table documents the compatibility of UCX versions with ROCm versions.

Install Open MPI#

These are the steps to build Open MPI:

export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git \
    -b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \
    --with-rocm=/opt/rocm \
    --enable-mca-no-build=btl-uct --enable-mpi1-compatibility \
    CC=clang CXX=clang++ FC=flang
make -j $(nproc)
make -j $(nproc) install

ROCm-enabled OSU#

The OSU Micro Benchmarks v5.9 (OMB) can be used to evaluate the performance of various primitives with an AMD GPU device and ROCm support. This functionality is exposed when configured with --enable-rocm option. We can use the following steps to compile OMB:

export OSU_DIR=$INSTALL_DIR/osu
cd $BUILD_DIR
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.9.tar.gz
tar xfz osu-micro-benchmarks-5.9.tar.gz
cd osu-micro-benchmarks-5.9
./configure --prefix=$INSTALL_DIR/osu --enable-rocm \
    --with-rocm=/opt/rocm \
    CC=$OMPI_DIR/bin/mpicc CXX=$OMPI_DIR/bin/mpicxx \
    LDFLAGS="-L$OMPI_DIR/lib/ -lmpi -L/opt/rocm/lib/ \
    $(hipconfig -C) -lamdhip64" CXXFLAGS="-std=c++11"
make -j $(nproc)

Intra-node Run#

Before running an Open MPI job, it is essential to set some environment variables to ensure that the correct version of Open MPI and UCX is being used.

export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
export PATH=$OMPI_DIR/bin:$PATH

The following command runs the OSU bandwidth benchmark between the first two GPU devices (i.e., GPU 0 and GPU 1, same OAM) by default inside the same node. It measures the unidirectional bandwidth from the first device to the other.

$OMPI_DIR/bin/mpirun -np 2               \
   -x UCX_TLS=sm,self,rocm               \
   --mca pml ucx mpi/pt2pt/osu_bw -d rocm D D

To select different devices, for example 2 and 3, use the following command:

export HIP_VISIBLE_DEVICES=2,3
export HSA_ENABLE_SDMA=0

The following output shows the effective transfer bandwidth measured for inter-die data transfer between GPU device 2 and 3 (same OAM). For messages larger than 67MB, an effective utilization of about 150GB/sec is achieved, which corresponds to 75% of the peak transfer bandwidth of 200GB/sec for that connection:

OSU execution showing transfer bandwidth increasing alongside payload inc.

Inter-GPU bandwidth with various payload sizes.#

Collective Operations#

Collective Operations on GPU buffers are best handled through the Unified Collective Communication Library (UCC) component in Open MPI. For this, the UCC library has to be configured and compiled with ROCm support. An example for configuring UCC and Open MPI with ROCm support is shown below:

export UCC_DIR=$INSTALL_DIR/ucc
git clone https://github.com/openucx/ucc.git
cd ucc
./configure --with-rocm=/opt/rocm \
            --with-ucx=$UCX_DIR   \
            --prefix=$UCC_DIR
make -j && make install

# Configure and compile Open MPI with UCX, UCC, and ROCm support
cd ompi
./configure --with-rocm=/opt/rocm  \
            --with-ucx=$UCX_DIR    \
            --with-ucc=$UCC_DIR
            --prefix=$OMPI_DIR

To use the UCC component with an MPI application requires setting some additional parameters:

mpirun --mca pml ucx --mca osc ucx \
       --mca coll_ucc_enable 1     \
       --mca coll_ucc_priority 100 -np 64 ./my_mpi_app

System Debugging Guide#

ROCm Language and System Level Debug, Flags, and Environment Variables#

Kernel options to avoid: the Ethernet port getting renamed every time you change graphics cards, net.ifnames=0 biosdevname=0

ROCr Error Code#

  • 2 Invalid Dimension

  • 4 Invalid Group Memory

  • 8 Invalid (or Null) Code

  • 32 Invalid Format

  • 64 Group is too large

  • 128 Out of VGPRs

  • 0x80000000 Debug Options

Command to Dump Firmware Version and Get Linux Kernel Version#

sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info

uname -a

Debug Flags#

Debug messages when developing/debugging base ROCm driver. You could enable the printing from libhsakmt.so by setting an environment variable, HSAKMT_DEBUG_LEVEL. Available debug levels are 3-7. The higher level you set, the more messages will print.

  • export HSAKMT_DEBUG_LEVEL=3 : Only pr_err() prints.

  • export HSAKMT_DEBUG_LEVEL=4 : pr_err() and pr_warn() print.

  • export HSAKMT_DEBUG_LEVEL=5 : We currently do not implement “notice”. Setting to 5 is same as setting to 4.

  • export HSAKMT_DEBUG_LEVEL=6 : pr_err(), pr_warn(), and pr_info print.

  • export HSAKMT_DEBUG_LEVEL=7 : Everything including pr_debug prints.

ROCr Level Environment Variables for Debug#

HSA_ENABLE_SDMA=0

HSA_ENABLE_INTERRUPT=0

HSA_SVM_GUARD_PAGES=0

HSA_DISABLE_CACHE=1

Turn Off Page Retry on GFX9/Vega Devices#

sudo -s

echo 1 > /sys/module/amdkfd/parameters/noretry

HIP Environment Variables 3.x#

OpenCL Debug Flags#

AMD_OCL_WAIT_COMMAND=1 (0 = OFF, 1 = On)

PCIe-Debug#

Refer to ROCm PCIe Debug, https://rocmdocs.amd.com/en/latest/Other_Solutions/PCIe-Debug.html#pcie-debug. For information on how to debug and profile HIP applications, see HIP Debugging

Machine Learning, Deep Learning, and Artificial Intelligence#

Inception V3 with PyTorch

A collection of detailed and guided examples for working with Inception V3 with PyTorch on ROCm.

Inception V3 with PyTorch#

Deep Learning Training#

Deep Learning models are designed to capture the complexity of the problem and the underlying data. These models are “deep,” comprising multiple component layers. Training is finding the best parameters for each model layer to achieve a well-defined objective.

The training data consists of input features in supervised learning, similar to what the learned model is expected to see during the evaluation or inference phase. The target output is also included, which serves to teach the model. A loss metric is defined as part of training that evaluates the model’s performance during the training process.

Training also includes the choice of an optimization algorithm that reduces the loss by adjusting the model’s parameters. Training is an iterative process where training data is fed in, usually split into different batches, with the entirety of the training data passed during one training epoch. Training usually is run for multiple epochs.

Training Phases#

Training occurs in multiple phases for every batch of training data. Table 14 provides an explanation of the types of training phases.

Types of Training Phases#

Types of Phases

Forward Pass

The input features are fed into the model, whose parameters may be randomly initialized initially. Activations (outputs) of each layer are retained during this pass to help in the loss gradient computation during the backward pass.

Loss Computation

The output is compared against the target outputs, and the loss is computed.

Backward Pass

The loss is propagated backward, and the model’s error gradients are computed and stored for each trainable parameter.

Optimization Pass

The optimization algorithm updates the model parameters using the stored error gradients.

Training is different from inference, particularly from the hardware perspective. Table 15 shows the contrast between training and inference.

Training vs. Inference#

Training

Inference

Training is measured in hours/days.

The inference is measured in minutes.

Training is generally run offline in a data center or cloud setting.

The inference is made on edge devices.

The memory requirements for training are higher than inference due to storing intermediate data, such as activations and error gradients.

The memory requirements are lower for inference than training.

Data for training is available on the disk before the training process and is generally significant. The training performance is measured by how fast the data batches can be processed.

Inference data usually arrive stochastically, which may be batched to improve performance. Inference performance is generally measured in throughput speed to process the batch of data and the delay in responding to the input (latency).

Different quantization data types are typically chosen between training (FP32, BF16) and inference (FP16, INT8). The computation hardware has different specializations from other datatypes, leading to improvement in performance if a faster datatype can be selected for the corresponding task.

Case Studies#

The following sections contain case studies for the Inception v3 model.

Inception v3 with PyTorch#

Convolution Neural Networks are forms of artificial neural networks commonly used for image processing. One of the core layers of such a network is the convolutional layer, which convolves the input with a weight tensor and passes the result to the next layer. Inception v3[1] is an architectural development over the ImageNet competition-winning entry, AlexNet, using more profound and broader networks while attempting to meet computational and memory budgets.

The implementation uses PyTorch as a framework. This case study utilizes torchvision[2], a repository of popular datasets and model architectures, for obtaining the model. torchvision also provides pre-trained weights as a starting point to develop new models or fine-tune the model for a new task.

Evaluating a Pre-Trained Model#

The Inception v3 model introduces a simple image classification task with the pre-trained model. This does not involve training but utilizes an already pre-trained model from torchvision.

This example is adapted from the PyTorch research hub page on Inception v3[3].

Follow these steps:

  1. Run the PyTorch ROCm-based Docker image or refer to the section Installing PyTorch for setting up a PyTorch environment on ROCm.

    docker run -it -v $HOME:/data --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
    
  2. Run the Python shell and import packages and libraries for model creation.

    import torch
    import torchvision
    
  3. Set the model in evaluation mode. Evaluation mode directs PyTorch not to store intermediate data, which would have been used in training.

    model = torch.hub.load('pytorch/vision:v0.10.0', 'inception_v3', pretrained=True)
    model.eval()
    
  4. Download a sample image for inference.

    import urllib
    url, filename = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg")
    try: urllib.URLopener().retrieve(url, filename)
    except: urllib.request.urlretrieve(url, filename)
    
  5. Import torchvision and PIL.Image support libraries.

    from PIL import Image
    from torchvision import transforms
    input_image = Image.open(filename)
    
  6. Apply preprocessing and normalization.

    preprocess = transforms.Compose([
        transforms.Resize(299),
        transforms.CenterCrop(299),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    
  7. Use input tensors and unsqueeze them later.

    input_tensor = preprocess(input_image)
    input_batch = input_tensor.unsqueeze(0)
    if torch.cuda.is_available():
        input_batch = input_batch.to('cuda')
        model.to('cuda')
    
  8. Find out probabilities.

    with torch.no_grad():
        output = model(input_batch)
    print(output[0])
    probabilities = torch.nn.functional.softmax(output[0], dim=0)
    print(probabilities)
    
  9. To understand the probabilities, download and examine the ImageNet labels.

    wget https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt
    
  10. Read the categories and show the top categories for the image.

    with open("imagenet_classes.txt", "r") as f:
        categories = [s.strip() for s in f.readlines()]
    top5_prob, top5_catid = torch.topk(probabilities, 5)
    for i in range(top5_prob.size(0)):
        print(categories[top5_catid[i]], top5_prob[i].item())
    
Training Inception v3#

The previous section focused on downloading and using the Inception v3 model for a simple image classification task. This section walks through training the model on a new dataset.

Follow these steps:

  1. Run the PyTorch ROCm Docker image or refer to the section Installing PyTorch for setting up a PyTorch environment on ROCm.

    docker pull rocm/pytorch:latest
    docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
    
  2. Download an ImageNet database. For this example, the tiny-imagenet-200[4], a smaller ImageNet variant with 200 image classes and a training dataset with 100,000 images, was downsized to 64x64 color images.

    wget http://cs231n.stanford.edu/tiny-imagenet-200.zip
    
  3. Process the database to set the validation directory to the format expected by PyTorch’s DataLoader.

  4. Run the following script:

    import io
    import glob
    import os
    from shutil import move
    from os.path import join
    from os import listdir, rmdir
    target_folder = './tiny-imagenet-200/val/'
    val_dict = {}
    with open('./tiny-imagenet-200/val/val_annotations.txt', 'r') as f:
        for line in f.readlines():
            split_line = line.split('\t')
            val_dict[split_line[0]] = split_line[1]
    
    paths = glob.glob('./tiny-imagenet-200/val/images/*')
    for path in paths:
        file = path.split('/')[-1]
        folder = val_dict[file]
        if not os.path.exists(target_folder + str(folder)):
            os.mkdir(target_folder + str(folder))
            os.mkdir(target_folder + str(folder) + '/images')
    
    for path in paths:
        file = path.split('/')[-1]
        folder = val_dict[file]
        dest = target_folder + str(folder) + '/images/' + str(file)
        move(path, dest)
    
    rmdir('./tiny-imagenet-200/val/images')
    
  5. Open a Python shell.

  6. Import dependencies, including torch, os, and torchvision.

    import torch
    import os
    import torchvision
    from torchvision import transforms
    from torchvision.transforms.functional import InterpolationMode
    
  7. Set parameters to guide the training process.

    Note

    The device is set to "cuda". In PyTorch, "cuda" is a generic keyword to denote a GPU.

    device = "cuda"
    
  8. Set the data_path to the location of the training and validation data. In this case, the tiny-imagenet-200 is present as a subdirectory to the current directory.

    data_path = "tiny-imagenet-200"
    

    The training image size is cropped for input into Inception v3.

    train_crop_size = 299
    
  9. To smooth the image, use bilinear interpolation, a resampling method that uses the distance weighted average of the four nearest pixel values to estimate a new pixel value.

    interpolation = "bilinear"
    

    The next parameters control the size to which the validation image is cropped and resized.

    val_crop_size = 299
    val_resize_size = 342
    

    The pre-trained Inception v3 model is chosen to be downloaded from torchvision.

    model_name = "inception_v3"
    pretrained = True
    

    During each training step, a batch of images is processed to compute the loss gradient and perform the optimization. In the following setting, the size of the batch is determined.

    batch_size = 32
    

    This refers to the number of CPU threads the data loader uses to perform efficient multi-process data loading.

    num_workers = 16
    

    The torch.optim package provides methods to adjust the learning rate as the training progresses. This example uses the StepLR scheduler, which decays the learning rate by lr_gamma at every lr_step_size number of epochs.

    learning_rate = 0.1
    momentum = 0.9
    weight_decay = 1e-4
    lr_step_size = 30
    lr_gamma = 0.1
    

    Note

    One training epoch is when the neural network passes an entire dataset forward and backward.

    epochs = 90
    

    The train and validation directories are determined.

    train_dir = os.path.join(data_path, "train")
    val_dir = os.path.join(data_path, "val")
    
  10. Set up the training and testing data loaders.

    interpolation = InterpolationMode(interpolation)
    
    TRAIN_TRANSFORM_IMG = transforms.Compose([
    Normalizaing and standardardizing the image
    transforms.RandomResizedCrop(train_crop_size, interpolation=interpolation),
        transforms.PILToTensor(),
        transforms.ConvertImageDtype(torch.float),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                            std=[0.229, 0.224, 0.225] )
        ])
    dataset = torchvision.datasets.ImageFolder(
        train_dir,
        transform=TRAIN_TRANSFORM_IMG
    )
    TEST_TRANSFORM_IMG = transforms.Compose([
        transforms.Resize(val_resize_size, interpolation=interpolation),
        transforms.CenterCrop(val_crop_size),
        transforms.PILToTensor(),
        transforms.ConvertImageDtype(torch.float),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                            std=[0.229, 0.224, 0.225] )
        ])
    
    dataset_test = torchvision.datasets.ImageFolder(
        val_dir,
        transform=TEST_TRANSFORM_IMG
    )
    
    print("Creating data loaders")
    train_sampler = torch.utils.data.RandomSampler(dataset)
    test_sampler = torch.utils.data.SequentialSampler(dataset_test)
    
    data_loader = torch.utils.data.DataLoader(
        dataset,
        batch_size=batch_size,
        sampler=train_sampler,
        num_workers=num_workers,
        pin_memory=True
    )
    
    data_loader_test = torch.utils.data.DataLoader(
        dataset_test, batch_size=batch_size, sampler=test_sampler, num_workers=num_workers, pin_memory=True
    )
    

    Note

    Use torchvision to obtain the Inception v3 model. Use the pre-trained model weights to speed up training.

    print("Creating model")
    print("Num classes = ", len(dataset.classes))
    model = torchvision.models.__dict__[model_name](pretrained=pretrained)
    
  11. Adapt Inception v3 for the current dataset. tiny-imagenet-200 contains only 200 classes, whereas Inception v3 is designed for 1,000-class output. The last layer of Inception v3 is replaced to match the output features required.

    model.fc = torch.nn.Linear(model.fc.in_features, len(dataset.classes))
    model.aux_logits = False
    model.AuxLogits = None
    
  12. Move the model to the GPU device.

    model.to(device)
    
  13. Set the loss criteria. For this example, Cross Entropy Loss[5] is used.

    criterion = torch.nn.CrossEntropyLoss()
    
  14. Set the optimizer to Stochastic Gradient Descent.

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=learning_rate,
        momentum=momentum,
        weight_decay=weight_decay
    )
    
  15. Set the learning rate scheduler.

    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=lr_step_size, gamma=lr_gamma)
    
  16. Iterate over epochs. Each epoch is a complete pass through the training data.

    print("Start training")
    for epoch in range(epochs):
        model.train()
        epoch_loss = 0
        len_dataset = 0
    
  17. Iterate over steps. The data is processed in batches, and each step passes through a full batch.

    for step, (image, target) in enumerate(data_loader):
    
  18. Pass the image and target to the GPU device.

    image, target = image.to(device), target.to(device)
    

    The following is the core training logic:

    a. The image is fed into the model.

    b. The output is compared with the target in the training data to obtain the loss.

    c. This loss is back propagated to all parameters that require optimization.

    d. The optimizer updates the parameters based on the selected optimization algorithm.

            output = model(image)
            loss = criterion(output, target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
    

    The epoch loss is updated, and the step loss prints.

            epoch_loss += output.shape[0] * loss.item()
            len_dataset += output.shape[0];
            if step % 10 == 0:
                print('Epoch: ', epoch, '| step : %d' % step, '| train loss : %0.4f' % loss.item() )
        epoch_loss = epoch_loss / len_dataset
        print('Epoch: ', epoch, '| train loss :  %0.4f' % epoch_loss )
    

    The learning rate is updated at the end of each epoch.

    lr_scheduler.step()
    

    After training for the epoch, the model evaluates against the validation dataset.

    model.eval()
        with torch.inference_mode():
            running_loss = 0
            for step, (image, target) in enumerate(data_loader_test):
                image, target = image.to(device), target.to(device)
    
                output = model(image)
                loss = criterion(output, target)
    
                running_loss += loss.item()
        running_loss = running_loss / len(data_loader_test)
        print('Epoch: ', epoch, '| test loss : %0.4f' % running_loss )
    
  19. Save the model for use in inferencing tasks.

# save model
torch.save(model.state_dict(), "trained_inception_v3.pt")

Plotting the train and test loss shows both metrics reducing over training epochs. This is demonstrated in Fig. 30.

_images/inception_v3.png

Inception v3 Train and Loss Graph#

Custom Model with CIFAR-10 on PyTorch#

The CIFAR-10 (Canadian Institute for Advanced Research) dataset is a subset of the Tiny Images dataset (which contains 80 million images of 32x32 collected from the Internet) and consists of 60,000 32x32 color images. The images are labeled with one of 10 mutually exclusive classes: airplane, motor car, bird, cat, deer, dog, frog, cruise ship, stallion, and truck (but not pickup truck). There are 6,000 images per class, with 5,000 training and 1,000 testing images per class. Let us prepare a custom model for classifying these images using the PyTorch framework and go step-by-step as illustrated below.

Follow these steps:

  1. Import dependencies, including torch, os, and torchvision.

    import torch
    import torchvision
    import torchvision.transforms as transforms
    import matplotlib.pyplot as plot
    import numpy as np
    
  2. The output of torchvision datasets is PILImage images of range [0, 1]. Transform them to Tensors of normalized range [-1, 1].

    transform = transforms.Compose(
            [transforms.ToTensor(),
                transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    

    During each training step, a batch of images is processed to compute the loss gradient and perform the optimization. In the following setting, the size of the batch is determined.

    batch_size = 4
    
  3. Download the dataset train and test datasets as follows. Specify the batch size, shuffle the dataset once, and specify the number of workers to the number of CPU threads used by the data loader to perform efficient multi-process data loading.

    train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=2)
    
  4. Follow the same procedure for the testing set.

    test_set = TorchVision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
    test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=2)
    print ("teast set and test loader")
    
  5. Specify the defined classes of images belonging to this dataset.

    classes = ('Aeroplane', 'motorcar', 'bird', 'cat', 'deer', 'puppy', 'frog', 'stallion', 'cruise', 'truck')
    print("defined classes")
    
  6. Denormalize the images and then iterate over them.

    global image_number
    image_number = 0
    def show_image(img):
        global image_number
        image_number = image_number + 1
        img = img / 2 + 0.5     # de-normalizing input image
        npimg = img.numpy()
        plot.imshow(np.transpose(npimg, (1, 2, 0)))
        plot.savefig("fig{}.jpg".format(image_number))
        print("fig{}.jpg".format(image_number))
        plot.show()
    data_iter = iter(train_loader)
    images, labels = data_iter.next()
    show_image(torchvision.utils.make_grid(images))
    print(' '.join('%5s' % classes[labels[j]] for j in range(batch_size)))
    print("image created and saved ")
    
  7. Import the torch.nn for constructing neural networks and torch.nn.functional to use the convolution functions.

    import torch.nn as nn
    import torch.nn.functional as F
    
  8. Define the CNN (Convolution Neural Networks) and relevant activation functions.

    class Net(nn.Module):
        def __init__(self):
            super().__init__()
            self.conv1 = nn.Conv2d(3, 6, 5)
            self.pool = nn.MaxPool2d(2, 2)
            self.conv2 = nn.Conv2d(6, 16, 5)
    self.pool = nn.MaxPool2d(2, 2)
    self.conv3 = nn.Conv2d(3, 6, 5)
            self.fc2 = nn.Linear(120, 84)
            self.fc3 = nn.Linear(84, 10)
    
        def forward(self, x):
            x = self.pool(F.relu(self.conv1(x)))
            x = self.pool(F.relu(self.conv2(x)))
            x = torch.flatten(x, 1) # flatten all dimensions except batch
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            x = self.fc3(x)
            return x
    net = Net()
    print("created Net() ")
    
  9. Set the optimizer to Stochastic Gradient Descent.

    import torch.optim as optim
    
  10. Set the loss criteria. For this example, Cross Entropy Loss[5] is used.

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
    
  11. Iterate over epochs. Each epoch is a complete pass through the training data.

    for epoch in range(2):  # loop over the dataset multiple times
    
        running_loss = 0.0
        for i, data in enumerate(train_loader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
    
            # zero the parameter gradients
            optimizer.zero_grad()
    
            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
    
            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:    # print every 2000 mini-batches
                print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000))
                running_loss = 0.0
    print('Finished Training')
    
    PATH = './cifar_net.pth'
    torch.save(net.state_dict(), PATH)
    print("saved model to path :",PATH)
    net = Net()
    net.load_state_dict(torch.load(PATH))
    print("loding back saved model")
    outputs = net(images)
    _, predicted = torch.max(outputs, 1)
    print('Predicted: ', ' '.join('%5s' % classes[predicted[j]] for j in range(4)))
    correct = 0
    total = 0
    

    As this is not training, calculating the gradients for outputs is not required.

    # calculate outputs by running images through the network
    with torch.no_grad():
        for data in test_loader:
            images, labels = data
            # calculate outputs by running images through the network
            outputs = net(images)
            # the class with the highest energy is what you can choose as prediction
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    print('Accuracy of the network on the 10000 test images: %d %%' % ( 100 * correct / total))
    # prepare to count predictions for each class
    correct_pred = {classname: 0 for classname in classes}
    total_pred = {classname: 0 for classname in classes}
    
    # again no gradients needed
    with torch.no_grad():
        for data in test_loader:
            images, labels = data
            outputs = net(images)
            _, predictions = torch.max(outputs, 1)
            # collect the correct predictions for each class
            for label, prediction in zip(labels, predictions):
                if label == prediction:
                    correct_pred[classes[label]] += 1
                total_pred[classes[label]] += 1
    # print accuracy for each class
    for classname, correct_count in correct_pred.items():
        accuracy = 100 * float(correct_count) / total_pred[classname]
        print("Accuracy for class {:5s} is: {:.1f} %".format(classname,accuracy))
    
Case Study: TensorFlow with Fashion MNIST#

Fashion MNIST is a dataset that contains 70,000 grayscale images in 10 categories.

Implement and train a neural network model using the TensorFlow framework to classify images of clothing, like sneakers and shirts.

The dataset has 60,000 images you will use to train the network and 10,000 to evaluate how accurately the network learned to classify images. The Fashion MNIST dataset can be accessed via TensorFlow internal libraries.

Access the source code from the following repository:

ROCmSoftwarePlatform/tensorflow_fashionmnist

To understand the code step by step, follow these steps:

  1. Import libraries like TensorFlow, NumPy, and Matplotlib to train the neural network and calculate and plot graphs.

    import tensorflow as tf
    import numpy as np
    import matplotlib.pyplot as plt
    
  2. To verify that TensorFlow is installed, print the version of TensorFlow by using the below print statement:

    print(tf._version__) r
    
  3. Load the dataset from the available internal libraries to analyze and train a neural network upon the MNIST Fashion Dataset. Loading the dataset returns four NumPy arrays. The model uses the training set arrays, train_images and train_labels, to learn.

  4. The model is tested against the test set, test_images, and test_labels arrays.

    fashion_mnist = tf.keras.datasets.fashion_mnist
    (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
    

    Since you have 10 types of images in the dataset, assign labels from zero to nine. Each image is assigned one label. The images are 28x28 NumPy arrays, with pixel values ranging from zero to 255.

  5. Each image is mapped to a single label. Since the class names are not included with the dataset, store them, and later use them when plotting the images:

    class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat','Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
    
  6. Use this code to explore the dataset by knowing its dimensions:

    train_images.shape
    
  7. Use this code to print the size of this training set:

    print(len(train_labels))
    
  8. Use this code to print the labels of this training set:

    print(train_labels)
    
  9. Preprocess the data before training the network, and you can start inspecting the first image, as its pixels will fall in the range of zero to 255.

    plt.figure()
    plt.imshow(train_images[0])
    plt.colorbar()
    plt.grid(False)
    plt.show()
    
    _images/mnist_1.png
  10. From the above picture, you can see that values are from zero to 255. Before training this on the neural network, you must bring them in the range of zero to one. Hence, divide the values by 255.

    train_images = train_images / 255.0
    
    test_images = test_images / 255.0
    
  11. To ensure the data is in the correct format and ready to build and train the network, display the first 25 images from the training set and the class name below each image.

    plt.figure(figsize=(10,10))
    for i in range(25):
        plt.subplot(5,5,i+1)
        plt.xticks([])
        plt.yticks([])
        plt.grid(False)
        plt.imshow(train_images[i], cmap=plt.cm.binary)
        plt.xlabel(class_names[train_labels[i]])
    plt.show()
    
    _images/mnist_2.png

    The basic building block of a neural network is the layer. Layers extract representations from the data fed into them. Deep Learning consists of chaining together simple layers. Most layers, such as tf.keras.layers.Dense, have parameters that are learned during training.

    model = tf.keras.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10)
    ])
    
    • The first layer in this network tf.keras.layers.Flatten transforms the format of the images from a two-dimensional array (of 28 x 28 pixels) to a one-dimensional array (of 28 * 28 = 784 pixels). Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn; it only reformats the data.

    • After the pixels are flattened, the network consists of a sequence of two tf.keras.layers.Dense layers. These are densely connected or fully connected neural layers. The first Dense layer has 128 nodes (or neurons). The second (and last) layer returns a logits array with a length of 10. Each node contains a score that indicates the current image belongs to one of the 10 classes.

  12. You must add the Loss function, Metrics, and Optimizer at the time of model compilation.

    model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['accuracy'])
    
    • Loss function —This measures how accurate the model is during training when you are looking to minimize this function to “steer” the model in the right direction.

    • Optimizer —This is how the model is updated based on the data it sees and its loss function.

    • Metrics —This is used to monitor the training and testing steps.

    The following example uses accuracy, the fraction of the correctly classified images.

    To train the neural network model, follow these steps:

    1. Feed the training data to the model. The training data is in the train_images and train_labels arrays in this example. The model learns to associate images and labels.

    2. Ask the model to make predictions about a test set—in this example, the test_images array.

    3. Verify that the predictions match the labels from the test_labels array.

    4. To start training, call the model.fit method because it “fits” the model to the training data.

      model.fit(train_images, train_labels, epochs=10)
      
    5. Compare how the model will perform on the test dataset.

      test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
      
      print('\nTest accuracy:', test_acc)
      
    6. With the model trained, you can use it to make predictions about some images: the model’s linear outputs and logits. Attach a softmax layer to convert the logits to probabilities, making it easier to interpret.

      probability_model = tf.keras.Sequential([model,
                                              tf.keras.layers.Softmax()])
      
      predictions = probability_model.predict(test_images)
      
    7. The model has predicted the label for each image in the testing set. Look at the first prediction:

      predictions[0]
      

      A prediction is an array of 10 numbers. They represent the model’s “confidence” that the image corresponds to each of the 10 different articles of clothing. You can see which label has the highest confidence value:

      np.argmax(predictions[0])
      
    8. Plot a graph to look at the complete set of 10 class predictions.

      def plot_image(i, predictions_array, true_label, img):
      true_label, img = true_label[i], img[i]
      plt.grid(False)
      plt.xticks([])
      plt.yticks([])
      
      plt.imshow(img, cmap=plt.cm.binary)
      
      predicted_label = np.argmax(predictions_array)
      if predicted_label == true_label:
          color = 'blue'
      else:
          color = 'red'
      
      plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label],
                                      100*np.max(predictions_array),
                                      class_names[true_label]),
                                      color=color)
      
      def plot_value_array(i, predictions_array, true_label):
      true_label = true_label[i]
      plt.grid(False)
      plt.xticks(range(10))
      plt.yticks([])
      thisplot = plt.bar(range(10), predictions_array, color="#777777")
      plt.ylim([0, 1])
      predicted_label = np.argmax(predictions_array)
      
      thisplot[predicted_label].set_color('red')
      thisplot[true_label].set_color('blue')
      
    9. With the model trained, you can use it to make predictions about some images. Review the 0-th image predictions and the prediction array. Correct prediction labels are blue, and incorrect prediction labels are red. The number gives the percentage (out of 100) for the predicted label.

      i = 0
      plt.figure(figsize=(6,3))
      plt.subplot(1,2,1)
      plot_image(i, predictions[i], test_labels, test_images)
      plt.subplot(1,2,2)
      plot_value_array(i, predictions[i],  test_labels)
      plt.show()
      
      _images/mnist_3.png
      i = 12
      plt.figure(figsize=(6,3))
      plt.subplot(1,2,1)
      plot_image(i, predictions[i], test_labels, test_images)
      plt.subplot(1,2,2)
      plot_value_array(i, predictions[i],  test_labels)
      plt.show()
      
      _images/mnist_4.png
    10. Use the trained model to predict a single image.

      # Grab an image from the test dataset.
      img = test_images[1]
      print(img.shape)
      
    11. tf.keras models are optimized to make predictions on a batch, or collection, of examples at once. Accordingly, even though you are using a single image, you must add it to a list.

      # Add the image to a batch where it's the only member.
      img = (np.expand_dims(img,0))
      
      print(img.shape)
      
    12. Predict the correct label for this image.

      predictions_single = probability_model.predict(img)
      
      print(predictions_single)
      
      plot_value_array(1, predictions_single[0], test_labels)
      _ = plt.xticks(range(10), class_names, rotation=45)
      plt.show()
      
      _images/mnist_5.png
    13. tf.keras.Model.predict returns a list of lists—one for each image in the batch of data. Grab the predictions for our (only) image in the batch.

      np.argmax(predictions_single[0])
      
Case Study: TensorFlow with Text Classification#

This procedure demonstrates text classification starting from plain text files stored on disk. You will train a binary classifier to perform sentiment analysis on an IMDB dataset. At the end of the notebook, there is an exercise for you to try in which you will train a multi-class classifier to predict the tag for a programming question on Stack Overflow.

Follow these steps:

  1. Import the necessary libraries.

    import matplotlib.pyplot as plt
    import os
    import re
    import shutil
    import string
    import tensorflow as tf
    
    from tensorflow.keras import layers
    from tensorflow.keras import losses
    
  2. Get the data for the text classification, and extract the database from the given link of IMDB.

    url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
    
    dataset = tf.keras.utils.get_file("aclImdb_v1", url,
                                        untar=True, cache_dir='.',
                                        cache_subdir='')
    
    Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    84131840/84125825 [==============================]  1s 0us/step
    84149932/84125825 [==============================]  1s 0us/step
    
  3. Fetch the data from the directory.

    dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
    print(os.listdir(dataset_dir))
    
  4. Load the data for training purposes.

    train_dir = os.path.join(dataset_dir, 'train')
    os.listdir(train_dir)
    
    ['labeledBow.feat',
    'urls_pos.txt',
    'urls_unsup.txt',
    'unsup',
    'pos',
    'unsupBow.feat',
    'urls_neg.txt',
    'neg']
    
  5. The directories contain many text files, each of which is a single movie review. To look at one of them, use the following:

    sample_file = os.path.join(train_dir, 'pos/1181_9.txt')
    with open(sample_file) as f:
    print(f.read())
    
  6. As the IMDB dataset contains additional folders, remove them before using this utility.

    remove_dir = os.path.join(train_dir, 'unsup')
    shutil.rmtree(remove_dir)
    batch_size = 32
    seed = 42
    
  7. The IMDB dataset has already been divided into train and test but lacks a validation set. Create a validation set using an 80:20 split of the training data by using the validation_split argument below:

    raw_train_ds=tf.keras.utils.text_dataset_from_directory('aclImdb/train',batch_size=batch_size, validation_split=0.2,subset='training', seed=seed)
    
  8. As you will see in a moment, you can train a model by passing a dataset directly to model.fit. If you are new to tf.data, you can also iterate over the dataset and print a few examples as follows:

    for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(3):
        print("Review", text_batch.numpy()[i])
        print("Label", label_batch.numpy()[i])
    
  9. The labels are zero or one. To see which of these correspond to positive and negative movie reviews, check the class_names property on the dataset.

    print("Label 0 corresponds to", raw_train_ds.class_names[0])
    print("Label 1 corresponds to", raw_train_ds.class_names[1])
    
  10. Next, create validation and test the dataset. Use the remaining 5,000 reviews from the training set for validation into two classes of 2,500 reviews each.

    raw_val_ds = tf.keras.utils.text_dataset_from_directory('aclImdb/train',
    batch_size=batch_size,validation_split=0.2,subset='validation', seed=seed)
    
    raw_test_ds =
    tf.keras.utils.text_dataset_from_directory(
        'aclImdb/test',
        batch_size=batch_size)
    

To prepare the data for training, follow these steps:

  1. Standardize, tokenize, and vectorize the data using the helpful tf.keras.layers.TextVectorization layer.

    def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '<br/>', ' ')
    return tf.strings.regex_replace(stripped_html,                                 '[%s]' % re.escape(string.punctuation),'')
    
  2. Create a TextVectorization layer. Use this layer to standardize, tokenize, and vectorize our data. Set the output_mode to int to create unique integer indices for each token. Note that we are using the default split function and the custom standardization function you defined above. You will also define some constants for the model, like an explicit maximum sequence_length, which will cause the layer to pad or truncate sequences to exactly sequence_length values.

    max_features = 10000
    sequence_length = 250
    vectorize_layer = layers.TextVectorization(
        standardize=custom_standardization,
        max_tokens=max_features,
        output_mode='int',
        output_sequence_length=sequence_length)
    
  3. Call adapt to fit the state of the preprocessing layer to the dataset. This causes the model to build an index of strings to integers.

    # Make a text-only dataset (without labels), then call adapt
    train_text = raw_train_ds.map(lambda x, y: x)
    vectorize_layer.adapt(train_text)
    
  4. Create a function to see the result of using this layer to preprocess some data.

    def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label
    
    text_batch, label_batch = next(iter(raw_train_ds))
    first_review, first_label = text_batch[0], label_batch[0]
    print("Review", first_review)
    print("Label", raw_train_ds.class_names[first_label])
    print("Vectorized review", vectorize_text(first_review, first_label))
    
    _images/TextClassification_3.png
  5. As you can see above, each token has been replaced by an integer. Look up the token (string) that each integer corresponds to by calling get_vocabulary() on the layer.

    print("1287 ---> ",vectorize_layer.get_vocabulary()[1287])
    print(" 313 ---> ",vectorize_layer.get_vocabulary()[313])
    print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary())))
    
  6. You are nearly ready to train your model. As a final preprocessing step, apply the TextVectorization layer we created earlier to train, validate, and test the dataset.

    train_ds = raw_train_ds.map(vectorize_text)
    val_ds = raw_val_ds.map(vectorize_text)
    test_ds = raw_test_ds.map(vectorize_text)
    

    The cache() function keeps data in memory after it is loaded off disk. This ensures the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.

    The prefetch() function overlaps data preprocessing and model execution while training.

    AUTOTUNE = tf.data.AUTOTUNE
    
    train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
    val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
    test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)
    
  7. Create your neural network.

    embedding_dim = 16
    model = tf.keras.Sequential([layers.Embedding(max_features + 1, embedding_dim),layers.Dropout(0.2),layers.GlobalAveragePooling1D(),
    layers.Dropout(0.2),layers.Dense(1)])
    model.summary()
    
    _images/TextClassification_4.png
  8. A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), use losses.BinaryCrossentropy loss function.

    model.compile(loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer='adam',metrics=tf.metrics.BinaryAccuracy(threshold=0.0))
    
  9. Train the model by passing the dataset object to the fit method.

    epochs = 10
    history = model.fit(train_ds,validation_data=val_ds,epochs=epochs)
    
    _images/TextClassification_5.png
  10. See how the model performs. Two values are returned: loss (a number representing our error; lower values are better) and accuracy.

    loss, accuracy = model.evaluate(test_ds)
    
    print("Loss: ", loss)
    print("Accuracy: ", accuracy)
    

    Note

    model.fit() returns a History object that contains a dictionary with everything that happened during training.

    history_dict = history.history
    history_dict.keys()
    
  11. Four entries are for each monitored metric during training and validation. Use these to plot the training and validation loss for comparison, as well as the training and validation accuracy:

    acc = history_dict['binary_accuracy']
    val_acc = history_dict['val_binary_accuracy']
    loss = history_dict['loss']
    val_loss = history_dict['val_loss']
    
    epochs = range(1, len(acc) + 1)
    
    # "bo" is for "blue dot"
    plt.plot(epochs, loss, 'bo', label='Training loss')
    # b is for "solid blue line"
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    
    plt.show()
    

    Fig. 31 and Fig. 32 illustrate the training and validation loss and the training and validation accuracy.

    _images/TextClassification_6.png

    Training and Validation Loss#

    _images/TextClassification_7.png

    Training and Validation Accuracy#

  12. Export the model.

    export_model = tf.keras.Sequential([
    vectorize_layer,
    model,
    layers.Activation('sigmoid')
    ])
    
    export_model.compile(
        loss=losses.BinaryCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
    )
    
    # Test it with `raw_test_ds`, which yields raw strings
    loss, accuracy = export_model.evaluate(raw_test_ds)
    print(accuracy)
    
  13. To get predictions for new examples, call model.predict().

    examples = [
    "The movie was great!",
    "The movie was okay.",
    "The movie was terrible..."
    ]
    
    export_model.predict(examples)
    

References#


About ROCm Documentation#

ROCm documentation is made available under open source licenses. Documentation is built using open source toolchains. Contributions to our documentation is encouraged and welcome. As a contributor, please familiarize yourself with our documentation toolchain.

ReadTheDocs#

ReadTheDocs is our front end for the our documentation. By front end, this is the tool that serves our HTML based documentation to our end users.

Doxygen#

Doxygen is the most common inline code documentation standard. ROCm projects are use Doxygen for public API documentation (unless the upstream project is using a different tool).

Sphinx#

Sphinx is a documentation generator originally used for python. It is now widely used in the Open Source community. Originally, sphinx supported RST based documentation. Markdown support is now available. ROCm documentation plans to default to markdown for new projects. Existing projects using RST are under no obligation to convert to markdown. New projects that believe markdown is not suitable should contact the documentation team prior to selecting RST.

MyST#

Markedly Structured Text (MyST) is an extended flavor of Markdown (CommonMark) influenced by reStructuredText (RST) and Sphinx. It is integrated via myst-parser. A cheat sheet that showcases how to use the MyST syntax is available over at the Jupyter reference.

Sphinx Theme#

ROCm is using the Sphinx Book Theme. This theme is used by Jupyter books. ROCm documentation applies some customization include a header and footer on top of the Sphinx Book Theme. A future custom ROCm theme will be part of our documentation goals.

Sphinx Design#

Sphinx Design is an extension for sphinx based websites that add design functionality. Please see the documentation here. ROCm documentation uses sphinx design for grids, cards, and synchronized tabs. Other features may be used in the future.

Sphinx External TOC#

ROCm uses the sphinx-external-toc for our navigation. This tool allows a YAML file based left navigation menu. This tool was selected due to its flexibility that allows scripts to operate on the YAML file. Please transition to this file for the project’s navigation. You can see the _toc.yml.in file in this repository in the docs/sphinx folder for an example.

Breathe#

Sphinx uses Breathe to integrate Doxygen content.

rocm-docs-core pip package#

rocm-docs-core is an AMD maintained project that applies customization for our documentation. This project is the tool most ROCm repositories will use as part of the documentation build.

Contributing to ROCm Docs#

AMD values and encourages the ROCm community to contribute to our code and documentation. This repository is focused on ROCm documentation and this contribution guide describes the recommend method for creating and modifying our documentation.

While interacting with ROCm Documentation, we encourage you to be polite and respectful in your contributions, content or otherwise. Authors, maintainers of these docs act on good intentions and to the best of their knowledge. Keep that in mind while you engage. Should you have issues with contributing itself, refer to discussions on the GitHub repository.

Supported Formats#

Our documentation includes both markdown and rst files. Markdown is encouraged over rst due to the lower barrier to participation. GitHub flavored markdown is preferred for all submissions as it will render accurately on our GitHub repositories. For existing documentation, MyST markdown is used to implement certain features unsupported in GitHub markdown. This is not encouraged for new documentation. AMD will transition to stricter use of GitHub flavored markdown with a few caveats. ROCm documentation also uses sphinx-design in our markdown and rst files. We also will use breathe syntax for doxygen documentation in our markdown files. Other design elements for effective HTML rendering of the documents may be added to our markdown files. Please see GitHub’s guide on writing and formatting on GitHub as a starting point.

ROCm documentation adds additional requirements to markdown and rst based files as follows:

  • Level one headers are only used for page titles. There must be only one level 1 header per file for both Markdown and Restructured Text.

  • Pass markdownlint check via our automated github action on a Pull Request (PR).

Filenames and folder structure#

Please use snake case for file names. Our documentation follows pitchfork for folder structure. All documentation is in /docs except for special files like the contributing guide in the / folder. All images used in the documentation are place in the /docs/data folder.

How to provide feedback for for ROCm documentation#

There are three standard ways to provide feedback for this repository.

Pull Request#

All contributions to ROCm documentation should arrive via the GitHub Flow targetting the develop branch of the repository. If you are unable to contribute via the GitHub Flow, feel free to email us. TODO, confirm email address.

GitHub Issue#

Issues on existing or absent docs can be filed as GitHub issues .

Email Feedback#

Language and Style#

Adopting Microsoft CPP-Docs guidelines for Voice and Tone .

ROCm documentation templates to be made public shortly. ROCm templates dictate the recommended structure and flow of the documentation. Guidelines on how to integrate figures, equations, and tables are all based off MyST.

Font size and selection, page layout, white space control, and other formatting details are controlled via rocm-docs-core, sphinx extention. Please raise issues in rocm-docs-core for any formatting concerns and changes requested.

Building Documentation#

While contributing, one may build the documentation locally on the command-line or rely on Continuous Integration for previewing the resulting HTML pages in a browser.

Command line documentation builds#

Python versions known to build documentation:

  • 3.8

To build the docs locally using Python Virtual Environment (venv), execute the following commands from the project root:

python3 -mvenv .venv
# Windows
.venv/Scripts/python -m pip install -r docs/sphinx/requirements.txt
.venv/Scripts/python -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html
# Linux
.venv/bin/python     -m pip install -r docs/sphinx/requirements.txt
.venv/bin/python     -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html

Then open up _build/html/index.html in your favorite browser.

Pull Requests documentation builds#

When opening a PR to the develop branch on GitHub, the page corresponding to the PR (https://github.com/RadeonOpenCompute/ROCm/pull/<pr_number>) will have a summary at the bottom. This requires the user be logged in to GitHub.

  • There, click Show all checks and Details of the Read the Docs pipeline. It will take you to https://readthedocs.com/projects/advanced-micro-devices-rocm/ builds/<some_build_num>/

    • The list of commands shown are the exact ones used by CI to produce a render of the documentation.

  • There, click on the small blue link View docs (which is not the same as the bigger button with the same text). It will take you to the built HTML site with a URL of the form https:// advanced-micro-devices-demo--<pr_number>.com.readthedocs.build/projects/alpha/en /<pr_number>/.

Build the docs using VS Code#

One can put together a productive environment to author documentation and also test it locally using VS Code with only a handful of extensions. Even though the extension landscape of VS Code is ever changing, here is one example setup that proved useful at the time of writing. In it, one can change/add content, build a new version of the docs using a single VS Code Task (or hotkey), see all errors/ warnings emitted by Sphinx in the Problems pane and immediately see the resulting website show up on a locally serving web server.

Configuring VS Code#
  1. Install the following extensions:

    • Python (ms-python.python)

    • Live Server (ritwickdey.LiveServer)

  2. Add the following entries in .vscode/settings.json

    {
      "liveServer.settings.root": "/.vscode/build/html",
      "liveServer.settings.wait": 1000,
      "python.terminal.activateEnvInCurrentTerminal": true
    }
    

    The settings in order are set for the following reasons:

    • Sets the root of the output website for live previews. Must be changed alongside the tasks.json command.

    • Tells live server to wait with the update to give time for Sphinx to regenerate site contents and not refresh before all is don. (Empirical value)

    • Automatic virtual env activation is a nice touch, should you want to build the site from the integrated terminal.

  3. Add the following tasks in .vscode/tasks.json

    {
      "version": "2.0.0",
      "tasks": [
        {
          "label": "Build Docs",
          "type": "process",
          "windows": {
            "command": "${workspaceFolder}/.venv/Scripts/python.exe"
          },
          "command": "${workspaceFolder}/.venv/bin/python3",
          "args": [
            "-m",
            "sphinx",
            "-j",
            "auto",
            "-T",
            "-b",
            "html",
            "-d",
            "${workspaceFolder}/.vscode/build/doctrees",
            "-D",
            "language=en",
            "${workspaceFolder}/docs",
            "${workspaceFolder}/.vscode/build/html"
          ],
          "problemMatcher": [
            {
              "owner": "sphinx",
              "fileLocation": "absolute",
              "pattern": {
                "regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):(\\d+):\\s+(WARNING|ERROR):\\s+(.*)$",
                "file": 1,
                "line": 2,
                "severity": 3,
                "message": 4
              },
            },
            {
              "owner": "sphinx",
              "fileLocation": "absolute",
              "pattern": {
                "regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):{1,2}\\s+(WARNING|ERROR):\\s+(.*)$",
                "file": 1,
                "severity": 2,
                "message": 3
              }
            }
          ],
          "group": {
            "kind": "build",
            "isDefault": true
          }
        },
      ],
    }
    

    (Implementation detail: two problem matchers were needed to be defined, because VS Code doesn’t tolerate some problem information being potentially absent. While a single regex could match all types of errors, if a capture group remains empty (the line number doesn’t show up in all warning/error messages) but the pattern references said empty capture group, VS Code discards the message completely.)

  4. Configure Python virtual environment (venv)

    • From the Command Palette, run Python: Create Environment

      • Select venv environment and the docs/sphinx/requirements.txt file. (Simply pressing enter while hovering over the file from the dropdown is insufficient, one has to select the radio button with the ‘Space’ key if using the keyboard.)

  5. Build the docs

    • Launch the default build Task using either:

      • a hotkey (default is ‘Ctrl+Shift+B’) or

      • by issuing the Tasks: Run Build Task from the Command Palette.

  6. Open the live preview

    • Navigate to the output of the site within VS Code, right-click on .vscode/build/html/index.html and select Open with Live Server. The contents should update on every rebuild without having to refresh the browser.