ROCm compiler reference#

ROCm includes compilers optimized for high-performance computing on AMD GPUs and CPUs supporting various heterogeneous programming models such as Heterogeneous-computing Interface for Portability (HIP), OpenMP, and OpenCL.

Important

The ROCm compilers only build the x86 and AMDGPU targets. Functionality and options that are relevant to other targets are not included within the ROCm compilers.

For more details, see:

ROCm compiler interfaces#

ROCm provides two compiler interfaces for compiling HIP programs:

  • /opt/rocm/bin/amdclang++

  • /opt/rocm/bin/hipcc

The ROCm compilers leverage the same LLVM compiler technology with the AMD GCN GPU support; however, they offer a slightly different user experience. The hipcc command-line interface provides a more familiar user interface to users who are experienced in CUDA but relatively new to the ROCm/HIP development environment. On the other hand, amdclang++ provides a user interface identical to the clang++ compiler. It is more suitable for experienced developers who want to directly interact with the clang compiler and gain full control of the application build process.

The major differences between hipcc and amdclang++ are listed below:

Feature

hipcc

amdclang++

Compiling HIP source files

Treats all source files as HIP language source files

Enables the HIP language support for files with the .hip extension or through the -x hip compiler option

Detecting GPU architecture

Auto-detects the GPUs available on the system and generates code for those devices when no GPU architecture is specified

Has AMD GCN gfx803 as the default GPU architecture. The --offload-arch compiler option may be used to target other GPU architectures

Finding a HIP installation

Finds the HIP installation based on its own location and its knowledge about the ROCm directory structure

First looks for HIP under the same parent directory as its own LLVM directory and then falls back on /opt/rocm. Users can use the --rocm-path option to instruct the compiler to use HIP from the specified ROCm installation.

Linking to the HIP runtime library

Is configured to automatically link to the HIP runtime from the detected HIP installation

Requires the --hip-link flag to be specified to link to the HIP runtime. Alternatively, users can use the -l<dir>  - Lamdhip64 option to link to a HIP runtime library.

Device function inlining

Inlines all GPU device functions, which provide greater performance and compatibility for codes that contain file scoped or device function scoped __shared__ variables. However, it may increase compile time.

Relies on inlining heuristics to control inlining. Users experiencing performance or compilation issues with code using file scoped or device function scoped __shared__ variables could try -mllvm  - Amdgpu-early-inline-all=true -mllvm  - Amdgpu-function-calls=false to work around the issue. There are plans to address these issues with future compiler improvements.

Source code location:

ROCm/llvm-project/amd/hipcc

ROCm/llvm-project/clang

Compiler options and features#

This section discusses compiler options and features.

AMD GPU compilation#

This section outlines commonly used compiler flags for hipcc and amdclang++.

Options

Description

-x hip

Compiles the source file as a HIP program.

-fopenmp

Enables the OpenMP support.

-fopenmp-targets=<gpu>

Enables the OpenMP target offload support of the specified GPU architecture, where <gpu> specifies the GPU architecture. For example: gfx908

--gpu-max-threads-per-block=<value>:

Sets the default limit of threads per block, also referred to as the launch bounds, where <value> specifies the default maximum amount of threads per block.

-munsafe-fp-atomics

Enables unsafe floating point atomic instructions (AMDGPU only).

-ffast-math

Allows aggressive, lossy floating-point optimizations.

-mwavefrontsize64, -mno-wavefrontsize64

Sets wavefront size to be 64 or 32 on RDNA architectures.

-mcumode

Switches between CU and WGP modes on RDNA architectures.

--offload-arch=<gpu>

HIP offloading target ID.
May be specified more than once, where <gpu> specifies the device architecture followed by target ID features delimited by a colon.
Each target ID feature is a predefined string followed by a plus or minus sign (e.g. gfx908:xnack+:sramecc-).

-g

Generates source-level debug information.

-fgpu-rdc,    -fno-gpu-rdc

Generates relocatable device code, also known as separate compilation mode.

Inline ASM statements#

Inline assembly (ASM) statements allow a developer to include assembly instructions directly in either host or device code. The ROCm compiler supports ASM statements, however you should not use them for the following reasons:

  • The compiler’s ability to produce both correct code and optimize surrounding code is impeded.

  • The compiler does not parse the content of the ASM statements and cannot examine its contents.

  • The compiler must make conservative assumptions in an effort to retain correctness.

  • The conservative assumptions may yield code that is less performant compared to code without ASM statements. It is possible that a syntactically correct ASM statement may cause incorrect runtime behavior.

  • ASM statements are often ASIC-specific; code containing them is less portable and adds a maintenance burden for the developer if different ASICs are targeted.

  • Writing correct ASM statements is often difficult; thorough testing of any ASM statements is strongly encouraged.

Note

For developers who include ASM statements in the code, AMD is interested in understanding the use case and appreciates your feedback at ROCm/ROCm#issues

Miscellaneous OpenMP compiler features#

This section discusses features that have been added or enhanced in the OpenMP compiler.

Offload-arch tool#

An LLVM library and tool that is used to query the execution capability of the current system as well as to query requirements of a binary file. It is used by OpenMP device runtime to ensure compatibility of an image with the current system while loading it. It is compatible with target ID support and multi-image fat binary support.

Usage:

offload-arch [Options] [Optional lookup-value]

When used without an option, offload-arch prints the value of the first offload architecture found in the underlying system. This can be used by various clang front ends. For example, to compile for OpenMP offloading on your current system, invoke clang with the following command:

clang -fopenmp  -fopenmp-targets=``offload-arch`` foo.c

If an optional lookup-value is specified, offload-arch will check if the value is either a valid offload-arch or a codename and look up requested additional information.

The following command provides all the information for offload-arch gfx906:

offload-arch gfx906  - V

The options are listed below:

Options

Description

-h

Prints the help message.

-a

Prints values for all devices. Do not stop at the first device found.

-m

Prints device code name (often found in pci.ids file).

-n

Prints numeric pci-id.

-t

Prints clang offload triple to use for the offload arch.

-v

Verbose. Implies the following options -a -m -n -t

-f <file>

Prints offload requirements including offload-arch for each compiled offload image built into an application binary file.

-c

Prints offload capabilities of the underlying system. This option is used by the language runtime to select an image when multiple images are available. A capability must exist for each requirement of the selected image.

There are symbolic link aliases amdgpu-offload-arch and nvidia-arch for offload-arch. These aliases return 1 if no AMD GCN GPU or CUDA GPU is found. These aliases are useful in determining whether architecture-specific tests should be run or to conditionally load architecture-specific software.

Command-line simplification using offload-arch flag#

Legacy mechanism of specifying offloading target for OpenMP involves using three flags, -fopenmp-targets, -Xopenmp-target, and -march. The first two flags take a target triple (like amdgcn-amd-amdhsa or nvptx64-nvidia-cuda), while the last flag takes device name (like gfx908 or sm_70) as input.

Alternatively, users of the ROCm compiler can simply use the flag -offload-arch for a combined effect of the preceding three flags.

Example:

# Legacy mechanism
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx906 helloworld.c -o helloworld

Example:

# Using offload-arch flag
clang -fopenmp -target x86_64-linux-gnu \
--offload-arch=gfx906 helloworld.c -o helloworld.

To ensure backward compatibility, both styles are supported. This option is compatible with target ID support and multi-image fat binaries.

Target ID support for OpenMP#

The ROCm compiler supports specification of target features along with the GPU name while specifying a target offload device in the command line, using -march or --offload-arch options. The compiled image in such cases is specialized for a given configuration of device and target features (target ID).

Example:

# compiling for a gfx908 device with XNACK paging support turned ON
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:xnack+ helloworld.c -o helloworld

Example:

# compiling for a gfx908 device with SRAMECC support turned OFF
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:sramecc- helloworld.c -o helloworld

Example:

# compiling for a gfx908 device with SRAMECC support turned ON and XNACK paging support turned OFF
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:sramecc+:xnack- helloworld.c -o helloworld

The target ID specified on the command line is passed to the clang driver using target-feature flag, to the LLVM optimizer and back end using -mattr flag, and to linker using -plugin-opt=-mattr flag. This feature is compatible with offload-arch command-line option and multi-image binaries for multiple architectures.

Multi-image fat binary for OpenMP#

The ROCm compiler is enhanced to generate binaries that can contain heterogenous images. This heterogeneity could be in terms of:

  • Images of different architectures, like AMD GCN and NVPTX

  • Images of same architectures but for different GPUs, like gfx906 and gfx908

  • Images of same architecture and same GPU but for different target features, like gfx908:xnack+ and gfx908:xnack-

An appropriate image is selected by the OpenMP device runtime for execution depending on the capability of the current system. This feature is compatible with target ID support and offload-arch command-line options and uses offload-arch tool to determine capability of the current system.

Example:

clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908 \
helloworld.c -o helloworld

Example:

clang -fopenmp -target x86_64-linux-gnu \
--offload-arch=gfx906 \
--offload-arch=gfx908 \
helloworld.c -o helloworld

Example:

clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa,amdgcn-amd-amdhsa,amdgcn-amd-amdhsa \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc+:xnack+ \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc-:xnack+ \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc+:xnack- \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc-:xnack- \
helloworld.c -o helloworld

ROCmCC compilers create an instance of toolchain for each unique combination of target triple and the target GPU (along with the associated target features). clang-offload-wrapper tool is modified to insert a new structure __tgt_image_info along with each image in the binary. Device runtime is also modified to query this structure to identify a compatible image based on the capability of the current system.

Unified shared memory#

The following OpenMP pragma is available on MI200, and it must be executed with xnack+ support.

omp requires unified_shared_memory

For more details on unified shared memory, see OpenMP support.