ROCm compiler reference

ROCm compiler reference#

ROCm includes compilers optimized for high-performance computing on AMD GPUs and CPUs supporting various heterogeneous programming models such as Heterogeneous-computing Interface for Portability (HIP), OpenMP, and OpenCL.

Important

The ROCm compilers only build the x86 and AMDGPU targets. Functionality and options that are relevant to other targets are not included within the ROCm compilers.

For more details, see:

LLVM User Guide for AMD GPU Backend
Releases and source: ROCm/llvm-project

ROCm compiler interfaces#

ROCm provides two compiler interfaces for compiling HIP programs:

/opt/rocm/bin/amdclang++
/opt/rocm/bin/hipcc

The ROCm compilers leverage the same LLVM compiler technology with the AMD GCN GPU support; however, they offer a slightly different user experience. The hipcc command-line interface provides a more familiar user interface to users who are experienced in CUDA but relatively new to the ROCm/HIP development environment. On the other hand, amdclang++ provides a user interface identical to the clang++ compiler. It is more suitable for experienced developers who want to directly interact with the clang compiler and gain full control of the application build process.

The major differences between hipcc and amdclang++ are listed below:

Feature	`hipcc`	`amdclang++`
Compiling HIP source files	Treats all source files as HIP language source files	Enables the HIP language support for files with the `.hip` extension or through the `-x hip` compiler option
Detecting GPU architecture	Auto-detects the GPUs available on the system and generates code for those devices when no GPU architecture is specified	Has AMD GCN gfx803 as the default GPU architecture. The `--offload-arch` compiler option may be used to target other GPU architectures
Finding a HIP installation	Finds the HIP installation based on its own location and its knowledge about the ROCm directory structure	First looks for HIP under the same parent directory as its own LLVM directory and then falls back on `/opt/rocm`. Users can use the `--rocm-path` option to instruct the compiler to use HIP from the specified ROCm installation.
Linking to the HIP runtime library	Is configured to automatically link to the HIP runtime from the detected HIP installation	Requires the `--hip-link` flag to be specified to link to the HIP runtime. Alternatively, users can use the `-l<dir> - Lamdhip64` option to link to a HIP runtime library.
Device function inlining	Inlines all GPU device functions, which provide greater performance and compatibility for codes that contain file scoped or device function scoped `__shared__` variables. However, it may increase compile time.	Relies on inlining heuristics to control inlining. Users experiencing performance or compilation issues with code using file scoped or device function scoped `__shared__` variables could try `-mllvm - Amdgpu-early-inline-all=true -mllvm - Amdgpu-function-calls=false` to work around the issue. There are plans to address these issues with future compiler improvements.
Source code location:	ROCm/llvm-project/amd/hipcc	ROCm/llvm-project/clang

Compiler options and features#

This section discusses compiler options and features.

AMD GPU compilation#

This section outlines commonly used compiler flags for hipcc and amdclang++.

Options	Description
`-x hip`	Compiles the source file as a HIP program.
`-fopenmp`	Enables the OpenMP support.
`-fopenmp-targets=<gpu>`	Enables the OpenMP target offload support of the specified GPU architecture, where `<gpu>` specifies the GPU architecture. For example: gfx908
`--gpu-max-threads-per-block=<value>:`	Sets the default limit of threads per block, also referred to as the launch bounds, where `<value>` specifies the default maximum amount of threads per block.
`-munsafe-fp-atomics`	Enables unsafe floating point atomic instructions (AMDGPU only).
`-ffast-math`	Allows aggressive, lossy floating-point optimizations.
`-mwavefrontsize64, -mno-wavefrontsize64`	Sets wavefront size to be 64 or 32 on RDNA architectures.
`-mcumode`	Switches between CU and WGP modes on RDNA architectures.
`--offload-arch=<gpu>`	HIP offloading target ID. May be specified more than once, where `<gpu>` specifies the device architecture followed by target ID features delimited by a colon. Each target ID feature is a predefined string followed by a plus or minus sign (e.g. `gfx908:xnack+:sramecc-`).
`-g`	Generates source-level debug information.
`-fgpu-rdc, -fno-gpu-rdc`	Generates relocatable device code, also known as separate compilation mode.

Inline ASM statements#

Inline assembly (ASM) statements allow a developer to include assembly instructions directly in either host or device code. The ROCm compiler supports ASM statements, however you should not use them for the following reasons:

The compiler’s ability to produce both correct code and optimize surrounding code is impeded.
The compiler does not parse the content of the ASM statements and cannot examine its contents.
The compiler must make conservative assumptions in an effort to retain correctness.
The conservative assumptions may yield code that is less performant compared to code without ASM statements. It is possible that a syntactically correct ASM statement may cause incorrect runtime behavior.
ASM statements are often ASIC-specific; code containing them is less portable and adds a maintenance burden for the developer if different ASICs are targeted.
Writing correct ASM statements is often difficult; thorough testing of any ASM statements is strongly encouraged.

Note

For developers who include ASM statements in the code, AMD is interested in understanding the use case and appreciates your feedback at ROCm/ROCm#issues

Miscellaneous OpenMP compiler features#

This section discusses features that have been added or enhanced in the OpenMP compiler.

Offload-arch tool#

An LLVM library and tool that is used to query the execution capability of the current system as well as to query requirements of a binary file. It is used by OpenMP device runtime to ensure compatibility of an image with the current system while loading it. It is compatible with target ID support and multi-image fat binary support.

Usage:

offload-arch [Options] [Optional lookup-value]

When used without an option, offload-arch prints the value of the first offload architecture found in the underlying system. This can be used by various clang front ends. For example, to compile for OpenMP offloading on your current system, invoke clang with the following command:

clang -fopenmp  -fopenmp-targets=``offload-arch`` foo.c

If an optional lookup-value is specified, offload-arch will check if the value is either a valid offload-arch or a codename and look up requested additional information.

The following command provides all the information for offload-arch gfx906:

offload-arch gfx906  - V

The options are listed below:

Options	Description
`-h`	Prints the help message.
`-a`	Prints values for all devices. Do not stop at the first device found.
`-m`	Prints device code name (often found in `pci.ids` file).
`-n`	Prints numeric `pci-id`.
`-t`	Prints clang offload triple to use for the offload arch.
`-v`	Verbose. Implies the following options `-a -m -n -t`
`-f <file>`	Prints offload requirements including offload-arch for each compiled offload image built into an application binary file.
`-c`	Prints offload capabilities of the underlying system. This option is used by the language runtime to select an image when multiple images are available. A capability must exist for each requirement of the selected image.

There are symbolic link aliases amdgpu-offload-arch and nvidia-arch for offload-arch. These aliases return 1 if no AMD GCN GPU or CUDA GPU is found. These aliases are useful in determining whether architecture-specific tests should be run or to conditionally load architecture-specific software.

Command-line simplification using offload-arch flag#

Legacy mechanism of specifying offloading target for OpenMP involves using three flags, -fopenmp-targets, -Xopenmp-target, and -march. The first two flags take a target triple (like amdgcn-amd-amdhsa or nvptx64-nvidia-cuda), while the last flag takes device name (like gfx908 or sm_70) as input.

Alternatively, users of the ROCm compiler can simply use the flag -offload-arch for a combined effect of the preceding three flags.

Example:

# Legacy mechanism
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx906 helloworld.c -o helloworld

Example:

# Using offload-arch flag
clang -fopenmp -target x86_64-linux-gnu \
--offload-arch=gfx906 helloworld.c -o helloworld.

To ensure backward compatibility, both styles are supported. This option is compatible with target ID support and multi-image fat binaries.

Target ID support for OpenMP#

The ROCm compiler supports specification of target features along with the GPU name while specifying a target offload device in the command line, using -march or --offload-arch options. The compiled image in such cases is specialized for a given configuration of device and target features (target ID).

Example:

# compiling for a gfx908 device with XNACK paging support turned ON
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:xnack+ helloworld.c -o helloworld

Example:

# compiling for a gfx908 device with SRAMECC support turned OFF
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:sramecc- helloworld.c -o helloworld

Example:

# compiling for a gfx908 device with SRAMECC support turned ON and XNACK paging support turned OFF
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:sramecc+:xnack- helloworld.c -o helloworld

The target ID specified on the command line is passed to the clang driver using target-feature flag, to the LLVM optimizer and back end using -mattr flag, and to linker using -plugin-opt=-mattr flag. This feature is compatible with offload-arch command-line option and multi-image binaries for multiple architectures.

Multi-image fat binary for OpenMP#

The ROCm compiler is enhanced to generate binaries that can contain heterogenous images. This heterogeneity could be in terms of:

Images of different architectures, like AMD GCN and NVPTX
Images of same architectures but for different GPUs, like gfx906 and gfx908
Images of same architecture and same GPU but for different target features, like gfx908:xnack+ and gfx908:xnack-

An appropriate image is selected by the OpenMP device runtime for execution depending on the capability of the current system. This feature is compatible with target ID support and offload-arch command-line options and uses offload-arch tool to determine capability of the current system.

Example:

clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908 \
helloworld.c -o helloworld

Example:

clang -fopenmp -target x86_64-linux-gnu \
--offload-arch=gfx906 \
--offload-arch=gfx908 \
helloworld.c -o helloworld

Example:

clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa,amdgcn-amd-amdhsa,amdgcn-amd-amdhsa \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc+:xnack+ \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc-:xnack+ \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc+:xnack- \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc-:xnack- \
helloworld.c -o helloworld

ROCmCC compilers create an instance of toolchain for each unique combination of target triple and the target GPU (along with the associated target features). clang-offload-wrapper tool is modified to insert a new structure __tgt_image_info along with each image in the binary. Device runtime is also modified to query this structure to identify a compatible image based on the capability of the current system.

Unified shared memory#

The following OpenMP pragma is available on MI200, and it must be executed with xnack+ support.

omp requires unified_shared_memory

For more details on unified shared memory, see OpenMP support.

ROCm compiler reference

Contents

ROCm compiler reference#

ROCm compiler interfaces#

Compiler options and features#

AMD GPU compilation#

Inline ASM statements#

Miscellaneous OpenMP compiler features#

Offload-arch tool#

Command-line simplification using offload-arch flag#

Target ID support for OpenMP#

Multi-image fat binary for OpenMP#

Unified shared memory#