ROCm compiler reference#

ROCm includes compilers optimized for high-performance computing on AMD GPUs and CPUs supporting various heterogeneous programming models such as Heterogeneous-computing Interface for Portability (HIP), OpenMP, and OpenCL.

Important

The ROCm compilers only build the x86 and AMDGPU targets. Functionality and options that are relevant to other targets are not included within the ROCm compilers.

ROCm compilers are available via two packages: rocm-llvm and rocm-llvm-alt. The differences are listed in the following table.

rocm-llvm

rocm-llvm-alt

Installed by default when ROCm is installed

An optional package

Provides an open-source compiler

Provides an additional closed-source compiler for users interested in additional CPU optimizations not available in rocm-llvm

For more details, see:

ROCm compiler interfaces#

ROCm provides two compiler interfaces for compiling HIP programs:

  • /opt/rocm/bin/amdclang++

  • /opt/rocm/bin/hipcc

The ROCm compilers leverage the same LLVM compiler technology with the AMD GCN GPU support; however, they offer a slightly different user experience. The hipcc command-line interface provides a more familiar user interface to users who are experienced in CUDA but relatively new to the ROCm/HIP development environment. On the other hand, amdclang++ provides a user interface identical to the clang++ compiler. It is more suitable for experienced developers who want to directly interact with the clang compiler and gain full control of the application build process.

The major differences between hipcc and amdclang++ are listed below:

Feature

hipcc

amdclang++

Compiling HIP source files

Treats all source files as HIP language source files

Enables the HIP language support for files with the .hip extension or through the -x hip compiler option

Detecting GPU architecture

Auto-detects the GPUs available on the system and generates code for those devices when no GPU architecture is specified

Has AMD GCN gfx803 as the default GPU architecture. The --offload-arch compiler option may be used to target other GPU architectures

Finding a HIP installation

Finds the HIP installation based on its own location and its knowledge about the ROCm directory structure

First looks for HIP under the same parent directory as its own LLVM directory and then falls back on /opt/rocm. Users can use the --rocm-path option to instruct the compiler to use HIP from the specified ROCm installation.

Linking to the HIP runtime library

Is configured to automatically link to the HIP runtime from the detected HIP installation

Requires the --hip-link flag to be specified to link to the HIP runtime. Alternatively, users can use the -l<dir>  - Lamdhip64 option to link to a HIP runtime library.

Device function inlining

Inlines all GPU device functions, which provide greater performance and compatibility for codes that contain file scoped or device function scoped __shared__ variables. However, it may increase compile time.

Relies on inlining heuristics to control inlining. Users experiencing performance or compilation issues with code using file scoped or device function scoped __shared__ variables could try -mllvm  - Amdgpu-early-inline-all=true -mllvm  - Amdgpu-function-calls=false to work around the issue. There are plans to address these issues with future compiler improvements.

Source code location:

ROCm/llvm-project/amd/hipcc

ROCm/llvm-project/clang

Compiler options and features#

This section discusses compiler options and features.

AMD GPU compilation#

This section outlines commonly used compiler flags for hipcc and amdclang++.

Options

Description

-x hip

Compiles the source file as a HIP program.

-fopenmp

Enables the OpenMP support.

-fopenmp-targets=<gpu>

Enables the OpenMP target offload support of the specified GPU architecture, where <gpu> specifies the GPU architecture. For example: gfx908

--gpu-max-threads-per-block=<value>:

Sets the default limit of threads per block, also referred to as the launch bounds, where <value> specifies the default maximum amount of threads per block.

-munsafe-fp-atomics

Enables unsafe floating point atomic instructions (AMDGPU only).

-ffast-math

Allows aggressive, lossy floating-point optimizations.

-mwavefrontsize64, -mno-wavefrontsize64

Sets wavefront size to be 64 or 32 on RDNA architectures.

-mcumode

Switches between CU and WGP modes on RDNA architectures.

--offload-arch=<gpu>

HIP offloading target ID.
May be specified more than once, where <gpu> specifies the device architecture followed by target ID features delimited by a colon.
Each target ID feature is a predefined string followed by a plus or minus sign (e.g. gfx908:xnack+:sramecc-).

-g

Generates source-level debug information.

-fgpu-rdc,    -fno-gpu-rdc

Generates relocatable device code, also known as separate compilation mode.

AMD optimizations for zen architectures#

The CPU compiler optimizations described in this chapter originate from the AMD Optimizing C/C++ Compiler (AOCC) compiler. They are available in the ROCm compilers if the optional rocm-llvm-alt package is installed. The user’s interaction with the compiler does not change once rocm-llvm-alt is installed. The user should use the same compiler entry point, provided AMD provides high-performance compiler optimizations for Zen-based processors in AOCC.

For more information, refer to https://www.amd.com/en/developer/aocc.html.

-famd-opt#

Enables a default set of AMD proprietary optimizations for the AMD Zen CPU architectures.

-fno-amd-opt disables the AMD proprietary optimizations.

The -famd-opt flag is useful when a user wants to build with the proprietary optimization compiler and not have to depend on setting any of the other proprietary optimization flags.

Note

-famd-opt can be used in addition to the other proprietary CPU optimization flags. The table of optimizations below implicitly enables the invocation of the AMD proprietary optimizations compiler, whereas the -famd-opt flag requires this to be handled explicitly.

-fstruct-layout=[1,2,3,4,5,6,7]#

Analyzes the whole program to determine if the structures in the code can be peeled and the pointer or integer fields in the structure can be compressed. If feasible, this optimization transforms the code to enable these improvements. This transformation is likely to improve cache utilization and memory bandwidth. It is expected to improve the scalability of programs executed on multiple cores.

This is effective only under -flto, as the whole program analysis is required to perform this optimization. Users can choose different levels of aggressiveness with which this optimization can be applied to the application, with 1 being the least aggressive and 7 being the most aggressive level.

-fstruct-layout value

Structure peeling

Pointer size after selective compression of self-referential pointers in structures, wherever safe

Type of structure fields eligible for compression

Whether compression performed under safety check

1

Enabled

NA

NA

NA

2

Enabled

32-bit

NA

NA

3

Enabled

16-bit

NA

NA

4

Enabled

32-bit

Integer

Yes

5

Enabled

16-bit

Integer

Yes

6

Enabled

32-bit

64-bit signed int or unsigned int. Users must ensure that the values assigned to 64-bit signed int fields are in range -(2^31 - 1) to +(2^31 - 1) and 64-bit unsigned int fields are in the range 0 to +(2^31 - 1). Otherwise, you may obtain incorrect results.

No. Users must ensure the safety based on the program compiled.

7

Enabled

16-bit

64-bit signed int or unsigned int. Users must ensure that the values assigned to 64-bit signed int fields are in range -(2^31 - 1) to +(2^31 - 1) and 64-bit unsigned int fields are in the range 0 to +(2^31 - 1). Otherwise, you may obtain incorrect results.

No. Users must ensure the safety based on the program compiled

-fitodcalls#

Promotes indirect-to-direct calls by placing conditional calls. Application or benchmarks that have a small and deterministic set of target functions for function pointers passed as call parameters benefit from this optimization. Indirect-to-direct call promotion transforms the code to use all possible determined targets under runtime checks and falls back to the original code for all the other cases. Runtime checks are introduced by the compiler for each of these possible function pointer targets followed by direct calls to the targets.

This is a link time optimization, which is invoked as -flto -fitodcalls

-fitodcallsbyclone#

Performs value specialization for functions with function pointers passed as an argument. It does this specialization by generating a clone of the function. The cloning of the function happens in the call chain as needed, to allow conversion of indirect function call to direct call.

This complements -fitodcalls optimization and is also a link time optimization, which is invoked as -flto -fitodcallsbyclone.

-fremap-arrays#

Transforms the data layout of a single dimensional array to provide better cache locality. This optimization is effective only under -flto, as the whole program needs to be analyzed to perform this optimization, which can be invoked as -flto -fremap-arrays.

-finline-aggressive#

Enables improved inlining capability through better heuristics. This optimization is more effective when used with -flto, as the whole program analysis is required to perform this optimization, which can be invoked as -flto -finline-aggressive.

-fnt-store (non-temporal store)#

Generates a non-temporal store instruction for array accesses in a loop with a large trip count.

-fnt-store=aggressive#

This is an experimental option to generate non-temporal store instruction for array accesses in a loop, whose iteration count cannot be determined at compile time. In this case, the compiler assumes the iteration count to be huge.

Optimizations through driver -mllvm <options>#

The following optimization options must be invoked through driver -mllvm <options>:

-enable-partial-unswitch#

Enables partial loop unswitching, which is an enhancement to the existing loop unswitching optimization in LLVM. Partial loop unswitching hoists a condition inside a loop from a path for which the execution condition remains invariant, whereas the original loop unswitching works for a condition that is completely loop invariant. The condition inside the loop gets hoisted out from the invariant path, and the original loop is retained for the path where the condition is variant.

-aggressive-loop-unswitch#

Experimental option that enables aggressive loop unswitching heuristic (including -enable-partial-unswitch) based on the usage of the branch conditional values. Loop unswitching leads to code bloat. Code bloat can be minimized if the hoisted condition is executed more often. This heuristic prioritizes the conditions based on the number of times they are used within the loop. The heuristic can be controlled with the following options:

  • -unswitch-identical-branches-min-count=<n> Enables unswitching of a loop with respect to a branch conditional value (B), where B appears in at least <n> compares in the loop. This option is enabled with -aggressive-loop-unswitch. The default value is 3.

    Usage: -mllvm -aggressive-loop-unswitch -mllvm -unswitch-identical-branches-min-count=<n>

    Where, n is a positive integer and lower value of <n> facilitates more unswitching.

  • -unswitch-identical-branches-max-count=<n> Enables unswitching of a loop with respect to a branch conditional value (B), where B appears in at most <n> compares in the loop. This option is enabled with -aggressive-loop-unswitch. The default value is 6.

    Usage: -mllvm -aggressive-loop-unswitch -mllvm -unswitch-identical-branches-max-count=<n>

    Where, n is a positive integer and higher value of <n> facilitates more unswitching.

    Note

    These options may facilitate more unswitching under some workloads. Since loop-unswitching inherently leads to code bloat, facilitating more unswitching may significantly increase the code size. Hence, it may also lead to longer compilation times.

-enable-strided-vectorization#

Enables strided memory vectorization as an enhancement to the interleaved vectorization framework present in LLVM. It enables the effective use of gather and scatter kind of instruction patterns. This flag must be used along with the interleave vectorization flag.

-enable-epilog-vectorization#

Enables vectorization of epilog-iterations as an enhancement to existing vectorization framework. This enables generation of an additional epilog vector loop version for the remainder iterations of the original vector loop. The vector size or factor of the original loop should be large enough to allow an effective epilog vectorization of the remaining iterations. This optimization takes place only when the original vector loop is vectorized with a vector width or factor of 16. This vectorization width of 16 may be overwritten by -min-width-epilog-vectorization command-line option.

-enable-redundant-movs#

Removes any redundant mov operations including redundant loads from memory and stores to memory. This can be invoked using -Wl,-plugin-opt=-enable-redundant-movs.

-merge-constant#

Attempts to promote frequently occurring constants to registers. The aim is to reduce the size of the instruction encoding for instructions using constants and obtain a performance improvement.

-function-specialize#

Optimizes the functions with compile time constant formal arguments.

-lv-function-specialization#

Generates specialized function versions when the loops inside function are vectorizable and the arguments are not aliased with each other.

-enable-vectorize-compares#

Enables vectorization on certain loops with conditional breaks assuming the memory accesses are safely bound within the page boundary.

-inline-recursion=[1,2,3,4]#

Enables inlining for recursive functions based on heuristics where the aggressiveness of heuristics increases with the level (1-4). The default level is 2. Higher levels may lead to code bloat due to expansion of recursive functions at call sites.

-inline-recursion value

Inline depth of heuristics used to enable inlining for recursive functions

1

1

2

1

3

1

4

10

This is more effective with -flto as the whole program needs to be analyzed to perform this optimization, which can be invoked as -flto -inline-recursion=[1,2,3,4].

-reduce-array-computations=[1,2,3]#

Performs array data flow analysis and optimizes the unused array computations.

-reduce-array-computations value

Array elements eligible for elimination of computations

1

Unused

2

Zero valued

3

Both unused and zero valued

This optimization is effective with -flto as the whole program needs to be analyzed to perform this optimization, which can be invoked as -flto -reduce-array-computations=[1,2,3].

-global-vectorize-slp={true,false}#

Vectorizes the straight-line code inside a basic block with data reordering vector operations. This option is set to true by default.

-region-vectorize#

Experimental flag for enabling vectorization on certain loops with complex control flow, which the normal vectorizer cannot handle.

This optimization is effective with -flto as the whole program needs to be analyzed to perform this optimization, which can be invoked as -flto -region-vectorize.

-enable-x86-prefetching#

Enables the generation of x86 prefetch instruction for the memory references inside a loop or inside an innermost loop of a loop nest to prefetch the second dimension of multidimensional array/memory references in the innermost loop of a loop nest. This is an experimental pass; its profitability is being improved.

-suppress-fmas#

Identifies the reduction patterns on FMA and suppresses the FMA generation, as it is not profitable on the reduction patterns.

-enable-icm-vrp#

Enables estimation of the virtual register pressure before performing loop invariant code motion. This estimation is used to control the number of loop invariants that will be hoisted during the loop invariant code motion.

-loop-splitting#

Enables splitting of loops into multiple loops to eliminate the branches, which compare the loop induction with an invariant or constant expression. This option is enabled under -O3 by default. To disable this optimization, use -loop-splitting=false.

-enable-ipo-loop-split#

Enables splitting of loops into multiple loops to eliminate the branches, which compares the loop induction with a constant expression. This constant expression can be derived through inter-procedural analysis. This option is enabled under -O3 by default. To disable this optimization, use -enable-ipo-loop-split=false.

-compute-interchange-order#

Enables heuristic for finding the best possible interchange order for a loop nest. To enable this option, use -enable-loopinterchange. This option is set to false by default.

Usage:

-mllvm -enable-loopinterchange -mllvm -compute-interchange-order
-convert-pow-exp-to-int={true,false}#

Converts the call to floating point exponent version of pow to its integer exponent version if the floating-point exponent can be converted to integer. This option is set to true by default.

-do-lock-reordering={none,normal,aggressive}#

Reorders the control predicates in increasing order of complexity from outer predicate to inner when it is safe. The normal mode reorders simple expressions, while the aggressive mode reorders predicates involving function calls if no side effects are determined. This option is set to normal by default.

-fuse-tile-inner-loop#

Enables fusion of adjacent tiled loops as a part of loop tiling transformation. This option is set to false by default.

-Hz,1,0x1 [Fortran]#

Helps to preserve array index information for array access expressions which get linearized in the compiler front end. The preserved information is used by the compiler optimization phase in performing optimizations such as loop transformations. It is recommended that any user who is using optimizations such as loop transformations and other optimizations requiring de-linearized index expressions should use the Hz option. This option has no impact on any other aspects of the Flang front end.

Inline ASM statements#

Inline assembly (ASM) statements allow a developer to include assembly instructions directly in either host or device code. The ROCm compiler supports ASM statements, however you should not use them for the following reasons:

  • The compiler’s ability to produce both correct code and optimize surrounding code is impeded.

  • The compiler does not parse the content of the ASM statements and cannot examine its contents.

  • The compiler must make conservative assumptions in an effort to retain correctness.

  • The conservative assumptions may yield code that is less performant compared to code without ASM statements. It is possible that a syntactically correct ASM statement may cause incorrect runtime behavior.

  • ASM statements are often ASIC-specific; code containing them is less portable and adds a maintenance burden for the developer if different ASICs are targeted.

  • Writing correct ASM statements is often difficult; thorough testing of any ASM statements is strongly encouraged.

Note

For developers who include ASM statements in the code, AMD is interested in understanding the use case and appreciates your feedback at ROCm/ROCm#issues

Miscellaneous OpenMP compiler features#

This section discusses features that have been added or enhanced in the OpenMP compiler.

An LLVM library and tool that is used to query the execution capability of the current system as well as to query requirements of a binary file. It is used by OpenMP device runtime to ensure compatibility of an image with the current system while loading it. It is compatible with target ID support and multi-image fat binary support.

Usage:

offload-arch [Options] [Optional lookup-value]

When used without an option, offload-arch prints the value of the first offload architecture found in the underlying system. This can be used by various clang front ends. For example, to compile for OpenMP offloading on your current system, invoke clang with the following command:

clang -fopenmp  -fopenmp-targets=``offload-arch`` foo.c

If an optional lookup-value is specified, offload-arch will check if the value is either a valid offload-arch or a codename and look up requested additional information.

The following command provides all the information for offload-arch gfx906:

offload-arch gfx906  - V

The options are listed below:

Options

Description

-h

Prints the help message.

-a

Prints values for all devices. Do not stop at the first device found.

-m

Prints device code name (often found in pci.ids file).

-n

Prints numeric pci-id.

-t

Prints clang offload triple to use for the offload arch.

-v

Verbose. Implies the following options -a -m -n -t

-f <file>

Prints offload requirements including offload-arch for each compiled offload image built into an application binary file.

-c

Prints offload capabilities of the underlying system. This option is used by the language runtime to select an image when multiple images are available. A capability must exist for each requirement of the selected image.

There are symbolic link aliases amdgpu-offload-arch and nvidia-arch for offload-arch. These aliases return 1 if no AMD GCN GPU or CUDA GPU is found. These aliases are useful in determining whether architecture-specific tests should be run or to conditionally load architecture-specific software.

Legacy mechanism of specifying offloading target for OpenMP involves using three flags, -fopenmp-targets, -Xopenmp-target, and -march. The first two flags take a target triple (like amdgcn-amd-amdhsa or nvptx64-nvidia-cuda), while the last flag takes device name (like gfx908 or sm_70) as input.

Alternatively, users of the ROCm compiler can simply use the flag -offload-arch for a combined effect of the preceding three flags.

Example:

# Legacy mechanism
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx906 helloworld.c -o helloworld

Example:

# Using offload-arch flag
clang -fopenmp -target x86_64-linux-gnu \
--offload-arch=gfx906 helloworld.c -o helloworld.

To ensure backward compatibility, both styles are supported. This option is compatible with target ID support and multi-image fat binaries.

The ROCm compiler supports specification of target features along with the GPU name while specifying a target offload device in the command line, using -march or --offload-arch options. The compiled image in such cases is specialized for a given configuration of device and target features (target ID).

Example:

# compiling for a gfx908 device with XNACK paging support turned ON
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:xnack+ helloworld.c -o helloworld

Example:

# compiling for a gfx908 device with SRAMECC support turned OFF
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:sramecc- helloworld.c -o helloworld

Example:

# compiling for a gfx908 device with SRAMECC support turned ON and XNACK paging support turned OFF
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:sramecc+:xnack- helloworld.c -o helloworld

The target ID specified on the command line is passed to the clang driver using target-feature flag, to the LLVM optimizer and back end using -mattr flag, and to linker using -plugin-opt=-mattr flag. This feature is compatible with offload-arch command-line option and multi-image binaries for multiple architectures.

The ROCm compiler is enhanced to generate binaries that can contain heterogenous images. This heterogeneity could be in terms of:

  • Images of different architectures, like AMD GCN and NVPTX

  • Images of same architectures but for different GPUs, like gfx906 and gfx908

  • Images of same architecture and same GPU but for different target features, like gfx908:xnack+ and gfx908:xnack-

An appropriate image is selected by the OpenMP device runtime for execution depending on the capability of the current system. This feature is compatible with target ID support and offload-arch command-line options and uses offload-arch tool to determine capability of the current system.

Example:

clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908 \
helloworld.c -o helloworld

Example:

clang -fopenmp -target x86_64-linux-gnu \
--offload-arch=gfx906 \
--offload-arch=gfx908 \
helloworld.c -o helloworld

Example:

clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa,amdgcn-amd-amdhsa,amdgcn-amd-amdhsa \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc+:xnack+ \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc-:xnack+ \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc+:xnack- \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc-:xnack- \
helloworld.c -o helloworld

ROCmCC compilers create an instance of toolchain for each unique combination of target triple and the target GPU (along with the associated target features). clang-offload-wrapper tool is modified to insert a new structure __tgt_image_info along with each image in the binary. Device runtime is also modified to query this structure to identify a compatible image based on the capability of the current system.

The following OpenMP pragma is available on MI200, and it must be executed with xnack+ support.

omp requires unified_shared_memory

For more details on unified shared memory, see OpenMP support.