ROCm compiler reference#
ROCm includes compilers optimized for high-performance computing on AMD GPUs and CPUs supporting various heterogeneous programming models such as Heterogeneous-computing Interface for Portability (HIP), OpenMP, and OpenCL.
Important
The ROCm compilers only build the x86 and AMDGPU targets. Functionality and options that are relevant to other targets are not included within the ROCm compilers.
ROCm compilers are available via two packages: rocm-llvm
and rocm-llvm-alt
. The differences are listed in the following table.
|
|
---|---|
Installed by default when ROCm is installed |
An optional package |
Provides an open-source compiler |
Provides an additional closed-source compiler for users interested in additional CPU optimizations not available in |
For more details, see:
Releases and source: ROCm/llvm-project
ROCm compiler interfaces#
ROCm provides two compiler interfaces for compiling HIP programs:
/opt/rocm/bin/amdclang++
/opt/rocm/bin/hipcc
The ROCm compilers leverage the same LLVM compiler technology with the AMD GCN GPU support;
however, they offer a slightly different user experience. The hipcc
command-line
interface provides a more familiar user interface to users who are
experienced in CUDA but relatively new to the ROCm/HIP development environment.
On the other hand, amdclang++
provides a user interface identical to the clang++
compiler. It is more suitable for experienced developers who want to directly
interact with the clang compiler and gain full control of the application
build process.
The major differences between hipcc
and amdclang++
are listed below:
Feature |
|
|
---|---|---|
Compiling HIP source files |
Treats all source files as HIP language source files |
Enables the HIP language support for files with the |
Detecting GPU architecture |
Auto-detects the GPUs available on the system and generates code for those devices when no GPU architecture is specified |
Has AMD GCN gfx803 as the default GPU architecture. The |
Finding a HIP installation |
Finds the HIP installation based on its own location and its knowledge about the ROCm directory structure |
First looks for HIP under the same parent directory as its own LLVM directory and then falls back on |
Linking to the HIP runtime library |
Is configured to automatically link to the HIP runtime from the detected HIP installation |
Requires the |
Device function inlining |
Inlines all GPU device functions, which provide greater performance and compatibility for codes that contain file scoped or device function scoped |
Relies on inlining heuristics to control inlining. Users experiencing performance or compilation issues with code using file scoped or device function scoped |
Source code location: |
Compiler options and features#
This section discusses compiler options and features.
AMD GPU compilation#
This section outlines commonly used compiler flags for hipcc
and amdclang++
.
Options |
Description |
|
Compiles the source file as a HIP program. |
|
Enables the OpenMP support. |
|
Enables the OpenMP target offload support of the specified GPU architecture, where |
|
Sets the default limit of threads per block, also referred to as the launch bounds, where |
|
Enables unsafe floating point atomic instructions (AMDGPU only). |
|
Allows aggressive, lossy floating-point optimizations. |
|
Sets wavefront size to be 64 or 32 on RDNA architectures. |
|
Switches between CU and WGP modes on RDNA architectures. |
|
HIP offloading target ID. |
|
Generates source-level debug information. |
|
Generates relocatable device code, also known as separate compilation mode. |
AMD optimizations for zen architectures#
The CPU compiler optimizations described in this chapter originate from the AMD
Optimizing C/C++ Compiler (AOCC) compiler. They are available in the ROCm compilers if the
optional rocm-llvm-alt
package is installed. The user’s interaction with the
compiler does not change once rocm-llvm-alt
is installed. The user should use
the same compiler entry point, provided AMD provides high-performance compiler
optimizations for Zen-based processors in AOCC.
For more information, refer to https://www.amd.com/en/developer/aocc.html.
-famd-opt
#
Enables a default set of AMD proprietary optimizations for the AMD Zen CPU architectures.
-fno-amd-opt
disables the AMD proprietary optimizations.
The -famd-opt
flag is useful when a user wants to build with the proprietary
optimization compiler and not have to depend on setting any of the other
proprietary optimization flags.
Note
-famd-opt
can be used in addition to the other proprietary CPU optimization
flags. The table of optimizations below implicitly enables the invocation of the
AMD proprietary optimizations compiler, whereas the -famd-opt
flag requires
this to be handled explicitly.
-fstruct-layout=[1,2,3,4,5,6,7]
#
Analyzes the whole program to determine if the structures in the code can be peeled and the pointer or integer fields in the structure can be compressed. If feasible, this optimization transforms the code to enable these improvements. This transformation is likely to improve cache utilization and memory bandwidth. It is expected to improve the scalability of programs executed on multiple cores.
This is effective only under -flto
, as the whole program analysis is required
to perform this optimization. Users can choose different levels of
aggressiveness with which this optimization can be applied to the application,
with 1 being the least aggressive and 7 being the most aggressive level.
|
Structure peeling |
Pointer size after selective compression of self-referential pointers in structures, wherever safe |
Type of structure fields eligible for compression |
Whether compression performed under safety check |
1 |
Enabled |
NA |
NA |
NA |
2 |
Enabled |
32-bit |
NA |
NA |
3 |
Enabled |
16-bit |
NA |
NA |
4 |
Enabled |
32-bit |
Integer |
Yes |
5 |
Enabled |
16-bit |
Integer |
Yes |
6 |
Enabled |
32-bit |
64-bit signed int or unsigned int. Users must ensure that the values assigned to 64-bit signed int fields are in range -(2^31 - 1) to +(2^31 - 1) and 64-bit unsigned int fields are in the range 0 to +(2^31 - 1). Otherwise, you may obtain incorrect results. |
No. Users must ensure the safety based on the program compiled. |
7 |
Enabled |
16-bit |
64-bit signed int or unsigned int. Users must ensure that the values assigned to 64-bit signed int fields are in range -(2^31 - 1) to +(2^31 - 1) and 64-bit unsigned int fields are in the range 0 to +(2^31 - 1). Otherwise, you may obtain incorrect results. |
No. Users must ensure the safety based on the program compiled |
-fitodcalls
#
Promotes indirect-to-direct calls by placing conditional calls. Application or benchmarks that have a small and deterministic set of target functions for function pointers passed as call parameters benefit from this optimization. Indirect-to-direct call promotion transforms the code to use all possible determined targets under runtime checks and falls back to the original code for all the other cases. Runtime checks are introduced by the compiler for each of these possible function pointer targets followed by direct calls to the targets.
This is a link time optimization, which is invoked as -flto -fitodcalls
-fitodcallsbyclone
#
Performs value specialization for functions with function pointers passed as an argument. It does this specialization by generating a clone of the function. The cloning of the function happens in the call chain as needed, to allow conversion of indirect function call to direct call.
This complements -fitodcalls
optimization and is also a link time
optimization, which is invoked as -flto -fitodcallsbyclone
.
-fremap-arrays
#
Transforms the data layout of a single dimensional array to provide better cache
locality. This optimization is effective only under -flto
, as the whole program
needs to be analyzed to perform this optimization, which can be invoked as
-flto -fremap-arrays
.
-finline-aggressive
#
Enables improved inlining capability through better heuristics. This
optimization is more effective when used with -flto, as the whole program
analysis is required to perform this optimization, which can be invoked as
-flto -finline-aggressive
.
-fnt-store (non-temporal store)
#
Generates a non-temporal store instruction for array accesses in a loop with a large trip count.
-fnt-store=aggressive
#
This is an experimental option to generate non-temporal store instruction for array accesses in a loop, whose iteration count cannot be determined at compile time. In this case, the compiler assumes the iteration count to be huge.
Optimizations through driver -mllvm <options>
#
The following optimization options must be invoked through driver
-mllvm <options>
:
-enable-partial-unswitch
#
Enables partial loop unswitching, which is an enhancement to the existing loop unswitching optimization in LLVM. Partial loop unswitching hoists a condition inside a loop from a path for which the execution condition remains invariant, whereas the original loop unswitching works for a condition that is completely loop invariant. The condition inside the loop gets hoisted out from the invariant path, and the original loop is retained for the path where the condition is variant.
-aggressive-loop-unswitch
#
Experimental option that enables aggressive loop unswitching heuristic
(including -enable-partial-unswitch
) based on the usage of the branch
conditional values. Loop unswitching leads to code bloat. Code bloat can be
minimized if the hoisted condition is executed more often. This heuristic
prioritizes the conditions based on the number of times they are used within the
loop. The heuristic can be controlled with the following options:
-unswitch-identical-branches-min-count=<n>
Enables unswitching of a loop with respect to a branch conditional value (B), where B appears in at least <n> compares in the loop. This option is enabled with-aggressive-loop-unswitch
. The default value is 3.Usage:
-mllvm -aggressive-loop-unswitch -mllvm -unswitch-identical-branches-min-count=<n>
Where, n is a positive integer and lower value of <n> facilitates more unswitching.
-unswitch-identical-branches-max-count=<n>
Enables unswitching of a loop with respect to a branch conditional value (B), where B appears in at most <n> compares in the loop. This option is enabled with-aggressive-loop-unswitch
. The default value is 6.Usage:
-mllvm -aggressive-loop-unswitch -mllvm -unswitch-identical-branches-max-count=<n>
Where,
n
is a positive integer and higher value of<n>
facilitates more unswitching.Note
These options may facilitate more unswitching under some workloads. Since loop-unswitching inherently leads to code bloat, facilitating more unswitching may significantly increase the code size. Hence, it may also lead to longer compilation times.
-enable-strided-vectorization
#
Enables strided memory vectorization as an enhancement to the interleaved vectorization framework present in LLVM. It enables the effective use of gather and scatter kind of instruction patterns. This flag must be used along with the interleave vectorization flag.
-enable-epilog-vectorization
#
Enables vectorization of epilog-iterations as an enhancement to existing
vectorization framework. This enables generation of an additional epilog vector
loop version for the remainder iterations of the original vector loop. The
vector size or factor of the original loop should be large enough to allow an
effective epilog vectorization of the remaining iterations. This optimization
takes place only when the original vector loop is vectorized with a vector width
or factor of 16. This vectorization width of 16 may be overwritten by
-min-width-epilog-vectorization
command-line option.
-enable-redundant-movs
#
Removes any redundant mov
operations including redundant loads from memory and
stores to memory. This can be invoked using
-Wl,-plugin-opt=-enable-redundant-movs
.
-merge-constant
#
Attempts to promote frequently occurring constants to registers. The aim is to reduce the size of the instruction encoding for instructions using constants and obtain a performance improvement.
-function-specialize
#
Optimizes the functions with compile time constant formal arguments.
-lv-function-specialization
#
Generates specialized function versions when the loops inside function are vectorizable and the arguments are not aliased with each other.
-enable-vectorize-compares
#
Enables vectorization on certain loops with conditional breaks assuming the memory accesses are safely bound within the page boundary.
-inline-recursion=[1,2,3,4]
#
Enables inlining for recursive functions based on heuristics where the aggressiveness of heuristics increases with the level (1-4). The default level is 2. Higher levels may lead to code bloat due to expansion of recursive functions at call sites.
|
Inline depth of heuristics used to enable inlining for recursive functions |
1 |
1 |
2 |
1 |
3 |
1 |
4 |
10 |
This is more effective with -flto
as the whole program needs to be analyzed to
perform this optimization, which can be invoked as
-flto -inline-recursion=[1,2,3,4]
.
-reduce-array-computations=[1,2,3]
#
Performs array data flow analysis and optimizes the unused array computations.
|
Array elements eligible for elimination of computations |
1 |
Unused |
2 |
Zero valued |
3 |
Both unused and zero valued |
This optimization is effective with -flto
as the whole program needs to be
analyzed to perform this optimization, which can be invoked as
-flto -reduce-array-computations=[1,2,3]
.
-global-vectorize-slp={true,false}
#
Vectorizes the straight-line code inside a basic block with data reordering vector operations. This option is set to true by default.
-region-vectorize
#
Experimental flag for enabling vectorization on certain loops with complex control flow, which the normal vectorizer cannot handle.
This optimization is effective with -flto
as the whole program needs to be
analyzed to perform this optimization, which can be invoked as
-flto -region-vectorize
.
-enable-x86-prefetching
#
Enables the generation of x86 prefetch instruction for the memory references inside a loop or inside an innermost loop of a loop nest to prefetch the second dimension of multidimensional array/memory references in the innermost loop of a loop nest. This is an experimental pass; its profitability is being improved.
-suppress-fmas
#
Identifies the reduction patterns on FMA and suppresses the FMA generation, as it is not profitable on the reduction patterns.
-enable-icm-vrp
#
Enables estimation of the virtual register pressure before performing loop invariant code motion. This estimation is used to control the number of loop invariants that will be hoisted during the loop invariant code motion.
-loop-splitting
#
Enables splitting of loops into multiple loops to eliminate the branches, which
compare the loop induction with an invariant or constant expression. This option
is enabled under -O3
by default. To disable this optimization, use
-loop-splitting=false
.
-enable-ipo-loop-split
#
Enables splitting of loops into multiple loops to eliminate the branches, which
compares the loop induction with a constant expression. This constant expression
can be derived through inter-procedural analysis. This option is enabled under
-O3
by default. To disable this optimization, use
-enable-ipo-loop-split=false
.
-compute-interchange-order
#
Enables heuristic for finding the best possible interchange order for a loop
nest. To enable this option, use -enable-loopinterchange
. This option is set
to false by default.
Usage:
-mllvm -enable-loopinterchange -mllvm -compute-interchange-order
-convert-pow-exp-to-int={true,false}
#
Converts the call to floating point exponent version of pow to its integer exponent version if the floating-point exponent can be converted to integer. This option is set to true by default.
-do-lock-reordering={none,normal,aggressive}
#
Reorders the control predicates in increasing order of complexity from outer predicate to inner when it is safe. The normal mode reorders simple expressions, while the aggressive mode reorders predicates involving function calls if no side effects are determined. This option is set to normal by default.
-fuse-tile-inner-loop
#
Enables fusion of adjacent tiled loops as a part of loop tiling transformation. This option is set to false by default.
-Hz,1,0x1 [Fortran]
#
Helps to preserve array index information for array access expressions which get linearized in the compiler front end. The preserved information is used by the compiler optimization phase in performing optimizations such as loop transformations. It is recommended that any user who is using optimizations such as loop transformations and other optimizations requiring de-linearized index expressions should use the Hz option. This option has no impact on any other aspects of the Flang front end.
Inline ASM statements#
Inline assembly (ASM) statements allow a developer to include assembly instructions directly in either host or device code. The ROCm compiler supports ASM statements, however you should not use them for the following reasons:
The compiler’s ability to produce both correct code and optimize surrounding code is impeded.
The compiler does not parse the content of the ASM statements and cannot examine its contents.
The compiler must make conservative assumptions in an effort to retain correctness.
The conservative assumptions may yield code that is less performant compared to code without ASM statements. It is possible that a syntactically correct ASM statement may cause incorrect runtime behavior.
ASM statements are often ASIC-specific; code containing them is less portable and adds a maintenance burden for the developer if different ASICs are targeted.
Writing correct ASM statements is often difficult; thorough testing of any ASM statements is strongly encouraged.
Note
For developers who include ASM statements in the code, AMD is interested in understanding the use case and appreciates your feedback at ROCm/ROCm#issues
Miscellaneous OpenMP compiler features#
This section discusses features that have been added or enhanced in the OpenMP compiler.
An LLVM library and tool that is used to query the execution capability of the current system as well as to query requirements of a binary file. It is used by OpenMP device runtime to ensure compatibility of an image with the current system while loading it. It is compatible with target ID support and multi-image fat binary support.
Usage:
offload-arch [Options] [Optional lookup-value]
When used without an option, offload-arch
prints the value of the first offload
architecture found in the underlying system. This can be used by various clang
front ends. For example, to compile for OpenMP offloading on your current system,
invoke clang with the following command:
clang -fopenmp -fopenmp-targets=``offload-arch`` foo.c
If an optional lookup-value is specified, offload-arch
will check if the value
is either a valid offload-arch
or a codename and look up requested additional
information.
The following command provides all the information for offload-arch gfx906:
offload-arch gfx906 - V
The options are listed below:
Options |
Description |
|
Prints the help message. |
|
Prints values for all devices. Do not stop at the first device found. |
|
Prints device code name (often found in |
|
Prints numeric |
|
Prints clang offload triple to use for the offload arch. |
|
Verbose. Implies the following options |
|
Prints offload requirements including offload-arch for each compiled offload image built into an application binary file. |
|
Prints offload capabilities of the underlying system. This option is used by the language runtime to select an image when multiple images are available. A capability must exist for each requirement of the selected image. |
There are symbolic link aliases amdgpu-offload-arch
and nvidia-arch
for
offload-arch
. These aliases return 1 if no AMD GCN GPU or CUDA GPU is found.
These aliases are useful in determining whether architecture-specific tests
should be run or to conditionally load architecture-specific software.
Legacy mechanism of specifying offloading target for OpenMP involves using three
flags, -fopenmp-targets
, -Xopenmp-target
, and -march
. The first two flags
take a target triple (like amdgcn-amd-amdhsa
or nvptx64-nvidia-cuda
), while
the last flag takes device name (like gfx908
or sm_70
) as input.
Alternatively, users of the ROCm compiler can simply use the flag -offload-arch
for a
combined effect of the preceding three flags.
Example:
# Legacy mechanism
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx906 helloworld.c -o helloworld
Example:
# Using offload-arch flag
clang -fopenmp -target x86_64-linux-gnu \
--offload-arch=gfx906 helloworld.c -o helloworld.
To ensure backward compatibility, both styles are supported. This option is compatible with target ID support and multi-image fat binaries.
The ROCm compiler supports specification of target features along with the GPU
name while specifying a target offload device in the command line, using
-march
or --offload-arch
options. The compiled image in such cases is
specialized for a given configuration of device and target features (target ID).
Example:
# compiling for a gfx908 device with XNACK paging support turned ON
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:xnack+ helloworld.c -o helloworld
Example:
# compiling for a gfx908 device with SRAMECC support turned OFF
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:sramecc- helloworld.c -o helloworld
Example:
# compiling for a gfx908 device with SRAMECC support turned ON and XNACK paging support turned OFF
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa \
-march=gfx908:sramecc+:xnack- helloworld.c -o helloworld
The target ID specified on the command line is passed to the clang driver using
target-feature
flag, to the LLVM optimizer and back end using -mattr
flag, and
to linker using -plugin-opt=-mattr
flag. This feature is compatible with
offload-arch
command-line option and multi-image binaries for multiple
architectures.
The ROCm compiler is enhanced to generate binaries that can contain heterogenous images. This heterogeneity could be in terms of:
Images of different architectures, like AMD GCN and NVPTX
Images of same architectures but for different GPUs, like gfx906 and gfx908
Images of same architecture and same GPU but for different target features, like
gfx908:xnack+
andgfx908:xnack-
An appropriate image is selected by the OpenMP device runtime for execution
depending on the capability of the current system. This feature is compatible
with target ID support and offload-arch
command-line options and uses
offload-arch tool to determine capability of the current system.
Example:
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908 \
helloworld.c -o helloworld
Example:
clang -fopenmp -target x86_64-linux-gnu \
--offload-arch=gfx906 \
--offload-arch=gfx908 \
helloworld.c -o helloworld
Example:
clang -fopenmp -target x86_64-linux-gnu \
-fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa,amdgcn-amd-amdhsa,amdgcn-amd-amdhsa \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc+:xnack+ \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc-:xnack+ \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc+:xnack- \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908:sramecc-:xnack- \
helloworld.c -o helloworld
ROCmCC compilers create an instance of toolchain for each unique combination
of target triple and the target GPU (along with the associated target features).
clang-offload-wrapper
tool is modified to insert a new structure
__tgt_image_info
along with each image in the binary. Device runtime is also
modified to query this structure to identify a compatible image based on the
capability of the current system.
The following OpenMP pragma is available on MI200, and it must be executed with
xnack+
support.
omp requires unified_shared_memory
For more details on unified shared memory, see OpenMP support.