HIP porting guide#
HIP is designed to ease the porting of existing CUDA code into the HIP environment. This page describes the available tools and provides practical suggestions on how to port CUDA code and work through common issues.
Porting a CUDA Project#
Mixing HIP and CUDA code results in valid CUDA code. This enables users to incrementally port CUDA to HIP, and still compile and test the code during the transition.
The only notable exception is hipError_t
, which is not just an alias to
cudaError_t
. In these cases HIP provides functions to convert between the
error code spaces:
hipErrorToCudaError()
hipErrorToCUResult()
hipCUDAErrorTohipError()
hipCUResultTohipError()
General Tips#
Starting to port on an NVIDIA machine is often the easiest approach, as the code can be tested for functionality and performance even if not fully ported to HIP.
Once the CUDA code is ported to HIP and is running on the CUDA machine, compile the HIP code for an AMD machine.
You can handle platform-specific features through conditional compilation or by adding them to the open-source HIP infrastructure.
Use the HIPIFY tools to automatically convert CUDA code to HIP, as described in the following section.
HIPIFY#
HIPIFY is a collection of tools that automatically
translate CUDA to HIP code. There are two flavours available, hipfiy-clang
and hipify-perl
.
hipify-clang is, as the name implies, a Clang-based
tool, and actually parses the code, translates it into an Abstract Syntax Tree,
from which it then generates the HIP source. For this, hipify-clang
needs to
be able to actually compile the code, so the CUDA code needs to be correct, and
a CUDA install with all necessary headers must be provided.
hipify-perl uses pattern matching, to translate the
CUDA code to HIP. It does not require a working CUDA installation, and can also
convert CUDA code, that is not syntactically correct. It is therefore easier to
set up and use, but is not as powerful as hipfiy-clang
.
Scanning existing CUDA code to scope the porting effort#
The --examine
option, supported by the clang and perl version, tells hipify
to do a test-run, without changing the files, but instead scan CUDA code to
determine which files contain CUDA code and how much of that code can
automatically be hipified.
There also are hipexamine-perl.sh
or hipexamine.sh
(for
hipify-clang
) scripts to automatically scan directories.
For example, the following is a scan of one of the cuda-samples:
> cd Samples/2_Concepts_and_Techniques/convolutionSeparable/
> hipexamine-perl.sh
[HIPIFY] info: file './convolutionSeparable.cu' statistics:
CONVERTED refs count: 2
TOTAL lines of code: 214
WARNINGS: 0
[HIPIFY] info: CONVERTED refs by names:
cooperative_groups.h => hip/hip_cooperative_groups.h: 1
cudaMemcpyToSymbol => hipMemcpyToSymbol: 1
[HIPIFY] info: file './main.cpp' statistics:
CONVERTED refs count: 13
TOTAL lines of code: 174
WARNINGS: 0
[HIPIFY] info: CONVERTED refs by names:
cudaDeviceSynchronize => hipDeviceSynchronize: 2
cudaFree => hipFree: 3
cudaMalloc => hipMalloc: 3
cudaMemcpy => hipMemcpy: 2
cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 1
cuda_runtime.h => hip/hip_runtime.h: 1
[HIPIFY] info: file 'GLOBAL' statistics:
CONVERTED refs count: 15
TOTAL lines of code: 512
WARNINGS: 0
[HIPIFY] info: CONVERTED refs by names:
cooperative_groups.h => hip/hip_cooperative_groups.h: 1
cudaDeviceSynchronize => hipDeviceSynchronize: 2
cudaFree => hipFree: 3
cudaMalloc => hipMalloc: 3
cudaMemcpy => hipMemcpy: 2
cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 1
cudaMemcpyToSymbol => hipMemcpyToSymbol: 1
cuda_runtime.h => hip/hip_runtime.h: 1
hipexamine-perl.sh
reports how many CUDA calls are going to be converted to
HIP (e.g. CONVERTED refs count: 2
), and lists them by name together with
their corresponding HIP-version (see the lines following [HIPIFY] info:
CONVERTED refs by names:
). It also lists the total lines of code for the file
and potential warnings. In the end it prints a summary for all files.
Automatically converting a CUDA project#
To directly replace the files, the --inplace
option of hipify-perl
or
hipify-clang
can be used. This creates a backup of the original files in a
<filename>.prehip
file and overwrites the existing files, keeping their file
endings. If the --inplace
option is not given, the scripts print the
hipified code to stdout
.
hipconvertinplace.sh``or ``hipconvertinplace-perl.sh
operate on whole
directories.
Library Equivalents#
ROCm provides libraries to ease porting of code relying on CUDA libraries. Most CUDA libraries have a corresponding HIP library.
There are two flavours of libraries provided by ROCm, ones prefixed with hip
and ones prefixed with roc
. While both are written using HIP, in general
only the hip
-libraries are portable. The libraries with the roc
-prefix
might also run on CUDA-capable GPUs, however they have been optimized for AMD
GPUs and might use assembly code or a different API, to achieve the best
performance.
Note
If the application is only required to run on AMD GPUs, it is recommended to
use the roc
-libraries.
In the case where a library provides a roc
- and a hip
- version, the
hip
version is a marshalling library, which is just a thin layer that is
redirecting the function calls to either the roc
-library or the
corresponding CUDA library, depending on the platform, to provide compatibility.
CUDA Library |
|
|
Comment |
---|---|---|---|
cuBLAS |
Basic Linear Algebra Subroutines |
||
cuBLASLt |
Linear Algebra Subroutines, lightweight and new flexible API |
||
cuFFT |
Fast Fourier Transfer Library |
||
cuSPARSE |
Sparse BLAS + SPMV |
||
cuSOLVER |
Lapack library |
||
AmgX |
Sparse iterative solvers and preconditioners with algebraic multigrid |
||
Thrust |
C++ parallel algorithms library |
||
CUB |
Low Level Optimized Parallel Primitives |
||
cuDNN |
Deep learning Solver Library |
||
cuRAND |
Random Number Generator Library |
||
NCCL |
Communications Primitives Library based on the MPI equivalents RCCL is a drop-in replacement for NCCL |
Distinguishing compilers and platforms#
Identifying the HIP Target Platform#
HIP projects can target either the AMD or NVIDIA platform. The platform affects which backend-headers are included and which libraries are used for linking. The created binaries are not portable between AMD and NVIDIA platforms.
To write code that is specific to a platform the C++-macros specified in the following section can be used.
Compiler Defines: Summary#
This section lists macros that are defined by compilers and the HIP/CUDA APIs, and what compiler/platform combinations they are defined for.
The following table lists the macros that can be used when compiling HIP. Most
of these macros are not directly defined by the compilers, but in
hip_common.h
, which is included by hip_runtime.h
.
Macro |
|
|
Other (GCC, ICC, Clang, etc.) |
---|---|---|---|
|
Defined |
Undefined |
Undefined, needs to be set explicitly |
|
Undefined |
Defined |
Undefined, needs to be set explicitly |
|
Defined when compiling |
Defined when compiling |
Undefined |
|
1 if compiling for device undefined if compiling for host |
1 if compiling for device undefined if compiling for host |
Undefined |
|
0 or 1 depending on feature support of targeted hardware (see Identifying Device Architecture Features) |
0 or 1 depending on feature support of targeted hardware |
0 |
|
Defined when compiling |
Undefined |
Undefined |
The following table lists macros related to nvcc
and CUDA as HIP backend.
Macro |
|
|
Other (GCC, ICC, Clang, etc.) |
---|---|---|---|
|
Undefined |
Defined |
Undefined (Clang defines this when explicitly compiling CUDA code) |
|
Undefined |
Defined |
Undefined |
|
Undefined |
Defined in device code Integer representing compute capability Must not be used in host code |
Undefined |
Identifying the compilation target platform#
Despite HIP’s portability, it can be necessary to tailor code to a specific platform, in order to provide platform-specific code, or aid in platform-specific performance improvements.
For this, the __HIP_PLATFORM_AMD__
and __HIP_PLATFORM_NVIDIA__
macros
can be used, e.g.:
#ifdef __HIP_PLATFORM_AMD__
// This code path is compiled when amdclang++ is used for compilation
#endif
#ifdef __HIP_PLATFORM_NVIDIA__
// This code path is compiled when nvcc is used for compilation
// Could be compiling with CUDA language extensions enabled (for example, a ".cu file)
// Could be in pass-through mode to an underlying host compiler (for example, a .cpp file)
#endif
When using hipcc
, the environment variable HIP_PLATFORM
specifies the
runtime to use. When an AMD graphics driver and an AMD GPU is detected,
HIP_PLATFORM
is set to amd
. If both runtimes are installed, and a
specific one should be used, or hipcc
can’t detect the runtime, the
environment variable has to be set manually.
To explicitly use the CUDA compilation path, use:
export HIP_PLATFORM=nvidia
hipcc main.cpp
Identifying Host or Device Compilation Pass#
amdclang++
makes multiple passes over the code: one for the host code, and
one each for the device code for every GPU architecture to be compiled for.
nvcc
makes two passes over the code: one for host code and one for device
code.
The __HIP_DEVICE_COMPILE__
-macro is defined when the compiler is compiling
for the device.
__HIP_DEVICE_COMPILE__
is a portable check that can replace the
__CUDA_ARCH__
.
#include "hip/hip_runtime.h"
#include <iostream>
__host__ __device__ void call_func(){
#ifdef __HIP_DEVICE_COMPILE__
printf("device\n");
#else
std::cout << "host" << std::endl;
#endif
}
__global__ void test_kernel(){
call_func();
}
int main(int argc, char** argv) {
test_kernel<<<1, 1, 0, 0>>>();
call_func();
}
Identifying Device Architecture Features#
GPUs of different generations and architectures do not all provide the same
level of hardware feature support. To
guard device-code using these architecture dependent features, the
__HIP_ARCH_<FEATURE>__
C++-macros can be used.
Device Code Feature Identification#
Some CUDA code tests __CUDA_ARCH__
for a specific value to determine whether
the GPU supports a certain architectural feature, depending on its compute
capability. This requires knowledge about what __CUDA_ARCH__
supports what
feature set.
HIP simplifies this, by replacing these macros with feature-specific macros, not architecture specific.
For instance,
//#if __CUDA_ARCH__ >= 130 // does not properly specify, what feature is required, not portable
#if __HIP_ARCH_HAS_DOUBLES__ == 1 // explicitly specifies, what feature is required, portable between AMD and NVIDIA GPUs
// device code
#endif
For host code, the __HIP_ARCH_<FEATURE>__
defines are set to 0, if
hip_runtime.h
is included, and undefined otherwise. It should not be relied
upon in host code.
Host Code Feature Identification#
Host code must not rely on the __HIP_ARCH_<FEATURE>__
macros, as the GPUs
available to a system can not be known during compile time, and their
architectural features differ.
Host code can query architecture feature flags during runtime, by using
hipGetDeviceProperties()
or hipDeviceGetAttribute()
.
#include <hip/hip_runtime.h>
#include <cstdlib>
#include <iostream>
#define HIP_CHECK(expression) { \
const hipError_t err = expression; \
if (err != hipSuccess){ \
std::cout << "HIP Error: " << hipGetErrorString(err)) \
<< " at line " << __LINE__ << std::endl; \
std::exit(EXIT_FAILURE); \
} \
}
int main(){
int deviceCount;
HIP_CHECK(hipGetDeviceCount(&deviceCount));
int device = 0; // Query first available GPU. Can be replaced with any
// integer up to, not including, deviceCount
hipDeviceProp_t deviceProp;
HIP_CHECK(hipGetDeviceProperties(&deviceProp, device));
std::cout << "The queried device ";
if (deviceProp.arch.hasSharedInt32Atomics) // portable HIP feature query
std::cout << "supports";
else
std::cout << "does not support";
std::cout << " shared int32 atomic operations" << std::endl;
}
Table of Architecture Properties#
The table below shows the full set of architectural properties that HIP supports, together with the corresponding macros and device properties.
Macro (for device code) |
Device Property (host runtime query) |
Comment |
---|---|---|
|
|
32-bit integer atomics for global memory |
|
|
32-bit float atomic exchange for global memory |
|
|
32-bit integer atomics for shared memory |
|
|
32-bit float atomic exchange for shared memory |
|
|
32-bit float atomic add in global and shared memory |
|
|
64-bit integer atomics for global memory |
|
|
64-bit integer atomics for shared memory |
|
|
Double-precision floating-point operations |
|
|
Warp vote instructions ( |
|
|
Warp ballot instructions |
|
|
Warp shuffle operations ( |
|
|
Funnel shift two input words into one |
|
|
|
|
|
|
|
|
Supports surface functions. |
|
|
Grids and groups are 3D |
|
|
Ability to launch a kernel from within a kernel |
Compilation#
hipcc
is a portable compiler driver that calls nvcc
or amdclang++
and forwards the appropriate options. It passes options through
to the target compiler. Tools that call hipcc
must ensure the compiler
options are appropriate for the target compiler.
hipconfig
is a helpful tool in identifying the current systems platform,
compiler and runtime. It can also help set options appropriately.
As an example, it can provide a path to HIP, in Makefiles for example:
HIP_PATH ?= $(shell hipconfig --path)
HIP Headers#
The hip_runtime.h
headers define all the necessary types, functions, macros,
etc., needed to compile a HIP program, this includes host as well as device
code. hip_runtime_api.h
is a subset of hip_runtime.h
.
CUDA has slightly different contents for these two files. In some cases you may
need to convert hipified code to include the richer hip_runtime.h
instead of
hip_runtime_api.h
.
Using a Standard C++ Compiler#
You can compile hip_runtime_api.h
using a standard C or C++ compiler
(e.g., gcc
or icc
).
A source file that is only calling HIP APIs but neither defines nor launches any
kernels can be compiled with a standard host compiler (e.g. gcc
or icc
)
even when hip_runtime_api.h
or hip_runtime.h
are included.
The HIP include paths and platform macros (__HIP_PLATFORM_AMD__
or
__HIP_PLATFORM_NVIDIA__
) must be passed to the compiler.
hipconfig
can help in finding the necessary options, for example on an AMD
platform:
hipconfig --cpp_config
-D__HIP_PLATFORM_AMD__= -I/opt/rocm/include
nvcc
includes some headers by default. hipcc
does not include
default headers, and instead all required files must be explicitly included.
The hipify
tool automatically converts cuda_runtime.h
to
hip_runtime.h
, and it converts cuda_runtime_api.h
to
hip_runtime_api.h
, but it may miss nested headers or macros.
warpSize#
Code should not assume a warp size of 32 or 64, as that is not portable between
platforms and architectures. The warpSize
built-in should be used in device
code, while the host can query it during runtime via the device properties. See
the HIP language extension for warpSize for information on
how to write portable wave-aware code.
Porting from CUDA __launch_bounds__#
CUDA also defines a __launch_bounds__
qualifier which works similar to HIP’s
implementation, however it uses different parameters:
__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
The first parameter is the same as HIP’s implementation, but
MIN_BLOCKS_PER_MULTIPROCESSOR
must be converted to
MIN_WARPS_PER_EXECUTION
, which uses warps and execution units rather than
blocks and multiprocessors. This conversion is performed automatically by
HIPIFY, or can be done manually with the following
equation.
MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) / warpSize
Directly controlling the warps per execution unit makes it easier to reason about the occupancy, unlike with blocks, where the occupancy depends on the block size.
The use of execution units rather than multiprocessors also provides support for architectures with multiple execution units per multiprocessor. For example, the AMD GCN architecture has 4 execution units per multiprocessor.
maxregcount#
Unlike nvcc
, amdclang++
does not support the --maxregcount
option.
Instead, users are encouraged to use the __launch_bounds__
directive since
the parameters are more intuitive and portable than micro-architecture details
like registers. The directive allows per-kernel control.