Kernel Language Syntax#
HIP provides a C++ syntax that is suitable for compiling most code that commonly appears in compute kernels, including classes, namespaces, operator overloading, templates and more. Additionally, it defines other language features designed specifically to target accelerators, such as the following:
A kernel-launch syntax that uses standard C++, resembles a function call and is portable to all HIP targets
Short-vector headers that can serve on a host or a device
Math functions resembling those in the “math.h” header included with standard C++ compilers
Built-in functions for accessing specific GPU hardware capabilities
This section describes the built-in variables and functions accessible from the HIP kernel. It’s intended for readers who are familiar with Cuda kernel syntax and want to understand how HIP is different.
Features are marked with one of the following keywords:
Supported—HIP supports the feature with a Cuda-equivalent function
Not supported—HIP does not support the feature
Under development—the feature is under development but not yet available
Function-Type Qualifiers#
__device__
#
Supported __device__
functions are
Executed on the device
Called from the device only
The __device__
keyword can combine with the host keyword (see __host__).
__global__
#
Supported __global__
functions are
Executed on the device
Called (“launched”) from the host
HIP __global__
functions must have a void
return type, and the first parameter to a HIP __global__
function must have the type hipLaunchParm
. See Kernel-Launch Example.
HIP lacks dynamic-parallelism support, so __global__
functions cannot be called from the device.
__host__
#
Supported __host__
functions are
Executed on the host
Called from the host
__host__
can combine with __device__
, in which case the function compiles for both the host and device. These functions cannot use the HIP grid coordinate functions (for example, “threadIdx.x”). A possible workaround is to pass the necessary coordinate info as an argument to the function.
__host__
cannot combine with __global__
.
HIP parses the __noinline__
and __forceinline__
keywords and converts them to the appropriate Clang attributes.
Calling __global__
Functions#
__global__
functions are often referred to as kernels, and calling one is termed launching the kernel. These functions require the caller to specify an “execution configuration” that includes the grid and block dimensions. The execution configuration can also include other information for the launch, such as the amount of additional shared memory to allocate and the stream where the kernel should execute. HIP introduces a standard C++ calling convention to pass the execution configuration to the kernel in addition to the Cuda <<< >>> syntax. In HIP,
Kernels launch with either <<< >>> syntax or the “hipLaunchKernelGGL” function
The first five parameters to hipLaunchKernelGGL are the following:
symbol kernelName: the name of the kernel to launch. To support template kernels which contains “,” use the HIP_KERNEL_NAME macro. The hipify tools insert this automatically.
dim3 gridDim: 3D-grid dimensions specifying the number of blocks to launch.
dim3 blockDim: 3D-block dimensions specifying the number of threads in each block.
size_t dynamicShared: amount of additional shared memory to allocate when launching the kernel (see shared)
hipStream_t: stream where the kernel should execute. A value of 0 corresponds to the NULL stream (see Synchronization Functions).
Kernel arguments follow these first five parameters
// Example pseudo code introducing hipLaunchKernelGGL:
__global__ MyKernel(hipLaunchParm lp, float *A, float *B, float *C, size_t N)
{
...
}
MyKernel<<<dim3(gridDim), dim3(groupDim), 0, 0>>> (a,b,c,n);
// Alternatively, kernel can be launched by
// hipLaunchKernelGGL(MyKernel, dim3(gridDim), dim3(groupDim), 0/*dynamicShared*/, 0/*stream), a, b, c, n);
The hipLaunchKernelGGL macro always starts with the five parameters specified above, followed by the kernel arguments. HIPIFY tools optionally convert Cuda launch syntax to hipLaunchKernelGGL, including conversion of optional arguments in <<< >>> to the five required hipLaunchKernelGGL parameters. The dim3 constructor accepts zero to three arguments and will by default initialize unspecified dimensions to 1. See dim3. The kernel uses the coordinate built-ins (thread*, block*, grid*) to determine coordinate index and coordinate bounds of the work item that’s currently executing. See Coordinate Built-Ins.
Please note, HIP does not support kernel launch with total work items defined in dimension with size gridDim x blockDim >= 2^32.
Kernel-Launch Example#
// Example showing device function, __device__ __host__
// <- compile for both device and host
float PlusOne(float x)
{
return x + 1.0;
}
__global__
void
MyKernel (hipLaunchParm lp, /*lp parm for execution configuration */
const float *a, const float *b, float *c, unsigned N)
{
unsigned gid = threadIdx.x; // <- coordinate index function
if (gid < N) {
c[gid] = a[gid] + PlusOne(b[gid]);
}
}
void callMyKernel()
{
float *a, *b, *c; // initialization not shown...
unsigned N = 1000000;
const unsigned blockSize = 256;
MyKernel<<<dim3(gridDim), dim3(groupDim), 0, 0>>> (a,b,c,n);
// Alternatively, kernel can be launched by
// hipLaunchKernelGGL(MyKernel, dim3(N/blockSize), dim3(blockSize), 0, 0, a,b,c,N);
}
Variable-Type Qualifiers#
__constant__
#
The __constant__
keyword is supported. The host writes constant memory before launching the kernel; from the GPU, this memory is read-only during kernel execution. The functions for accessing constant memory (hipGetSymbolAddress(), hipGetSymbolSize(), hipMemcpyToSymbol(), hipMemcpyToSymbolAsync(), hipMemcpyFromSymbol(), hipMemcpyFromSymbolAsync()) are available.
__managed__
#
Managed memory, including the __managed__
keyword, are supported in HIP combined host/device compilation.
__restrict__
#
The __restrict__
keyword tells the compiler that the associated memory pointer will not alias with any other pointer in the kernel or function. This feature can help the compiler generate better code. In most cases, all pointer arguments must use this keyword to realize the benefit.
Built-In Variables#
Coordinate Built-Ins#
Built-ins determine the coordinate of the active work item in the execution grid. They are defined in amd_hip_runtime.h (rather than being implicitly defined by the compiler). In HIP, built-ins coordinate variable definitions are the same as in Cuda, for instance: threadIdx.x, blockIdx.y, gridDim.y, etc. The products gridDim.x * blockDim.x, gridDim.y * blockDim.y and gridDim.z * blockDim.z are always less than 2^32.
warpSize#
The warpSize variable is of type int and contains the warp size (in threads) for the target device. Note that all current Nvidia devices return 32 for this variable, and all current AMD devices return 64. Device code should use the warpSize built-in to develop portable wave-aware code.
Vector Types#
Note that these types are defined in hip_runtime.h and are not automatically provided by the compiler.
Short Vector Types#
Short vector types derive from the basic integer and floating-point types. They are structures defined in hip_vector_types.h. The first, second, third and fourth components of the vector are accessible through the x
, y
, z
and w
fields, respectively. All the short vector types support a constructor function of the form make_<type_name>()
. For example, float4 make_float4(float x, float y, float z, float w)
creates a vector of type float4
and value (x,y,z,w)
.
HIP supports the following short vector formats:
Signed Integers:
char1, char2, char3, char4
short1, short2, short3, short4
int1, int2, int3, int4
long1, long2, long3, long4
longlong1, longlong2, longlong3, longlong4
Unsigned Integers:
uchar1, uchar2, uchar3, uchar4
ushort1, ushort2, ushort3, ushort4
uint1, uint2, uint3, uint4
ulong1, ulong2, ulong3, ulong4
ulonglong1, ulonglong2, ulonglong3, ulonglong4
Floating Points
float1, float2, float3, float4
double1, double2, double3, double4
dim3#
dim3 is a three-dimensional integer vector type commonly used to specify grid and group dimensions. Unspecified dimensions are initialized to 1.
typedef struct dim3 {
uint32_t x;
uint32_t y;
uint32_t z;
dim3(uint32_t _x=1, uint32_t _y=1, uint32_t _z=1) : x(_x), y(_y), z(_z) {};
};
Memory-Fence Instructions#
HIP supports __threadfence() and __threadfence_block().
HIP provides workaround for threadfence_system() under the HIP-Clang path. To enable the workaround, HIP should be built with environment variable HIP_COHERENT_HOST_ALLOC enabled. In addition,the kernels that use __threadfence_system() should be modified as follows:
The kernel should only operate on finegrained system memory; which should be allocated with hipHostMalloc().
Remove all memcpy for those allocated finegrained system memory regions.
Synchronization Functions#
The __syncthreads() built-in function is supported in HIP. The __syncthreads_count(int), __syncthreads_and(int) and __syncthreads_or(int) functions are under development.
Math Functions#
HIP-Clang supports a set of math operations callable from the device.
Single Precision Mathematical Functions#
Following is the list of supported single precision mathematical functions.
Function |
Supported on Host |
Supported on Device |
---|---|---|
float acosf ( float x ) |
✓ |
✓ |
float acoshf ( float x ) |
✓ |
✓ |
float asinf ( float x ) |
✓ |
✓ |
float asinhf ( float x ) |
✓ |
✓ |
float atan2f ( float y, float x ) |
✓ |
✓ |
float atanf ( float x ) |
✓ |
✓ |
float atanhf ( float x ) |
✓ |
✓ |
float cbrtf ( float x ) |
✓ |
✓ |
float ceilf ( float x ) |
✓ |
✓ |
float copysignf ( float x, float y ) |
✓ |
✓ |
float cosf ( float x ) |
✓ |
✓ |
float coshf ( float x ) |
✓ |
✓ |
float erfcf ( float x ) |
✓ |
✓ |
float erff ( float x ) |
✓ |
✓ |
float exp10f ( float x ) |
✓ |
✓ |
float exp2f ( float x ) |
✓ |
✓ |
float expf ( float x ) |
✓ |
✓ |
float expm1f ( float x ) |
✓ |
✓ |
float fabsf ( float x ) |
✓ |
✓ |
float fdimf ( float x, float y ) |
✓ |
✓ |
float floorf ( float x ) |
✓ |
✓ |
float fmaf ( float x, float y, float z ) |
✓ |
✓ |
float fmaxf ( float x, float y ) |
✓ |
✓ |
float fminf ( float x, float y ) |
✓ |
✓ |
float fmodf ( float x, float y ) |
✓ |
✓ |
float frexpf ( float x, int* nptr ) |
✓ |
✗ |
float hypotf ( float x, float y ) |
✓ |
✓ |
int ilogbf ( float x ) |
✓ |
✓ |
__RETURN_TYPE[1] isfinite ( float a ) |
✓ |
✓ |
__RETURN_TYPE[1] isinf ( float a ) |
✓ |
✓ |
__RETURN_TYPE[1] isnan ( float a ) |
✓ |
✓ |
float ldexpf ( float x, int exp ) |
✓ |
✓ |
float log10f ( float x ) |
✓ |
✓ |
float log1pf ( float x ) |
✓ |
✓ |
float logbf ( float x ) |
✓ |
✓ |
float log2f ( float x ) |
✓ |
✓ |
float logf ( float x ) |
✓ |
✓ |
float modff ( float x, float* iptr ) |
✓ |
✗ |
float nanf ( const char* tagp ) |
✗ |
✓ |
float nearbyintf ( float x ) |
✓ |
✓ |
float powf ( float x, float y ) |
✓ |
✓ |
float remainderf ( float x, float y ) |
✓ |
✓ |
float remquof ( float x, float y, int* quo ) |
✓ |
✗ |
float roundf ( float x ) |
✓ |
✓ |
float scalbnf ( float x, int n ) |
✓ |
✓ |
__RETURN_TYPE[1] signbit ( float a ) |
✓ |
✓ |
void sincosf ( float x, float* sptr, float* cptr ) |
✓ |
✗ |
float sinf ( float x ) |
✓ |
✓ |
float sinhf ( float x ) |
✓ |
✓ |
float sqrtf ( float x ) |
✓ |
✓ |
float tanf ( float x ) |
✓ |
✓ |
float tanhf ( float x ) |
✓ |
✓ |
float truncf ( float x ) |
✓ |
✓ |
float tgammaf ( float x ) |
✓ |
✓ |
float erfcinvf ( float y ) |
✓ |
✓ |
float erfcxf ( float x ) |
✓ |
✓ |
float erfinvf ( float y ) |
✓ |
✓ |
float fdividef ( float x, float y ) |
✓ |
✓ |
float frexpf ( float x, int *nptr ) |
✓ |
✓ |
float j0f ( float x ) |
✓ |
✓ |
float j1f ( float x ) |
✓ |
✓ |
float jnf ( int n, float x ) |
✓ |
✓ |
float lgammaf ( float x ) |
✓ |
✓ |
long long int llrintf ( float x ) |
✓ |
✓ |
long long int llroundf ( float x ) |
✓ |
✓ |
long int lrintf ( float x ) |
✓ |
✓ |
long int lroundf ( float x ) |
✓ |
✓ |
float modff ( float x, float *iptr ) |
✓ |
✓ |
float nextafterf ( float x, float y ) |
✓ |
✓ |
float norm3df ( float a, float b, float c ) |
✓ |
✓ |
float norm4df ( float a, float b, float c, float d ) |
✓ |
✓ |
float normcdff ( float y ) |
✓ |
✓ |
float normcdfinvf ( float y ) |
✓ |
✓ |
float normf ( int dim, const float *a ) |
✓ |
✓ |
float rcbrtf ( float x ) |
✓ |
✓ |
float remquof ( float x, float y, int *quo ) |
✓ |
✓ |
float rhypotf ( float x, float y ) |
✓ |
✓ |
float rintf ( float x ) |
✓ |
✓ |
float rnorm3df ( float a, float b, float c ) |
✓ |
✓ |
float rnorm4df ( float a, float b, float c, float d ) |
✓ |
✓ |
float rnormf ( int dim, const float *a ) |
✓ |
✓ |
float scalblnf ( float x, long int n ) |
✓ |
✓ |
void sincosf ( float x, float *sptr, float *cptr ) |
✓ |
✓ |
void sincospif ( float x, float *sptr, float *cptr ) |
✓ |
✓ |
float y0f ( float x ) |
✓ |
✓ |
float y1f ( float x ) |
✓ |
✓ |
float ynf ( int n, float x ) |
✓ |
✓ |
Double Precision Mathematical Functions#
Following is the list of supported double precision mathematical functions.
Function |
Supported on Host |
Supported on Device |
---|---|---|
double acos ( double x ) |
✓ |
✓ |
double acosh ( double x ) |
✓ |
✓ |
double asin ( double x ) |
✓ |
✓ |
double asinh ( double x ) |
✓ |
✓ |
double atan ( double x ) |
✓ |
✓ |
double atan2 ( double y, double x ) |
✓ |
✓ |
double atanh ( double x ) |
✓ |
✓ |
double cbrt ( double x ) |
✓ |
✓ |
double ceil ( double x ) |
✓ |
✓ |
double copysign ( double x, double y ) |
✓ |
✓ |
double cos ( double x ) |
✓ |
✓ |
double cosh ( double x ) |
✓ |
✓ |
double erf ( double x ) |
✓ |
✓ |
double erfc ( double x ) |
✓ |
✓ |
double exp ( double x ) |
✓ |
✓ |
double exp10 ( double x ) |
✓ |
✓ |
double exp2 ( double x ) |
✓ |
✓ |
double expm1 ( double x ) |
✓ |
✓ |
double fabs ( double x ) |
✓ |
✓ |
double fdim ( double x, double y ) |
✓ |
✓ |
double floor ( double x ) |
✓ |
✓ |
double fma ( double x, double y, double z ) |
✓ |
✓ |
double fmax ( double , double ) |
✓ |
✓ |
double fmin ( double x, double y ) |
✓ |
✓ |
double fmod ( double x, double y ) |
✓ |
✓ |
double frexp ( double x, int* nptr ) |
✓ |
✗ |
double hypot ( double x, double y ) |
✓ |
✓ |
int ilogb ( double x ) |
✓ |
✓ |
__RETURN_TYPE[1] isfinite ( double a ) |
✓ |
✓ |
__RETURN_TYPE[1] isinf ( double a ) |
✓ |
✓ |
__RETURN_TYPE[1] isnan ( double a ) |
✓ |
✓ |
double ldexp ( double x, int exp ) |
✓ |
✓ |
double log ( double x ) |
✓ |
✓ |
double log10 ( double x ) |
✓ |
✓ |
double log1p ( double x ) |
✓ |
✓ |
double log2 ( double x ) |
✓ |
✓ |
double logb ( double x ) |
✓ |
✓ |
double modf ( double x, double* iptr ) |
✓ |
✗ |
double nan ( const char* tagp ) |
✗ |
✓ |
double nearbyint ( double x ) |
✓ |
✓ |
double pow ( double x, double y ) |
✓ |
✓ |
double remainder ( double x, double y ) |
✓ |
✓ |
double remquo ( double x, double y, int* quo ) |
✓ |
✗ |
double round ( double x ) |
✓ |
✓ |
double scalbn ( double x, int n ) |
✓ |
✓ |
__RETURN_TYPE[1] signbit ( double a ) |
✓ |
✓ |
double sin ( double x ) |
✓ |
✓ |
void sincos ( double x, double* sptr, double* cptr ) |
✓ |
✗ |
double sinh ( double x ) |
✓ |
✓ |
double sqrt ( double x ) |
✓ |
✓ |
double tan ( double x ) |
✓ |
✓ |
double tanh ( double x ) |
✓ |
✓ |
double tgamma ( double x ) |
✓ |
✓ |
double trunc ( double x ) |
✓ |
✓ |
double erfcinv ( double y ) |
✓ |
✓ |
double erfcx ( double x ) |
✓ |
✓ |
double erfinv ( double y ) |
✓ |
✓ |
double frexp ( float x, int *nptr ) |
✓ |
✓ |
double j0 ( double x ) |
✓ |
✓ |
double j1 ( double x ) |
✓ |
✓ |
double jn ( int n, double x ) |
✓ |
✓ |
double lgamma ( double x ) |
✓ |
✓ |
long long int llrint ( double x ) |
✓ |
✓ |
long long int llround ( double x ) |
✓ |
✓ |
long int lrint ( double x ) |
✓ |
✓ |
long int lround ( double x ) |
✓ |
✓ |
double modf ( double x, double *iptr ) |
✓ |
✓ |
double nextafter ( double x, double y ) |
✓ |
✓ |
double norm3d ( double a, double b, double c ) |
✓ |
✓ |
float norm4d ( double a, double b, double c, double d ) |
✓ |
✓ |
double normcdf ( double y ) |
✓ |
✓ |
double normcdfinv ( double y ) |
✓ |
✓ |
double rcbrt ( double x ) |
✓ |
✓ |
double remquo ( double x, double y, int *quo ) |
✓ |
✓ |
double rhypot ( double x, double y ) |
✓ |
✓ |
double rint ( double x ) |
✓ |
✓ |
double rnorm3d ( double a, double b, double c ) |
✓ |
✓ |
double rnorm4d ( double a, double b, double c, double d ) |
✓ |
✓ |
double rnorm ( int dim, const double *a ) |
✓ |
✓ |
double scalbln ( double x, long int n ) |
✓ |
✓ |
void sincos ( double x, double *sptr, double *cptr ) |
✓ |
✓ |
void sincospi ( double x, double *sptr, double *cptr ) |
✓ |
✓ |
double y0f ( double x ) |
✓ |
✓ |
double y1 ( double x ) |
✓ |
✓ |
double yn ( int n, double x ) |
✓ |
✓ |
Integer Intrinsics#
Following is the list of supported integer intrinsics. Note that intrinsics are supported on device only.
Function |
---|
unsigned int __brev ( unsigned int x ) |
unsigned long long int __brevll ( unsigned long long int x ) |
int __clz ( int x ) |
unsigned int __clz(unsigned int x) |
int __clzll ( long long int x ) |
unsigned int __clzll(long long int x) |
unsigned int __ffs(unsigned int x) |
unsigned int __ffs(int x) |
unsigned int __ffsll(unsigned long long int x) |
unsigned int __ffsll(long long int x) |
unsigned int __popc ( unsigned int x ) |
unsigned int __popcll ( unsigned long long int x ) |
int __mul24 ( int x, int y ) |
unsigned int __umul24 ( unsigned int x, unsigned int y ) |
[^f3] |
The HIP-Clang implementation of __ffs() and __ffsll() contains code to add a constant +1 to produce the ffs result format. |
For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform, |
HIP-Clang provides __lastbit_u32_u32(unsigned int input) and __lastbit_u32_u64(unsigned long long int input). |
The index returned by _lastbit instructions starts at -1, while for ffs the index starts at 0. |
Floating-point Intrinsics#
Following is the list of supported floating-point intrinsics. Note that intrinsics are supported on device only.
Function |
---|
float __cosf ( float x ) |
float __expf ( float x ) |
float __frsqrt_rn ( float x ) |
float __fsqrt_rn ( float x ) |
float __log10f ( float x ) |
float __log2f ( float x ) |
float __logf ( float x ) |
float __powf ( float x, float y ) |
float __sinf ( float x ) |
float __tanf ( float x ) |
double __dsqrt_rn ( double x ) |
Texture Functions#
The supported Texture functions are listed in header files “texture_fetch_functions.h”(https://github.com/ROCm-Developer-Tools/HIP/blob/main/include/hip/hcc_detail/texture_fetch_functions.h) and”texture_indirect_functions.h” (https://github.com/ROCm-Developer-Tools/HIP/blob/main/include/hip/hcc_detail/texture_indirect_functions.h).
Texture functions are not supported on some devices. Macro __HIP_NO_IMAGE_SUPPORT == 1 can be used to check whether texture functions are not supported in device code. Attribute hipDeviceAttributeImageSupport can be queried to check whether texture functions are supported in host runtime code.
Surface Functions#
Surface functions are not supported.
Timer Functions#
HIP provides the following built-in functions for reading a high-resolution timer from the device.
clock_t clock()
long long int clock64()
Returns the value of counter that is incremented every clock cycle on device. Difference in values returned provides the cycles used.
long long int wall_clock64()
Returns wall clock count at a constant frequency on the device, which can be queried via HIP API with hipDeviceAttributeWallClockRate attribute of the device in HIP application code, for example,
int wallClkRate = 0; //in kilohertz
HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId));
Where hipDeviceAttributeWallClockRate is a device attribute. Note that, wall clock frequency is a per-device attribute.
Atomic Functions#
Atomic functions execute as read-modify-write operations residing in global or shared memory. No other device or thread can observe or modify the memory location during an atomic operation. If multiple instructions from different devices or threads target the same memory location, the instructions are serialized in an undefined order.
HIP adds new APIs with _system as suffix to support system scope atomic operations. For example, the atomicAnd
function is meant to be atomic and coherent within the GPU device executing the function. atomicAnd_system
will allow developers to extend the atomic operation to system scope, from the GPU device to other CPUs and GPU devices in the system.
HIP supports the following atomic operations.
Function |
Supported in HIP |
Supported in CUDA |
---|---|---|
int atomicAdd(int* address, int val) |
✓ |
✓ |
int atomicAdd_system(int* address, int val) |
✓ |
✓ |
unsigned int atomicAdd(unsigned int* address,unsigned int val) |
✓ |
✓ |
unsigned int atomicAdd_system(unsigned int* address, unsigned int val) |
✓ |
✓ |
unsigned long long atomicAdd(unsigned long long* address,unsigned long long val) |
✓ |
✓ |
unsigned long long atomicAdd_system(unsigned long long* address, unsigned long long val) |
✓ |
✓ |
float atomicAdd(float* address, float val) |
✓ |
✓ |
float atomicAdd_system(float* address, float val) |
✓ |
✓ |
double atomicAdd(double* address, double val) |
✓ |
✓ |
double atomicAdd_system(double* address, double val) |
✓ |
✓ |
float unsafeAtomicAdd(float* address, float val) |
✓ |
✗ |
float safeAtomicAdd(float* address, float val) |
✓ |
✗ |
double unsafeAtomicAdd(double* address, double val) |
✓ |
✗ |
double safeAtomicAdd(double* address, double val) |
✓ |
✗ |
int atomicSub(int* address, int val) |
✓ |
✓ |
int atomicSub_system(int* address, int val) |
✓ |
✓ |
unsigned int atomicSub(unsigned int* address,unsigned int val) |
✓ |
✓ |
unsigned int atomicSub_system(unsigned int* address, unsigned int val) |
✓ |
✓ |
int atomicExch(int* address, int val) |
✓ |
✓ |
int atomicExch_system(int* address, int val) |
✓ |
✓ |
unsigned int atomicExch(unsigned int* address,unsigned int val) |
✓ |
✓ |
unsigned int atomicExch_system(unsigned int* address, unsigned int val) |
✓ |
✓ |
unsigned long long atomicExch(unsigned long long int* address,unsigned long long int val) |
✓ |
✓ |
unsigned long long atomicExch_system(unsigned long long* address, unsigned long long val) |
✓ |
✓ |
unsigned long long atomicExch_system(unsigned long long* address, unsigned long long val) |
✓ |
✓ |
float atomicExch(float* address, float val) |
✓ |
✓ |
int atomicMin(int* address, int val) |
✓ |
✓ |
int atomicMin_system(int* address, int val) |
✓ |
✓ |
unsigned int atomicMin(unsigned int* address,unsigned int val) |
✓ |
✓ |
unsigned int atomicMin_system(unsigned int* address, unsigned int val) |
✓ |
✓ |
unsigned long long atomicMin(unsigned long long* address,unsigned long long val) |
✓ |
✓ |
int atomicMax(int* address, int val) |
✓ |
✓ |
int atomicMax_system(int* address, int val) |
✓ |
✓ |
unsigned int atomicMax(unsigned int* address,unsigned int val) |
✓ |
✓ |
unsigned int atomicMax_system(unsigned int* address, unsigned int val) |
✓ |
✓ |
unsigned long long atomicMax(unsigned long long* address,unsigned long long val) |
✓ |
✓ |
unsigned int atomicInc(unsigned int* address) |
✗ |
✓ |
unsigned int atomicDec(unsigned int* address) |
✗ |
✓ |
int atomicCAS(int* address, int compare, int val) |
✓ |
✓ |
int atomicCAS_system(int* address, int compare, int val) |
✓ |
✓ |
unsigned int atomicCAS(unsigned int* address,unsigned int compare,unsigned int val) |
✓ |
✓ |
unsigned int atomicCAS_system(unsigned int* address, unsigned int compare, unsigned int val) |
✓ |
✓ |
unsigned long long atomicCAS(unsigned long long* address,unsigned long long compare,unsigned long long val) |
✓ |
✓ |
unsigned long long atomicCAS_system(unsigned long long* address, unsigned long long compare, unsigned long long val) |
✓ |
✓ |
int atomicAnd(int* address, int val) |
✓ |
✓ |
int atomicAnd_system(int* address, int val) |
✓ |
✓ |
unsigned int atomicAnd(unsigned int* address,unsigned int val) |
✓ |
✓ |
unsigned int atomicAnd_system(unsigned int* address, unsigned int val) |
✓ |
✓ |
unsigned long long atomicAnd(unsigned long long* address,unsigned long long val) |
✓ |
✓ |
unsigned long long atomicAnd_system(unsigned long long* address, unsigned long long val) |
✓ |
✓ |
int atomicOr(int* address, int val) |
✓ |
✓ |
int atomicOr_system(int* address, int val) |
✓ |
✓ |
unsigned int atomicOr(unsigned int* address,unsigned int val) |
✓ |
✓ |
unsigned int atomicOr_system(unsigned int* address, unsigned int val) |
✓ |
✓ |
unsigned int atomicOr_system(unsigned int* address, unsigned int val) |
✓ |
✓ |
unsigned long long atomicOr(unsigned long long int* address,unsigned long long val) |
✓ |
✓ |
unsigned long long atomicOr_system(unsigned long long* address, unsigned long long val) |
✓ |
✓ |
int atomicXor(int* address, int val) |
✓ |
✓ |
int atomicXor_system(int* address, int val) |
✓ |
✓ |
unsigned int atomicXor(unsigned int* address,unsigned int val) |
✓ |
✓ |
unsigned int atomicXor_system(unsigned int* address, unsigned int val) |
✓ |
✓ |
unsigned long long atomicXor(unsigned long long* address,unsigned long long val)) |
✓ |
✓ |
unsigned long long atomicXor_system(unsigned long long* address, unsigned long long val) |
✓ |
✓ |
Unsafe Floating-Point Atomic RMW Operations#
Some HIP devices support fast atomic read-modify-write (RMW) operations on floating-point values.
For example, atomicAdd
on single- or double-precision floating-point values may generate a hardware RMW instruction that is faster than emulating the atomic operation using an atomic compare-and-swap (CAS) loop.
On some devices, these fast atomic RMW instructions can produce different results when compared with the same functions implemented with atomic CAS loops. For example, some devices will produce incorrect answers if a fast atomic floating-point RMW instruction targets fine-grained memory allocations. As another example, some devices will use different rounding or denormal modes when using fast atomic floating-point RMW instructions.
As such, the HIP-Clang compiler offers a compile-time option for users to choose whether their code will use the fast, potentially unsafe, atomic instructions.
On devices that support these fast, but unsafe, floating-point atomic RMW instructions, the compiler option -munsafe-fp-atomics
will allow the compiler to generate them when it sees appropriate atomic RMW function calls.
By passing the -munsafe-fp-atomics
flag to the compiler, the user is indicating that all floating-point atomic function calls are allowed to use an unsafe version if one exists.
For instance, on some devices, this flag indicates to the compiler that that no floating-point atomicAdd
function targets fine-grained memory.
If the user instead compiles with -mno-unsafe-fp-atomics
, the user is telling the compiler to never use a floating-point atomic RMW that may not be safe.
The compiler will default to not producing unsafe floating-point atomic RMW instructions, so the -mno-unsafe-fp-atomics
compilation option is not strictly necessary.
Explicitly passing this flag to the compiler is good practice, however.
Whenever either of the two options described above, -munsafe-fp-atomics
and -mno-unsafe-fp-atomics
are passed to the compiler’s command line, they are applied globally for that entire compilation.
If only a subset of the atomic RMW function calls could safely use the faster floating-point atomic RMW instructions, the developer would instead need to compile with -mno-unsafe-fp-atomics
in order to ensure the remaining atomic RMW function calls produce correct results.
Towards this end, HIP has four extra functions to help developers more precisely control which floating-point atomic RMW functions produce unsafe atomic RMW instructions:
float unsafeAtomicAdd(float* address, float val)
double unsafeAtomicAdd(double* address, double val)
These functions will always produce fast atomic RMW instructions on devices that have them, even when
-mno-unsafe-fp-atomics
is set
float safeAtomicAdd(float* address, float val)
double safeAtomicAdd(double* address, double val)
These functions will always produce safe atomic RMW operations, even when
-munsafe-fp-atomics
is set
Warp Cross-Lane Functions#
Warp cross-lane functions operate across all lanes in a warp. The hardware guarantees that all warp lanes will execute in lockstep, so additional synchronization is unnecessary, and the instructions use no shared memory.
Note that Nvidia and AMD devices have different warp sizes, so portable code should use the warpSize built-ins to query the warp size. Hipified code from the Cuda path requires careful review to ensure it doesn’t assume a waveSize of 32. “Wave-aware” code that assumes a waveSize of 32 will run on a wave-64 machine, but it will utilize only half of the machine resources. WarpSize built-ins should only be used in device functions and its value depends on GPU arch. Users should not assume warpSize to be a compile-time constant. Host functions should use hipGetDeviceProperties to get the default warp size of a GPU device:
cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceID);
int w = props.warpSize;
// implement portable algorithm based on w (rather than assume 32 or 64)
Note that assembly kernels may be built for a warp size which is different than the default warp size.
Warp Vote and Ballot Functions#
int __all(int predicate)
int __any(int predicate)
uint64_t __ballot(int predicate)
Threads in a warp are referred to as lanes and are numbered from 0 to warpSize – 1. For these functions, each warp lane contributes 1 – the bit value (the predicate), which is efficiently broadcast to all lanes in the warp. The 32-bit int predicate from each lane reduces to a 1-bit value: 0 (predicate = 0) or 1 (predicate != 0). __any
and __all
provide a summary view of the predicates that the other warp lanes contribute:
__any()
returns 1 if any warp lane contributes a nonzero predicate, or 0 otherwise__all()
returns 1 if all other warp lanes contribute nonzero predicates, or 0 otherwise
Applications can test whether the target platform supports the any/all instruction using the hasWarpVote
device property or the HIP_ARCH_HAS_WARP_VOTE compiler define.
__ballot
provides a bit mask containing the 1-bit predicate value from each lane. The nth bit of the result contains the 1 bit contributed by the nth warp lane. Note that HIP’s __ballot
function supports a 64-bit return value (compared with Cuda’s 32 bits). Code ported from Cuda should support the larger warp sizes that the HIP version of this instruction supports. Applications can test whether the target platform supports the ballot instruction using the hasWarpBallot
device property or the HIP_ARCH_HAS_WARP_BALLOT compiler define.
Warp Shuffle Functions#
Half-float shuffles are not supported. The default width is warpSize—see Warp Cross-Lane Functions. Applications should not assume the warpSize is 32 or 64.
int __shfl (int var, int srcLane, int width=warpSize);
float __shfl (float var, int srcLane, int width=warpSize);
int __shfl_up (int var, unsigned int delta, int width=warpSize);
float __shfl_up (float var, unsigned int delta, int width=warpSize);
int __shfl_down (int var, unsigned int delta, int width=warpSize);
float __shfl_down (float var, unsigned int delta, int width=warpSize);
int __shfl_xor (int var, int laneMask, int width=warpSize);
float __shfl_xor (float var, int laneMask, int width=warpSize);
Cooperative Groups Functions#
Cooperative groups is a mechanism for forming and communicating between groups of threads at a granularity different than the block. This feature was introduced in Cuda 9.
HIP supports the following kernel language cooperative groups types or functions.
Function |
Supported in HIP |
Supported in CUDA |
---|---|---|
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
Warp Matrix Functions#
Warp matrix functions allow a warp to cooperatively operate on small matrices whose elements are spread over the lanes in an unspecified manner. This feature was introduced in Cuda 9.
HIP does not support any of the kernel language warp matrix types or functions.
Function |
Supported in HIP |
Supported in CUDA |
---|---|---|
|
✓ |
|
|
✓ |
|
|
✓ |
|
|
✓ |
|
|
✓ |
Independent Thread Scheduling#
The hardware support for independent thread scheduling introduced in certain architectures supporting Cuda allows threads to progress independently of each other and enables intra-warp synchronizations that were previously not allowed.
HIP does not support this type of scheduling.
Profiler Counter Function#
The Cuda __prof_trigger()
instruction is not supported.
Assert#
The assert function is supported in HIP. Assert function is used for debugging purpose, when the input expression equals to zero, the execution will be stopped.
void assert(int input)
There are two kinds of implementations for assert functions depending on the use sceneries,
One is for the host version of assert, which is defined in assert.h,
Another is the device version of assert, which is implemented in hip/hip_runtime.h. Users need to include assert.h to use assert. For assert to work in both device and host functions, users need to include “hip/hip_runtime.h”.
Printf#
Printf function is supported in HIP. The following is a simple example to print information in the kernel.
#include <hip/hip_runtime.h>
__global__ void run_printf() { printf("Hello World\n"); }
int main() {
run_printf<<<dim3(1), dim3(1), 0, 0>>>();
}
Device-Side Dynamic Global Memory Allocation#
Device-side dynamic global memory allocation is under development. HIP now includes a preliminary implementation of malloc and free that can be called from device functions.
__launch_bounds__
#
GPU multiprocessors have a fixed pool of resources (primarily registers and shared memory) which are shared by the actively running warps. Using more resources can increase IPC of the kernel but reduces the resources available for other warps and limits the number of warps that can be simulaneously running. Thus GPUs have a complex relationship between resource usage and performance.
launch_bounds allows the application to provide usage hints that influence the resources (primarily registers) used by the generated code. It is a function attribute that must be attached to a global function:
__global__ void `__launch_bounds__`(MAX_THREADS_PER_BLOCK, MIN_WARPS_PER_EXECUTION_UNIT)
MyKernel(hipGridLaunch lp, ...)
...
launch_bounds supports two parameters:
MAX_THREADS_PER_BLOCK - The programmers guarantees that kernel will be launched with threads less than MAX_THREADS_PER_BLOCK. (On NVCC this maps to the .maxntid PTX directive). If no launch_bounds is specified, MAX_THREADS_PER_BLOCK is the maximum block size supported by the device (typically 1024 or larger). Specifying MAX_THREADS_PER_BLOCK less than the maximum effectively allows the compiler to use more resources than a default unconstrained compilation that supports all possible block sizes at launch time. The threads-per-block is the product of (blockDim.x * blockDim.y * blockDim.z).
MIN_WARPS_PER_EXECUTION_UNIT - directs the compiler to minimize resource usage so that the requested number of warps can be simultaneously active on a multi-processor. Since active warps compete for the same fixed pool of resources, the compiler must reduce resources required by each warp(primarily registers). MIN_WARPS_PER_EXECUTION_UNIT is optional and defaults to 1 if not specified. Specifying a MIN_WARPS_PER_EXECUTION_UNIT greater than the default 1 effectively constrains the compiler’s resource usage.
When launch kernel with HIP APIs, for example, hipModuleLaunchKernel(), HIP will do validation to make sure input kernel dimension size is not larger than specified launch_bounds. In case exceeded, HIP would return launch failure, if AMD_LOG_LEVEL is set with proper value (for details, please refer to docs/markdown/hip_logging.md), detail information will be shown in the error log message, including launch parameters of kernel dim size, launch bounds, and the name of the faulting kernel. It’s helpful to figure out which is the faulting kernel, besides, the kernel dim size and launch bounds values will also assist in debugging such failures.
Compiler Impact#
The compiler uses these parameters as follows:
The compiler uses the hints only to manage register usage, and does not automatically reduce shared memory or other resources.
Compilation fails if compiler cannot generate a kernel which meets the requirements of the specified launch bounds.
From MAX_THREADS_PER_BLOCK, the compiler derives the maximum number of warps/block that can be used at launch time. Values of MAX_THREADS_PER_BLOCK less than the default allows the compiler to use a larger pool of registers : each warp uses registers, and this hint constains the launch to a warps/block size which is less than maximum.
From MIN_WARPS_PER_EXECUTION_UNIT, the compiler derives a maximum number of registers that can be used by the kernel (to meet the required #simultaneous active blocks). If MIN_WARPS_PER_EXECUTION_UNIT is 1, then the kernel can use all registers supported by the multiprocessor.
The compiler ensures that the registers used in the kernel is less than both allowed maximums, typically by spilling registers (to shared or global memory), or by using more instructions.
The compiler may use hueristics to increase register usage, or may simply be able to avoid spilling. The MAX_THREADS_PER_BLOCK is particularly useful in this cases, since it allows the compiler to use more registers and avoid situations where the compiler constrains the register usage (potentially spilling) to meet the requirements of a large block size that is never used at launch time.
CU and EU Definitions#
A compute unit (CU) is responsible for executing the waves of a work-group. It is composed of one or more execution units (EU) which are responsible for executing waves. An EU can have enough resources to maintain the state of more than one executing wave. This allows an EU to hide latency by switching between waves in a similar way to symmetric multithreading on a CPU. In order to allow the state for multiple waves to fit on an EU, the resources used by a single wave have to be limited. Limiting such resources can allow greater latency hiding, but can result in having to spill some register state to memory. This attribute allows an advanced developer to tune the number of waves that are capable of fitting within the resources of an EU. It can be used to ensure at least a certain number will fit to help hide latency, and can also be used to ensure no more than a certain number will fit to limit cache thrashing.
Porting from CUDA __launch_bounds#
CUDA defines a __launch_bounds which is also designed to control occupancy:
__launch_bounds(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
The second parameter __launch_bounds parameters must be converted to the format used __hip_launch_bounds, which uses warps and execution-units rather than blocks and multi-processors (this conversion is performed automatically by hipify tools).
MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) / 32
The key differences in the interface are:
Warps (rather than blocks): The developer is trying to tell the compiler to control resource utilization to guarantee some amount of active Warps/EU for latency hiding. Specifying active warps in terms of blocks appears to hide the micro-architectural details of the warp size, but makes the interface more confusing since the developer ultimately needs to compute the number of warps to obtain the desired level of control.
Execution Units (rather than multiProcessor): The use of execution units rather than multiprocessors provides support for architectures with multiple execution units/multi-processor. For example, the AMD GCN architecture has 4 execution units per multiProcessor. The hipDeviceProps has a field executionUnitsPerMultiprocessor. Platform-specific coding techniques such as #ifdef can be used to specify different launch_bounds for NVCC and HIP-Clang platforms, if desired.
maxregcount#
Unlike nvcc, HIP-Clang does not support the “–maxregcount” option. Instead, users are encouraged to use the hip_launch_bounds directive since the parameters are more intuitive and portable than micro-architecture details like registers, and also the directive allows per-kernel control rather than an entire file. hip_launch_bounds works on both HIP-Clang and nvcc targets.
Register Keyword#
The register keyword is deprecated in C++, and is silently ignored by both nvcc and HIP-Clang. You can pass the option -Wdeprecated-register
the compiler warning message.
Pragma Unroll#
Unroll with a bounds that is known at compile-time is supported. For example:
#pragma unroll 16 /* hint to compiler to unroll next loop by 16 */
for (int i=0; i<16; i++) ...
#pragma unroll 1 /* tell compiler to never unroll the loop */
for (int i=0; i<16; i++) ...
#pragma unroll /* hint to compiler to completely unroll next loop. */
for (int i=0; i<16; i++) ...
In-Line Assembly#
GCN ISA In-line assembly, is supported. For example:
asm volatile ("v_mac_f32_e32 %0, %2, %3" : "=v" (out[i]) : "0"(out[i]), "v" (a), "v" (in[i]));
We insert the GCN isa into the kernel using asm()
Assembler statement.
volatile
keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations.
v_mac_f32_e32
is the GCN instruction, for more information please refer - AMD GCN3 ISA architecture manual
Index for the respective operand in the ordered fashion is provided by %
followed by position in the list of operands
"v"
is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register, for more info please refer - Supported Constraint Code List for AMDGPU
Output Constraints are specified by an "="
prefix as shown above (“=v”). This indicate that assemby will write to this operand, and the operand will then be made available as a return value of the asm expression. Input constraints do not have a prefix - just the constraint code. The constraint string of "0"
says to use the assigned register for output as an input as well (it being the 0’th constraint).
C++ Support#
The following C++ features are not supported:
Run-time-type information (RTTI)
Try/catch
Virtual functions Virtual functions are not supported if objects containing virtual function tables are passed between GPU’s of different offload arch’s, e.g. between gfx906 and gfx1030. Otherwise virtual functions are supported.
Kernel Compilation#
hipcc now supports compiling C++/HIP kernels to binary code objects.
The file format for binary is .co
which means Code Object. The following command builds the code object using hipcc
.
hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]
[TARGET GPU] = GPU architecture
[INPUT FILE] = Name of the file containing kernels
[OUTPUT FILE] = Name of the generated code object file
Note: When using binary code objects is that the number of arguments to the kernel is different on HIP-Clang and NVCC path. Refer to the sample in samples/0_Intro/module_api for differences in the arguments to be passed to the kernel.
gfx-arch-specific-kernel#
Clang defined ‘gfx*’ macros can be used to execute gfx arch specific codes inside the kernel. Refer to the sample 14_gpu_arch in samples/2_Cookbook.