C++ language extensions

Contents

C++ language extensions#

HIP provides a C++ syntax that is suitable for compiling most code that commonly appears in compute kernels (classes, namespaces, operator overloading, and templates). HIP also defines other language features that are designed to target accelerators, such as:

  • A kernel-launch syntax that uses standard C++ (this resembles a function call and is portable to all HIP targets)

  • Short-vector headers that can serve on a host or device

  • Math functions that resemble those in math.h, which is included with standard C++ compilers

  • Built-in functions for accessing specific GPU hardware capabilities

Note

This chapter describes the built-in variables and functions that are accessible from the HIP kernel. It’s intended for users who are familiar with CUDA kernel syntax and want to learn how HIP differs from CUDA.

Features are labeled with one of the following keywords:

  • Supported: HIP supports the feature with a CUDA-equivalent function

  • Not supported: HIP does not support the feature

  • Under development: The feature is under development and not yet available

Function-type qualifiers#

__device__#

Supported __device__ functions are:

  • Run on the device

  • Called from the device only

You can combine __device__ with the host keyword (__host__).

__global__#

Supported __global__ functions are:

  • Run on the device

  • Called (launched) from the host

HIP __global__ functions must have a void return type.

HIP doesn’t support dynamic-parallelism, which means that you can’t call __global__ functions from the device.

__host__#

Supported __host__ functions are:

  • Run on the host

  • Called from the host

You can combine __host__ with __device__; in this case, the function compiles for the host and the device. Note that these functions can’t use the HIP grid coordinate functions (e.g., threadIdx.x). If you need to use HIP grid coordinate functions, you can pass the necessary coordinate information as an argument.

You can’t combine __host__ with __global__.

HIP parses the __noinline__ and __forceinline__ keywords and converts them into the appropriate Clang attributes.

Calling __global__ functions#

__global__ functions are often referred to as kernels. When you call a global function, you’re launching a kernel. When launching a kernel, you must specify an execution configuration that includes the grid and block dimensions. The execution configuration can also include other information for the launch, such as the amount of additional shared memory to allocate and the stream where you want to execute the kernel.

HIP introduces a standard C++ calling convention (hipLaunchKernelGGL) to pass the run configuration to the kernel. However, you can also use the CUDA <<< >>> syntax.

When using hipLaunchKernelGGL, your first five parameters must be:

  • symbol kernelName: The name of the kernel you want to launch. To support template kernels that contain ",", use the HIP_KERNEL_NAME macro (HIPIFY tools insert this automatically).

  • dim3 gridDim: 3D-grid dimensions that specify the number of blocks to launch.

  • dim3 blockDim: 3D-block dimensions that specify the number of threads in each block.

  • size_t dynamicShared: The amount of additional shared memory that you want to allocate when launching the kernel (see __shared__).

  • hipStream_t: The stream where you want to run the kernel. A value of 0 corresponds to the NULL stream (see Synchronization functions).

You can include your kernel arguments after these parameters.

// Example hipLaunchKernelGGL pseudocode:
__global__ MyKernel(float *A, float *B, float *C, size_t N)
{
...
}

MyKernel<<<dim3(gridDim), dim3(groupDim), 0, 0>>> (a,b,c,n);

// Alternatively, you can launch the kernel using:
// hipLaunchKernelGGL(MyKernel, dim3(gridDim), dim3(groupDim), 0/*dynamicShared*/, 0/*stream), a, b, c, n);

You can use HIPIFY tools to convert CUDA launch syntax to hipLaunchKernelGGL. This includes the conversion of optional <<< >>> arguments into the five required hipLaunchKernelGGL parameters.

Note

HIP doesn’t support dimension sizes of \(gridDim * blockDim \ge 2^{32}\) when launching a kernel.

Kernel launch example#

// Example showing device function, __device__ __host__
// <- compile for both device and host
float PlusOne(float x)
{
  return x + 1.0;
}

__global__
void
MyKernel (hipLaunchParm lp, /*lp parm for execution configuration */
          const float *a, const float *b, float *c, unsigned N)
{
  unsigned gid = threadIdx.x; // <- coordinate index function
  if (gid < N) {
    c[gid] = a[gid] + PlusOne(b[gid]);
  }
}
void callMyKernel()
{
  float *a, *b, *c; // initialization not shown...
  unsigned N = 1000000;
  const unsigned blockSize = 256;

  MyKernel<<<dim3(gridDim), dim3(groupDim), 0, 0>>> (a,b,c,n);
  // Alternatively, kernel can be launched by
  // hipLaunchKernelGGL(MyKernel, dim3(N/blockSize), dim3(blockSize), 0, 0, a,b,c,N);
}

Variable type qualifiers#

__constant__#

The host writes constant memory before launching the kernel. This memory is read-only from the GPU while the kernel is running. The functions for accessing constant memory are:

  • hipGetSymbolAddress()

  • hipGetSymbolSize()

  • hipMemcpyToSymbol()

  • hipMemcpyToSymbolAsync()

  • hipMemcpyFromSymbol()

  • hipMemcpyFromSymbolAsync()

__shared__#

To allow the host to dynamically allocate shared memory, you can specify extern __shared__ as a launch parameter.

Note

Prior to the HIP-Clang compiler, dynamic shared memory had to be declared using the HIP_DYNAMIC_SHARED macro in order to ensure accuracy. This is because using static shared memory in the same kernel could’ve resulted in overlapping memory ranges and data-races. The HIP-Clang compiler provides support for extern __shared_ declarations, so HIP_DYNAMIC_SHARED is no longer required.

__managed__#

Managed memory, including the __managed__ keyword, is supported in HIP combined host/device compilation.

__restrict__#

__restrict__ tells the compiler that the associated memory pointer not to alias with any other pointer in the kernel or function. This can help the compiler generate better code. In most use cases, every pointer argument should use this keyword in order to achieve the benefit.

Built-in variables#

Coordinate built-ins#

The kernel uses coordinate built-ins (thread*, block*, grid*) to determine the coordinate index and bounds for the active work item.

Built-ins are defined in amd_hip_runtime.h, rather than being implicitly defined by the compiler.

Coordinate variable definitions for built-ins are the same for HIP and CUDA. For example: threadIdx.x, blockIdx.y, and gridDim.y. The products gridDim.x * blockDim.x, gridDim.y * blockDim.y, and gridDim.z * blockDim.z are always less than 2^32.

Coordinate built-ins are implemented as structures for improved performance. When used with printf, they must be explicitly cast to integer types.

warpSize#

The warpSize variable type is int. It contains the warp size (in threads) for the target device. warpSize should only be used in device functions that develop portable wave-aware code.

Note

NVIDIA devices return 32 for this variable; AMD devices return 64 for gfx9 and 32 for gfx10 and above.

Vector types#

The following vector types are defined in hip_runtime.h. They are not automatically provided by the compiler.

Short vector types#

Short vector types derive from basic integer and floating-point types. These structures are defined in hip_vector_types.h. The first, second, third, and fourth components of the vector are defined by the x, y, z, and w fields, respectively. All short vector types support a constructor function of the form make_<type_name>(). For example, float4 make_float4(float x, float y, float z, float w) creates a vector with type float4 and value (x,y,z,w).

HIP supports the following short vector formats:

  • Signed Integers:

    • char1, char2, char3, char4

    • short1, short2, short3, short4

    • int1, int2, int3, int4

    • long1, long2, long3, long4

    • longlong1, longlong2, longlong3, longlong4

  • Unsigned Integers:

    • uchar1, uchar2, uchar3, uchar4

    • ushort1, ushort2, ushort3, ushort4

    • uint1, uint2, uint3, uint4

    • ulong1, ulong2, ulong3, ulong4

    • ulonglong1, ulonglong2, ulonglong3, ulonglong4

  • Floating Points:

    • float1, float2, float3, float4

    • double1, double2, double3, double4

dim3#

dim3 is a three-dimensional integer vector type that is commonly used to specify grid and group dimensions.

The dim3 constructor accepts between zero and three arguments. By default, it initializes unspecified dimensions to 1.

typedef struct dim3 {
  uint32_t x;
  uint32_t y;
  uint32_t z;

  dim3(uint32_t _x=1, uint32_t _y=1, uint32_t _z=1) : x(_x), y(_y), z(_z) {};
};

Memory fence instructions#

HIP supports __threadfence() and __threadfence_block(). If you’re using threadfence_system() in the HIP-Clang path, you can use the following workaround:

  1. Build HIP with the HIP_COHERENT_HOST_ALLOC environment variable enabled.

  2. Modify kernels that use __threadfence_system() as follows:

  • Ensure the kernel operates only on fine-grained system memory, which should be allocated with hipHostMalloc().

  • Remove memcpy for all allocated fine-grained system memory regions.

Synchronization functions#

Synchronization functions causes all threads in the group to wait at this synchronization point, and for all shared and global memory accesses by the threads to complete, before running synchronization. This guarantees the visibility of accessed data for all threads in the group.

The __syncthreads() built-in function is supported in HIP. The __syncthreads_count(int), __syncthreads_and(int), and __syncthreads_or(int) functions are under development.

The Cooperative Groups API offer options to do synchronization on a developer defined set of thread groups. For further information, check Cooperative Groups API or Cooperative Groups how to.

Math functions#

HIP-Clang supports a set of math operations that are callable from the device. HIP supports most of the device functions supported by CUDA. These are described on Math API page.

Texture functions#

The supported texture functions are listed in texture_fetch_functions.h and texture_indirect_functions.h header files in the HIP-AMD backend repository.

Texture functions are not supported on some devices. To determine if texture functions are supported on your device, use Macro __HIP_NO_IMAGE_SUPPORT == 1. You can query the attribute hipDeviceAttributeImageSupport to check if texture functions are supported in the host runtime code.

Surface functions#

The following surface functions are supported in HIP:

hipError_t hipCreateSurfaceObject(hipSurfaceObject_t *pSurfObject, const hipResourceDesc *pResDesc)#

Create a surface object.

Parameters:
  • pSurfObject[out] Pointer of surface object to be created.

  • pResDesc[in] Pointer of suface object descriptor.

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipDestroySurfaceObject(hipSurfaceObject_t surfaceObject)#

Destroy a surface object.

Parameters:

surfaceObject[in] Surface object to be destroyed.

Returns:

hipSuccess, hipErrorInvalidValue

template<typename T, typename std::enable_if<__hip_is_tex_surf_channel_type<T>::value>::type* = nullptr>
static void surf1Dread(T *data, hipSurfaceObject_t surfObj, int x, int boundaryMode = hipBoundaryModeZero)#

Reads the value at coordinate x from the one-dimensional surface.

Template Parameters:

T – The data type of the surface.

Parameters:
  • data – [out] The T type result is stored in this pointer.

  • surfObj – [in] The surface descriptor.

  • x – [in] The coordinate where the value will be read out.

  • boundaryMode – [in] The boundary mode is currently ignored.

template<typename T, typename std::enable_if<__hip_is_tex_surf_channel_type<T>::value>::type* = nullptr>
static void surf1Dwrite(T data, hipSurfaceObject_t surfObj, int x)#

Writes the value data to the one-dimensional surface at coordinate x.

Template Parameters:

T – The data type of the surface.

Parameters:
  • data – [in] The T type value is written to surface.

  • surfObj – [in] The surface descriptor.

  • x – [in] The coordinate where the data will be written.

template<typename T, typename std::enable_if<__hip_is_tex_surf_channel_type<T>::value>::type* = nullptr>
static void surf2Dread(T *data, hipSurfaceObject_t surfObj, int x, int y)#

Reads the value from the two-dimensional surface at coordinate x, y.

Template Parameters:

T – The data type of the surface.

Parameters:
  • data – [out] The T type result is stored in this pointer.

  • surfObj – [in] The surface descriptor.

  • x – [in] The x coordinate where the value will be read out.

  • y – [in] The y coordinate where the value will be read out.

template<typename T, typename std::enable_if<__hip_is_tex_surf_channel_type<T>::value>::type* = nullptr>
static void surf2Dwrite(T data, hipSurfaceObject_t surfObj, int x, int y)#

Writes the value data to the two-dimensional surface at coordinate x, y.

Template Parameters:

T – The data type of the surface.

Parameters:
  • data – [in] The T type value is written to surface.

  • surfObj – [in] The surface descriptor.

  • x – [in] The x coordinate where the data will be written.

  • y – [in] The y coordinate where the data will be written.

template<typename T, typename std::enable_if<__hip_is_tex_surf_channel_type<T>::value>::type* = nullptr>
static void surf3Dread(T *data, hipSurfaceObject_t surfObj, int x, int y, int z)#

Reads the value from the three-dimensional surface at coordinate x, y, z.

Template Parameters:

T – The data type of the surface.

Parameters:
  • data – [out] The T type result is stored in this pointer.

  • surfObj – [in] The surface descriptor.

  • x – [in] The x coordinate where the value will be read out.

  • y – [in] The y coordinate where the value will be read out.

  • z – [in] The z coordinate where the value will be read out.

template<typename T, typename std::enable_if<__hip_is_tex_surf_channel_type<T>::value>::type* = nullptr>
static void surf3Dwrite(T data, hipSurfaceObject_t surfObj, int x, int y, int z)#

Writes the value data to the three-dimensional surface at coordinate x, y, z.

Template Parameters:

T – The data type of the surface.

Parameters:
  • data – [in] The T type value is written to surface.

  • surfObj – [in] The surface descriptor.

  • x – [in] The x coordinate where the data will be written.

  • y – [in] The y coordinate where the data will be written.

  • z – [in] The z coordinate where the data will be written.

template<typename T, typename std::enable_if<__hip_is_tex_surf_channel_type<T>::value>::type* = nullptr>
static void surf1DLayeredread(T *data, hipSurfaceObject_t surfObj, int x, int layer)#

Reads the value from the one-dimensional layered surface at coordinate x and layer index.

Template Parameters:

T – The data type of the surface.

Parameters:
  • data – [out] The T type result is stored in this pointer.

  • surfObj – [in] The surface descriptor.

  • x – [in] The coordinate where the value will be read out.

  • layer – [in] The layer index where the value will be read out.

template<typename T, typename std::enable_if<__hip_is_tex_surf_channel_type<T>::value>::type* = nullptr>
static void surf1DLayeredwrite(T data, hipSurfaceObject_t surfObj, int x, int layer)#

Writes the value data to the one-dimensional layered surface at coordinate x and layer index.

Template Parameters:

T – The data type of the surface.

Parameters:
  • data – [in] The T type value is written to surface.

  • surfObj – [in] The surface descriptor.

  • x – [in] The x coordinate where the data will be written.

  • layer – [in] The layer index where the data will be written.

template<typename T, typename std::enable_if<__hip_is_tex_surf_channel_type<T>::value>::type* = nullptr>
static void surf2DLayeredread(T *data, hipSurfaceObject_t surfObj, int x, int y, int layer)#

Reads the value from the two-dimensional layered surface at coordinate x, y and layer index.

Template Parameters:

T – The data type of the surface.

Parameters:
  • data – [out] The T type result is stored in this pointer.

  • surfObj – [in] The surface descriptor.

  • x – [in] The x coordinate where the value will be read out.

  • y – [in] The y coordinate where the value will be read out.

  • layer – [in] The layer index where the value will be read out.

template<typename T, typename std::enable_if<__hip_is_tex_surf_channel_type<T>::value>::type* = nullptr>
static void surf2DLayeredwrite(T data, hipSurfaceObject_t surfObj, int x, int y, int layer)#

Writes the value data to the two-dimensional layered surface at coordinate x, y and layer index.

Template Parameters:

T – The data type of the surface.

Parameters:
  • data – [in] The T type value is written to surface.

  • surfObj – [in] The surface descriptor.

  • x – [in] The x coordinate where the data will be written.

  • y – [in] The y coordinate where the data will be written.

  • layer – [in] The layer index where the data will be written.

template<typename T, typename std::enable_if<__hip_is_tex_surf_channel_type<T>::value>::type* = nullptr>
static void surfCubemapread(T *data, hipSurfaceObject_t surfObj, int x, int y, int face)#

Reads the value from the cubemap surface at coordinate x, y and face index.

Template Parameters:

T – The data type of the surface.

Parameters:
  • data – [out] The T type result is stored in this pointer.

  • surfObj – [in] The surface descriptor.

  • x – [in] The x coordinate where the value will be read out.

  • y – [in] The y coordinate where the value will be read out.

  • face – [in] The face index where the value will be read out.

template<typename T, typename std::enable_if<__hip_is_tex_surf_channel_type<T>::value>::type* = nullptr>
static void surfCubemapwrite(T data, hipSurfaceObject_t surfObj, int x, int y, int face)#

Writes the value data to the cubemap surface at coordinate x, y and face index.

Template Parameters:

T – The data type of the surface.

Parameters:
  • data – [in] The T type value is written to surface.

  • surfObj – [in] The surface descriptor.

  • x – [in] The x coordinate where the data will be written.

  • y – [in] The y coordinate where the data will be written.

  • face – [in] The face index where the data will be written.

template<typename T, typename std::enable_if<__hip_is_tex_surf_channel_type<T>::value>::type* = nullptr>
static void surfCubemapLayeredread(T *data, hipSurfaceObject_t surfObj, int x, int y, int face, int layer)#

Reads the value from the layered cubemap surface at coordinate x, y and face, layer index.

Template Parameters:

T – The data type of the surface.

Parameters:
  • data – [out] The T type result is stored in this pointer.

  • surfObj – [in] The surface descriptor.

  • x – [in] The x coordinate where the value will be read out.

  • y – [in] The y coordinate where the value will be read out.

  • face – [in] The face index where the value will be read out.

  • layer – [in] The layer index where the data will be written.

template<typename T, typename std::enable_if<__hip_is_tex_surf_channel_type<T>::value>::type* = nullptr>
static void surfCubemapLayeredwrite(T *data, hipSurfaceObject_t surfObj, int x, int y, int face, int layer)#

Writes the value data to the layered cubemap surface at coordinate x, y and face, layer index.

Template Parameters:

T – The data type of the surface.

Parameters:
  • data – [in] The T type value to write to the surface.

  • surfObj – [in] The surface descriptor.

  • x – [in] The x coordinate where the data will be written.

  • y – [in] The y coordinate where the data will be written.

  • face – [in] The face index where the data will be written.

  • layer – [in] The layer index where the data will be written.

Timer functions#

To read a high-resolution timer from the device, HIP provides the following built-in functions:

  • Returning the incremental counter value for every clock cycle on a device:

    clock_t clock()
    long long int clock64()
    

    The difference between the values that are returned represents the cycles used.

  • Returning the wall clock count at a constant frequency on the device:

    long long int wall_clock64()
    

    This can be queried using the HIP API with the hipDeviceAttributeWallClockRate attribute of the device in HIP application code. For example:

    int wallClkRate = 0; //in kilohertz
    HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId));
    

    Where hipDeviceAttributeWallClockRate is a device attribute. Note that wall clock frequency is a per-device attribute.

    Note that clock() and clock64() do not work properly on AMD RDNA3 (GFX11) graphic processors.

Atomic functions#

Atomic functions are run as read-modify-write (RMW) operations that reside in global or shared memory. No other device or thread can observe or modify the memory location during an atomic operation. If multiple instructions from different devices or threads target the same memory location, the instructions are serialized in an undefined order.

To support system scope atomic operations, you can use the HIP APIs that contain the _system suffix. For example:

  • atomicAnd: This function is atomic and coherent within the GPU device running the function

  • atomicAnd_system: This function extends the atomic operation from the GPU device to other CPUs and GPU devices in the system.

HIP supports the following atomic operations.

Atomic operations#

Function

Supported in HIP

Supported in CUDA

int atomicAdd(int* address, int val)

int atomicAdd_system(int* address, int val)

unsigned int atomicAdd(unsigned int* address,unsigned int val)

unsigned int atomicAdd_system(unsigned int* address, unsigned int val)

unsigned long long atomicAdd(unsigned long long* address,unsigned long long val)

unsigned long long atomicAdd_system(unsigned long long* address, unsigned long long val)

float atomicAdd(float* address, float val)

float atomicAdd_system(float* address, float val)

double atomicAdd(double* address, double val)

double atomicAdd_system(double* address, double val)

float unsafeAtomicAdd(float* address, float val)

float safeAtomicAdd(float* address, float val)

double unsafeAtomicAdd(double* address, double val)

double safeAtomicAdd(double* address, double val)

int atomicSub(int* address, int val)

int atomicSub_system(int* address, int val)

unsigned int atomicSub(unsigned int* address,unsigned int val)

unsigned int atomicSub_system(unsigned int* address, unsigned int val)

int atomicExch(int* address, int val)

int atomicExch_system(int* address, int val)

unsigned int atomicExch(unsigned int* address,unsigned int val)

unsigned int atomicExch_system(unsigned int* address, unsigned int val)

unsigned long long atomicExch(unsigned long long int* address,unsigned long long int val)

unsigned long long atomicExch_system(unsigned long long* address, unsigned long long val)

unsigned long long atomicExch_system(unsigned long long* address, unsigned long long val)

float atomicExch(float* address, float val)

int atomicMin(int* address, int val)

int atomicMin_system(int* address, int val)

unsigned int atomicMin(unsigned int* address,unsigned int val)

unsigned int atomicMin_system(unsigned int* address, unsigned int val)

unsigned long long atomicMin(unsigned long long* address,unsigned long long val)

int atomicMax(int* address, int val)

int atomicMax_system(int* address, int val)

unsigned int atomicMax(unsigned int* address,unsigned int val)

unsigned int atomicMax_system(unsigned int* address, unsigned int val)

unsigned long long atomicMax(unsigned long long* address,unsigned long long val)

unsigned int atomicInc(unsigned int* address)

unsigned int atomicDec(unsigned int* address)

int atomicCAS(int* address, int compare, int val)

int atomicCAS_system(int* address, int compare, int val)

unsigned int atomicCAS(unsigned int* address,unsigned int compare,unsigned int val)

unsigned int atomicCAS_system(unsigned int* address, unsigned int compare, unsigned int val)

unsigned long long atomicCAS(unsigned long long* address,unsigned long long compare,unsigned long long val)

unsigned long long atomicCAS_system(unsigned long long* address, unsigned long long compare, unsigned long long val)

int atomicAnd(int* address, int val)

int atomicAnd_system(int* address, int val)

unsigned int atomicAnd(unsigned int* address,unsigned int val)

unsigned int atomicAnd_system(unsigned int* address, unsigned int val)

unsigned long long atomicAnd(unsigned long long* address,unsigned long long val)

unsigned long long atomicAnd_system(unsigned long long* address, unsigned long long val)

int atomicOr(int* address, int val)

int atomicOr_system(int* address, int val)

unsigned int atomicOr(unsigned int* address,unsigned int val)

unsigned int atomicOr_system(unsigned int* address, unsigned int val)

unsigned int atomicOr_system(unsigned int* address, unsigned int val)

unsigned long long atomicOr(unsigned long long int* address,unsigned long long val)

unsigned long long atomicOr_system(unsigned long long* address, unsigned long long val)

int atomicXor(int* address, int val)

int atomicXor_system(int* address, int val)

unsigned int atomicXor(unsigned int* address,unsigned int val)

unsigned int atomicXor_system(unsigned int* address, unsigned int val)

unsigned long long atomicXor(unsigned long long* address,unsigned long long val)

unsigned long long atomicXor_system(unsigned long long* address, unsigned long long val)

Unsafe floating-point atomic RMW operations#

Some HIP devices support fast atomic RMW operations on floating-point values. For example, atomicAdd on single- or double-precision floating-point values may generate a hardware RMW instruction that is faster than emulating the atomic operation using an atomic compare-and-swap (CAS) loop.

On some devices, fast atomic RMW instructions can produce results that differ from the same functions implemented with atomic CAS loops. For example, some devices will use different rounding or denormal modes, and some devices produce incorrect answers if fast floating-point atomic RMW instructions target fine-grained memory allocations.

The HIP-Clang compiler offers a compile-time option, so you can choose fast–but potentially unsafe–atomic instructions for your code. On devices that support these instructions, you can include the -munsafe-fp-atomics option. This flag indicates to the compiler that all floating-point atomic function calls are allowed to use an unsafe version, if one exists. For example, on some devices, this flag indicates to the compiler that no floating-point atomicAdd function can target fine-grained memory.

If you want to avoid using unsafe use a floating-point atomic RMW operations, you can use the -mno-unsafe-fp-atomics option. Note that the compiler default is to not produce unsafe floating-point atomic RMW instructions, so the -mno-unsafe-fp-atomics option is not necessarily required. However, passing this option to the compiler is good practice.

When you pass -munsafe-fp-atomics or -mno-unsafe-fp-atomics to the compiler’s command line, the option is applied globally for the entire compilation. Note that if some of the atomic RMW function calls cannot safely use the faster floating-point atomic RMW instructions, you must use -mno-unsafe-fp-atomics in order to ensure that your atomic RMW function calls produce correct results.

HIP has four extra functions that you can use to more precisely control which floating-point atomic RMW functions produce unsafe atomic RMW instructions:

  • float unsafeAtomicAdd(float* address, float val)

  • double unsafeAtomicAdd(double* address, double val) (Always produces fast atomic RMW instructions on devices that have them, even when -mno-unsafe-fp-atomics is used)

  • float safeAtomicAdd(float* address, float val)

  • double safeAtomicAdd(double* address, double val) (Always produces safe atomic RMW operations, even when -munsafe-fp-atomics is used)

Warp cross-lane functions#

Threads in a warp are referred to as lanes and are numbered from 0 to warpSize - 1. Warp cross-lane functions operate across all lanes in a warp. The hardware guarantees that all warp lanes will execute in lockstep, so additional synchronization is unnecessary, and the instructions use no shared memory.

Note that NVIDIA and AMD devices have different warp sizes. You can use warpSize built-ins in you portable code to query the warp size.

Tip

Be sure to review HIP code generated from the CUDA path to ensure that it doesn’t assume a waveSize of 32. “Wave-aware” code that assumes a waveSize of 32 can run on a wave-64 machine, but it only utilizes half of the machine’s resources.

To get the default warp size of a GPU device, use hipGetDeviceProperties in you host functions.

cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceID);
int w = props.warpSize;
  // implement portable algorithm based on w (rather than assume 32 or 64)

Only use warpSize built-ins in device functions, and don’t assume warpSize to be a compile-time constant.

Note that assembly kernels may be built for a warp size that is different from the default. All mask values either returned or accepted by these builtins are 64-bit unsigned integer values, even when compiled for a wave-32 device, where all the higher bits are unused. CUDA code ported to HIP requires changes to ensure that the correct type is used.

Note that the __sync variants are made available in ROCm 6.2, but disabled by default to help with the transition to 64-bit masks. They can be enabled by setting the preprocessor macro HIP_ENABLE_WARP_SYNC_BUILTINS. These builtins will be enabled unconditionally in ROCm 6.3. Wherever possible, the implementation includes a static assert to check that the program source uses the correct type for the mask.

Warp vote and ballot functions#

int __all(int predicate)
int __any(int predicate)
unsigned long long __ballot(int predicate)
unsigned long long __activemask()

int __all_sync(unsigned long long mask, int predicate)
int __any_sync(unsigned long long mask, int predicate)
int __ballot(unsigned long long mask, int predicate)

You can use __any and __all to get a summary view of the predicates evaluated by the participating lanes.

  • __any(): Returns 1 if the predicate is non-zero for any participating lane, otherwise it returns 0.

  • __all(): Returns 1 if the predicate is non-zero for all participating lanes, otherwise it returns 0.

To determine if the target platform supports the any/all instruction, you can use the hasWarpVote device property or the HIP_ARCH_HAS_WARP_VOTE compiler definition.

__ballot returns a bit mask containing the 1-bit predicate value from each lane. The nth bit of the result contains the 1 bit contributed by the nth warp lane.

__activemask() returns a bit mask of currently active warp lanes. The nth bit of the result is 1 if the nth warp lane is active.

Note that the __ballot and __activemask builtins in HIP have a 64-bit return value (unlike the 32-bit value returned by the CUDA builtins). Code ported from CUDA should be adapted to support the larger warp sizes that the HIP version requires.

Applications can test whether the target platform supports the __ballot or __activemask instructions using the hasWarpBallot device property in host code or the HIP_ARCH_HAS_WARP_BALLOT macro defined by the compiler for device code.

The _sync variants require a 64-bit unsigned integer mask argument that specifies the lanes in the warp that will participate in cross-lane communication with the calling lane. Each participating thread must have its own bit set in its mask argument, and all active threads specified in any mask argument must execute the same call with the same mask, otherwise the result is undefined.

Warp match functions#

unsigned long long __match_any(T value)
unsigned long long __match_all(T value, int *pred)

unsigned long long __match_any_sync(unsigned long long mask, T value)
unsigned long long __match_all_sync(unsigned long long mask, T value, int *pred)

T can be a 32-bit integer type, 64-bit integer type or a single precision or double precision floating point type.

__match_any returns a bit mask containing a 1-bit for every participating lane if and only if that lane has the same value in value as the current lane, and a 0-bit for all other lanes.

__match_all returns a bit mask containing a 1-bit for every participating lane if and only if they all have the same value in value as the current lane, and a 0-bit for all other lanes. The predicate pred is set to true if and only if all participating threads have the same value in value.

The _sync variants require a 64-bit unsigned integer mask argument that specifies the lanes in the warp that will participate in cross-lane communication with the calling lane. Each participating thread must have its own bit set in its mask argument, and all active threads specified in any mask argument must execute the same call with the same mask, otherwise the result is undefined.

Warp shuffle functions#

The default width is warpSize (see Warp cross-lane functions). Half-float shuffles are not supported.

int   __shfl      (T var,   int srcLane, int width=warpSize);
int   __shfl_up   (T var,   unsigned int delta, int width=warpSize);
int   __shfl_down (T var,   unsigned int delta, int width=warpSize);
int   __shfl_xor  (T var,   int laneMask, int width=warpSize);

int   __shfl_sync      (unsigned long long mask, T var,   int srcLane, int width=warpSize);
int   __shfl_up_sync   (unsigned long long mask, T var,   unsigned int delta, int width=warpSize);
int   __shfl_down_sync (unsigned long long mask, T var,   unsigned int delta, int width=warpSize);
int   __shfl_xor_sync  (unsigned long long mask, T var,   int laneMask, int width=warpSize);

T can be a 32-bit integer type, 64-bit integer type or a single precision or double precision floating point type.

The _sync variants require a 64-bit unsigned integer mask argument that specifies the lanes in the warp that will participate in cross-lane communication with the calling lane. Each participating thread must have its own bit set in its mask argument, and all active threads specified in any mask argument must execute the same call with the same mask, otherwise the result is undefined.

Cooperative groups functions#

You can use cooperative groups to synchronize groups of threads. Cooperative groups also provide a way of communicating between groups of threads at a granularity that is different from the block.

HIP supports the following kernel language cooperative groups types and functions:

Cooperative groups functions#

Function

Supported in HIP

Supported in CUDA

void thread_group.sync();

unsigned thread_group.size();

unsigned thread_group.thread_rank()

bool thread_group.is_valid();

grid_group this_grid()

void grid_group.sync()

unsigned grid_group.size()

unsigned grid_group.thread_rank()

bool grid_group.is_valid()

multi_grid_group this_multi_grid()

void multi_grid_group.sync()

unsigned multi_grid_group.size()

unsigned multi_grid_group.thread_rank()

bool multi_grid_group.is_valid()

unsigned multi_grid_group.num_grids()

unsigned multi_grid_group.grid_rank()

thread_block this_thread_block()

multi_grid_group this_multi_grid()

void multi_grid_group.sync()

void thread_block.sync()

unsigned thread_block.size()

unsigned thread_block.thread_rank()

bool thread_block.is_valid()

dim3 thread_block.group_index()

dim3 thread_block.thread_index()

For further information, check Cooperative Groups API or Cooperative Groups how to.

Warp matrix functions#

Warp matrix functions allow a warp to cooperatively operate on small matrices that have elements spread over lanes in an unspecified manner.

HIP does not support kernel language warp matrix types or functions.

Warp matrix functions#

Function

Supported in HIP

Supported in CUDA

void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned lda)

void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned lda, layout_t layout)

void store_matrix_sync(T* mptr, fragment<...> &a, unsigned lda, layout_t layout)

void fill_fragment(fragment<...> &a, const T &value)

void mma_sync(fragment<...> &d, const fragment<...> &a, const fragment<...> &b, const fragment<...> &c , bool sat)

Independent thread scheduling#

Certain architectures that support CUDA allow threads to progress independently of each other. This independent thread scheduling makes intra-warp synchronization possible.

HIP does not support this type of scheduling.

Profiler Counter Function#

The CUDA __prof_trigger() instruction is not supported.

Assert#

The assert function is supported in HIP. Assert function is used for debugging purpose, when the input expression equals to zero, the execution will be stopped.

void assert(int input)

There are two kinds of implementations for assert functions depending on the use sceneries, - One is for the host version of assert, which is defined in assert.h, - Another is the device version of assert, which is implemented in hip/hip_runtime.h. Users need to include assert.h to use assert. For assert to work in both device and host functions, users need to include "hip/hip_runtime.h".

HIP provides the function abort() which can be used to terminate the application when terminal failures are detected. It is implemented using the __builtin_trap() function.

This function produces a similar effect of using asm("trap") in the CUDA code.

Note

In HIP, the function terminates the entire application, while in CUDA, asm("trap") only terminates the dispatch and the application continues to run.

printf#

printf function is supported in HIP. The following is a simple example to print information in the kernel.

#include <hip/hip_runtime.h>

__global__ void run_printf() { printf("Hello World\n"); }

int main() {
  run_printf<<<dim3(1), dim3(1), 0, 0>>>();
}

Device-Side Dynamic Global Memory Allocation#

Device-side dynamic global memory allocation is under development. HIP now includes a preliminary implementation of malloc and free that can be called from device functions.

__launch_bounds__#

GPU multiprocessors have a fixed pool of resources (primarily registers and shared memory) which are shared by the actively running warps. Using more resources can increase IPC of the kernel but reduces the resources available for other warps and limits the number of warps that can be simultaneously running. Thus GPUs have a complex relationship between resource usage and performance.

__launch_bounds__ allows the application to provide usage hints that influence the resources (primarily registers) used by the generated code. It is a function attribute that must be attached to a __global__ function:

__global__ void __launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_WARPS_PER_EXECUTION_UNIT)
MyKernel(hipGridLaunch lp, ...)
...

__launch_bounds__ supports two parameters: - MAX_THREADS_PER_BLOCK - The programmers guarantees that kernel will be launched with threads less than MAX_THREADS_PER_BLOCK. (On NVCC this maps to the .maxntid PTX directive). If no launch_bounds is specified, MAX_THREADS_PER_BLOCK is the maximum block size supported by the device (typically 1024 or larger). Specifying MAX_THREADS_PER_BLOCK less than the maximum effectively allows the compiler to use more resources than a default unconstrained compilation that supports all possible block sizes at launch time. The threads-per-block is the product of (blockDim.x * blockDim.y * blockDim.z). - MIN_WARPS_PER_EXECUTION_UNIT - directs the compiler to minimize resource usage so that the requested number of warps can be simultaneously active on a multi-processor. Since active warps compete for the same fixed pool of resources, the compiler must reduce resources required by each warp(primarily registers). MIN_WARPS_PER_EXECUTION_UNIT is optional and defaults to 1 if not specified. Specifying a MIN_WARPS_PER_EXECUTION_UNIT greater than the default 1 effectively constrains the compiler’s resource usage.

When launch kernel with HIP APIs, for example, hipModuleLaunchKernel(), HIP will do validation to make sure input kernel dimension size is not larger than specified launch_bounds. In case exceeded, HIP would return launch failure, if AMD_LOG_LEVEL is set with proper value (for details, please refer to docs/markdown/hip_logging.md), detail information will be shown in the error log message, including launch parameters of kernel dim size, launch bounds, and the name of the faulting kernel. It’s helpful to figure out which is the faulting kernel, besides, the kernel dim size and launch bounds values will also assist in debugging such failures.

Compiler Impact#

The compiler uses these parameters as follows: - The compiler uses the hints only to manage register usage, and does not automatically reduce shared memory or other resources. - Compilation fails if compiler cannot generate a kernel which meets the requirements of the specified launch bounds. - From MAX_THREADS_PER_BLOCK, the compiler derives the maximum number of warps/block that can be used at launch time. Values of MAX_THREADS_PER_BLOCK less than the default allows the compiler to use a larger pool of registers : each warp uses registers, and this hint constrains the launch to a warps/block size which is less than maximum. - From MIN_WARPS_PER_EXECUTION_UNIT, the compiler derives a maximum number of registers that can be used by the kernel (to meet the required #simultaneous active blocks). If MIN_WARPS_PER_EXECUTION_UNIT is 1, then the kernel can use all registers supported by the multiprocessor. - The compiler ensures that the registers used in the kernel is less than both allowed maximums, typically by spilling registers (to shared or global memory), or by using more instructions. - The compiler may use heuristics to increase register usage, or may simply be able to avoid spilling. The MAX_THREADS_PER_BLOCK is particularly useful in this cases, since it allows the compiler to use more registers and avoid situations where the compiler constrains the register usage (potentially spilling) to meet the requirements of a large block size that is never used at launch time.

CU and EU Definitions#

A compute unit (CU) is responsible for executing the waves of a work-group. It is composed of one or more execution units (EU) which are responsible for executing waves. An EU can have enough resources to maintain the state of more than one executing wave. This allows an EU to hide latency by switching between waves in a similar way to symmetric multithreading on a CPU. In order to allow the state for multiple waves to fit on an EU, the resources used by a single wave have to be limited. Limiting such resources can allow greater latency hiding, but can result in having to spill some register state to memory. This attribute allows an advanced developer to tune the number of waves that are capable of fitting within the resources of an EU. It can be used to ensure at least a certain number will fit to help hide latency, and can also be used to ensure no more than a certain number will fit to limit cache thrashing.

Porting from CUDA __launch_bounds#

CUDA defines a __launch_bounds which is also designed to control occupancy:

__launch_bounds(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
  • The second parameter __launch_bounds parameters must be converted to the format used __hip_launch_bounds, which uses warps and execution-units rather than blocks and multi-processors (this conversion is performed automatically by HIPIFY tools).

MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) / 32

The key differences in the interface are: - Warps (rather than blocks): The developer is trying to tell the compiler to control resource utilization to guarantee some amount of active Warps/EU for latency hiding. Specifying active warps in terms of blocks appears to hide the micro-architectural details of the warp size, but makes the interface more confusing since the developer ultimately needs to compute the number of warps to obtain the desired level of control. - Execution Units (rather than multiprocessor): The use of execution units rather than multiprocessors provides support for architectures with multiple execution units/multi-processor. For example, the AMD GCN architecture has 4 execution units per multiprocessor. The hipDeviceProps has a field executionUnitsPerMultiprocessor. Platform-specific coding techniques such as #ifdef can be used to specify different launch_bounds for NVCC and HIP-Clang platforms, if desired.

maxregcount#

Unlike NVCC, HIP-Clang does not support the --maxregcount option. Instead, users are encouraged to use the hip_launch_bounds directive since the parameters are more intuitive and portable than micro-architecture details like registers, and also the directive allows per-kernel control rather than an entire file. hip_launch_bounds works on both HIP-Clang and NVCC targets.

Asynchronous Functions#

Memory stream#

typedef void (*hipStreamCallback_t)(hipStream_t stream, hipError_t status, void *userData)#

Stream CallBack struct

hipError_t hipStreamCreate(hipStream_t *stream)#

Create an asynchronous stream.

Create a new asynchronous stream. stream returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy.

Parameters:

stream[inout] Valid pointer to hipStream_t. This function writes the memory with the newly created stream.

Returns:

hipSuccess, hipErrorInvalidValue

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipStreamCreateWithFlags(hipStream_t *stream, unsigned int flags)#

Create an asynchronous stream.

Create a new asynchronous stream. stream returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy. Flags controls behavior of the stream. See hipStreamDefault, hipStreamNonBlocking.

Parameters:
  • stream[inout] Pointer to new stream

  • flags[in] to control stream creation.

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipStreamCreateWithPriority(hipStream_t *stream, unsigned int flags, int priority)#

Create an asynchronous stream with the specified priority.

Create a new asynchronous stream with the specified priority. stream returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy. Flags controls behavior of the stream. See hipStreamDefault, hipStreamNonBlocking.

Parameters:
  • stream[inout] Pointer to new stream

  • flags[in] to control stream creation.

  • priority[in] of the stream. Lower numbers represent higher priorities.

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipDeviceGetStreamPriorityRange(int *leastPriority, int *greatestPriority)#

Returns numerical values that correspond to the least and greatest stream priority.

Returns in *leastPriority and *greatestPriority the numerical values that correspond to the least and greatest stream priority respectively. Stream priorities follow a convention where lower numbers imply greater priorities. The range of meaningful stream priorities is given by [*greatestPriority, *leastPriority]. If the user attempts to create a stream with a priority value that is outside the meaningful range as specified by this API, the priority is automatically clamped to within the valid range.

Parameters:
  • leastPriority[inout] pointer in which value corresponding to least priority is returned.

  • greatestPriority[inout] pointer in which value corresponding to greatest priority is returned.

Returns:

hipSuccess

hipError_t hipStreamDestroy(hipStream_t stream)#

Destroys the specified stream.

Destroys the specified stream.

If commands are still executing on the specified stream, some may complete execution before the queue is deleted.

The queue may be destroyed while some commands are still inflight, or may wait for all commands queued to the stream before destroying it.

Parameters:

stream[in] stream identifier.

Returns:

hipSuccess hipErrorInvalidHandle

hipError_t hipStreamQuery(hipStream_t stream)#

Return hipSuccess if all of the operations in the specified stream have completed, or hipErrorNotReady if not.

This is thread-safe and returns a snapshot of the current state of the queue. However, if other host threads are sending work to the stream, the status may change immediately after the function is called. It is typically used for debug.

Parameters:

stream[in] stream to query

Returns:

hipSuccess, hipErrorNotReady, hipErrorInvalidHandle

hipError_t hipStreamSynchronize(hipStream_t stream)#

Wait for all commands in stream to complete.

This command is host-synchronous : the host will block until the specified stream is empty.

This command follows standard null-stream semantics. Specifically, specifying the null stream will cause the command to wait for other streams on the same device to complete all pending operations.

This command honors the hipDeviceLaunchBlocking flag, which controls whether the wait is active or blocking.

Parameters:

stream[in] stream identifier.

Returns:

hipSuccess, hipErrorInvalidHandle

hipError_t hipStreamWaitEvent(hipStream_t stream, hipEvent_t event, unsigned int flags)#

Make the specified compute stream wait for an event.

This function inserts a wait operation into the specified stream. All future work submitted to stream will wait until event reports completion before beginning execution.

This function only waits for commands in the current stream to complete. Notably, this function does not implicitly wait for commands in the default stream to complete, even if the specified stream is created with hipStreamNonBlocking = 0.

Parameters:
  • stream[in] stream to make wait.

  • event[in] event to wait on

  • flags[in] control operation [must be 0]

Returns:

hipSuccess, hipErrorInvalidHandle

hipError_t hipStreamGetFlags(hipStream_t stream, unsigned int *flags)#

Return flags associated with this stream.

Return flags associated with this stream in *flags.

Parameters:
  • stream[in] stream to be queried

  • flags[inout] Pointer to an unsigned integer in which the stream’s flags are returned

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidHandle

Returns:

hipSuccess hipErrorInvalidValue hipErrorInvalidHandle

hipError_t hipStreamGetPriority(hipStream_t stream, int *priority)#

Query the priority of a stream.

Query the priority of a stream. The priority is returned in in priority.

Parameters:
  • stream[in] stream to be queried

  • priority[inout] Pointer to an unsigned integer in which the stream’s priority is returned

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidHandle

Returns:

hipSuccess hipErrorInvalidValue hipErrorInvalidHandle

hipError_t hipStreamGetDevice(hipStream_t stream, hipDevice_t *device)#

Get the device assocaited with the stream.

Parameters:
  • stream[in] stream to be queried

  • device[out] device associated with the stream

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorContextIsDestroyed, hipErrorInvalidHandle, hipErrorNotInitialized, hipErrorDeinitialized, hipErrorInvalidContext

hipError_t hipExtStreamCreateWithCUMask(hipStream_t *stream, uint32_t cuMaskSize, const uint32_t *cuMask)#

Create an asynchronous stream with the specified CU mask.

Create a new asynchronous stream with the specified CU mask. stream returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy.

Parameters:
  • stream[inout] Pointer to new stream

  • cuMaskSize[in] Size of CU mask bit array passed in.

  • cuMask[in] Bit-vector representing the CU mask. Each active bit represents using one CU. The first 32 bits represent the first 32 CUs, and so on. If its size is greater than physical CU number (i.e., multiProcessorCount member of hipDeviceProp_t), the extra elements are ignored. It is user’s responsibility to make sure the input is meaningful.

Returns:

hipSuccess, hipErrorInvalidHandle, hipErrorInvalidValue

hipError_t hipExtStreamGetCUMask(hipStream_t stream, uint32_t cuMaskSize, uint32_t *cuMask)#

Get CU mask associated with an asynchronous stream.

Parameters:
  • stream[in] stream to be queried

  • cuMaskSize[in] number of the block of memories (uint32_t *) allocated by user

  • cuMask[out] Pointer to a pre-allocated block of memories (uint32_t *) in which the stream’s CU mask is returned. The CU mask is returned in a chunck of 32 bits where each active bit represents one active CU

Returns:

hipSuccess, hipErrorInvalidHandle, hipErrorInvalidValue

hipError_t hipStreamAddCallback(hipStream_t stream, hipStreamCallback_t callback, void *userData, unsigned int flags)#

Adds a callback to be called on the host after all currently enqueued items in the stream have completed. For each hipStreamAddCallback call, a callback will be executed exactly once. The callback will block later work in the stream until it is finished.

Parameters:
  • stream[in] - Stream to add callback to

  • callback[in] - The function to call once preceding stream operations are complete

  • userData[in] - User specified data to be passed to the callback function

  • flags[in] - Reserved for future use, must be 0

Returns:

hipSuccess, hipErrorInvalidHandle, hipErrorNotSupported

static inline hipError_t hipMallocAsync(void **dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream)#

C++ wrappers for allocations from a memory pool.

This section describes wrappers for stream Ordered allocation from memory pool functions of HIP runtime API.

This is an alternate C++ calls for hipMallocFromPoolAsync made available through function overloading.

Note

APIs in this section are implemented on Linux, under development on Windows.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

template<class T>
static inline hipError_t hipMallocAsync(T **dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream)#

C++ wrappers for allocations from a memory pool on the stream.

This is an alternate C++ calls for hipMallocFromPoolAsync made available through function overloading.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

template<class T>
static inline hipError_t hipMallocAsync(T **dev_ptr, size_t size, hipStream_t stream)#

C++ wrappers for allocations from a memory pool.

This is an alternate C++ calls for hipMallocFromPoolAsync made available through function overloading.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

template<class T>
static inline hipError_t hipMallocFromPoolAsync(T **dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream)#

C++ wrappers for allocations from a memory pool.

This is an alternate C++ calls for hipMallocFromPoolAsync made available through function overloading.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

hipError_t hipMallocAsync(void **dev_ptr, size_t size, hipStream_t stream)#

Allocates memory with stream ordered semantics.

Inserts a memory allocation operation into stream. A pointer to the allocated memory is returned immediately in *dptr. The allocation must not be accessed until the allocation operation completes. The allocation comes from the memory pool associated with the stream’s device.

Note

The default memory pool of a device contains device memory from that device.

Note

Basic stream ordering allows future work submitted into the same stream to use the allocation. Stream query, stream synchronize, and HIP events can be used to guarantee that the allocation operation completes before work submitted in a separate stream runs.

Note

During stream capture, this function results in the creation of an allocation node. In this case, the allocation is owned by the graph instead of the memory pool. The memory pool’s properties are used to set the node’s creation parameters.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • dev_ptr[out] Returned device pointer of memory allocation

  • size[in] Number of bytes to allocate

  • stream[in] The stream establishing the stream ordering contract and the memory pool to allocate from

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotSupported, hipErrorOutOfMemory

hipError_t hipFreeAsync(void *dev_ptr, hipStream_t stream)#

Frees memory with stream ordered semantics.

Inserts a free operation into stream. The allocation must not be used after stream execution reaches the free. After this API returns, accessing the memory from any subsequent work launched on the GPU or querying its pointer attributes results in undefined behavior.

Note

During stream capture, this function results in the creation of a free node and must therefore be passed the address of a graph allocation.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • dev_ptr[in] Pointer to device memory to free

  • stream[in] The stream, where the destruciton will occur according to the execution order

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotSupported

hipError_t hipMemPoolTrimTo(hipMemPool_t mem_pool, size_t min_bytes_to_hold)#

Releases freed memory back to the OS.

Releases memory back to the OS until the pool contains fewer than min_bytes_to_keep reserved bytes, or there is no more memory that the allocator can safely release. The allocator cannot release OS allocations that back outstanding asynchronous allocations. The OS allocations may happen at different granularity from the user allocations.

Note

Allocations that have not been freed count as outstanding.

Note

Allocations that have been asynchronously freed but whose completion has not been observed on the host (eg. by a synchronize) can count as outstanding.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • mem_pool[in] The memory pool to trim allocations

  • min_bytes_to_hold[in] If the pool has less than min_bytes_to_hold reserved, then the TrimTo operation is a no-op. Otherwise the memory pool will contain at least min_bytes_to_hold bytes reserved after the operation.

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemPoolSetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void *value)#

Sets attributes of a memory pool.

Supported attributes are:

  • hipMemPoolAttrReleaseThreshold: (value type = cuuint64_t) Amount of reserved memory in bytes to hold onto before trying to release memory back to the OS. When more than the release threshold bytes of memory are held by the memory pool, the allocator will try to release memory back to the OS on the next call to stream, event or context synchronize. (default 0)

  • hipMemPoolReuseFollowEventDependencies: (value type = int) Allow hipMallocAsync to use memory asynchronously freed in another stream as long as a stream ordering dependency of the allocating stream on the free action exists. HIP events and null stream interactions can create the required stream ordered dependencies. (default enabled)

  • hipMemPoolReuseAllowOpportunistic: (value type = int) Allow reuse of already completed frees when there is no dependency between the free and allocation. (default enabled)

  • hipMemPoolReuseAllowInternalDependencies: (value type = int) Allow hipMallocAsync to insert new stream dependencies in order to establish the stream ordering required to reuse a piece of memory released by hipFreeAsync (default enabled).

Note

This API is implemented on Linux and is under development on Microsoft Windows.

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • mem_pool[in] The memory pool to modify

  • attr[in] The attribute to modify

  • value[in] Pointer to the value to assign

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemPoolGetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void *value)#

Gets attributes of a memory pool.

Supported attributes are:

  • hipMemPoolAttrReleaseThreshold: (value type = cuuint64_t) Amount of reserved memory in bytes to hold onto before trying to release memory back to the OS. When more than the release threshold bytes of memory are held by the memory pool, the allocator will try to release memory back to the OS on the next call to stream, event or context synchronize. (default 0)

  • hipMemPoolReuseFollowEventDependencies: (value type = int) Allow hipMallocAsync to use memory asynchronously freed in another stream as long as a stream ordering dependency of the allocating stream on the free action exists. HIP events and null stream interactions can create the required stream ordered dependencies. (default enabled)

  • hipMemPoolReuseAllowOpportunistic: (value type = int) Allow reuse of already completed frees when there is no dependency between the free and allocation. (default enabled)

  • hipMemPoolReuseAllowInternalDependencies: (value type = int) Allow hipMallocAsync to insert new stream dependencies in order to establish the stream ordering required to reuse a piece of memory released by hipFreeAsync (default enabled).

Note

This API is implemented on Linux and is under development on Microsoft Windows.

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • mem_pool[in] The memory pool to get attributes of

  • attr[in] The attribute to get

  • value[in] Retrieved value

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemPoolSetAccess(hipMemPool_t mem_pool, const hipMemAccessDesc *desc_list, size_t count)#

Controls visibility of the specified pool between devices.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • mem_pool[in] Memory pool for acccess change

  • desc_list[in] Array of access descriptors. Each descriptor instructs the access to enable for a single gpu

  • count[in] Number of descriptors in the map array.

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemPoolGetAccess(hipMemAccessFlags *flags, hipMemPool_t mem_pool, hipMemLocation *location)#

Returns the accessibility of a pool from a device.

Returns the accessibility of the pool’s memory from the specified location.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • flags[out] Accessibility of the memory pool from the specified location/device

  • mem_pool[in] Memory pool being queried

  • location[in] Location/device for memory pool access

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemPoolCreate(hipMemPool_t *mem_pool, const hipMemPoolProps *pool_props)#

Creates a memory pool.

Creates a HIP memory pool and returns the handle in mem_pool. The pool_props determines the properties of the pool such as the backing device and IPC capabilities.

By default, the memory pool will be accessible from the device it is allocated on.

Note

Specifying hipMemHandleTypeNone creates a memory pool that will not support IPC.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • mem_pool[out] Contains createed memory pool

  • pool_props[in] Memory pool properties

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotSupported

hipError_t hipMemPoolDestroy(hipMemPool_t mem_pool)#

Destroys the specified memory pool.

If any pointers obtained from this pool haven’t been freed or the pool has free operations that haven’t completed when hipMemPoolDestroy is invoked, the function will return immediately and the resources associated with the pool will be released automatically once there are no more outstanding allocations.

Destroying the current mempool of a device sets the default mempool of that device as the current mempool for that device.

Note

A device’s default memory pool cannot be destroyed.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:

mem_pool[in] Memory pool for destruction

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMallocFromPoolAsync(void **dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream)#

Allocates memory from a specified pool with stream ordered semantics.

Inserts an allocation operation into stream. A pointer to the allocated memory is returned immediately in dev_ptr. The allocation must not be accessed until the allocation operation completes. The allocation comes from the specified memory pool.

Basic stream ordering allows future work submitted into the same stream to use the allocation. Stream query, stream synchronize, and HIP events can be used to guarantee that the allocation operation completes before work submitted in a separate stream runs.

Note

The specified memory pool may be from a device different than that of the specified stream.

Note

During stream capture, this function results in the creation of an allocation node. In this case, the allocation is owned by the graph instead of the memory pool. The memory pool’s properties are used to set the node’s creation parameters.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • dev_ptr[out] Returned device pointer

  • size[in] Number of bytes to allocate

  • mem_pool[in] The pool to allocate from

  • stream[in] The stream establishing the stream ordering semantic

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotSupported, hipErrorOutOfMemory

hipError_t hipMemPoolExportToShareableHandle(void *shared_handle, hipMemPool_t mem_pool, hipMemAllocationHandleType handle_type, unsigned int flags)#

Exports a memory pool to the requested handle type.

Given an IPC capable mempool, create an OS handle to share the pool with another process. A recipient process can convert the shareable handle into a mempool with hipMemPoolImportFromShareableHandle. Individual pointers can then be shared with the hipMemPoolExportPointer and hipMemPoolImportPointer APIs. The implementation of what the shareable handle is and how it can be transferred is defined by the requested handle type.

Note

To create an IPC capable mempool, create a mempool with a hipMemAllocationHandleType other than hipMemHandleTypeNone.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • shared_handle[out] Pointer to the location in which to store the requested handle

  • mem_pool[in] Pool to export

  • handle_type[in] The type of handle to create

  • flags[in] Must be 0

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory

hipError_t hipMemPoolImportFromShareableHandle(hipMemPool_t *mem_pool, void *shared_handle, hipMemAllocationHandleType handle_type, unsigned int flags)#

Imports a memory pool from a shared handle.

Specific allocations can be imported from the imported pool with hipMemPoolImportPointer.

Note

Imported memory pools do not support creating new allocations. As such imported memory pools may not be used in hipDeviceSetMemPool or hipMallocFromPoolAsync calls.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • mem_pool[out] Returned memory pool

  • shared_handle[in] OS handle of the pool to open

  • handle_type[in] The type of handle being imported

  • flags[in] Must be 0

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory

hipError_t hipMemPoolExportPointer(hipMemPoolPtrExportData *export_data, void *dev_ptr)#

Export data to share a memory pool allocation between processes.

Constructs export_data for sharing a specific allocation from an already shared memory pool. The recipient process can import the allocation with the hipMemPoolImportPointer api. The data is not a handle and may be shared through any IPC mechanism.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • export_data[out] Returned export data

  • dev_ptr[in] Pointer to memory being exported

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory

hipError_t hipMemPoolImportPointer(void **dev_ptr, hipMemPool_t mem_pool, hipMemPoolPtrExportData *export_data)#

Import a memory pool allocation from another process.

Returns in dev_ptr a pointer to the imported memory. The imported memory must not be accessed before the allocation operation completes in the exporting process. The imported memory must be freed from all importing processes before being freed in the exporting process. The pointer may be freed with hipFree or hipFreeAsync. If hipFreeAsync is used, the free must be completed on the importing process before the free operation on the exporting process.

Note

The hipFreeAsync api may be used in the exporting process before the hipFreeAsync operation completes in its stream as long as the hipFreeAsync in the exporting process specifies a stream with a stream dependency on the importing process’s hipFreeAsync.

Note

This API is implemented on Linux and is under development on Microsoft Windows.

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • dev_ptr[out] Pointer to imported memory

  • mem_pool[in] Memory pool from which to import a pointer

  • export_data[in] Data specifying the memory to import

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized, hipErrorOutOfMemory

Peer to peer#

hipError_t hipDeviceCanAccessPeer(int *canAccessPeer, int deviceId, int peerDeviceId)#

Determine if a device can access a peer’s memory.

Returns “1” in canAccessPeer if the specified device is capable of directly accessing memory physically located on peerDevice , or “0” if not.

Returns “0” in canAccessPeer if deviceId == peerDeviceId, and both are valid devices : a device is not a peer of itself.

Parameters:
  • canAccessPeer[out] Returns the peer access capability (0 or 1)

  • deviceId[in] - device from where memory may be accessed.

  • peerDeviceId[in] - device where memory is physically located

Returns:

hipSuccess,

Returns:

hipErrorInvalidDevice if deviceId or peerDeviceId are not valid devices

hipError_t hipDeviceEnablePeerAccess(int peerDeviceId, unsigned int flags)#

Enable direct access from current device’s virtual address space to memory allocations physically located on a peer device.

Memory which already allocated on peer device will be mapped into the address space of the current device. In addition, all future memory allocations on peerDeviceId will be mapped into the address space of the current device when the memory is allocated. The peer memory remains accessible from the current device until a call to hipDeviceDisablePeerAccess or hipDeviceReset.

Returns hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue,

Parameters:
  • peerDeviceId[in] Peer device to enable direct access to from the current device

  • flags[in] Reserved for future use, must be zero

Returns:

hipErrorPeerAccessAlreadyEnabled if peer access is already enabled for this device.

hipError_t hipDeviceDisablePeerAccess(int peerDeviceId)#

Disable direct access from current device’s virtual address space to memory allocations physically located on a peer device.

Returns hipErrorPeerAccessNotEnabled if direct access to memory on peerDevice has not yet been enabled from the current device.

Parameters:

peerDeviceId[in] Peer device to disable direct access to

Returns:

hipSuccess, hipErrorPeerAccessNotEnabled

hipError_t hipMemGetAddressRange(hipDeviceptr_t *pbase, size_t *psize, hipDeviceptr_t dptr)#

Get information on memory allocations.

See also

hipCtxCreate, hipCtxDestroy, hipCtxGetFlags, hipCtxPopCurrent, hipCtxGetCurrent, hipCtxSetCurrent, hipCtxPushCurrent, hipCtxSetCacheConfig, hipCtxSynchronize, hipCtxGetDevice

Parameters:
  • pbase[out] - BAse pointer address

  • psize[out] - Size of allocation

  • dptr-[in] Device Pointer

Returns:

hipSuccess, hipErrorNotFound

USE_PEER_NON_UNIFIED#

Memory management#

hipError_t hipPointerSetAttribute(const void *value, hipPointer_attribute attribute, hipDeviceptr_t ptr)#

Sets information on the specified pointer.[BETA].

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • value[in] Sets pointer attribute value

  • attribute[in] Attribute to set

  • ptr[in] Pointer to set attributes for

Returns:

hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue

hipError_t hipPointerGetAttributes(hipPointerAttribute_t *attributes, const void *ptr)#

Returns attributes for the specified pointer.

The output parameter ‘attributes’ has a member named ‘type’ that describes what memory the pointer is associated with, such as device memory, host memory, managed memory, and others. Otherwise, the API cannot handle the pointer and returns hipErrorInvalidValue.

Note

The unrecognized memory type is unsupported to keep the HIP functionality backward compatibility due to hipMemoryType enum values.

Note

The current behavior of this HIP API corresponds to the CUDA API before version 11.0.

Parameters:
  • attributes[out] attributes for the specified pointer

  • ptr[in] pointer to get attributes for

Returns:

hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue

hipError_t hipPointerGetAttribute(void *data, hipPointer_attribute attribute, hipDeviceptr_t ptr)#

Returns information about the specified pointer.[BETA].

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • data[inout] Returned pointer attribute value

  • attribute[in] Attribute to query for

  • ptr[in] Pointer to get attributes for

Returns:

hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue

hipError_t hipDrvPointerGetAttributes(unsigned int numAttributes, hipPointer_attribute *attributes, void **data, hipDeviceptr_t ptr)#

Returns information about the specified pointer.[BETA].

Warning

This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.

Parameters:
  • numAttributes[in] number of attributes to query for

  • attributes[in] attributes to query for

  • data[inout] a two-dimensional containing pointers to memory locations where the result of each attribute query will be written to

  • ptr[in] pointer to get attributes for

Returns:

hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue

hipError_t hipMalloc(void **ptr, size_t size)#

Allocate memory on the default accelerator.

If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.

Parameters:
  • ptr[out] Pointer to the allocated memory

  • size[in] Requested memory size

Returns:

hipSuccess, hipErrorOutOfMemory, hipErrorInvalidValue (bad context, null *ptr)

hipError_t hipExtMallocWithFlags(void **ptr, size_t sizeBytes, unsigned int flags)#

Allocate memory on the default accelerator.

If requested memory size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.

The memory allocation flag should be either hipDeviceMallocDefault, hipDeviceMallocFinegrained, hipDeviceMallocUncached, or hipMallocSignalMemory. If the flag is any other value, the API returns hipErrorInvalidValue.

Parameters:
  • ptr[out] Pointer to the allocated memory

  • sizeBytes[in] Requested memory size

  • flags[in] Type of memory allocation

Returns:

hipSuccess, hipErrorOutOfMemory, hipErrorInvalidValue (bad context, null *ptr)

hipError_t hipMallocHost(void **ptr, size_t size)#

Allocate pinned host memory [Deprecated].

If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.

Warning

This API is deprecated, use hipHostMalloc() instead

Parameters:
  • ptr[out] Pointer to the allocated host pinned memory

  • size[in] Requested memory size

Returns:

hipSuccess, hipErrorOutOfMemory

hipError_t hipMemAllocHost(void **ptr, size_t size)#

Allocate pinned host memory [Deprecated].

If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.

Warning

This API is deprecated, use hipHostMalloc() instead

Parameters:
  • ptr[out] Pointer to the allocated host pinned memory

  • size[in] Requested memory size

Returns:

hipSuccess, hipErrorOutOfMemory

hipError_t hipHostMalloc(void **ptr, size_t size, unsigned int flags)#

Allocates device accessible page locked (pinned) host memory.

This API allocates pinned host memory which is mapped into the address space of all GPUs in the system, the memory can be accessed directly by the GPU device, and can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().

Using the pinned host memory, applications can implement faster data transfers for HostToDevice and DeviceToHost. The runtime tracks the hipHostMalloc allocations and can avoid some of the setup required for regular unpinned memory.

When the memory accesses are infrequent, zero-copy memory can be a good choice, for coherent allocation. GPU can directly access the host memory over the CPU/GPU interconnect, without need to copy the data.

Currently the allocation granularity is 4KB for the API.

Developers need to choose proper allocation flag with consideration of synchronization.

If no input for flags, it will be the default pinned memory allocation on the host.

See also

hipSetDeviceFlags, hipHostFree

Parameters:
  • ptr[out] Pointer to the allocated host pinned memory

  • size[in] Requested memory size in bytes If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.

  • flags[in] Type of host memory allocation. See the description of flags in hipSetDeviceFlags.

Returns:

hipSuccess, hipErrorOutOfMemory

hipError_t hipHostAlloc(void **ptr, size_t size, unsigned int flags)#

Allocate device accessible page locked host memory [Deprecated].

If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.

Warning

This API is deprecated, use hipHostMalloc() instead

Parameters:
  • ptr[out] Pointer to the allocated host pinned memory

  • size[in] Requested memory size in bytes

  • flags[in] Type of host memory allocation

Returns:

hipSuccess, hipErrorOutOfMemory

hipError_t hipHostGetDevicePointer(void **devPtr, void *hstPtr, unsigned int flags)#

Get Device pointer from Host Pointer allocated through hipHostMalloc.

See also

hipSetDeviceFlags, hipHostMalloc

Parameters:
  • devPtr[out] Device Pointer mapped to passed host pointer

  • hstPtr[in] Host Pointer allocated through hipHostMalloc

  • flags[in] Flags to be passed for extension

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory

hipError_t hipHostGetFlags(unsigned int *flagsPtr, void *hostPtr)#

Return flags associated with host pointer.

See also

hipHostMalloc

Parameters:
  • flagsPtr[out] Memory location to store flags

  • hostPtr[in] Host Pointer allocated through hipHostMalloc

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipHostRegister(void *hostPtr, size_t sizeBytes, unsigned int flags)#

Register host memory so it can be accessed from the current device.

Flags:

  • hipHostRegisterDefault Memory is Mapped and Portable

  • hipHostRegisterPortable Memory is considered registered by all contexts. HIP only supports one context so this is always assumed true.

  • hipHostRegisterMapped Map the allocation into the address space for the current device. The device pointer can be obtained with hipHostGetDevicePointer.

After registering the memory, use hipHostGetDevicePointer to obtain the mapped device pointer. On many systems, the mapped device pointer will have a different value than the mapped host pointer. Applications must use the device pointer in device code, and the host pointer in host code.

On some systems, registered memory is pinned. On some systems, registered memory may not be actually be pinned but uses OS or hardware facilities to all GPU access to the host memory.

Developers are strongly encouraged to register memory blocks which are aligned to the host cache-line size. (typically 64-bytes but can be obtains from the CPUID instruction).

If registering non-aligned pointers, the application must take care when register pointers from the same cache line on different devices. HIP’s coarse-grained synchronization model does not guarantee correct results if different devices write to different parts of the same cache block - typically one of the writes will “win” and overwrite data from the other registered memory region.

Parameters:
  • hostPtr[out] Pointer to host memory to be registered.

  • sizeBytes[in] Size of the host memory

  • flags[in] See below.

Returns:

hipSuccess, hipErrorOutOfMemory

hipError_t hipHostUnregister(void *hostPtr)#

Un-register host pointer.

See also

hipHostRegister

Parameters:

hostPtr[in] Host pointer previously registered with hipHostRegister

Returns:

Error code

hipError_t hipMallocPitch(void **ptr, size_t *pitch, size_t width, size_t height)#

Allocates at least width (in bytes) * height bytes of linear memory Padding may occur to ensure alighnment requirements are met for the given row The change in width size due to padding will be returned in *pitch. Currently the alignment is set to 128 bytes

If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.

Parameters:
  • ptr[out] Pointer to the allocated device memory

  • pitch[out] Pitch for allocation (in bytes)

  • width[in] Requested pitched allocation width (in bytes)

  • height[in] Requested pitched allocation height

Returns:

Error code

hipError_t hipMemAllocPitch(hipDeviceptr_t *dptr, size_t *pitch, size_t widthInBytes, size_t height, unsigned int elementSizeBytes)#

Allocates at least width (in bytes) * height bytes of linear memory Padding may occur to ensure alighnment requirements are met for the given row The change in width size due to padding will be returned in *pitch. Currently the alignment is set to 128 bytes

If size is 0, no memory is allocated, ptr returns nullptr, and hipSuccess is returned. The intended usage of pitch is as a separate parameter of the allocation, used to compute addresses within the 2D array. Given the row and column of an array element of type T, the address is computed as: T pElement = (T*)((char*)BaseAddress + Row * Pitch) + Column;

Parameters:
  • dptr[out] Pointer to the allocated device memory

  • pitch[out] Pitch for allocation (in bytes)

  • widthInBytes[in] Requested pitched allocation width (in bytes)

  • height[in] Requested pitched allocation height

  • elementSizeBytes[in] The size of element bytes, should be 4, 8 or 16

Returns:

Error code

hipError_t hipFree(void *ptr)#

Free memory allocated by the hcc hip memory allocation API. This API performs an implicit hipDeviceSynchronize() call. If pointer is NULL, the hip runtime is initialized and hipSuccess is returned.

Parameters:

ptr[in] Pointer to memory to be freed

Returns:

hipSuccess

Returns:

hipErrorInvalidDevicePointer (if pointer is invalid, including host pointers allocated with hipHostMalloc)

hipError_t hipFreeHost(void *ptr)#

Free memory allocated by the hcc hip host memory allocation API [Deprecated].

Warning

This API is deprecated, use hipHostFree() instead

Parameters:

ptr[in] Pointer to memory to be freed

Returns:

hipSuccess, hipErrorInvalidValue (if pointer is invalid, including device pointers allocated with hipMalloc)

hipError_t hipHostFree(void *ptr)#

Free memory allocated by the hcc hip host memory allocation API This API performs an implicit hipDeviceSynchronize() call. If pointer is NULL, the hip runtime is initialized and hipSuccess is returned.

Parameters:

ptr[in] Pointer to memory to be freed

Returns:

hipSuccess, hipErrorInvalidValue (if pointer is invalid, including device pointers allocated with hipMalloc)

hipError_t hipMemcpy(void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind)#

Copy data from src to dst.

It supports memory from host to device, device to host, device to device and host to host The src and dst must not overlap.

For hipMemcpy, the copy is always performed by the current device (set by hipSetDevice). For multi-gpu or peer-to-peer configurations, it is recommended to set the current device to the device where the src data is physically located. For optimal peer-to-peer copies, the copy device must be able to access the src and dst pointers (by calling hipDeviceEnablePeerAccess with copy agent as the current device and src/dest as the peerDevice argument. if this is not done, the hipMemcpy will still work, but will perform the copy using a staging buffer on the host. Calling hipMemcpy with dst and src pointers that do not match the hipMemcpyKind results in undefined behavior.

See also

hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer

Parameters:
  • dst[out] Data being copy to

  • src[in] Data being copy from

  • sizeBytes[in] Data size in bytes

  • kind[in] Kind of transfer

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorUnknown

hipError_t hipMemcpyWithStream(void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind, hipStream_t stream)#

Memory copy on the stream. It allows single or multiple devices to do memory copy on single or multiple streams.

See also

hipMemcpy, hipStreamCreate, hipStreamSynchronize, hipStreamDestroy, hipSetDevice, hipLaunchKernelGGL

Parameters:
  • dst[out] Data being copy to

  • src[in] Data being copy from

  • sizeBytes[in] Data size in bytes

  • kind[in] Kind of transfer

  • stream[in] Valid stream

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorUnknown, hipErrorContextIsDestroyed

hipError_t hipMemcpyHtoD(hipDeviceptr_t dst, void *src, size_t sizeBytes)#

Copy data from Host to Device.

See also

hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer

Parameters:
  • dst[out] Data being copy to

  • src[in] Data being copy from

  • sizeBytes[in] Data size in bytes

Returns:

hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue

hipError_t hipMemcpyDtoH(void *dst, hipDeviceptr_t src, size_t sizeBytes)#

Copy data from Device to Host.

See also

hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer

Parameters:
  • dst[out] Data being copy to

  • src[in] Data being copy from

  • sizeBytes[in] Data size in bytes

Returns:

hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue

hipError_t hipMemcpyDtoD(hipDeviceptr_t dst, hipDeviceptr_t src, size_t sizeBytes)#

Copy data from Device to Device.

See also

hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer

Parameters:
  • dst[out] Data being copy to

  • src[in] Data being copy from

  • sizeBytes[in] Data size in bytes

Returns:

hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue

hipError_t hipMemcpyHtoDAsync(hipDeviceptr_t dst, void *src, size_t sizeBytes, hipStream_t stream)#

Copy data from Host to Device asynchronously.

See also

hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer

Parameters:
  • dst[out] Data being copy to

  • src[in] Data being copy from

  • sizeBytes[in] Data size in bytes

  • stream[in] Stream identifier

Returns:

hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue

hipError_t hipMemcpyDtoHAsync(void *dst, hipDeviceptr_t src, size_t sizeBytes, hipStream_t stream)#

Copy data from Device to Host asynchronously.

See also

hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer

Parameters:
  • dst[out] Data being copy to

  • src[in] Data being copy from

  • sizeBytes[in] Data size in bytes

  • stream[in] Stream identifier

Returns:

hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue

hipError_t hipMemcpyDtoDAsync(hipDeviceptr_t dst, hipDeviceptr_t src, size_t sizeBytes, hipStream_t stream)#

Copy data from Device to Device asynchronously.

See also

hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer

Parameters:
  • dst[out] Data being copy to

  • src[in] Data being copy from

  • sizeBytes[in] Data size in bytes

  • stream[in] Stream identifier

Returns:

hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue

hipError_t hipModuleGetGlobal(hipDeviceptr_t *dptr, size_t *bytes, hipModule_t hmod, const char *name)#

Returns a global pointer from a module. Returns in *dptr and *bytes the pointer and size of the global of name name located in module hmod. If no variable of that name exists, it returns hipErrorNotFound. Both parameters dptr and bytes are optional. If one of them is NULL, it is ignored and hipSuccess is returned.

Parameters:
  • dptr[out] Returns global device pointer

  • bytes[out] Returns global size in bytes

  • hmod[in] Module to retrieve global from

  • name[in] Name of global to retrieve

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotFound, hipErrorInvalidContext

hipError_t hipGetSymbolAddress(void **devPtr, const void *symbol)#

Gets device pointer associated with symbol on the device.

Parameters:
  • devPtr[out] pointer to the device associated the symbole

  • symbol[in] pointer to the symbole of the device

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipGetSymbolSize(size_t *size, const void *symbol)#

Gets the size of the given symbol on the device.

Parameters:
  • symbol[in] pointer to the device symbole

  • size[out] pointer to the size

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipGetProcAddress(const char *symbol, void **pfn, int hipVersion, uint64_t flags, hipDriverProcAddressQueryResult *symbolStatus)#

Gets the pointer of requested HIP driver function.

Returns hipSuccess if the returned pfn is addressed to the pointer of found driver function.

Parameters:
  • symbol[in] The Symbol name of the driver function to request.

  • pfn[out] Output pointer to the requested driver function.

  • hipVersion[in] The HIP version for the requested driver function symbol. HIP version is defined as 100*version_major + version_minor. For example, in HIP 6.1, the hipversion is 601, for the symbol function “hipGetDeviceProperties”, the specified hipVersion 601 is greater or equal to the version 600, the symbol function will be handle properly as backend compatible function.

  • flags[in] Currently only default flag is suppported.

  • symbolStatus[out] Optional enumeration for returned status of searching for symbol driver function based on the input hipVersion.

Returns:

hipSuccess, hipErrorInvalidValue.

hipError_t hipMemcpyToSymbol(const void *symbol, const void *src, size_t sizeBytes, size_t offset, hipMemcpyKind kind)#

Copies data to the given symbol on the device. Symbol HIP APIs allow a kernel to define a device-side data symbol which can be accessed on the host side. The symbol can be in __constant or device space. Note that the symbol name needs to be encased in the HIP_SYMBOL macro. This also applies to hipMemcpyFromSymbol, hipGetSymbolAddress, and hipGetSymbolSize. For detailed usage, see the memcpyToSymbol example in the HIP Porting Guide.

Parameters:
  • symbol[out] pointer to the device symbole

  • src[in] pointer to the source address

  • sizeBytes[in] size in bytes to copy

  • offset[in] offset in bytes from start of symbole

  • kind[in] type of memory transfer

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemcpyToSymbolAsync(const void *symbol, const void *src, size_t sizeBytes, size_t offset, hipMemcpyKind kind, hipStream_t stream)#

Copies data to the given symbol on the device asynchronously.

Parameters:
  • symbol[out] pointer to the device symbole

  • src[in] pointer to the source address

  • sizeBytes[in] size in bytes to copy

  • offset[in] offset in bytes from start of symbole

  • kind[in] type of memory transfer

  • stream[in] stream identifier

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemcpyFromSymbol(void *dst, const void *symbol, size_t sizeBytes, size_t offset, hipMemcpyKind kind)#

Copies data from the given symbol on the device.

Parameters:
  • dst[out] Returns pointer to destinition memory address

  • symbol[in] Pointer to the symbole address on the device

  • sizeBytes[in] Size in bytes to copy

  • offset[in] Offset in bytes from the start of symbole

  • kind[in] Type of memory transfer

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemcpyFromSymbolAsync(void *dst, const void *symbol, size_t sizeBytes, size_t offset, hipMemcpyKind kind, hipStream_t stream)#

Copies data from the given symbol on the device asynchronously.

Parameters:
  • dst[out] Returns pointer to destinition memory address

  • symbol[in] pointer to the symbole address on the device

  • sizeBytes[in] size in bytes to copy

  • offset[in] offset in bytes from the start of symbole

  • kind[in] type of memory transfer

  • stream[in] stream identifier

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemcpyAsync(void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind, hipStream_t stream)#

Copy data from src to dst asynchronously.

For multi-gpu or peer-to-peer configurations, it is recommended to use a stream which is a attached to the device where the src data is physically located. For optimal peer-to-peer copies, the copy device must be able to access the src and dst pointers (by calling hipDeviceEnablePeerAccess with copy agent as the current device and src/dest as the peerDevice argument. if this is not done, the hipMemcpy will still work, but will perform the copy using a staging buffer on the host.

Warning

If host or dest are not pinned, the memory copy will be performed synchronously. For best performance, use hipHostMalloc to allocate host memory that is transferred asynchronously.

Warning

on HCC hipMemcpyAsync does not support overlapped H2D and D2H copies. For hipMemcpy, the copy is always performed by the device associated with the specified stream.

Parameters:
  • dst[out] Data being copy to

  • src[in] Data being copy from

  • sizeBytes[in] Data size in bytes

  • kind[in] Type of memory transfer

  • stream[in] Stream identifier

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorUnknown

hipError_t hipMemset(void *dst, int value, size_t sizeBytes)#

Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant byte value value.

Parameters:
  • dst[out] Data being filled

  • value[in] Value to be set

  • sizeBytes[in] Data size in bytes

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized

hipError_t hipMemsetD8(hipDeviceptr_t dest, unsigned char value, size_t count)#

Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant byte value value.

Parameters:
  • dest[out] Data ptr to be filled

  • value[in] Value to be set

  • count[in] Number of values to be set

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized

hipError_t hipMemsetD8Async(hipDeviceptr_t dest, unsigned char value, size_t count, hipStream_t stream)#

Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant byte value value.

hipMemsetD8Async() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.

Parameters:
  • dest[out] Data ptr to be filled

  • value[in] Constant value to be set

  • count[in] Number of values to be set

  • stream[in] Stream identifier

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized

hipError_t hipMemsetD16(hipDeviceptr_t dest, unsigned short value, size_t count)#

Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant short value value.

Parameters:
  • dest[out] Data ptr to be filled

  • value[in] Constant value to be set

  • count[in] Number of values to be set

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized

hipError_t hipMemsetD16Async(hipDeviceptr_t dest, unsigned short value, size_t count, hipStream_t stream)#

Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant short value value.

hipMemsetD16Async() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.

Parameters:
  • dest[out] Data ptr to be filled

  • value[in] Constant value to be set

  • count[in] Number of values to be set

  • stream[in] Stream identifier

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized

hipError_t hipMemsetD32(hipDeviceptr_t dest, int value, size_t count)#

Fills the memory area pointed to by dest with the constant integer value for specified number of times.

Parameters:
  • dest[out] Data being filled

  • value[in] Constant value to be set

  • count[in] Number of values to be set

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized

hipError_t hipMemsetAsync(void *dst, int value, size_t sizeBytes, hipStream_t stream)#

Fills the first sizeBytes bytes of the memory area pointed to by dev with the constant byte value value.

hipMemsetAsync() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.

Parameters:
  • dst[out] Pointer to device memory

  • value[in] Value to set for each byte of specified memory

  • sizeBytes[in] Size in bytes to set

  • stream[in] Stream identifier

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemsetD32Async(hipDeviceptr_t dst, int value, size_t count, hipStream_t stream)#

Fills the memory area pointed to by dev with the constant integer value for specified number of times.

hipMemsetD32Async() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.

Parameters:
  • dst[out] Pointer to device memory

  • value[in] Value to set for each byte of specified memory

  • count[in] Number of values to be set

  • stream[in] Stream identifier

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemset2D(void *dst, size_t pitch, int value, size_t width, size_t height)#

Fills the memory area pointed to by dst with the constant value.

Parameters:
  • dst[out] Pointer to device memory

  • pitch[in] Data size in bytes

  • value[in] Constant value to be set

  • width[in]

  • height[in]

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemset2DAsync(void *dst, size_t pitch, int value, size_t width, size_t height, hipStream_t stream)#

Fills asynchronously the memory area pointed to by dst with the constant value.

Parameters:
  • dst[in] Pointer to 2D device memory

  • pitch[in] Pitch size in bytes

  • value[in] Value to be set for each byte of specified memory

  • width[in] Width of matrix set columns in bytes

  • height[in] Height of matrix set rows in bytes

  • stream[in] Stream identifier

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemset3D(hipPitchedPtr pitchedDevPtr, int value, hipExtent extent)#

Fills synchronously the memory area pointed to by pitchedDevPtr with the constant value.

Parameters:
  • pitchedDevPtr[in] Pointer to pitched device memory

  • value[in] Value to set for each byte of specified memory

  • extent[in] Size parameters for width field in bytes in device memory

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemset3DAsync(hipPitchedPtr pitchedDevPtr, int value, hipExtent extent, hipStream_t stream)#

Fills asynchronously the memory area pointed to by pitchedDevPtr with the constant value.

Parameters:
  • pitchedDevPtr[in] Pointer to pitched device memory

  • value[in] Value to set for each byte of specified memory

  • extent[in] Size parameters for width field in bytes in device memory

  • stream[in] Stream identifier

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMemGetInfo(size_t *free, size_t *total)#

Query memory info.

On ROCM, this function gets the actual free memory left on the current device, so supports the cases while running multi-workload (such as multiple processes, multiple threads, and multiple GPUs).

Warning

On Windows, the free memory only accounts for memory allocated by this process and may be optimistic.

Parameters:
  • free[out] Returns free memory on the current device in bytes

  • total[out] Returns total allocatable memory on the current device in bytes

Returns:

hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue

hipError_t hipMemPtrGetInfo(void *ptr, size_t *size)#

Get allocated memory size via memory pointer.

This function gets the allocated shared virtual memory size from memory pointer.

Parameters:
  • ptr[in] Pointer to allocated memory

  • size[out] Returns the allocated memory size in bytes

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipMallocArray(hipArray_t *array, const hipChannelFormatDesc *desc, size_t width, size_t height, unsigned int flags)#

Allocate an array on the device.

Parameters:
  • array[out] Pointer to allocated array in device memory

  • desc[in] Requested channel format

  • width[in] Requested array allocation width

  • height[in] Requested array allocation height

  • flags[in] Requested properties of allocated array

Returns:

hipSuccess, hipErrorOutOfMemory

hipError_t hipArrayCreate(hipArray_t *pHandle, const HIP_ARRAY_DESCRIPTOR *pAllocateArray)#

Create an array memory pointer on the device.

Parameters:
  • pHandle[out] Pointer to the array memory

  • pAllocateArray[in] Requested array desciptor

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotSupported

hipError_t hipArrayDestroy(hipArray_t array)#

Destroy an array memory pointer on the device.

Parameters:

array[in] Pointer to the array memory

Returns:

hipSuccess, hipErrorInvalidValue

hipError_t hipArray3DCreate(hipArray_t *array, const HIP_ARRAY3D_DESCRIPTOR *pAllocateArray)#

Create a 3D array memory pointer on the device.

Parameters:
  • array[out] Pointer to the 3D array memory

  • pAllocateArray[in] Requested array desciptor

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotSupported

hipError_t hipMalloc3D(hipPitchedPtr *pitchedDevPtr, hipExtent extent)#

Create a 3D memory pointer on the device.

Parameters:
  • pitchedDevPtr[out] Pointer to the 3D memory

  • extent[in] Requested extent

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotSupported

hipError_t hipFreeArray(hipArray_t array)#

Frees an array on the device.

Parameters:

array[in] Pointer to array to free

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized

hipError_t hipMalloc3DArray(hipArray_t *array, const struct hipChannelFormatDesc *desc, struct hipExtent extent, unsigned int flags)#

Allocate an array on the device.

Parameters:
  • array[out] Pointer to allocated array in device memory

  • desc[in] Requested channel format

  • extent[in] Requested array allocation width, height and depth

  • flags[in] Requested properties of allocated array

Returns:

hipSuccess, hipErrorOutOfMemory

hipError_t hipArrayGetInfo(hipChannelFormatDesc *desc, hipExtent *extent, unsigned int *flags, hipArray_t array)#

Gets info about the specified array.

Parameters:
  • desc[out] - Returned array type

  • extent[out] - Returned array shape. 2D arrays will have depth of zero

  • flags[out] - Returned array flags

  • array[in] - The HIP array to get info for

Returns:

hipSuccess, hipErrorInvalidValue hipErrorInvalidHandle

hipError_t hipArrayGetDescriptor(HIP_ARRAY_DESCRIPTOR *pArrayDescriptor, hipArray_t array)#

Gets a 1D or 2D array descriptor.

See also

hipArray3DCreate, hipArray3DGetDescriptor, hipArrayCreate, hipArrayDestroy, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpy3D, hipMemcpy3DAsync, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoD, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer, hipMemsetD8, hipMemsetD16, hipMemsetD32, hipArrayGetInfo

Parameters:
  • pArrayDescriptor[out] - Returned array descriptor

  • array[in] - Array to get descriptor of

Returns:

hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipErrorInvalidHandle

hipError_t hipArray3DGetDescriptor(HIP_ARRAY3D_DESCRIPTOR *pArrayDescriptor, hipArray_t array)#

Gets a 3D array descriptor.

See also

hipArray3DCreate, hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpy3D, hipMemcpy3DAsync, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoD, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer, hipMemsetD8, hipMemsetD16, hipMemsetD32, hipArrayGetInfo

Parameters:
  • pArrayDescriptor[out] - Returned 3D array descriptor

  • array[in] - 3D array to get descriptor of

Returns:

hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipErrorInvalidHandle, hipErrorContextIsDestroyed

hipError_t hipMemcpy2D(void *dst, size_t dpitch, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind)#

Copies data between host and device.

Parameters:
  • dst[in] Destination memory address

  • dpitch[in] Pitch of destination memory

  • src[in] Source memory address

  • spitch[in] Pitch of source memory

  • width[in] Width of matrix transfer (columns in bytes)

  • height[in] Height of matrix transfer (rows)

  • kind[in] Type of transfer

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipMemcpyParam2D(const hip_Memcpy2D *pCopy)#

Copies memory for 2D arrays.

Parameters:

pCopy[in] Parameters for the memory copy

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipMemcpyParam2DAsync(const hip_Memcpy2D *pCopy, hipStream_t stream)#

Copies memory for 2D arrays.

Parameters:
  • pCopy[in] Parameters for the memory copy

  • stream[in] Stream to use

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipMemcpy2DAsync(void *dst, size_t dpitch, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind, hipStream_t stream)#

Copies data between host and device.

Parameters:
  • dst[in] Destination memory address

  • dpitch[in] Pitch of destination memory

  • src[in] Source memory address

  • spitch[in] Pitch of source memory

  • width[in] Width of matrix transfer (columns in bytes)

  • height[in] Height of matrix transfer (rows)

  • kind[in] Type of transfer

  • stream[in] Stream to use

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipMemcpy2DToArray(hipArray_t dst, size_t wOffset, size_t hOffset, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind)#

Copies data between host and device.

Parameters:
  • dst[in] Destination memory address

  • wOffset[in] Destination starting X offset

  • hOffset[in] Destination starting Y offset

  • src[in] Source memory address

  • spitch[in] Pitch of source memory

  • width[in] Width of matrix transfer (columns in bytes)

  • height[in] Height of matrix transfer (rows)

  • kind[in] Type of transfer

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipMemcpy2DToArrayAsync(hipArray_t dst, size_t wOffset, size_t hOffset, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind, hipStream_t stream)#

Copies data between host and device.

Parameters:
  • dst[in] Destination memory address

  • wOffset[in] Destination starting X offset

  • hOffset[in] Destination starting Y offset

  • src[in] Source memory address

  • spitch[in] Pitch of source memory

  • width[in] Width of matrix transfer (columns in bytes)

  • height[in] Height of matrix transfer (rows)

  • kind[in] Type of transfer

  • stream[in] Accelerator view which the copy is being enqueued

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipMemcpyToArray(hipArray_t dst, size_t wOffset, size_t hOffset, const void *src, size_t count, hipMemcpyKind kind)#

Copies data between host and device.

Warning

This API is deprecated.

Parameters:
  • dst[in] Destination memory address

  • wOffset[in] Destination starting X offset

  • hOffset[in] Destination starting Y offset

  • src[in] Source memory address

  • count[in] size in bytes to copy

  • kind[in] Type of transfer

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipMemcpyFromArray(void *dst, hipArray_const_t srcArray, size_t wOffset, size_t hOffset, size_t count, hipMemcpyKind kind)#

Copies data between host and device.

Warning

This API is deprecated.

Parameters:
  • dst[in] Destination memory address

  • srcArray[in] Source memory address

  • wOffset[in] Source starting X offset

  • hOffset[in] Source starting Y offset

  • count[in] Size in bytes to copy

  • kind[in] Type of transfer

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipMemcpy2DFromArray(void *dst, size_t dpitch, hipArray_const_t src, size_t wOffset, size_t hOffset, size_t width, size_t height, hipMemcpyKind kind)#

Copies data between host and device.

Parameters:
  • dst[in] Destination memory address

  • dpitch[in] Pitch of destination memory

  • src[in] Source memory address

  • wOffset[in] Source starting X offset

  • hOffset[in] Source starting Y offset

  • width[in] Width of matrix transfer (columns in bytes)

  • height[in] Height of matrix transfer (rows)

  • kind[in] Type of transfer

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipMemcpy2DFromArrayAsync(void *dst, size_t dpitch, hipArray_const_t src, size_t wOffset, size_t hOffset, size_t width, size_t height, hipMemcpyKind kind, hipStream_t stream)#

Copies data between host and device asynchronously.

Parameters:
  • dst[in] Destination memory address

  • dpitch[in] Pitch of destination memory

  • src[in] Source memory address

  • wOffset[in] Source starting X offset

  • hOffset[in] Source starting Y offset

  • width[in] Width of matrix transfer (columns in bytes)

  • height[in] Height of matrix transfer (rows)

  • kind[in] Type of transfer

  • stream[in] Accelerator view which the copy is being enqueued

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipMemcpyAtoH(void *dst, hipArray_t srcArray, size_t srcOffset, size_t count)#

Copies data between host and device.

Parameters:
  • dst[in] Destination memory address

  • srcArray[in] Source array

  • srcOffset[in] Offset in bytes of source array

  • count[in] Size of memory copy in bytes

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipMemcpyHtoA(hipArray_t dstArray, size_t dstOffset, const void *srcHost, size_t count)#

Copies data between host and device.

Parameters:
  • dstArray[in] Destination memory address

  • dstOffset[in] Offset in bytes of destination array

  • srcHost[in] Source host pointer

  • count[in] Size of memory copy in bytes

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipMemcpy3D(const struct hipMemcpy3DParms *p)#

Copies data between host and device.

Parameters:

p[in] 3D memory copy parameters

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipMemcpy3DAsync(const struct hipMemcpy3DParms *p, hipStream_t stream)#

Copies data between host and device asynchronously.

Parameters:
  • p[in] 3D memory copy parameters

  • stream[in] Stream to use

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipDrvMemcpy3D(const HIP_MEMCPY3D *pCopy)#

Copies data between host and device.

Parameters:

pCopy[in] 3D memory copy parameters

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

hipError_t hipDrvMemcpy3DAsync(const HIP_MEMCPY3D *pCopy, hipStream_t stream)#

Copies data between host and device asynchronously.

Parameters:
  • pCopy[in] 3D memory copy parameters

  • stream[in] Stream to use

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection

template<typename T>
hipError_t hipGetSymbolAddress(void **devPtr, const T &symbol)#

Gets the address of a symbol.

Parameters:
  • devPtr[out] - Returns device pointer associated with symbol.

  • symbol[in] - Device symbol.

Returns:

hipSuccess, hipErrorInvalidValue

template<typename T>
hipError_t hipGetSymbolSize(size_t *size, const T &symbol)#

Gets the size of a symbol.

Parameters:
  • size[out] - Returns the size of a symbol.

  • symbol[in] - Device symbol address.

Returns:

hipSuccess, hipErrorInvalidValue

template<typename T>
hipError_t hipMemcpyToSymbol(const T &symbol, const void *src, size_t sizeBytes, size_t offset, hipMemcpyKind kind)#

Copies data to the given symbol on the device.

Returns:

hipSuccess, hipErrorInvalidMemcpyDirection, hipErrorInvalidValue

template<typename T>
hipError_t hipMemcpyToSymbolAsync(const T &symbol, const void *src, size_t sizeBytes, size_t offset, hipMemcpyKind kind, hipStream_t stream)#

Copies data to the given symbol on the device asynchronously on the stream.

Returns:

hipSuccess, hipErrorInvalidMemcpyDirection, hipErrorInvalidValue

template<typename T>
hipError_t hipMemcpyFromSymbol(void *dst, const T &symbol, size_t sizeBytes, size_t offset, hipMemcpyKind kind)#

Copies data from the given symbol on the device.

Returns:

hipSuccess, hipErrorInvalidMemcpyDirection, hipErrorInvalidValue

template<typename T>
hipError_t hipMemcpyFromSymbolAsync(void *dst, const T &symbol, size_t sizeBytes, size_t offset, hipMemcpyKind kind, hipStream_t stream)#

Copies data from the given symbol on the device asynchronously on the stream.

Returns:

hipSuccess, hipErrorInvalidMemcpyDirection, hipErrorInvalidValue

template<class T>
static inline hipError_t hipMalloc(T **devPtr, size_t size)#

: C++ wrapper for hipMalloc

Perform automatic type conversion to eliminate need for excessive typecasting (ie void**)

HIP_DISABLE_CPP_FUNCTIONS macro can be defined to suppress these wrappers. It is useful for applications which need to obtain decltypes of HIP runtime APIs.

See also

hipMalloc

template<class T>
static inline hipError_t hipHostMalloc(T **ptr, size_t size, unsigned int flags = hipHostMallocDefault)#

: C++ wrapper for hipHostMalloc

Provide an override to automatically typecast the pointer type from void**, and also provide a default for the flags.

HIP_DISABLE_CPP_FUNCTIONS macro can be defined to suppress these wrappers. It is useful for applications which need to obtain decltypes of HIP runtime APIs.

See also

hipHostMalloc

External Resource Interoperability#

hipError_t hipImportExternalSemaphore(hipExternalSemaphore_t *extSem_out, const hipExternalSemaphoreHandleDesc *semHandleDesc)#

Imports an external semaphore.

See also

Parameters:
  • extSem_out[out] External semaphores to be waited on

  • semHandleDesc[in] Semaphore import handle descriptor

Returns:

hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue

hipError_t hipSignalExternalSemaphoresAsync(const hipExternalSemaphore_t *extSemArray, const hipExternalSemaphoreSignalParams *paramsArray, unsigned int numExtSems, hipStream_t stream)#

Signals a set of external semaphore objects.

See also

Parameters:
  • extSemArray[in] External semaphores to be waited on

  • paramsArray[in] Array of semaphore parameters

  • numExtSems[in] Number of semaphores to wait on

  • stream[in] Stream to enqueue the wait operations in

Returns:

hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue

hipError_t hipWaitExternalSemaphoresAsync(const hipExternalSemaphore_t *extSemArray, const hipExternalSemaphoreWaitParams *paramsArray, unsigned int numExtSems, hipStream_t stream)#

Waits on a set of external semaphore objects.

See also

Parameters:
  • extSemArray[in] External semaphores to be waited on

  • paramsArray[in] Array of semaphore parameters

  • numExtSems[in] Number of semaphores to wait on

  • stream[in] Stream to enqueue the wait operations in

Returns:

hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue

hipError_t hipDestroyExternalSemaphore(hipExternalSemaphore_t extSem)#

Destroys an external semaphore object and releases any references to the underlying resource. Any outstanding signals or waits must have completed before the semaphore is destroyed.

See also

Parameters:

extSem[in] handle to an external memory object

Returns:

hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue

hipError_t hipImportExternalMemory(hipExternalMemory_t *extMem_out, const hipExternalMemoryHandleDesc *memHandleDesc)#

Imports an external memory object.

See also

Parameters:
  • extMem_out[out] Returned handle to an external memory object

  • memHandleDesc[in] Memory import handle descriptor

Returns:

hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue

hipError_t hipExternalMemoryGetMappedBuffer(void **devPtr, hipExternalMemory_t extMem, const hipExternalMemoryBufferDesc *bufferDesc)#

Maps a buffer onto an imported memory object.

See also

Parameters:
  • devPtr[out] Returned device pointer to buffer

  • extMem[in] Handle to external memory object

  • bufferDesc[in] Buffer descriptor

Returns:

hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue

hipError_t hipDestroyExternalMemory(hipExternalMemory_t extMem)#

Destroys an external memory object.

See also

Parameters:

extMem[in] External memory object to be destroyed

Returns:

hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue

hipError_t hipExternalMemoryGetMappedMipmappedArray(hipMipmappedArray_t *mipmap, hipExternalMemory_t extMem, const hipExternalMemoryMipmappedArrayDesc *mipmapDesc)#

Maps a mipmapped array onto an external memory object.

Returned mipmapped array must be freed using hipFreeMipmappedArray.

Parameters:
  • mipmap[out] mipmapped array to return

  • extMem[in] external memory object handle

  • mipmapDesc[in] external mipmapped array descriptor

Returns:

hipSuccess, hipErrorInvalidValue, hipErrorInvalidResourceHandle

Register Keyword#

The register keyword is deprecated in C++, and is silently ignored by both NVCC and HIP-Clang. You can pass the option -Wdeprecated-register the compiler warning message.

Pragma Unroll#

Unroll with a bounds that is known at compile-time is supported. For example:

#pragma unroll 16 /* hint to compiler to unroll next loop by 16 */
for (int i=0; i<16; i++) ...
#pragma unroll 1 /* tell compiler to never unroll the loop */
for (int i=0; i<16; i++) ...
#pragma unroll /* hint to compiler to completely unroll next loop. */
for (int i=0; i<16; i++) ...

In-Line Assembly#

GCN ISA In-line assembly, is supported. For example:

asm volatile ("v_mac_f32_e32 %0, %2, %3" : "=v" (out[i]) : "0"(out[i]), "v" (a), "v" (in[i]));

We insert the GCN isa into the kernel using asm() Assembler statement. volatile keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations. v_mac_f32_e32 is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/) Index for the respective operand in the ordered fashion is provided by % followed by position in the list of operands “v” is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register, for more info please refer - [Supported Constraint Code List for AMDGPU](https://llvm.org/docs/LangRef.html#supported-constraint-code-list) Output Constraints are specified by an “=” prefix as shown above (“=v”). This indicate that assembly will write to this operand, and the operand will then be made available as a return value of the asm expression. Input constraints do not have a prefix - just the constraint code. The constraint string of “0” says to use the assigned register for output as an input as well (it being the 0’th constraint).

## C++ Support The following C++ features are not supported: - Run-time-type information (RTTI) - Try/catch - Virtual functions Virtual functions are not supported if objects containing virtual function tables are passed between GPU’s of different offload arch’s, e.g. between gfx906 and gfx1030. Otherwise virtual functions are supported.

Kernel Compilation#

hipcc now supports compiling C++/HIP kernels to binary code objects. The file format for binary is .co which means Code Object. The following command builds the code object using hipcc.

hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]

[TARGET GPU] = GPU architecture
[INPUT FILE] = Name of the file containing kernels
[OUTPUT FILE] = Name of the generated code object file

Note

When using binary code objects is that the number of arguments to the kernel is different on HIP-Clang and NVCC path. Refer to the HIP module_api sample for differences in the arguments to be passed to the kernel.

gfx-arch-specific-kernel#

Clang defined ‘__gfx*__’ macros can be used to execute gfx arch specific codes inside the kernel. Refer to the sample in HIP 14_gpu_arch sample.