C++ Language Extensions#
HIP provides a C++ syntax that is suitable for compiling most code that commonly appears in compute kernels (classes, namespaces, operator overloading, and templates). HIP also defines other language features that are designed to target accelerators, such as:
A kernel-launch syntax that uses standard C++ (this resembles a function call and is portable to all HIP targets)
Short-vector headers that can serve on a host or device
Math functions that resemble those in
math.h
, which is included with standard C++ compilersBuilt-in functions for accessing specific GPU hardware capabilities
Note
This chapter describes the built-in variables and functions that are accessible from the HIP kernel. It’s intended for users who are familiar with CUDA kernel syntax and want to learn how HIP differs from CUDA.
Features are labeled with one of the following keywords:
Supported: HIP supports the feature with a CUDA-equivalent function
Not supported: HIP does not support the feature
Under development: The feature is under development and not yet available
Function-type qualifiers#
__device__
#
Supported __device__
functions are:
Run on the device
Called from the device only
You can combine __device__
with the host keyword (__host__).
__global__
#
Supported __global__
functions are:
Run on the device
Called (launched) from the host
HIP __global__
functions must have a void
return type. The first parameter in a HIP __global__
function must have the type hipLaunchParm
. Refer to Kernel launch example to see usage.
HIP doesn’t support dynamic-parallelism, which means that you can’t call __global__
functions from
the device.
__host__
#
Supported __host__
functions are:
Run on the host
Called from the host
You can combine __host__
with __device__
; in this case, the function compiles for the host and the
device. Note that these functions can’t use the HIP grid coordinate functions (e.g., threadIdx.x
). If
you need to use HIP grid coordinate functions, you can pass the necessary coordinate information as
an argument.
You can’t combine __host__
with __global__
.
HIP parses the __noinline__
and __forceinline__
keywords and converts them into the appropriate
Clang attributes.
Calling __global__
functions#
__global__ functions are often referred to as kernels. When you call a global function, you’re launching a kernel. When launching a kernel, you must specify an execution configuration that includes the grid and block dimensions. The execution configuration can also include other information for the launch, such as the amount of additional shared memory to allocate and the stream where you want to execute the kernel.
HIP introduces a standard C++ calling convention (hipLaunchKernelGGL
) to pass the run
configuration to the kernel. However, you can also use the CUDA <<< >>>
syntax.
When using hipLaunchKernelGGL
, your first five parameters must be:
symbol kernelName
: The name of the kernel you want to launch. To support template kernels that contain","
, use theHIP_KERNEL_NAME
macro (HIPIFY tools insert this automatically).
dim3 gridDim
: 3D-grid dimensions that specify the number of blocks to launch.
dim3 blockDim
: 3D-block dimensions that specify the number of threads in each block.
size_t dynamicShared
: The amount of additional shared memory that you want to allocate when launching the kernel (see __shared__).
hipStream_t
: The stream where you want to run the kernel. A value of0
corresponds to the NULL stream (see Synchronization functions).
You can include your kernel arguments after these parameters.
// Example hipLaunchKernelGGL pseudocode:
__global__ MyKernel(hipLaunchParm lp, float *A, float *B, float *C, size_t N)
{
...
}
MyKernel<<<dim3(gridDim), dim3(groupDim), 0, 0>>> (a,b,c,n);
// Alternatively, you can launch the kernel using:
// hipLaunchKernelGGL(MyKernel, dim3(gridDim), dim3(groupDim), 0/*dynamicShared*/, 0/*stream), a, b, c, n);
You can use HIPIFY tools to convert CUDA launch syntax to hipLaunchKernelGGL
. This includes the
conversion of optional <<< >>>
arguments into the five required hipLaunchKernelGGL
parameters.
Note
HIP doesn’t support dimension sizes of \(gridDim * blockDim \ge 2^{32}\) when launching a kernel.
Kernel launch example#
// Example showing device function, __device__ __host__
// <- compile for both device and host
float PlusOne(float x)
{
return x + 1.0;
}
__global__
void
MyKernel (hipLaunchParm lp, /*lp parm for execution configuration */
const float *a, const float *b, float *c, unsigned N)
{
unsigned gid = threadIdx.x; // <- coordinate index function
if (gid < N) {
c[gid] = a[gid] + PlusOne(b[gid]);
}
}
void callMyKernel()
{
float *a, *b, *c; // initialization not shown...
unsigned N = 1000000;
const unsigned blockSize = 256;
MyKernel<<<dim3(gridDim), dim3(groupDim), 0, 0>>> (a,b,c,n);
// Alternatively, kernel can be launched by
// hipLaunchKernelGGL(MyKernel, dim3(N/blockSize), dim3(blockSize), 0, 0, a,b,c,N);
}
Variable type qualifiers#
__constant__
#
The host writes constant memory before launching the kernel. This memory is read-only from the GPU while the kernel is running. The functions for accessing constant memory are:
hipGetSymbolAddress()
hipGetSymbolSize()
hipMemcpyToSymbol()
hipMemcpyToSymbolAsync()
hipMemcpyFromSymbol()
hipMemcpyFromSymbolAsync()
__managed__
#
Managed memory, including the __managed__
keyword, is supported in HIP combined host/device
compilation.
__restrict__
#
__restrict__
tells the compiler that the associated memory pointer not to alias with any other pointer
in the kernel or function. This can help the compiler generate better code. In most use cases, every
pointer argument should use this keyword in order to achieve the benefit.
Built-in variables#
Coordinate built-ins#
The kernel uses coordinate built-ins (thread*
, block*
, grid*
) to determine the coordinate index
and bounds for the active work item.
Built-ins are defined in amd_hip_runtime.h
, rather than being implicitly defined by the compiler.
Coordinate variable definitions for built-ins are the same for HIP and CUDA. For example: threadIdx.x
,
blockIdx.y
, and gridDim.y
. The products gridDim.x * blockDim.x
, gridDim.y * blockDim.y
, and
gridDim.z * blockDim.z
are always less than 2^32
.
Coordinate built-ins are implemented as structures for improved performance. When used with
printf
, they must be explicitly cast to integer types.
warpSize
#
The warpSize
variable type is int
. It contains the warp size (in threads) for the target device.
warpSize
should only be used in device functions that develop portable wave-aware code.
Note
NVIDIA devices return 32 for this variable; AMD devices return 64 for gfx9 and 32 for gfx10 and above.
Vector types#
The following vector types are defined in hip_runtime.h
. They are not automatically provided by the
compiler.
Short vector types#
Short vector types derive from basic integer and floating-point types. These structures are defined in
hip_vector_types.h
. The first, second, third, and fourth components of the vector are defined by the
x
, y
, z
, and w
fields, respectively. All short vector types support a constructor function of the
form make_<type_name>()
. For example, float4 make_float4(float x, float y, float z, float w)
creates
a vector with type float4
and value (x,y,z,w)
.
HIP supports the following short vector formats:
Signed Integers:
char1
,char2
,char3
,char4
short1
,short2
,short3
,short4
int1
,int2
,int3
,int4
long1
,long2
,long3
,long4
longlong1
,longlong2
,longlong3
,longlong4
Unsigned Integers:
uchar1
,uchar2
,uchar3
,uchar4
ushort1
,ushort2
,ushort3
,ushort4
uint1
,uint2
,uint3
,uint4
ulong1
,ulong2
,ulong3
,ulong4
ulonglong1
,ulonglong2
,ulonglong3
,ulonglong4
Floating Points:
float1
,float2
,float3
,float4
double1
,double2
,double3
,double4
dim3#
dim3
is a three-dimensional integer vector type that is commonly used to specify grid and group
dimensions.
The dim3 constructor accepts between zero and three arguments. By default, it initializes unspecified dimensions to 1.
typedef struct dim3 {
uint32_t x;
uint32_t y;
uint32_t z;
dim3(uint32_t _x=1, uint32_t _y=1, uint32_t _z=1) : x(_x), y(_y), z(_z) {};
};
Memory fence instructions#
HIP supports __threadfence()
and __threadfence_block()
. If you’re using threadfence_system()
in
the HIP-Clang path, you can use the following workaround:
Build HIP with the
HIP_COHERENT_HOST_ALLOC
environment variable enabled.Modify kernels that use
__threadfence_system()
as follows:
Ensure the kernel operates only on fine-grained system memory, which should be allocated with
hipHostMalloc()
.Remove
memcpy
for all allocated fine-grained system memory regions.
Synchronization functions#
The __syncthreads()
built-in function is supported in HIP. The __syncthreads_count(int)
,
__syncthreads_and(int)
, and __syncthreads_or(int)
functions are under development.
Math functions#
HIP-Clang supports a set of math operations that are callable from the device. HIP supports most of the device functions supported by CUDA. These are described in the following sections.
Single precision mathematical functions#
Following is the list of supported single precision mathematical functions.
Function |
Supported on Host |
Supported on Device |
float abs(float x) Returns the absolute value of \(x\)
|
✓ |
✓ |
float acosf(float x) Returns the arc cosine of \(x\).
|
✓ |
✓ |
float acoshf(float x) Returns the nonnegative arc hyperbolic cosine of \(x\).
|
✓ |
✓ |
float asinf(float x) Returns the arc sine of \(x\).
|
✓ |
✓ |
float asinhf(float x) Returns the arc hyperbolic sine of \(x\).
|
✓ |
✓ |
float atanf(float x) Returns the arc tangent of \(x\).
|
✓ |
✓ |
float atan2f(float x, float y) Returns the arc tangent of the ratio of \(x\) and \(y\).
|
✓ |
✓ |
float atanhf(float x) Returns the arc hyperbolic tangent of \(x\).
|
✓ |
✓ |
float cbrtf(float x) Returns the cube root of \(x\).
|
✓ |
✓ |
float ceilf(float x) Returns ceiling of \(x\).
|
✓ |
✓ |
float copysignf(float x, float y) Create value with given magnitude, copying sign of second value.
|
✓ |
✓ |
float cosf(float x) Returns the cosine of \(x\).
|
✓ |
✓ |
float coshf(float x) Returns the hyperbolic cosine of \(x\).
|
✓ |
✓ |
float cospif(float x) Returns the cosine of \(\pi \cdot x\).
|
✓ |
✓ |
float cyl_bessel_i0f(float x) Returns the value of the regular modified cylindrical Bessel function of order 0 for \(x\).
|
✗ |
✗ |
float cyl_bessel_i1f(float x) Returns the value of the regular modified cylindrical Bessel function of order 1 for \(x\).
|
✗ |
✗ |
float erff(float x) Returns the error function of \(x\).
|
✓ |
✓ |
float erfcf(float x) Returns the complementary error function of \(x\).
|
✓ |
✓ |
float erfcinvf(float x) Returns the inverse complementary function of \(x\).
|
✓ |
✓ |
float erfcxf(float x) Returns the scaled complementary error function of \(x\).
|
✓ |
✓ |
float erfinvf(float x) Returns the inverse error function of \(x\).
|
✓ |
✓ |
float expf(float x) Returns \(e^x\).
|
✓ |
✓ |
float exp10f(float x) Returns \(10^x\).
|
✓ |
✓ |
float exp2f( float x) Returns \(2^x\).
|
✓ |
✓ |
float expm1f(float x) Returns \(ln(x - 1)\)
|
✓ |
✓ |
float fabsf(float x) Returns the absolute value of x
|
✓ |
✓ |
float fdimf(float x, float y) Returns the positive difference between \(x\) and \(y\).
|
✓ |
✓ |
float fdividef(float x, float y) Divide two floating point values.
|
✓ |
✓ |
float floorf(float x) Returns the largest integer less than or equal to \(x\).
|
✓ |
✓ |
float fmaf(float x, float y, float z) Returns \(x \cdot y + z\) as a single operation.
|
✓ |
✓ |
float fmaxf(float x, float y) Determine the maximum numeric value of \(x\) and \(y\).
|
✓ |
✓ |
float fminf(float x, float y) Determine the minimum numeric value of \(x\) and \(y\).
|
✓ |
✓ |
float fmodf(float x, float y) Returns the floating-point remainder of \(x / y\).
|
✓ |
✓ |
float modff(float x, float* iptr) Break down \(x\) into fractional and integral parts.
|
✓ |
✗ |
float frexpf(float x, int* nptr) Extract mantissa and exponent of \(x\).
|
✓ |
✗ |
float hypotf(float x, float y) Returns the square root of the sum of squares of \(x\) and \(y\).
|
✓ |
✓ |
int ilogbf(float x) Returns the unbiased integer exponent of \(x\).
|
✓ |
✓ |
bool isfinite(float x) Determine whether \(x\) is finite.
|
✓ |
✓ |
bool isinf(float x) Determine whether \(x\) is infinite.
|
✓ |
✓ |
bool isnan(float x) Determine whether \(x\) is a
NAN . |
✓ |
✓ |
float j0f(float x) Returns the value of the Bessel function of the first kind of order 0 for \(x\).
|
✓ |
✓ |
float j1f(float x) Returns the value of the Bessel function of the first kind of order 1 for \(x\).
|
✓ |
✓ |
float jnf(int n, float x) Returns the value of the Bessel function of the first kind of order n for \(x\).
|
✓ |
✓ |
float ldexpf(float x, int exp) Returns the natural logarithm of the absolute value of the gamma function of \(x\).
|
✓ |
✓ |
float lgammaf(float x) Returns the natural logarithm of the absolute value of the gamma function of \(x\).
|
✓ |
✗ |
long int lrintf(float x) Round \(x\) to nearest integer value.
|
✓ |
✓ |
long long int llrintf(float x) Round \(x\) to nearest integer value.
|
✓ |
✓ |
long int lroundf(float x) Round to nearest integer value.
|
✓ |
✓ |
long long int llroundf(float x) Round to nearest integer value.
|
✓ |
✓ |
float log10f(float x) Returns the base 10 logarithm of \(x\).
|
✓ |
✓ |
float log1pf(float x) Returns the natural logarithm of \(x + 1\).
|
✓ |
✓ |
float log2f(float x) Returns the base 2 logarithm of \(x\).
|
✓ |
✓ |
float logf(float x) Returns the natural logarithm of \(x\).
|
✓ |
✓ |
float logbf(float x) Returns the floating point representation of the exponent of \(x\).
|
✓ |
✓ |
float nanf(const char* tagp) Returns “Not a Number” value.
|
✗ |
✓ |
float nearbyintf(float x) Round \(x\) to the nearest integer.
|
✓ |
✓ |
float nextafterf(float x, float y) Returns next representable single-precision floating-point value after argument.
|
✓ |
✗ |
float norm3df(float x, float y, float z) Returns the square root of the sum of squares of \(x\), \(y\) and \(z\).
|
✓ |
✓ |
float norm4df(float x, float y, float z, float w) Returns the square root of the sum of squares of \(x\), \(y\), \(z\) and \(w\).
|
✓ |
✓ |
float normcdff(float y) Returns the standard normal cumulative distribution function.
|
✓ |
✓ |
float normcdfinvf(float y) Returns the inverse of the standard normal cumulative distribution function.
|
✓ |
✓ |
float normf(int dim, const float *a) Returns the square root of the sum of squares of any number of coordinates.
|
✓ |
✓ |
float powf(float x, float y) Returns \(x^y\).
|
✓ |
✓ |
float powif(float base, int iexp) Returns the value of first argument to the power of second argument.
|
✓ |
✓ |
float remainderf(float x, float y) Returns single-precision floating-point remainder.
|
✓ |
✓ |
float remquof(float x, float y, int* quo) Returns single-precision floating-point remainder and part of quotient.
|
✓ |
✓ |
float roundf(float x) Round to nearest integer value in floating-point.
|
✓ |
✓ |
float rcbrtf(float x) Returns the reciprocal cube root function.
|
✓ |
✓ |
float rhypotf(float x, float y) Returns one over the square root of the sum of squares of two arguments.
|
✓ |
✓ |
float rintf(float x) Round input to nearest integer value in floating-point.
|
✓ |
✓ |
float rnorm3df(float x, float y, float z) Returns one over the square root of the sum of squares of three coordinates of the argument.
|
✓ |
✓ |
float rnorm4df(float x, float y, float z, float w) Returns one over the square root of the sum of squares of four coordinates of the argument.
|
✓ |
✓ |
float rnormf(int dim, const float *a) Returns the reciprocal of square root of the sum of squares of any number of coordinates.
|
✓ |
✓ |
float scalblnf(float x, long int n) Scale \(x\) by \(2^n\).
|
✓ |
✓ |
float scalbnf(float x, int n) Scale \(x\) by \(2^n\).
|
✓ |
✓ |
bool signbit(float x) Return the sign bit of \(x\).
|
✓ |
✓ |
float sinf(float x) Returns the sine of \(x\).
|
✓ |
✓ |
float sinhf(float x) Returns the hyperbolic sine of \(x\).
|
✓ |
✓ |
float sinpif(float x) Returns the hyperbolic sine of \(\pi \cdot x\).
|
✓ |
✓ |
void sincosf(float x, float *sptr, float *cptr) Returns the sine and cosine of \(x\).
|
✓ |
✓ |
void sincospif(float x, float *sptr, float *cptr) Returns the sine and cosine of \(\pi \cdot x\).
|
✓ |
✓ |
float sqrtf(float x) Returns the square root of \(x\).
|
✓ |
✓ |
float rsqrtf(float x) Returns the reciprocal of the square root of \(x\).
|
✗ |
✓ |
float tanf(float x) Returns the tangent of \(x\).
|
✓ |
✓ |
float tanhf(float x) Returns the hyperbolic tangent of \(x\).
|
✓ |
✓ |
float tgammaf(float x) Returns the gamma function of \(x\).
|
✓ |
✓ |
float truncf(float x) Truncate \(x\) to the integral part.
|
✓ |
✓ |
float y0f(float x) Returns the value of the Bessel function of the second kind of order 0 for \(x\).
|
✓ |
✓ |
float y1f(float x) Returns the value of the Bessel function of the second kind of order 1 for \(x\).
|
✓ |
✓ |
float ynf(int n, float x) Returns the value of the Bessel function of the second kind of order n for \(x\).
|
✓ |
✓ |
Double precision mathematical functions#
Following is the list of supported double precision mathematical functions.
Function |
Supported on Host |
Supported on Device |
double abs(double x) Returns the absolute value of \(x\)
|
✓ |
✓ |
double acos(double x) Returns the arc cosine of \(x\).
|
✓ |
✓ |
double acosh(double x) Returns the nonnegative arc hyperbolic cosine of \(x\).
|
✓ |
✓ |
double asin(double x) Returns the arc sine of \(x\).
|
✓ |
✓ |
double asinh(double x) Returns the arc hyperbolic sine of \(x\).
|
✓ |
✓ |
double atan(double x) Returns the arc tangent of \(x\).
|
✓ |
✓ |
double atan2(double x, double y) Returns the arc tangent of the ratio of \(x\) and \(y\).
|
✓ |
✓ |
double atanh(double x) Returns the arc hyperbolic tangent of \(x\).
|
✓ |
✓ |
double cbrt(double x) Returns the cube root of \(x\).
|
✓ |
✓ |
double ceil(double x) Returns ceiling of \(x\).
|
✓ |
✓ |
double copysign(double x, double y) Create value with given magnitude, copying sign of second value.
|
✓ |
✓ |
double cos(double x) Returns the cosine of \(x\).
|
✓ |
✓ |
double cosh(double x) Returns the hyperbolic cosine of \(x\).
|
✓ |
✓ |
double cospi(double x) Returns the cosine of \(\pi \cdot x\).
|
✓ |
✓ |
double cyl_bessel_i0(double x) Returns the value of the regular modified cylindrical Bessel function of order 0 for \(x\).
|
✗ |
✗ |
double cyl_bessel_i1(double x) Returns the value of the regular modified cylindrical Bessel function of order 1 for \(x\).
|
✗ |
✗ |
double erf(double x) Returns the error function of \(x\).
|
✓ |
✓ |
double erfc(double x) Returns the complementary error function of \(x\).
|
✓ |
✓ |
double erfcinv(double x) Returns the inverse complementary function of \(x\).
|
✓ |
✓ |
double erfcx(double x) Returns the scaled complementary error function of \(x\).
|
✓ |
✓ |
double erfinv(double x) Returns the inverse error function of \(x\).
|
✓ |
✓ |
double exp(double x) Returns \(e^x\).
|
✓ |
✓ |
double exp10(double x) Returns \(10^x\).
|
✓ |
✓ |
double exp2( double x) Returns \(2^x\).
|
✓ |
✓ |
double expm1(double x) Returns \(ln(x - 1)\)
|
✓ |
✓ |
double fabs(double x) Returns the absolute value of x
|
✓ |
✓ |
double fdim(double x, double y) Returns the positive difference between \(x\) and \(y\).
|
✓ |
✓ |
double floor(double x) Returns the largest integer less than or equal to \(x\).
|
✓ |
✓ |
double fma(double x, double y, double z) Returns \(x \cdot y + z\) as a single operation.
|
✓ |
✓ |
double fmax(double x, double y) Determine the maximum numeric value of \(x\) and \(y\).
|
✓ |
✓ |
double fmin(double x, double y) Determine the minimum numeric value of \(x\) and \(y\).
|
✓ |
✓ |
double fmod(double x, double y) Returns the floating-point remainder of \(x / y\).
|
✓ |
✓ |
double modf(double x, double* iptr) Break down \(x\) into fractional and integral parts.
|
✓ |
✗ |
double frexp(double x, int* nptr) Extract mantissa and exponent of \(x\).
|
✓ |
✗ |
double hypot(double x, double y) Returns the square root of the sum of squares of \(x\) and \(y\).
|
✓ |
✓ |
int ilogb(double x) Returns the unbiased integer exponent of \(x\).
|
✓ |
✓ |
bool isfinite(double x) Determine whether \(x\) is finite.
|
✓ |
✓ |
bool isin(double x) Determine whether \(x\) is infinite.
|
✓ |
✓ |
bool isnan(double x) Determine whether \(x\) is a
NAN . |
✓ |
✓ |
double j0(double x) Returns the value of the Bessel function of the first kind of order 0 for \(x\).
|
✓ |
✓ |
double j1(double x) Returns the value of the Bessel function of the first kind of order 1 for \(x\).
|
✓ |
✓ |
double jn(int n, double x) Returns the value of the Bessel function of the first kind of order n for \(x\).
|
✓ |
✓ |
double ldexp(double x, int exp) Returns the natural logarithm of the absolute value of the gamma function of \(x\).
|
✓ |
✓ |
double lgamma(double x) Returns the natural logarithm of the absolute value of the gamma function of \(x\).
|
✓ |
✗ |
long int lrint(double x) Round \(x\) to nearest integer value.
|
✓ |
✓ |
long long int llrint(double x) Round \(x\) to nearest integer value.
|
✓ |
✓ |
long int lround(double x) Round to nearest integer value.
|
✓ |
✓ |
long long int llround(double x) Round to nearest integer value.
|
✓ |
✓ |
double log10(double x) Returns the base 10 logarithm of \(x\).
|
✓ |
✓ |
double log1p(double x) Returns the natural logarithm of \(x + 1\).
|
✓ |
✓ |
double log2(double x) Returns the base 2 logarithm of \(x\).
|
✓ |
✓ |
double log(double x) Returns the natural logarithm of \(x\).
|
✓ |
✓ |
double logb(double x) Returns the floating point representation of the exponent of \(x\).
|
✓ |
✓ |
double nan(const char* tagp) Returns “Not a Number” value.
|
✗ |
✓ |
double nearbyint(double x) Round \(x\) to the nearest integer.
|
✓ |
✓ |
double nextafter(double x, double y) Returns next representable double-precision floating-point value after argument.
|
✓ |
✓ |
double norm3d(double x, double y, double z) Returns the square root of the sum of squares of \(x\), \(y\) and \(z\).
|
✓ |
✓ |
double norm4d(double x, double y, double z, double w) Returns the square root of the sum of squares of \(x\), \(y\), \(z\) and \(w\).
|
✓ |
✓ |
double normcdf(double y) Returns the standard normal cumulative distribution function.
|
✓ |
✓ |
double normcdfinv(double y) Returns the inverse of the standard normal cumulative distribution function.
|
✓ |
✓ |
double norm(int dim, const double *a) Returns the square root of the sum of squares of any number of coordinates.
|
✓ |
✓ |
double pow(double x, double y) Returns \(x^y\).
|
✓ |
✓ |
double powi(double base, int iexp) Returns the value of first argument to the power of second argument.
|
✓ |
✓ |
double remainder(double x, double y) Returns double-precision floating-point remainder.
|
✓ |
✓ |
double remquo(double x, double y, int* quo) Returns double-precision floating-point remainder and part of quotient.
|
✓ |
✗ |
double round(double x) Round to nearest integer value in floating-point.
|
✓ |
✓ |
double rcbrt(double x) Returns the reciprocal cube root function.
|
✓ |
✓ |
double rhypot(double x, double y) Returns one over the square root of the sum of squares of two arguments.
|
✓ |
✓ |
double rint(double x) Round input to nearest integer value in floating-point.
|
✓ |
✓ |
double rnorm3d(double x, double y, double z) Returns one over the square root of the sum of squares of three coordinates of the argument.
|
✓ |
✓ |
double rnorm4d(double x, double y, double z, double w) Returns one over the square root of the sum of squares of four coordinates of the argument.
|
✓ |
✓ |
double rnorm(int dim, const double *a) Returns the reciprocal of square root of the sum of squares of any number of coordinates.
|
✓ |
✓ |
double scalbln(double x, long int n) Scale \(x\) by \(2^n\).
|
✓ |
✓ |
double scalbn(double x, int n) Scale \(x\) by \(2^n\).
|
✓ |
✓ |
bool signbit(double x) Return the sign bit of \(x\).
|
✓ |
✓ |
double sin(double x) Returns the sine of \(x\).
|
✓ |
✓ |
double sinh(double x) Returns the hyperbolic sine of \(x\).
|
✓ |
✓ |
double sinpi(double x) Returns the hyperbolic sine of \(\pi \cdot x\).
|
✓ |
✓ |
void sincos(double x, double *sptr, double *cptr) Returns the sine and cosine of \(x\).
|
✓ |
✓ |
void sincospi(double x, double *sptr, double *cptr) Returns the sine and cosine of \(\pi \cdot x\).
|
✓ |
✓ |
double sqrt(double x) Returns the square root of \(x\).
|
✓ |
✓ |
double rsqrt(double x) Returns the reciprocal of the square root of \(x\).
|
✗ |
✓ |
double tan(double x) Returns the tangent of \(x\).
|
✓ |
✓ |
double tanh(double x) Returns the hyperbolic tangent of \(x\).
|
✓ |
✓ |
double tgamma(double x) Returns the gamma function of \(x\).
|
✓ |
✓ |
double trunc(double x) Truncate \(x\) to the integral part.
|
✓ |
✓ |
double y0(double x) Returns the value of the Bessel function of the second kind of order 0 for \(x\).
|
✓ |
✓ |
double y1(double x) Returns the value of the Bessel function of the second kind of order 1 for \(x\).
|
✓ |
✓ |
double yn(int n, double x) Returns the value of the Bessel function of the second kind of order n for \(x\).
|
✓ |
✓ |
Integer intrinsics#
Following is the list of supported integer intrinsics. Note that intrinsics are supported on device only.
Function |
unsigned int __brev(unsigned int x) Reverse the bit order of a 32 bit unsigned integer.
|
unsigned long long int __brevll(unsigned long long int x) Reverse the bit order of a 64 bit unsigned integer.
|
unsigned int __byte_perm(unsigned int x, unsigned int y, unsigned int z) Return selected bytes from two 32-bit unsigned integers.
|
unsigned int __clz(int x) Return the number of consecutive high-order zero bits in 32 bit integer.
|
unsigned int __clzll(long long int x) Return the number of consecutive high-order zero bits in 64 bit integer.
|
unsigned int __ffs(int x) Find the position of least significant bit set to 1 in a 32 bit integer.
|
unsigned int __ffsll(long long int x) Find the position of least significant bit set to 1 in a 64 bit signed integer.
|
unsigned int __fns32(unsigned long long mask, unsigned int base, int offset) Find the position of the n-th set to 1 bit in a 32-bit integer.
|
unsigned int __fns64(unsigned long long int mask, unsigned int base, int offset) Find the position of the n-th set to 1 bit in a 64-bit integer.
|
unsigned int __funnelshift_l(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate \(hi\) and \(lo\), shift left by shift & 31 bits, return the most significant 32 bits.
|
unsigned int __funnelshift_lc(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate \(hi\) and \(lo\), shift left by min(shift, 32) bits, return the most significant 32 bits.
|
unsigned int __funnelshift_r(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate \(hi\) and \(lo\), shift right by shift & 31 bits, return the least significant 32 bits.
|
unsigned int __funnelshift_rc(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate \(hi\) and \(lo\), shift right by min(shift, 32) bits, return the least significant 32 bits.
|
unsigned int __hadd(int x, int y) Compute average of signed input arguments, avoiding overflow in the intermediate sum.
|
unsigned int __rhadd(int x, int y) Compute rounded average of signed input arguments, avoiding overflow in the intermediate sum.
|
unsigned int __uhadd(int x, int y) Compute average of unsigned input arguments, avoiding overflow in the intermediate sum.
|
unsigned int __urhadd (unsigned int x, unsigned int y) Compute rounded average of unsigned input arguments, avoiding overflow in the intermediate sum.
|
int __sad(int x, int y, int z) Returns \(|x - y| + z\), the sum of absolute difference.
|
unsigned int __usad(unsigned int x, unsigned int y, unsigned int z) Returns \(|x - y| + z\), the sum of absolute difference.
|
unsigned int __popc(unsigned int x) Count the number of bits that are set to 1 in a 32 bit integer.
|
unsigned int __popcll(unsigned long long int x) Count the number of bits that are set to 1 in a 64 bit integer.
|
int __mul24(int x, int y) Multiply two 24bit integers.
|
unsigned int __umul24(unsigned int x, unsigned int y) Multiply two 24bit unsigned integers.
|
int __mulhi(int x, int y) Returns the most significant 32 bits of the product of the two 32-bit integers.
|
unsigned int __umulhi(unsigned int x, unsigned int y) Returns the most significant 32 bits of the product of the two 32-bit unsigned integers.
|
long long int __mul64hi(long long int x, long long int y) Returns the most significant 64 bits of the product of the two 64-bit integers.
|
unsigned long long int __umul64hi(unsigned long long int x, unsigned long long int y) Returns the most significant 64 bits of the product of the two 64 unsigned bit integers.
|
The HIP-Clang implementation of __ffs()
and __ffsll()
contains code to add a constant +1 to produce the ffs
result format.
For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform,
HIP-Clang provides __lastbit_u32_u32(unsigned int input)
and __lastbit_u32_u64(unsigned long long int input)
.
The index returned by __lastbit_
instructions starts at -1, while for ffs
the index starts at 0.
Floating-point Intrinsics#
Following is the list of supported floating-point intrinsics. Note that intrinsics are supported on device only.
Note
Only the nearest even rounding mode supported on AMD GPUs by defaults. The _rz
, _ru
and
_rd
suffixed intrinsic functions are existing in HIP AMD backend, if the
OCML_BASIC_ROUNDED_OPERATIONS
macro is defined.
Function |
float __cosf(float x) Returns the fast approximate cosine of \(x\).
|
float __exp10f(float x) Returns the fast approximate for 10 x.
|
float __expf(float x) Returns the fast approximate for e x.
|
float __fadd_rn(float x, float y) Add two floating-point values in round-to-nearest-even mode.
|
float __fdiv_rn(float x, float y) Divide two floating point values in round-to-nearest-even mode.
|
float __fmaf_rn(float x, float y, float z) Returns
x × y + z as a single operation in round-to-nearest-even mode. |
float __fmul_rn(float x, float y) Multiply two floating-point values in round-to-nearest-even mode.
|
float __frcp_rn(float x, float y) Returns
1 / x in round-to-nearest-even mode. |
float __frsqrt_rn(float x) Returns
1 / √x in round-to-nearest-even mode. |
float __fsqrt_rn(float x) Returns
√x in round-to-nearest-even mode. |
float __fsub_rn(float x, float y) Subtract two floating-point values in round-to-nearest-even mode.
|
float __log10f(float x) Returns the fast approximate for base 10 logarithm of \(x\).
|
float __log2f(float x) Returns the fast approximate for base 2 logarithm of \(x\).
|
float __logf(float x) Returns the fast approximate for natural logarithm of \(x\).
|
float __powf(float x, float y) Returns the fast approximate of x y.
|
float __saturatef(float x) Clamp \(x\) to [+0.0, 1.0].
|
float __sincosf(float x, float* sinptr, float* cosptr) Returns the fast approximate of sine and cosine of \(x\).
|
float __sinf(float x) Returns the fast approximate sine of \(x\).
|
float __tanf(float x) Returns the fast approximate tangent of \(x\).
|
Function |
double __dadd_rn(double x, double y) Add two floating-point values in round-to-nearest-even mode.
|
double __ddiv_rn(double x, double y) Divide two floating-point values in round-to-nearest-even mode.
|
double __dmul_rn(double x, double y) Multiply two floating-point values in round-to-nearest-even mode.
|
double __drcp_rn(double x, double y) Returns
1 / x in round-to-nearest-even mode. |
double __dsqrt_rn(double x) Returns
√x in round-to-nearest-even mode. |
double __dsub_rn(double x, double y) Subtract two floating-point values in round-to-nearest-even mode.
|
double __fma_rn(double x, double y, double z) Returns
x × y + z as a single operation in round-to-nearest-even mode. |
Texture functions#
The supported texture functions are listed in texture_fetch_functions.h
and
texture_indirect_functions.h
header files in the
HIP-AMD backend repository.
Texture functions are not supported on some devices. To determine if texture functions are supported
on your device, use Macro __HIP_NO_IMAGE_SUPPORT == 1
. You can query the attribute
hipDeviceAttributeImageSupport
to check if texture functions are supported in the host runtime
code.
Surface functions#
Surface functions are not supported.
Timer functions#
To read a high-resolution timer from the device, HIP provides the following built-in functions:
Returning the incremental counter value for every clock cycle on a device:
clock_t clock() long long int clock64()
The difference between the values that are returned represents the cycles used.
Returning the wall clock count at a constant frequency on the device:
long long int wall_clock64()
This can be queried using the HIP API with the
hipDeviceAttributeWallClockRate
attribute of the device in HIP application code. For example:int wallClkRate = 0; //in kilohertz HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId));
Where
hipDeviceAttributeWallClockRate
is a device attribute. Note that wall clock frequency is a per-device attribute.
Atomic functions#
Atomic functions are run as read-modify-write (RMW) operations that reside in global or shared memory. No other device or thread can observe or modify the memory location during an atomic operation. If multiple instructions from different devices or threads target the same memory location, the instructions are serialized in an undefined order.
To support system scope atomic operations, you can use the HIP APIs that contain the _system
suffix.
For example:
atomicAnd
: This function is atomic and coherent within the GPU device running the functionatomicAnd_system
: This function extends the atomic operation from the GPU device to other CPUs and GPU devices in the system.
HIP supports the following atomic operations.
Function |
Supported in HIP |
Supported in CUDA |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✗ |
|
✓ |
✗ |
|
✓ |
✗ |
|
✓ |
✗ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✗ |
✓ |
|
✗ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
Unsafe floating-point atomic RMW operations#
Some HIP devices support fast atomic RMW operations on floating-point values. For example,
atomicAdd
on single- or double-precision floating-point values may generate a hardware RMW
instruction that is faster than emulating the atomic operation using an atomic compare-and-swap
(CAS) loop.
On some devices, fast atomic RMW instructions can produce results that differ from the same functions implemented with atomic CAS loops. For example, some devices will use different rounding or denormal modes, and some devices produce incorrect answers if fast floating-point atomic RMW instructions target fine-grained memory allocations.
The HIP-Clang compiler offers a compile-time option, so you can choose fast–but potentially
unsafe–atomic instructions for your code. On devices that support these instructions, you can include
the -munsafe-fp-atomics
option. This flag indicates to the compiler that all floating-point atomic
function calls are allowed to use an unsafe version, if one exists. For example, on some devices, this
flag indicates to the compiler that no floating-point atomicAdd
function can target fine-grained
memory.
If you want to avoid using unsafe use a floating-point atomic RMW operations, you can use the
-mno-unsafe-fp-atomics
option. Note that the compiler default is to not produce unsafe
floating-point atomic RMW instructions, so the -mno-unsafe-fp-atomics
option is not necessarily
required. However, passing this option to the compiler is good practice.
When you pass -munsafe-fp-atomics
or -mno-unsafe-fp-atomics
to the compiler’s command line,
the option is applied globally for the entire compilation. Note that if some of the atomic RMW function
calls cannot safely use the faster floating-point atomic RMW instructions, you must use
-mno-unsafe-fp-atomics
in order to ensure that your atomic RMW function calls produce correct
results.
HIP has four extra functions that you can use to more precisely control which floating-point atomic RMW functions produce unsafe atomic RMW instructions:
float unsafeAtomicAdd(float* address, float val)
double unsafeAtomicAdd(double* address, double val)
(Always produces fast atomic RMW instructions on devices that have them, even when-mno-unsafe-fp-atomics
is used)float safeAtomicAdd(float* address, float val)
double safeAtomicAdd(double* address, double val)
(Always produces safe atomic RMW operations, even when-munsafe-fp-atomics
is used)
Warp cross-lane functions#
Threads in a warp are referred to as lanes
and are numbered from 0
to warpSize - 1
.
Warp cross-lane functions operate across all lanes in a warp. The hardware guarantees that all warp
lanes will execute in lockstep, so additional synchronization is unnecessary, and the instructions
use no shared memory.
Note that NVIDIA and AMD devices have different warp sizes. You can use warpSize
built-ins in you
portable code to query the warp size.
Tip
Be sure to review HIP code generated from the CUDA path to ensure that it doesn’t assume a
waveSize
of 32. “Wave-aware” code that assumes a waveSize
of 32 can run on a wave-64
machine, but it only utilizes half of the machine’s resources.
To get the default warp size of a GPU device, use hipGetDeviceProperties
in you host functions.
cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceID);
int w = props.warpSize;
// implement portable algorithm based on w (rather than assume 32 or 64)
Only use warpSize
built-ins in device functions, and don’t assume warpSize
to be a compile-time
constant.
Note that assembly kernels may be built for a warp size that is different from the default. All mask values either returned or accepted by these builtins are 64-bit unsigned integer values, even when compiled for a wave-32 device, where all the higher bits are unused. CUDA code ported to HIP requires changes to ensure that the correct type is used.
Note that the __sync
variants are made available in ROCm 6.2, but disabled by
default to help with the transition to 64-bit masks. They can be enabled by
setting the preprocessor macro HIP_ENABLE_WARP_SYNC_BUILTINS
. These builtins
will be enabled unconditionally in ROCm 6.3. Wherever possible, the
implementation includes a static assert to check that the program source uses
the correct type for the mask.
Warp vote and ballot functions#
int __all(int predicate)
int __any(int predicate)
unsigned long long __ballot(int predicate)
unsigned long long __activemask()
int __all_sync(unsigned long long mask, int predicate)
int __any_sync(unsigned long long mask, int predicate)
int __ballot(unsigned long long mask, int predicate)
You can use __any
and __all
to get a summary view of the predicates evaluated by the
participating lanes.
__any()
: Returns 1 if the predicate is non-zero for any participating lane, otherwise it returns 0.__all()
: Returns 1 if the predicate is non-zero for all participating lanes, otherwise it returns 0.
To determine if the target platform supports the any/all instruction, you can use the hasWarpVote
device property or the HIP_ARCH_HAS_WARP_VOTE
compiler definition.
__ballot
returns a bit mask containing the 1-bit predicate value from each
lane. The nth bit of the result contains the 1 bit contributed by the nth warp
lane.
__activemask()
returns a bit mask of currently active warp lanes. The nth bit
of the result is 1 if the nth warp lane is active.
Note that the __ballot
and __activemask
builtins in HIP have a 64-bit return
value (unlike the 32-bit value returned by the CUDA builtins). Code ported from
CUDA should be adapted to support the larger warp sizes that the HIP version
requires.
Applications can test whether the target platform supports the __ballot
or
__activemask
instructions using the hasWarpBallot
device property in host
code or the HIP_ARCH_HAS_WARP_BALLOT
macro defined by the compiler for device
code.
The _sync
variants require a 64-bit unsigned integer mask argument that
specifies the lanes in the warp that will participate in cross-lane
communication with the calling lane. Each participating thread must have its own
bit set in its mask argument, and all active threads specified in any mask
argument must execute the same call with the same mask, otherwise the result is
undefined.
Warp match functions#
unsigned long long __match_any(T value)
unsigned long long __match_all(T value, int *pred)
unsigned long long __match_any_sync(unsigned long long mask, T value)
unsigned long long __match_all_sync(unsigned long long mask, T value, int *pred)
T
can be a 32-bit integer type, 64-bit integer type or a single precision or
double precision floating point type.
__match_any
returns a bit mask containing a 1-bit for every participating lane
if and only if that lane has the same value in value
as the current lane, and
a 0-bit for all other lanes.
__match_all
returns a bit mask containing a 1-bit for every participating lane
if and only if they all have the same value in value
as the current lane, and
a 0-bit for all other lanes. The predicate pred
is set to true if and only if
all participating threads have the same value in value
.
The _sync
variants require a 64-bit unsigned integer mask argument that
specifies the lanes in the warp that will participate in cross-lane
communication with the calling lane. Each participating thread must have its own
bit set in its mask argument, and all active threads specified in any mask
argument must execute the same call with the same mask, otherwise the result is
undefined.
Warp shuffle functions#
The default width is warpSize
(see Warp cross-lane functions). Half-float shuffles are not supported.
T __shfl (T var, int srcLane, int width=warpSize);
T __shfl_up (T var, unsigned int delta, int width=warpSize);
T __shfl_down (T var, unsigned int delta, int width=warpSize);
T __shfl_xor (T var, int laneMask, int width=warpSize);
T __shfl_sync (unsigned long long mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync (unsigned long long mask, T var, unsigned int delta, int width=warpSize);
T __shfl_down_sync (unsigned long long mask, T var, unsigned int delta, int width=warpSize);
T __shfl_xor_sync (unsigned long long mask, T var, int laneMask, int width=warpSize);
T
can be a 32-bit integer type, 64-bit integer type or a single precision or
double precision floating point type.
The _sync
variants require a 64-bit unsigned integer mask argument that
specifies the lanes in the warp that will participate in cross-lane
communication with the calling lane. Each participating thread must have its own
bit set in its mask argument, and all active threads specified in any mask
argument must execute the same call with the same mask, otherwise the result is
undefined.
Cooperative groups functions#
You can use cooperative groups to synchronize groups of threads. Cooperative groups also provide a way of communicating between groups of threads at a granularity that is different from the block.
HIP supports the following kernel language cooperative groups types and functions:
Function |
Supported in HIP |
Supported in CUDA |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
|
✓ |
✓ |
Warp matrix functions#
Warp matrix functions allow a warp to cooperatively operate on small matrices that have elements spread over lanes in an unspecified manner.
HIP does not support kernel language warp matrix types or functions.
Function |
Supported in HIP |
Supported in CUDA |
|
✗ |
✓ |
|
✗ |
✓ |
|
✗ |
✓ |
|
✗ |
✓ |
|
✗ |
✓ |
Independent thread scheduling#
Certain architectures that support CUDA allow threads to progress independently of each other. This independent thread scheduling makes intra-warp synchronization possible.
HIP does not support this type of scheduling.
Profiler Counter Function#
The CUDA __prof_trigger()
instruction is not supported.
Assert#
The assert function is supported in HIP. Assert function is used for debugging purpose, when the input expression equals to zero, the execution will be stopped.
void assert(int input)
There are two kinds of implementations for assert functions depending on the use sceneries,
- One is for the host version of assert, which is defined in assert.h
,
- Another is the device version of assert, which is implemented in hip/hip_runtime.h
.
Users need to include assert.h
to use assert
. For assert to work in both device and host functions, users need to include "hip/hip_runtime.h"
.
HIP provides the function abort()
which can be used to terminate the application when terminal failures are detected. It is implemented using the __builtin_trap()
function.
This function produces a similar effect of using asm("trap")
in the CUDA code.
Note
In HIP, the function terminates the entire application, while in CUDA, asm("trap")
only terminates the dispatch and the application continues to run.
printf
#
printf
function is supported in HIP.
The following is a simple example to print information in the kernel.
#include <hip/hip_runtime.h>
__global__ void run_printf() { printf("Hello World\n"); }
int main() {
run_printf<<<dim3(1), dim3(1), 0, 0>>>();
}
Device-Side Dynamic Global Memory Allocation#
Device-side dynamic global memory allocation is under development. HIP now includes a preliminary implementation of malloc and free that can be called from device functions.
__launch_bounds__
#
GPU multiprocessors have a fixed pool of resources (primarily registers and shared memory) which are shared by the actively running warps. Using more resources can increase IPC of the kernel but reduces the resources available for other warps and limits the number of warps that can be simultaneously running. Thus GPUs have a complex relationship between resource usage and performance.
__launch_bounds__
allows the application to provide usage hints that influence the resources (primarily registers) used by the generated code. It is a function attribute that must be attached to a __global__ function:
__global__ void __launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_WARPS_PER_EXECUTION_UNIT)
MyKernel(hipGridLaunch lp, ...)
...
__launch_bounds__
supports two parameters:
- MAX_THREADS_PER_BLOCK - The programmers guarantees that kernel will be launched with threads less than MAX_THREADS_PER_BLOCK. (On NVCC this maps to the .maxntid
PTX directive). If no launch_bounds is specified, MAX_THREADS_PER_BLOCK is the maximum block size supported by the device (typically 1024 or larger). Specifying MAX_THREADS_PER_BLOCK less than the maximum effectively allows the compiler to use more resources than a default unconstrained compilation that supports all possible block sizes at launch time.
The threads-per-block is the product of (blockDim.x * blockDim.y * blockDim.z
).
- MIN_WARPS_PER_EXECUTION_UNIT - directs the compiler to minimize resource usage so that the requested number of warps can be simultaneously active on a multi-processor. Since active warps compete for the same fixed pool of resources, the compiler must reduce resources required by each warp(primarily registers). MIN_WARPS_PER_EXECUTION_UNIT is optional and defaults to 1 if not specified. Specifying a MIN_WARPS_PER_EXECUTION_UNIT greater than the default 1 effectively constrains the compiler’s resource usage.
When launch kernel with HIP APIs, for example, hipModuleLaunchKernel()
, HIP will do validation to make sure input kernel dimension size is not larger than specified launch_bounds.
In case exceeded, HIP would return launch failure, if AMD_LOG_LEVEL is set with proper value (for details, please refer to docs/markdown/hip_logging.md
), detail information will be shown in the error log message, including
launch parameters of kernel dim size, launch bounds, and the name of the faulting kernel. It’s helpful to figure out which is the faulting kernel, besides, the kernel dim size and launch bounds values will also assist in debugging such failures.
Compiler Impact#
The compiler uses these parameters as follows: - The compiler uses the hints only to manage register usage, and does not automatically reduce shared memory or other resources. - Compilation fails if compiler cannot generate a kernel which meets the requirements of the specified launch bounds. - From MAX_THREADS_PER_BLOCK, the compiler derives the maximum number of warps/block that can be used at launch time. Values of MAX_THREADS_PER_BLOCK less than the default allows the compiler to use a larger pool of registers : each warp uses registers, and this hint constrains the launch to a warps/block size which is less than maximum. - From MIN_WARPS_PER_EXECUTION_UNIT, the compiler derives a maximum number of registers that can be used by the kernel (to meet the required #simultaneous active blocks). If MIN_WARPS_PER_EXECUTION_UNIT is 1, then the kernel can use all registers supported by the multiprocessor. - The compiler ensures that the registers used in the kernel is less than both allowed maximums, typically by spilling registers (to shared or global memory), or by using more instructions. - The compiler may use heuristics to increase register usage, or may simply be able to avoid spilling. The MAX_THREADS_PER_BLOCK is particularly useful in this cases, since it allows the compiler to use more registers and avoid situations where the compiler constrains the register usage (potentially spilling) to meet the requirements of a large block size that is never used at launch time.
CU and EU Definitions#
A compute unit (CU) is responsible for executing the waves of a work-group. It is composed of one or more execution units (EU) which are responsible for executing waves. An EU can have enough resources to maintain the state of more than one executing wave. This allows an EU to hide latency by switching between waves in a similar way to symmetric multithreading on a CPU. In order to allow the state for multiple waves to fit on an EU, the resources used by a single wave have to be limited. Limiting such resources can allow greater latency hiding, but can result in having to spill some register state to memory. This attribute allows an advanced developer to tune the number of waves that are capable of fitting within the resources of an EU. It can be used to ensure at least a certain number will fit to help hide latency, and can also be used to ensure no more than a certain number will fit to limit cache thrashing.
Porting from CUDA __launch_bounds
#
CUDA defines a __launch_bounds
which is also designed to control occupancy:
__launch_bounds(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
The second parameter
__launch_bounds
parameters must be converted to the format used __hip_launch_bounds, which uses warps and execution-units rather than blocks and multi-processors (this conversion is performed automatically by HIPIFY tools).
MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) / 32
The key differences in the interface are:
- Warps (rather than blocks):
The developer is trying to tell the compiler to control resource utilization to guarantee some amount of active Warps/EU for latency hiding. Specifying active warps in terms of blocks appears to hide the micro-architectural details of the warp size, but makes the interface more confusing since the developer ultimately needs to compute the number of warps to obtain the desired level of control.
- Execution Units (rather than multiprocessor):
The use of execution units rather than multiprocessors provides support for architectures with multiple execution units/multi-processor. For example, the AMD GCN architecture has 4 execution units per multiprocessor. The hipDeviceProps
has a field executionUnitsPerMultiprocessor
.
Platform-specific coding techniques such as #ifdef
can be used to specify different launch_bounds for NVCC and HIP-Clang platforms, if desired.
maxregcount
#
Unlike NVCC, HIP-Clang does not support the --maxregcount
option. Instead, users are encouraged to use the hip_launch_bounds directive since the parameters are more intuitive and portable than
micro-architecture details like registers, and also the directive allows per-kernel control rather than an entire file. hip_launch_bounds works on both HIP-Clang and NVCC targets.
Asynchronous Functions#
Memory stream#
-
typedef void (*hipStreamCallback_t)(hipStream_t stream, hipError_t status, void *userData)#
Stream CallBack struct
-
hipError_t hipStreamCreate(hipStream_t *stream)#
Create an asynchronous stream.
Create a new asynchronous stream.
stream
returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy.See also
hipStreamCreateWithFlags, hipStreamCreateWithPriority, hipStreamSynchronize, hipStreamWaitEvent, hipStreamDestroy
- Parameters:
stream – [inout] Valid pointer to hipStream_t. This function writes the memory with the newly created stream.
- Returns:
hipSuccess, hipErrorInvalidValue
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipStreamCreateWithFlags(hipStream_t *stream, unsigned int flags)#
Create an asynchronous stream.
Create a new asynchronous stream.
stream
returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy. Flags controls behavior of the stream. See hipStreamDefault, hipStreamNonBlocking.See also
hipStreamCreate, hipStreamCreateWithPriority, hipStreamSynchronize, hipStreamWaitEvent, hipStreamDestroy
- Parameters:
stream – [inout] Pointer to new stream
flags – [in] to control stream creation.
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipStreamCreateWithPriority(hipStream_t *stream, unsigned int flags, int priority)#
Create an asynchronous stream with the specified priority.
Create a new asynchronous stream with the specified priority.
stream
returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy. Flags controls behavior of the stream. See hipStreamDefault, hipStreamNonBlocking.- Parameters:
stream – [inout] Pointer to new stream
flags – [in] to control stream creation.
priority – [in] of the stream. Lower numbers represent higher priorities.
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipDeviceGetStreamPriorityRange(int *leastPriority, int *greatestPriority)#
Returns numerical values that correspond to the least and greatest stream priority.
Returns in *leastPriority and *greatestPriority the numerical values that correspond to the least and greatest stream priority respectively. Stream priorities follow a convention where lower numbers imply greater priorities. The range of meaningful stream priorities is given by [*greatestPriority, *leastPriority]. If the user attempts to create a stream with a priority value that is outside the meaningful range as specified by this API, the priority is automatically clamped to within the valid range.
- Parameters:
leastPriority – [inout] pointer in which value corresponding to least priority is returned.
greatestPriority – [inout] pointer in which value corresponding to greatest priority is returned.
- Returns:
hipSuccess
-
hipError_t hipStreamDestroy(hipStream_t stream)#
Destroys the specified stream.
Destroys the specified stream.
If commands are still executing on the specified stream, some may complete execution before the queue is deleted.
The queue may be destroyed while some commands are still inflight, or may wait for all commands queued to the stream before destroying it.
See also
hipStreamCreate, hipStreamCreateWithFlags, hipStreamCreateWithPriority, hipStreamQuery, hipStreamWaitEvent, hipStreamSynchronize
- Parameters:
stream – [in] stream identifier.
- Returns:
hipSuccess hipErrorInvalidHandle
-
hipError_t hipStreamQuery(hipStream_t stream)#
Return hipSuccess if all of the operations in the specified
stream
have completed, or hipErrorNotReady if not.This is thread-safe and returns a snapshot of the current state of the queue. However, if other host threads are sending work to the stream, the status may change immediately after the function is called. It is typically used for debug.
See also
hipStreamCreate, hipStreamCreateWithFlags, hipStreamCreateWithPriority, hipStreamWaitEvent, hipStreamSynchronize, hipStreamDestroy
- Parameters:
stream – [in] stream to query
- Returns:
hipSuccess, hipErrorNotReady, hipErrorInvalidHandle
-
hipError_t hipStreamSynchronize(hipStream_t stream)#
Wait for all commands in stream to complete.
This command is host-synchronous : the host will block until the specified stream is empty.
This command follows standard null-stream semantics. Specifically, specifying the null stream will cause the command to wait for other streams on the same device to complete all pending operations.
This command honors the hipDeviceLaunchBlocking flag, which controls whether the wait is active or blocking.
See also
hipStreamCreate, hipStreamCreateWithFlags, hipStreamCreateWithPriority, hipStreamWaitEvent, hipStreamDestroy
- Parameters:
stream – [in] stream identifier.
- Returns:
hipSuccess, hipErrorInvalidHandle
-
hipError_t hipStreamWaitEvent(hipStream_t stream, hipEvent_t event, unsigned int flags)#
Make the specified compute stream wait for an event.
This function inserts a wait operation into the specified stream. All future work submitted to
stream
will wait untilevent
reports completion before beginning execution.This function only waits for commands in the current stream to complete. Notably, this function does not implicitly wait for commands in the default stream to complete, even if the specified stream is created with hipStreamNonBlocking = 0.
See also
hipStreamCreate, hipStreamCreateWithFlags, hipStreamCreateWithPriority, hipStreamSynchronize, hipStreamDestroy
- Parameters:
stream – [in] stream to make wait.
event – [in] event to wait on
flags – [in] control operation [must be 0]
- Returns:
hipSuccess, hipErrorInvalidHandle
-
hipError_t hipStreamGetFlags(hipStream_t stream, unsigned int *flags)#
Return flags associated with this stream.
Return flags associated with this stream in *
flags
.See also
- Parameters:
stream – [in] stream to be queried
flags – [inout] Pointer to an unsigned integer in which the stream’s flags are returned
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidHandle
- Returns:
hipSuccess hipErrorInvalidValue hipErrorInvalidHandle
-
hipError_t hipStreamGetPriority(hipStream_t stream, int *priority)#
Query the priority of a stream.
Query the priority of a stream. The priority is returned in in priority.
See also
- Parameters:
stream – [in] stream to be queried
priority – [inout] Pointer to an unsigned integer in which the stream’s priority is returned
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidHandle
- Returns:
hipSuccess hipErrorInvalidValue hipErrorInvalidHandle
-
hipError_t hipStreamGetDevice(hipStream_t stream, hipDevice_t *device)#
Get the device assocaited with the stream.
- Parameters:
stream – [in] stream to be queried
device – [out] device associated with the stream
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorContextIsDestroyed, hipErrorInvalidHandle, hipErrorNotInitialized, hipErrorDeinitialized, hipErrorInvalidContext
-
hipError_t hipExtStreamCreateWithCUMask(hipStream_t *stream, uint32_t cuMaskSize, const uint32_t *cuMask)#
Create an asynchronous stream with the specified CU mask.
Create a new asynchronous stream with the specified CU mask.
stream
returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy.- Parameters:
stream – [inout] Pointer to new stream
cuMaskSize – [in] Size of CU mask bit array passed in.
cuMask – [in] Bit-vector representing the CU mask. Each active bit represents using one CU. The first 32 bits represent the first 32 CUs, and so on. If its size is greater than physical CU number (i.e., multiProcessorCount member of hipDeviceProp_t), the extra elements are ignored. It is user’s responsibility to make sure the input is meaningful.
- Returns:
hipSuccess, hipErrorInvalidHandle, hipErrorInvalidValue
-
hipError_t hipExtStreamGetCUMask(hipStream_t stream, uint32_t cuMaskSize, uint32_t *cuMask)#
Get CU mask associated with an asynchronous stream.
- Parameters:
stream – [in] stream to be queried
cuMaskSize – [in] number of the block of memories (uint32_t *) allocated by user
cuMask – [out] Pointer to a pre-allocated block of memories (uint32_t *) in which the stream’s CU mask is returned. The CU mask is returned in a chunck of 32 bits where each active bit represents one active CU
- Returns:
hipSuccess, hipErrorInvalidHandle, hipErrorInvalidValue
-
hipError_t hipStreamAddCallback(hipStream_t stream, hipStreamCallback_t callback, void *userData, unsigned int flags)#
Adds a callback to be called on the host after all currently enqueued items in the stream have completed. For each hipStreamAddCallback call, a callback will be executed exactly once. The callback will block later work in the stream until it is finished.
See also
hipStreamCreate, hipStreamCreateWithFlags, hipStreamQuery, hipStreamSynchronize, hipStreamWaitEvent, hipStreamDestroy, hipStreamCreateWithPriority
- Parameters:
stream – [in] - Stream to add callback to
callback – [in] - The function to call once preceding stream operations are complete
userData – [in] - User specified data to be passed to the callback function
flags – [in] - Reserved for future use, must be 0
- Returns:
hipSuccess, hipErrorInvalidHandle, hipErrorNotSupported
-
hipError_t hipMallocAsync(void **dev_ptr, size_t size, hipStream_t stream)#
Allocates memory with stream ordered semantics.
Inserts a memory allocation operation into
stream
. A pointer to the allocated memory is returned immediately in *dptr. The allocation must not be accessed until the allocation operation completes. The allocation comes from the memory pool associated with the stream’s device.See also
hipMallocFromPoolAsync, hipFreeAsync, hipMemPoolTrimTo, hipMemPoolGetAttribute, hipDeviceSetMemPool, hipMemPoolSetAttribute, hipMemPoolSetAccess, hipMemPoolGetAccess
Note
The default memory pool of a device contains device memory from that device.
Note
Basic stream ordering allows future work submitted into the same stream to use the allocation. Stream query, stream synchronize, and HIP events can be used to guarantee that the allocation operation completes before work submitted in a separate stream runs.
Note
During stream capture, this function results in the creation of an allocation node. In this case, the allocation is owned by the graph instead of the memory pool. The memory pool’s properties are used to set the node’s creation parameters.
Note
This API is implemented on Linux, under development on Windows.
Warning
: This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
dev_ptr – [out] Returned device pointer of memory allocation
size – [in] Number of bytes to allocate
stream – [in] The stream establishing the stream ordering contract and the memory pool to allocate from
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotSupported, hipErrorOutOfMemory
-
hipError_t hipFreeAsync(void *dev_ptr, hipStream_t stream)#
Frees memory with stream ordered semantics.
Inserts a free operation into
stream
. The allocation must not be used after stream execution reaches the free. After this API returns, accessing the memory from any subsequent work launched on the GPU or querying its pointer attributes results in undefined behavior.See also
hipMallocFromPoolAsync, hipMallocAsync, hipMemPoolTrimTo, hipMemPoolGetAttribute, hipDeviceSetMemPool, hipMemPoolSetAttribute, hipMemPoolSetAccess, hipMemPoolGetAccess
Note
During stream capture, this function results in the creation of a free node and must therefore be passed the address of a graph allocation.
Note
This API is implemented on Linux, under development on Windows.
Warning
: This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
dev_ptr – [in] Pointer to device memory to free
stream – [in] The stream, where the destruciton will occur according to the execution order
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
hipError_t hipMemPoolTrimTo(hipMemPool_t mem_pool, size_t min_bytes_to_hold)#
Releases freed memory back to the OS.
Releases memory back to the OS until the pool contains fewer than
min_bytes_to_keep
reserved bytes, or there is no more memory that the allocator can safely release. The allocator cannot release OS allocations that back outstanding asynchronous allocations. The OS allocations may happen at different granularity from the user allocations.See also
hipMallocFromPoolAsync, hipMallocAsync, hipFreeAsync, hipMemPoolGetAttribute, hipDeviceSetMemPool, hipMemPoolSetAttribute, hipMemPoolSetAccess, hipMemPoolGetAccess
Note
: Allocations that have not been freed count as outstanding.
Note
: Allocations that have been asynchronously freed but whose completion has not been observed on the host (eg. by a synchronize) can count as outstanding.
Note
This API is implemented on Linux, under development on Windows.
Warning
: This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
mem_pool – [in] The memory pool to trim allocations
min_bytes_to_hold – [in] If the pool has less than min_bytes_to_hold reserved, then the TrimTo operation is a no-op. Otherwise the memory pool will contain at least min_bytes_to_hold bytes reserved after the operation.
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemPoolSetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void *value)#
Sets attributes of a memory pool.
Supported attributes are:
hipMemPoolAttrReleaseThreshold:
(value type = cuuint64_t) Amount of reserved memory in bytes to hold onto before trying to release memory back to the OS. When more than the release threshold bytes of memory are held by the memory pool, the allocator will try to release memory back to the OS on the next call to stream, event or context synchronize. (default 0)hipMemPoolReuseFollowEventDependencies:
(value type = int) AllowhipMallocAsync
to use memory asynchronously freed in another stream as long as a stream ordering dependency of the allocating stream on the free action exists. HIP events and null stream interactions can create the required stream ordered dependencies. (default enabled)hipMemPoolReuseAllowOpportunistic:
(value type = int) Allow reuse of already completed frees when there is no dependency between the free and allocation. (default enabled)hipMemPoolReuseAllowInternalDependencies:
(value type = int) AllowhipMallocAsync
to insert new stream dependencies in order to establish the stream ordering required to reuse a piece of memory released byhipFreeAsync
(default enabled).
See also
hipMallocFromPoolAsync, hipMallocAsync, hipFreeAsync, hipMemPoolGetAttribute, hipMemPoolTrimTo, hipDeviceSetMemPool, hipMemPoolSetAccess, hipMemPoolGetAccess
Note
This API is implemented on Linux, under development on Windows.
Warning
: This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
mem_pool – [in] The memory pool to modify
attr – [in] The attribute to modify
value – [in] Pointer to the value to assign
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemPoolGetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void *value)#
Gets attributes of a memory pool.
Supported attributes are:
hipMemPoolAttrReleaseThreshold:
(value type = cuuint64_t) Amount of reserved memory in bytes to hold onto before trying to release memory back to the OS. When more than the release threshold bytes of memory are held by the memory pool, the allocator will try to release memory back to the OS on the next call to stream, event or context synchronize. (default 0)hipMemPoolReuseFollowEventDependencies:
(value type = int) AllowhipMallocAsync
to use memory asynchronously freed in another stream as long as a stream ordering dependency of the allocating stream on the free action exists. HIP events and null stream interactions can create the required stream ordered dependencies. (default enabled)hipMemPoolReuseAllowOpportunistic:
(value type = int) Allow reuse of already completed frees when there is no dependency between the free and allocation. (default enabled)hipMemPoolReuseAllowInternalDependencies:
(value type = int) AllowhipMallocAsync
to insert new stream dependencies in order to establish the stream ordering required to reuse a piece of memory released byhipFreeAsync
(default enabled).
See also
hipMallocFromPoolAsync, hipMallocAsync, hipFreeAsync, hipMemPoolTrimTo, hipDeviceSetMemPool, hipMemPoolSetAttribute, hipMemPoolSetAccess, hipMemPoolGetAccess
Note
This API is implemented on Linux, under development on Windows.
Warning
: This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
mem_pool – [in] The memory pool to get attributes of
attr – [in] The attribute to get
value – [in] Retrieved value
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemPoolSetAccess(hipMemPool_t mem_pool, const hipMemAccessDesc *desc_list, size_t count)#
Controls visibility of the specified pool between devices.
See also
hipMallocFromPoolAsync, hipMallocAsync, hipFreeAsync, hipMemPoolGetAttribute, hipMemPoolTrimTo, hipDeviceSetMemPool, hipMemPoolSetAttribute, hipMemPoolGetAccess
Note
This API is implemented on Linux, under development on Windows.
Warning
: This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
mem_pool – [in] Memory pool for acccess change
desc_list – [in] Array of access descriptors. Each descriptor instructs the access to enable for a single gpu
count – [in] Number of descriptors in the map array.
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemPoolGetAccess(hipMemAccessFlags *flags, hipMemPool_t mem_pool, hipMemLocation *location)#
Returns the accessibility of a pool from a device.
Returns the accessibility of the pool’s memory from the specified location.
See also
hipMallocFromPoolAsync, hipMallocAsync, hipFreeAsync, hipMemPoolGetAttribute, hipMemPoolTrimTo, hipDeviceSetMemPool, hipMemPoolSetAttribute, hipMemPoolSetAccess
Note
This API is implemented on Linux, under development on Windows.
Warning
: This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
flags – [out] Accessibility of the memory pool from the specified location/device
mem_pool – [in] Memory pool being queried
location – [in] Location/device for memory pool access
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemPoolCreate(hipMemPool_t *mem_pool, const hipMemPoolProps *pool_props)#
Creates a memory pool.
Creates a HIP memory pool and returns the handle in
mem_pool
. Thepool_props
determines the properties of the pool such as the backing device and IPC capabilities.By default, the memory pool will be accessible from the device it is allocated on.
See also
hipMallocFromPoolAsync, hipMallocAsync, hipFreeAsync, hipMemPoolGetAttribute, hipMemPoolDestroy, hipMemPoolTrimTo, hipDeviceSetMemPool, hipMemPoolSetAttribute, hipMemPoolSetAccess, hipMemPoolGetAccess
Note
Specifying hipMemHandleTypeNone creates a memory pool that will not support IPC.
Note
This API is implemented on Linux, under development on Windows.
Warning
: This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
mem_pool – [out] Contains createed memory pool
pool_props – [in] Memory pool properties
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
hipError_t hipMemPoolDestroy(hipMemPool_t mem_pool)#
Destroys the specified memory pool.
If any pointers obtained from this pool haven’t been freed or the pool has free operations that haven’t completed when
hipMemPoolDestroy
is invoked, the function will return immediately and the resources associated with the pool will be released automatically once there are no more outstanding allocations.Destroying the current mempool of a device sets the default mempool of that device as the current mempool for that device.
See also
hipMallocFromPoolAsync, hipMallocAsync, hipFreeAsync, hipMemPoolGetAttribute, hipMemPoolCreate hipMemPoolTrimTo, hipDeviceSetMemPool, hipMemPoolSetAttribute, hipMemPoolSetAccess, hipMemPoolGetAccess
Note
A device’s default memory pool cannot be destroyed.
Note
This API is implemented on Linux, under development on Windows.
Warning
: This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
mem_pool – [in] Memory pool for destruction
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMallocFromPoolAsync(void **dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream)#
Allocates memory from a specified pool with stream ordered semantics.
Inserts an allocation operation into
stream
. A pointer to the allocated memory is returned immediately indev_ptr
. The allocation must not be accessed until the allocation operation completes. The allocation comes from the specified memory pool.Basic stream ordering allows future work submitted into the same stream to use the allocation. Stream query, stream synchronize, and HIP events can be used to guarantee that the allocation operation completes before work submitted in a separate stream runs.
See also
hipMallocAsync, hipFreeAsync, hipMemPoolGetAttribute, hipMemPoolCreate hipMemPoolTrimTo, hipDeviceSetMemPool, hipMemPoolSetAttribute, hipMemPoolSetAccess, hipMemPoolGetAccess,
Note
The specified memory pool may be from a device different than that of the specified
stream
.Note
During stream capture, this function results in the creation of an allocation node. In this case, the allocation is owned by the graph instead of the memory pool. The memory pool’s properties are used to set the node’s creation parameters.
Note
This API is implemented on Linux, under development on Windows.
Warning
: This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
dev_ptr – [out] Returned device pointer
size – [in] Number of bytes to allocate
mem_pool – [in] The pool to allocate from
stream – [in] The stream establishing the stream ordering semantic
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotSupported, hipErrorOutOfMemory
Exports a memory pool to the requested handle type.
Given an IPC capable mempool, create an OS handle to share the pool with another process. A recipient process can convert the shareable handle into a mempool with
hipMemPoolImportFromShareableHandle
. Individual pointers can then be shared with thehipMemPoolExportPointer
andhipMemPoolImportPointer
APIs. The implementation of what the shareable handle is and how it can be transferred is defined by the requested handle type.See also
Note
: To create an IPC capable mempool, create a mempool with a
hipMemAllocationHandleType
other thanhipMemHandleTypeNone
.Note
This API is implemented on Linux, under development on Windows.
Warning
: This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
shared_handle – [out] Pointer to the location in which to store the requested handle
mem_pool – [in] Pool to export
handle_type – [in] The type of handle to create
flags – [in] Must be 0
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory
Imports a memory pool from a shared handle.
Specific allocations can be imported from the imported pool with
hipMemPoolImportPointer
.See also
Note
Imported memory pools do not support creating new allocations. As such imported memory pools may not be used in
hipDeviceSetMemPool
orhipMallocFromPoolAsync
calls.Note
This API is implemented on Linux, under development on Windows.
Warning
: This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
mem_pool – [out] Returned memory pool
shared_handle – [in] OS handle of the pool to open
handle_type – [in] The type of handle being imported
flags – [in] Must be 0
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory
-
hipError_t hipMemPoolExportPointer(hipMemPoolPtrExportData *export_data, void *dev_ptr)#
Export data to share a memory pool allocation between processes.
Constructs
export_data
for sharing a specific allocation from an already shared memory pool. The recipient process can import the allocation with thehipMemPoolImportPointer
api. The data is not a handle and may be shared through any IPC mechanism.See also
Note
This API is implemented on Linux, under development on Windows.
Warning
: This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
export_data – [out] Returned export data
dev_ptr – [in] Pointer to memory being exported
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory
-
hipError_t hipMemPoolImportPointer(void **dev_ptr, hipMemPool_t mem_pool, hipMemPoolPtrExportData *export_data)#
Import a memory pool allocation from another process.
Returns in
dev_ptr
a pointer to the imported memory. The imported memory must not be accessed before the allocation operation completes in the exporting process. The imported memory must be freed from all importing processes before being freed in the exporting process. The pointer may be freed withhipFree
orhipFreeAsync
. IfhipFreeAsync
is used, the free must be completed on the importing process before the free operation on the exporting process.See also
Note
The
hipFreeAsync
api may be used in the exporting process before thehipFreeAsync
operation completes in its stream as long as thehipFreeAsync
in the exporting process specifies a stream with a stream dependency on the importing process’shipFreeAsync
.Note
This API is implemented on Linux, under development on Windows.
Warning
: This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
dev_ptr – [out] Pointer to imported memory
mem_pool – [in] Memory pool from which to import a pointer
export_data – [in] Data specifying the memory to import
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized, hipErrorOutOfMemory
Peer to peer#
-
hipError_t hipDeviceCanAccessPeer(int *canAccessPeer, int deviceId, int peerDeviceId)#
Determine if a device can access a peer’s memory.
Returns “1” in
canAccessPeer
if the specifieddevice
is capable of directly accessing memory physically located on peerDevice , or “0” if not.Returns “0” in
canAccessPeer
if deviceId == peerDeviceId, and both are valid devices : a device is not a peer of itself.- Parameters:
canAccessPeer – [out] Returns the peer access capability (0 or 1)
deviceId – [in] - device from where memory may be accessed.
peerDeviceId – [in] - device where memory is physically located
- Returns:
hipSuccess,
- Returns:
hipErrorInvalidDevice if deviceId or peerDeviceId are not valid devices
-
hipError_t hipDeviceEnablePeerAccess(int peerDeviceId, unsigned int flags)#
Enable direct access from current device’s virtual address space to memory allocations physically located on a peer device.
Memory which already allocated on peer device will be mapped into the address space of the current device. In addition, all future memory allocations on peerDeviceId will be mapped into the address space of the current device when the memory is allocated. The peer memory remains accessible from the current device until a call to hipDeviceDisablePeerAccess or hipDeviceReset.
Returns hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue,
- Parameters:
peerDeviceId – [in] Peer device to enable direct access to from the current device
flags – [in] Reserved for future use, must be zero
- Returns:
hipErrorPeerAccessAlreadyEnabled if peer access is already enabled for this device.
-
hipError_t hipDeviceDisablePeerAccess(int peerDeviceId)#
Disable direct access from current device’s virtual address space to memory allocations physically located on a peer device.
Returns hipErrorPeerAccessNotEnabled if direct access to memory on peerDevice has not yet been enabled from the current device.
- Parameters:
peerDeviceId – [in] Peer device to disable direct access to
- Returns:
hipSuccess, hipErrorPeerAccessNotEnabled
-
hipError_t hipMemGetAddressRange(hipDeviceptr_t *pbase, size_t *psize, hipDeviceptr_t dptr)#
Get information on memory allocations.
See also
hipCtxCreate, hipCtxDestroy, hipCtxGetFlags, hipCtxPopCurrent, hipCtxGetCurrent, hipCtxSetCurrent, hipCtxPushCurrent, hipCtxSetCacheConfig, hipCtxSynchronize, hipCtxGetDevice
- Parameters:
pbase – [out] - BAse pointer address
psize – [out] - Size of allocation
dptr- – [in] Device Pointer
- Returns:
hipSuccess, hipErrorNotFound
-
hipError_t hipMemcpyPeer(void *dst, int dstDeviceId, const void *src, int srcDeviceId, size_t sizeBytes)#
Copies memory from one device to memory on another device.
- Parameters:
dst – [out] - Destination device pointer.
dstDeviceId – [in] - Destination device
src – [in] - Source device pointer
srcDeviceId – [in] - Source device
sizeBytes – [in] - Size of memory copy in bytes
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidDevice
-
hipError_t hipMemcpyPeerAsync(void *dst, int dstDeviceId, const void *src, int srcDevice, size_t sizeBytes, hipStream_t stream)#
Copies memory from one device to memory on another device.
- Parameters:
dst – [out] - Destination device pointer.
dstDeviceId – [in] - Destination device
src – [in] - Source device pointer
srcDevice – [in] - Source device
sizeBytes – [in] - Size of memory copy in bytes
stream – [in] - Stream identifier
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidDevice
-
USE_PEER_NON_UNIFIED#
Memory management#
-
hipError_t hipPointerSetAttribute(const void *value, hipPointer_attribute attribute, hipDeviceptr_t ptr)#
Sets information on the specified pointer.[BETA].
Warning
This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
value – [in] Sets pointer attribute value
attribute – [in] Attribute to set
ptr – [in] Pointer to set attributes for
- Returns:
hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue
-
hipError_t hipPointerGetAttributes(hipPointerAttribute_t *attributes, const void *ptr)#
Returns attributes for the specified pointer.
The output parameter ‘attributes’ has a member named ‘type’ that describes what memory the pointer is associated with, such as device memory, host memory, managed memory, and others. Otherwise, the API cannot handle the pointer and returns hipErrorInvalidValue.
See also
Note
The unrecognized memory type is unsupported to keep the HIP functionality backward compatibility due to hipMemoryType enum values.
Note
The current behavior of this HIP API corresponds to the CUDA API before version 11.0.
- Parameters:
attributes – [out] attributes for the specified pointer
ptr – [in] pointer to get attributes for
- Returns:
hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue
-
hipError_t hipPointerGetAttribute(void *data, hipPointer_attribute attribute, hipDeviceptr_t ptr)#
Returns information about the specified pointer.[BETA].
See also
Warning
This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
data – [inout] Returned pointer attribute value
attribute – [in] Attribute to query for
ptr – [in] Pointer to get attributes for
- Returns:
hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue
-
hipError_t hipDrvPointerGetAttributes(unsigned int numAttributes, hipPointer_attribute *attributes, void **data, hipDeviceptr_t ptr)#
Returns information about the specified pointer.[BETA].
See also
Warning
This API is marked as beta, meaning, while this is feature complete, it is still open to changes and may have outstanding issues.
- Parameters:
numAttributes – [in] number of attributes to query for
attributes – [in] attributes to query for
data – [inout] a two-dimensional containing pointers to memory locations where the result of each attribute query will be written to
ptr – [in] pointer to get attributes for
- Returns:
hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue
-
hipError_t hipMalloc(void **ptr, size_t size)#
Allocate memory on the default accelerator.
If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.
See also
hipMallocPitch, hipFree, hipMallocArray, hipFreeArray, hipMalloc3D, hipMalloc3DArray, hipHostFree, hipHostMalloc
- Parameters:
ptr – [out] Pointer to the allocated memory
size – [in] Requested memory size
- Returns:
hipSuccess, hipErrorOutOfMemory, hipErrorInvalidValue (bad context, null *ptr)
-
hipError_t hipExtMallocWithFlags(void **ptr, size_t sizeBytes, unsigned int flags)#
Allocate memory on the default accelerator.
If requested memory size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.
The memory allocation flag should be either hipDeviceMallocDefault, hipDeviceMallocFinegrained, hipDeviceMallocUncached, or hipMallocSignalMemory. If the flag is any other value, the API returns hipErrorInvalidValue.
See also
hipMallocPitch, hipFree, hipMallocArray, hipFreeArray, hipMalloc3D, hipMalloc3DArray, hipHostFree, hipHostMalloc
- Parameters:
ptr – [out] Pointer to the allocated memory
sizeBytes – [in] Requested memory size
flags – [in] Type of memory allocation
- Returns:
hipSuccess, hipErrorOutOfMemory, hipErrorInvalidValue (bad context, null *ptr)
-
hipError_t hipMallocHost(void **ptr, size_t size)#
Allocate pinned host memory [Deprecated].
If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.
Warning
This API is deprecated, use hipHostMalloc() instead
- Parameters:
ptr – [out] Pointer to the allocated host pinned memory
size – [in] Requested memory size
- Returns:
hipSuccess, hipErrorOutOfMemory
-
hipError_t hipMemAllocHost(void **ptr, size_t size)#
Allocate pinned host memory [Deprecated].
If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.
Warning
This API is deprecated, use hipHostMalloc() instead
- Parameters:
ptr – [out] Pointer to the allocated host pinned memory
size – [in] Requested memory size
- Returns:
hipSuccess, hipErrorOutOfMemory
-
hipError_t hipHostMalloc(void **ptr, size_t size, unsigned int flags)#
Allocates device accessible page locked (pinned) host memory.
This API allocates pinned host memory which is mapped into the address space of all GPUs in the system, the memory can be accessed directly by the GPU device, and can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().
Using the pinned host memory, applications can implement faster data transfers for HostToDevice and DeviceToHost. The runtime tracks the hipHostMalloc allocations and can avoid some of the setup required for regular unpinned memory.
When the memory accesses are infrequent, zero-copy memory can be a good choice, for coherent allocation. GPU can directly access the host memory over the CPU/GPU interconnect, without need to copy the data.
Currently the allocation granularity is 4KB for the API.
Developers need to choose proper allocation flag with consideration of synchronization.
If no input for flags, it will be the default pinned memory allocation on the host.
See also
hipSetDeviceFlags, hipHostFree
- Parameters:
ptr – [out] Pointer to the allocated host pinned memory
size – [in] Requested memory size in bytes If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.
flags – [in] Type of host memory allocation
- Returns:
hipSuccess, hipErrorOutOfMemory
-
hipError_t hipHostAlloc(void **ptr, size_t size, unsigned int flags)#
Allocate device accessible page locked host memory [Deprecated].
If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.
- Parameters:
ptr – [out] Pointer to the allocated host pinned memory
size – [in] Requested memory size in bytes
flags – [in] Type of host memory allocation
- Returns:
hipSuccess, hipErrorOutOfMemory
-
hipError_t hipHostGetDevicePointer(void **devPtr, void *hstPtr, unsigned int flags)#
Get Device pointer from Host Pointer allocated through hipHostMalloc.
See also
hipSetDeviceFlags, hipHostMalloc
- Parameters:
devPtr – [out] Device Pointer mapped to passed host pointer
hstPtr – [in] Host Pointer allocated through hipHostMalloc
flags – [in] Flags to be passed for extension
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory
-
hipError_t hipHostGetFlags(unsigned int *flagsPtr, void *hostPtr)#
Return flags associated with host pointer.
See also
- Parameters:
flagsPtr – [out] Memory location to store flags
hostPtr – [in] Host Pointer allocated through hipHostMalloc
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipHostRegister(void *hostPtr, size_t sizeBytes, unsigned int flags)#
Register host memory so it can be accessed from the current device.
Flags:
hipHostRegisterDefault Memory is Mapped and Portable
hipHostRegisterPortable Memory is considered registered by all contexts. HIP only supports one context so this is always assumed true.
hipHostRegisterMapped Map the allocation into the address space for the current device. The device pointer can be obtained with hipHostGetDevicePointer.
After registering the memory, use hipHostGetDevicePointer to obtain the mapped device pointer. On many systems, the mapped device pointer will have a different value than the mapped host pointer. Applications must use the device pointer in device code, and the host pointer in device code.
On some systems, registered memory is pinned. On some systems, registered memory may not be actually be pinned but uses OS or hardware facilities to all GPU access to the host memory.
Developers are strongly encouraged to register memory blocks which are aligned to the host cache-line size. (typically 64-bytes but can be obtains from the CPUID instruction).
If registering non-aligned pointers, the application must take care when register pointers from the same cache line on different devices. HIP’s coarse-grained synchronization model does not guarantee correct results if different devices write to different parts of the same cache block - typically one of the writes will “win” and overwrite data from the other registered memory region.
- Parameters:
hostPtr – [out] Pointer to host memory to be registered.
sizeBytes – [in] Size of the host memory
flags – [in] See below.
- Returns:
hipSuccess, hipErrorOutOfMemory
-
hipError_t hipHostUnregister(void *hostPtr)#
Un-register host pointer.
See also
- Parameters:
hostPtr – [in] Host pointer previously registered with hipHostRegister
- Returns:
Error code
-
hipError_t hipMallocPitch(void **ptr, size_t *pitch, size_t width, size_t height)#
Allocates at least width (in bytes) * height bytes of linear memory Padding may occur to ensure alighnment requirements are met for the given row The change in width size due to padding will be returned in *pitch. Currently the alignment is set to 128 bytes
If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.
See also
hipMalloc, hipFree, hipMallocArray, hipFreeArray, hipHostFree, hipMalloc3D, hipMalloc3DArray, hipHostMalloc
- Parameters:
ptr – [out] Pointer to the allocated device memory
pitch – [out] Pitch for allocation (in bytes)
width – [in] Requested pitched allocation width (in bytes)
height – [in] Requested pitched allocation height
- Returns:
Error code
-
hipError_t hipMemAllocPitch(hipDeviceptr_t *dptr, size_t *pitch, size_t widthInBytes, size_t height, unsigned int elementSizeBytes)#
Allocates at least width (in bytes) * height bytes of linear memory Padding may occur to ensure alighnment requirements are met for the given row The change in width size due to padding will be returned in *pitch. Currently the alignment is set to 128 bytes
If size is 0, no memory is allocated, ptr returns nullptr, and hipSuccess is returned. The intended usage of pitch is as a separate parameter of the allocation, used to compute addresses within the 2D array. Given the row and column of an array element of type T, the address is computed as: T pElement = (T*)((char*)BaseAddress + Row * Pitch) + Column;
See also
hipMalloc, hipFree, hipMallocArray, hipFreeArray, hipHostFree, hipMalloc3D, hipMalloc3DArray, hipHostMalloc
- Parameters:
dptr – [out] Pointer to the allocated device memory
pitch – [out] Pitch for allocation (in bytes)
widthInBytes – [in] Requested pitched allocation width (in bytes)
height – [in] Requested pitched allocation height
elementSizeBytes – [in] The size of element bytes, should be 4, 8 or 16
- Returns:
Error code
-
hipError_t hipFree(void *ptr)#
Free memory allocated by the hcc hip memory allocation API. This API performs an implicit hipDeviceSynchronize() call. If pointer is NULL, the hip runtime is initialized and hipSuccess is returned.
See also
hipMalloc, hipMallocPitch, hipMallocArray, hipFreeArray, hipHostFree, hipMalloc3D, hipMalloc3DArray, hipHostMalloc
- Parameters:
ptr – [in] Pointer to memory to be freed
- Returns:
hipSuccess
- Returns:
hipErrorInvalidDevicePointer (if pointer is invalid, including host pointers allocated with hipHostMalloc)
-
hipError_t hipFreeHost(void *ptr)#
Free memory allocated by the hcc hip host memory allocation API [Deprecated].
Warning
This API is deprecated, use hipHostFree() instead
- Parameters:
ptr – [in] Pointer to memory to be freed
- Returns:
hipSuccess, hipErrorInvalidValue (if pointer is invalid, including device pointers allocated with hipMalloc)
-
hipError_t hipHostFree(void *ptr)#
Free memory allocated by the hcc hip host memory allocation API This API performs an implicit hipDeviceSynchronize() call. If pointer is NULL, the hip runtime is initialized and hipSuccess is returned.
See also
hipMalloc, hipMallocPitch, hipFree, hipMallocArray, hipFreeArray, hipMalloc3D, hipMalloc3DArray, hipHostMalloc
- Parameters:
ptr – [in] Pointer to memory to be freed
- Returns:
hipSuccess, hipErrorInvalidValue (if pointer is invalid, including device pointers allocated with hipMalloc)
-
hipError_t hipMemcpy(void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind)#
Copy data from src to dst.
It supports memory from host to device, device to host, device to device and host to host The src and dst must not overlap.
For hipMemcpy, the copy is always performed by the current device (set by hipSetDevice). For multi-gpu or peer-to-peer configurations, it is recommended to set the current device to the device where the src data is physically located. For optimal peer-to-peer copies, the copy device must be able to access the src and dst pointers (by calling hipDeviceEnablePeerAccess with copy agent as the current device and src/dest as the peerDevice argument. if this is not done, the hipMemcpy will still work, but will perform the copy using a staging buffer on the host. Calling hipMemcpy with dst and src pointers that do not match the hipMemcpyKind results in undefined behavior.
See also
hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer
- Parameters:
dst – [out] Data being copy to
src – [in] Data being copy from
sizeBytes – [in] Data size in bytes
kind – [in] Kind of transfer
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorUnknown
-
hipError_t hipMemcpyWithStream(void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind, hipStream_t stream)#
Memory copy on the stream. It allows single or multiple devices to do memory copy on single or multiple streams.
See also
hipMemcpy, hipStreamCreate, hipStreamSynchronize, hipStreamDestroy, hipSetDevice, hipLaunchKernelGGL
- Parameters:
dst – [out] Data being copy to
src – [in] Data being copy from
sizeBytes – [in] Data size in bytes
kind – [in] Kind of transfer
stream – [in] Valid stream
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorUnknown, hipErrorContextIsDestroyed
-
hipError_t hipMemcpyHtoD(hipDeviceptr_t dst, void *src, size_t sizeBytes)#
Copy data from Host to Device.
See also
hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer
- Parameters:
dst – [out] Data being copy to
src – [in] Data being copy from
sizeBytes – [in] Data size in bytes
- Returns:
hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue
-
hipError_t hipMemcpyDtoH(void *dst, hipDeviceptr_t src, size_t sizeBytes)#
Copy data from Device to Host.
See also
hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer
- Parameters:
dst – [out] Data being copy to
src – [in] Data being copy from
sizeBytes – [in] Data size in bytes
- Returns:
hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue
-
hipError_t hipMemcpyDtoD(hipDeviceptr_t dst, hipDeviceptr_t src, size_t sizeBytes)#
Copy data from Device to Device.
See also
hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer
- Parameters:
dst – [out] Data being copy to
src – [in] Data being copy from
sizeBytes – [in] Data size in bytes
- Returns:
hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue
-
hipError_t hipMemcpyAtoD(hipDeviceptr_t dstDevice, hipArray_t srcArray, size_t srcOffset, size_t ByteCount)#
Copies from one 1D array to device memory.
See also
hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer
- Parameters:
dstDevice – [out] Destination device pointer
srcArray – [in] Source array
srcOffset – [in] Offset in bytes of source array
ByteCount – [in] Size of memory copy in bytes
- Returns:
hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue
-
hipError_t hipMemcpyDtoA(hipArray_t dstArray, size_t dstOffset, hipDeviceptr_t srcDevice, size_t ByteCount)#
Copies from device memory to a 1D array.
See also
hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer
- Parameters:
dstArray – [out] Destination array
dstOffset – [in] Offset in bytes of destination array
srcDevice – [in] Source device pointer
ByteCount – [in] Size of memory copy in bytes
- Returns:
hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue
-
hipError_t hipMemcpyAtoA(hipArray_t dstArray, size_t dstOffset, hipArray_t srcArray, size_t srcOffset, size_t ByteCount)#
Copies from one 1D array to another.
See also
hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer
- Parameters:
dstArray – [out] Destination array
dstOffset – [in] Offset in bytes of destination array
srcArray – [in] Source array
srcOffset – [in] Offset in bytes of source array
ByteCount – [in] Size of memory copy in bytes
- Returns:
hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue
-
hipError_t hipMemcpyHtoDAsync(hipDeviceptr_t dst, void *src, size_t sizeBytes, hipStream_t stream)#
Copy data from Host to Device asynchronously.
See also
hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer
- Parameters:
dst – [out] Data being copy to
src – [in] Data being copy from
sizeBytes – [in] Data size in bytes
stream – [in] Stream identifier
- Returns:
hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue
-
hipError_t hipMemcpyDtoHAsync(void *dst, hipDeviceptr_t src, size_t sizeBytes, hipStream_t stream)#
Copy data from Device to Host asynchronously.
See also
hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer
- Parameters:
dst – [out] Data being copy to
src – [in] Data being copy from
sizeBytes – [in] Data size in bytes
stream – [in] Stream identifier
- Returns:
hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue
-
hipError_t hipMemcpyDtoDAsync(hipDeviceptr_t dst, hipDeviceptr_t src, size_t sizeBytes, hipStream_t stream)#
Copy data from Device to Device asynchronously.
See also
hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer
- Parameters:
dst – [out] Data being copy to
src – [in] Data being copy from
sizeBytes – [in] Data size in bytes
stream – [in] Stream identifier
- Returns:
hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue
-
hipError_t hipMemcpyAtoHAsync(void *dstHost, hipArray_t srcArray, size_t srcOffset, size_t ByteCount, hipStream_t stream)#
Copies from one 1D array to host memory.
See also
hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer
- Parameters:
dstHost – [out] Destination pointer
srcArray – [in] Source array
srcOffset – [in] Offset in bytes of source array
ByteCount – [in] Size of memory copy in bytes
stream – [in] Stream identifier
- Returns:
hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue
-
hipError_t hipMemcpyHtoAAsync(hipArray_t dstArray, size_t dstOffset, const void *srcHost, size_t ByteCount, hipStream_t stream)#
Copies from host memory to a 1D array.
See also
hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer
- Parameters:
dstArray – [out] Destination array
dstOffset – [in] Offset in bytes of destination array
srcHost – [in] Source host pointer
ByteCount – [in] Size of memory copy in bytes
stream – [in] Stream identifier
- Returns:
hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue
-
hipError_t hipModuleGetGlobal(hipDeviceptr_t *dptr, size_t *bytes, hipModule_t hmod, const char *name)#
Returns a global pointer from a module. Returns in *dptr and *bytes the pointer and size of the global of name name located in module hmod. If no variable of that name exists, it returns hipErrorNotFound. Both parameters dptr and bytes are optional. If one of them is NULL, it is ignored and hipSuccess is returned.
- Parameters:
dptr – [out] Returns global device pointer
bytes – [out] Returns global size in bytes
hmod – [in] Module to retrieve global from
name – [in] Name of global to retrieve
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotFound, hipErrorInvalidContext
-
hipError_t hipGetSymbolAddress(void **devPtr, const void *symbol)#
Gets device pointer associated with symbol on the device.
- Parameters:
devPtr – [out] pointer to the device associated the symbole
symbol – [in] pointer to the symbole of the device
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipGetSymbolSize(size_t *size, const void *symbol)#
Gets the size of the given symbol on the device.
- Parameters:
symbol – [in] pointer to the device symbole
size – [out] pointer to the size
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipGetProcAddress(const char *symbol, void **pfn, int hipVersion, uint64_t flags, hipDriverProcAddressQueryResult *symbolStatus)#
Gets the pointer of requested HIP driver function.
Returns hipSuccess if the returned pfn is addressed to the pointer of found driver function.
- Parameters:
symbol – [in] The Symbol name of the driver function to request.
pfn – [out] Output pointer to the requested driver function.
hipVersion – [in] The HIP version for the requested driver function symbol. HIP version is defined as 100*version_major + version_minor. For example, in HIP 6.1, the hipversion is 601, for the symbol function “hipGetDeviceProperties”, the specified hipVersion 601 is greater or equal to the version 600, the symbol function will be handle properly as backend compatible function.
flags – [in] Currently only default flag is suppported.
symbolStatus – [out] Optional enumeration for returned status of searching for symbol driver function based on the input hipVersion.
- Returns:
hipSuccess, hipErrorInvalidValue.
-
hipError_t hipMemcpyToSymbol(const void *symbol, const void *src, size_t sizeBytes, size_t offset, hipMemcpyKind kind)#
Copies data to the given symbol on the device. Symbol HIP APIs allow a kernel to define a device-side data symbol which can be accessed on the host side. The symbol can be in __constant or device space. Note that the symbol name needs to be encased in the HIP_SYMBOL macro. This also applies to hipMemcpyFromSymbol, hipGetSymbolAddress, and hipGetSymbolSize. For detailed usage, see the memcpyToSymbol example in the HIP Porting Guide.
- Parameters:
symbol – [out] pointer to the device symbole
src – [in] pointer to the source address
sizeBytes – [in] size in bytes to copy
offset – [in] offset in bytes from start of symbole
kind – [in] type of memory transfer
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemcpyToSymbolAsync(const void *symbol, const void *src, size_t sizeBytes, size_t offset, hipMemcpyKind kind, hipStream_t stream)#
Copies data to the given symbol on the device asynchronously.
- Parameters:
symbol – [out] pointer to the device symbole
src – [in] pointer to the source address
sizeBytes – [in] size in bytes to copy
offset – [in] offset in bytes from start of symbole
kind – [in] type of memory transfer
stream – [in] stream identifier
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemcpyFromSymbol(void *dst, const void *symbol, size_t sizeBytes, size_t offset, hipMemcpyKind kind)#
Copies data from the given symbol on the device.
- Parameters:
dst – [out] Returns pointer to destinition memory address
symbol – [in] Pointer to the symbole address on the device
sizeBytes – [in] Size in bytes to copy
offset – [in] Offset in bytes from the start of symbole
kind – [in] Type of memory transfer
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemcpyFromSymbolAsync(void *dst, const void *symbol, size_t sizeBytes, size_t offset, hipMemcpyKind kind, hipStream_t stream)#
Copies data from the given symbol on the device asynchronously.
- Parameters:
dst – [out] Returns pointer to destinition memory address
symbol – [in] pointer to the symbole address on the device
sizeBytes – [in] size in bytes to copy
offset – [in] offset in bytes from the start of symbole
kind – [in] type of memory transfer
stream – [in] stream identifier
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemcpyAsync(void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind, hipStream_t stream)#
Copy data from src to dst asynchronously.
For multi-gpu or peer-to-peer configurations, it is recommended to use a stream which is a attached to the device where the src data is physically located. For optimal peer-to-peer copies, the copy device must be able to access the src and dst pointers (by calling hipDeviceEnablePeerAccess with copy agent as the current device and src/dest as the peerDevice argument. if this is not done, the hipMemcpy will still work, but will perform the copy using a staging buffer on the host.
See also
hipMemcpy, hipMemcpy2D, hipMemcpyToArray, hipMemcpy2DToArray, hipMemcpyFromArray, hipMemcpy2DFromArray, hipMemcpyArrayToArray, hipMemcpy2DArrayToArray, hipMemcpyToSymbol, hipMemcpyFromSymbol, hipMemcpy2DAsync, hipMemcpyToArrayAsync, hipMemcpy2DToArrayAsync, hipMemcpyFromArrayAsync, hipMemcpy2DFromArrayAsync, hipMemcpyToSymbolAsync, hipMemcpyFromSymbolAsync
Warning
If host or dest are not pinned, the memory copy will be performed synchronously. For best performance, use hipHostMalloc to allocate host memory that is transferred asynchronously.
Warning
on HCC hipMemcpyAsync does not support overlapped H2D and D2H copies. For hipMemcpy, the copy is always performed by the device associated with the specified stream.
- Parameters:
dst – [out] Data being copy to
src – [in] Data being copy from
sizeBytes – [in] Data size in bytes
kind – [in] Type of memory transfer
stream – [in] Stream identifier
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorUnknown
-
hipError_t hipMemset(void *dst, int value, size_t sizeBytes)#
Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant byte value value.
- Parameters:
dst – [out] Data being filled
value – [in] Value to be set
sizeBytes – [in] Data size in bytes
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized
-
hipError_t hipMemsetD8(hipDeviceptr_t dest, unsigned char value, size_t count)#
Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant byte value value.
- Parameters:
dest – [out] Data ptr to be filled
value – [in] Value to be set
count – [in] Number of values to be set
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized
-
hipError_t hipMemsetD8Async(hipDeviceptr_t dest, unsigned char value, size_t count, hipStream_t stream)#
Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant byte value value.
hipMemsetD8Async() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.
- Parameters:
dest – [out] Data ptr to be filled
value – [in] Constant value to be set
count – [in] Number of values to be set
stream – [in] Stream identifier
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized
-
hipError_t hipMemsetD16(hipDeviceptr_t dest, unsigned short value, size_t count)#
Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant short value value.
- Parameters:
dest – [out] Data ptr to be filled
value – [in] Constant value to be set
count – [in] Number of values to be set
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized
-
hipError_t hipMemsetD16Async(hipDeviceptr_t dest, unsigned short value, size_t count, hipStream_t stream)#
Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant short value value.
hipMemsetD16Async() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.
- Parameters:
dest – [out] Data ptr to be filled
value – [in] Constant value to be set
count – [in] Number of values to be set
stream – [in] Stream identifier
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized
-
hipError_t hipMemsetD32(hipDeviceptr_t dest, int value, size_t count)#
Fills the memory area pointed to by dest with the constant integer value for specified number of times.
- Parameters:
dest – [out] Data being filled
value – [in] Constant value to be set
count – [in] Number of values to be set
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized
-
hipError_t hipMemsetAsync(void *dst, int value, size_t sizeBytes, hipStream_t stream)#
Fills the first sizeBytes bytes of the memory area pointed to by dev with the constant byte value value.
hipMemsetAsync() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.
- Parameters:
dst – [out] Pointer to device memory
value – [in] Value to set for each byte of specified memory
sizeBytes – [in] Size in bytes to set
stream – [in] Stream identifier
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemsetD32Async(hipDeviceptr_t dst, int value, size_t count, hipStream_t stream)#
Fills the memory area pointed to by dev with the constant integer value for specified number of times.
hipMemsetD32Async() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.
- Parameters:
dst – [out] Pointer to device memory
value – [in] Value to set for each byte of specified memory
count – [in] Number of values to be set
stream – [in] Stream identifier
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemset2D(void *dst, size_t pitch, int value, size_t width, size_t height)#
Fills the memory area pointed to by dst with the constant value.
- Parameters:
dst – [out] Pointer to device memory
pitch – [in] Data size in bytes
value – [in] Constant value to be set
width – [in]
height – [in]
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemset2DAsync(void *dst, size_t pitch, int value, size_t width, size_t height, hipStream_t stream)#
Fills asynchronously the memory area pointed to by dst with the constant value.
- Parameters:
dst – [in] Pointer to 2D device memory
pitch – [in] Pitch size in bytes
value – [in] Value to be set for each byte of specified memory
width – [in] Width of matrix set columns in bytes
height – [in] Height of matrix set rows in bytes
stream – [in] Stream identifier
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemset3D(hipPitchedPtr pitchedDevPtr, int value, hipExtent extent)#
Fills synchronously the memory area pointed to by pitchedDevPtr with the constant value.
- Parameters:
pitchedDevPtr – [in] Pointer to pitched device memory
value – [in] Value to set for each byte of specified memory
extent – [in] Size parameters for width field in bytes in device memory
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemset3DAsync(hipPitchedPtr pitchedDevPtr, int value, hipExtent extent, hipStream_t stream)#
Fills asynchronously the memory area pointed to by pitchedDevPtr with the constant value.
- Parameters:
pitchedDevPtr – [in] Pointer to pitched device memory
value – [in] Value to set for each byte of specified memory
extent – [in] Size parameters for width field in bytes in device memory
stream – [in] Stream identifier
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMemGetInfo(size_t *free, size_t *total)#
Query memory info.
On ROCM, this function gets the actual free memory left on the current device, so supports the cases while running multi-workload (such as multiple processes, multiple threads, and multiple GPUs).
Warning
On Windows, the free memory only accounts for memory allocated by this process and may be optimistic.
- Parameters:
free – [out] Returns free memory on the current device in bytes
total – [out] Returns total allocatable memory on the current device in bytes
- Returns:
hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue
-
hipError_t hipMemPtrGetInfo(void *ptr, size_t *size)#
Get allocated memory size via memory pointer.
This function gets the allocated shared virtual memory size from memory pointer.
- Parameters:
ptr – [in] Pointer to allocated memory
size – [out] Returns the allocated memory size in bytes
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipMallocArray(hipArray_t *array, const hipChannelFormatDesc *desc, size_t width, size_t height, unsigned int flags)#
Allocate an array on the device.
See also
hipMalloc, hipMallocPitch, hipFree, hipFreeArray, hipHostMalloc, hipHostFree
- Parameters:
array – [out] Pointer to allocated array in device memory
desc – [in] Requested channel format
width – [in] Requested array allocation width
height – [in] Requested array allocation height
flags – [in] Requested properties of allocated array
- Returns:
hipSuccess, hipErrorOutOfMemory
-
hipError_t hipArrayCreate(hipArray_t *pHandle, const HIP_ARRAY_DESCRIPTOR *pAllocateArray)#
Create an array memory pointer on the device.
See also
- Parameters:
pHandle – [out] Pointer to the array memory
pAllocateArray – [in] Requested array desciptor
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
hipError_t hipArrayDestroy(hipArray_t array)#
Destroy an array memory pointer on the device.
See also
- Parameters:
array – [in] Pointer to the array memory
- Returns:
hipSuccess, hipErrorInvalidValue
-
hipError_t hipArray3DCreate(hipArray_t *array, const HIP_ARRAY3D_DESCRIPTOR *pAllocateArray)#
Create a 3D array memory pointer on the device.
See also
- Parameters:
array – [out] Pointer to the 3D array memory
pAllocateArray – [in] Requested array desciptor
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
hipError_t hipMalloc3D(hipPitchedPtr *pitchedDevPtr, hipExtent extent)#
Create a 3D memory pointer on the device.
See also
- Parameters:
pitchedDevPtr – [out] Pointer to the 3D memory
extent – [in] Requested extent
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
hipError_t hipFreeArray(hipArray_t array)#
Frees an array on the device.
See also
hipMalloc, hipMallocPitch, hipFree, hipMallocArray, hipHostMalloc, hipHostFree
- Parameters:
array – [in] Pointer to array to free
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized
-
hipError_t hipMalloc3DArray(hipArray_t *array, const struct hipChannelFormatDesc *desc, struct hipExtent extent, unsigned int flags)#
Allocate an array on the device.
See also
hipMalloc, hipMallocPitch, hipFree, hipFreeArray, hipHostMalloc, hipHostFree
- Parameters:
array – [out] Pointer to allocated array in device memory
desc – [in] Requested channel format
extent – [in] Requested array allocation width, height and depth
flags – [in] Requested properties of allocated array
- Returns:
hipSuccess, hipErrorOutOfMemory
-
hipError_t hipArrayGetInfo(hipChannelFormatDesc *desc, hipExtent *extent, unsigned int *flags, hipArray_t array)#
Gets info about the specified array.
See also
- Parameters:
desc – [out] - Returned array type
extent – [out] - Returned array shape. 2D arrays will have depth of zero
flags – [out] - Returned array flags
array – [in] - The HIP array to get info for
- Returns:
hipSuccess, hipErrorInvalidValue hipErrorInvalidHandle
-
hipError_t hipArrayGetDescriptor(HIP_ARRAY_DESCRIPTOR *pArrayDescriptor, hipArray_t array)#
Gets a 1D or 2D array descriptor.
See also
hipArray3DCreate, hipArray3DGetDescriptor, hipArrayCreate, hipArrayDestroy, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpy3D, hipMemcpy3DAsync, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoD, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer, hipMemsetD8, hipMemsetD16, hipMemsetD32, hipArrayGetInfo
- Parameters:
pArrayDescriptor – [out] - Returned array descriptor
array – [in] - Array to get descriptor of
- Returns:
hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipErrorInvalidHandle
-
hipError_t hipArray3DGetDescriptor(HIP_ARRAY3D_DESCRIPTOR *pArrayDescriptor, hipArray_t array)#
Gets a 3D array descriptor.
See also
hipArray3DCreate, hipArrayCreate, hipArrayDestroy, hipArrayGetDescriptor, hipMemAlloc, hipMemAllocHost, hipMemAllocPitch, hipMemcpy2D, hipMemcpy2DAsync, hipMemcpy2DUnaligned, hipMemcpy3D, hipMemcpy3DAsync, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH, hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD, hipMemcpyDtoDAsync, hipMemcpyDtoH, hipMemcpyDtoHAsync, hipMemcpyHtoA, hipMemcpyHtoAAsync, hipMemcpyHtoD, hipMemcpyHtoDAsync, hipMemFree, hipMemFreeHost, hipMemGetAddressRange, hipMemGetInfo, hipMemHostAlloc, hipMemHostGetDevicePointer, hipMemsetD8, hipMemsetD16, hipMemsetD32, hipArrayGetInfo
- Parameters:
pArrayDescriptor – [out] - Returned 3D array descriptor
array – [in] - 3D array to get descriptor of
- Returns:
hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipErrorInvalidHandle, hipErrorContextIsDestroyed
-
hipError_t hipMemcpy2D(void *dst, size_t dpitch, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind)#
Copies data between host and device.
See also
hipMemcpy, hipMemcpyToArray, hipMemcpy2DToArray, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
dst – [in] Destination memory address
dpitch – [in] Pitch of destination memory
src – [in] Source memory address
spitch – [in] Pitch of source memory
width – [in] Width of matrix transfer (columns in bytes)
height – [in] Height of matrix transfer (rows)
kind – [in] Type of transfer
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipMemcpyParam2D(const hip_Memcpy2D *pCopy)#
Copies memory for 2D arrays.
See also
hipMemcpy, hipMemcpy2D, hipMemcpyToArray, hipMemcpy2DToArray, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
pCopy – [in] Parameters for the memory copy
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipMemcpyParam2DAsync(const hip_Memcpy2D *pCopy, hipStream_t stream)#
Copies memory for 2D arrays.
See also
hipMemcpy, hipMemcpy2D, hipMemcpyToArray, hipMemcpy2DToArray, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
pCopy – [in] Parameters for the memory copy
stream – [in] Stream to use
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipMemcpy2DAsync(void *dst, size_t dpitch, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind, hipStream_t stream)#
Copies data between host and device.
See also
hipMemcpy, hipMemcpyToArray, hipMemcpy2DToArray, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
dst – [in] Destination memory address
dpitch – [in] Pitch of destination memory
src – [in] Source memory address
spitch – [in] Pitch of source memory
width – [in] Width of matrix transfer (columns in bytes)
height – [in] Height of matrix transfer (rows)
kind – [in] Type of transfer
stream – [in] Stream to use
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipMemcpy2DToArray(hipArray_t dst, size_t wOffset, size_t hOffset, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind)#
Copies data between host and device.
See also
hipMemcpy, hipMemcpyToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
dst – [in] Destination memory address
wOffset – [in] Destination starting X offset
hOffset – [in] Destination starting Y offset
src – [in] Source memory address
spitch – [in] Pitch of source memory
width – [in] Width of matrix transfer (columns in bytes)
height – [in] Height of matrix transfer (rows)
kind – [in] Type of transfer
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipMemcpy2DToArrayAsync(hipArray_t dst, size_t wOffset, size_t hOffset, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind, hipStream_t stream)#
Copies data between host and device.
See also
hipMemcpy, hipMemcpyToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
dst – [in] Destination memory address
wOffset – [in] Destination starting X offset
hOffset – [in] Destination starting Y offset
src – [in] Source memory address
spitch – [in] Pitch of source memory
width – [in] Width of matrix transfer (columns in bytes)
height – [in] Height of matrix transfer (rows)
kind – [in] Type of transfer
stream – [in] Accelerator view which the copy is being enqueued
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipMemcpy2DArrayToArray(hipArray_t dst, size_t wOffsetDst, size_t hOffsetDst, hipArray_const_t src, size_t wOffsetSrc, size_t hOffsetSrc, size_t width, size_t height, hipMemcpyKind kind)#
Copies data between host and device.
See also
hipMemcpy, hipMemcpyToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
dst – [in] Destination memory address
wOffsetDst – [in] Destination starting X offset
hOffsetDst – [in] Destination starting Y offset
src – [in] Source memory address
wOffsetSrc – [in] Source starting X offset
hOffsetSrc – [in] Source starting Y offset (columns in bytes)
width – [in] Width of matrix transfer (columns in bytes)
height – [in] Height of matrix transfer (rows)
kind – [in] Type of transfer
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidMemcpyDirection
-
hipError_t hipMemcpyToArray(hipArray_t dst, size_t wOffset, size_t hOffset, const void *src, size_t count, hipMemcpyKind kind)#
Copies data between host and device.
See also
hipMemcpy, hipMemcpy2DToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
Warning
This API is deprecated.
- Parameters:
dst – [in] Destination memory address
wOffset – [in] Destination starting X offset
hOffset – [in] Destination starting Y offset
src – [in] Source memory address
count – [in] size in bytes to copy
kind – [in] Type of transfer
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipMemcpyFromArray(void *dst, hipArray_const_t srcArray, size_t wOffset, size_t hOffset, size_t count, hipMemcpyKind kind)#
Copies data between host and device.
See also
hipMemcpy, hipMemcpy2DToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
Warning
This API is deprecated.
- Parameters:
dst – [in] Destination memory address
srcArray – [in] Source memory address
wOffset – [in] Source starting X offset
hOffset – [in] Source starting Y offset
count – [in] Size in bytes to copy
kind – [in] Type of transfer
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipMemcpy2DFromArray(void *dst, size_t dpitch, hipArray_const_t src, size_t wOffset, size_t hOffset, size_t width, size_t height, hipMemcpyKind kind)#
Copies data between host and device.
See also
hipMemcpy, hipMemcpy2DToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
dst – [in] Destination memory address
dpitch – [in] Pitch of destination memory
src – [in] Source memory address
wOffset – [in] Source starting X offset
hOffset – [in] Source starting Y offset
width – [in] Width of matrix transfer (columns in bytes)
height – [in] Height of matrix transfer (rows)
kind – [in] Type of transfer
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipMemcpy2DFromArrayAsync(void *dst, size_t dpitch, hipArray_const_t src, size_t wOffset, size_t hOffset, size_t width, size_t height, hipMemcpyKind kind, hipStream_t stream)#
Copies data between host and device asynchronously.
See also
hipMemcpy, hipMemcpy2DToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
dst – [in] Destination memory address
dpitch – [in] Pitch of destination memory
src – [in] Source memory address
wOffset – [in] Source starting X offset
hOffset – [in] Source starting Y offset
width – [in] Width of matrix transfer (columns in bytes)
height – [in] Height of matrix transfer (rows)
kind – [in] Type of transfer
stream – [in] Accelerator view which the copy is being enqueued
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipMemcpyAtoH(void *dst, hipArray_t srcArray, size_t srcOffset, size_t count)#
Copies data between host and device.
See also
hipMemcpy, hipMemcpy2DToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
dst – [in] Destination memory address
srcArray – [in] Source array
srcOffset – [in] Offset in bytes of source array
count – [in] Size of memory copy in bytes
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipMemcpyHtoA(hipArray_t dstArray, size_t dstOffset, const void *srcHost, size_t count)#
Copies data between host and device.
See also
hipMemcpy, hipMemcpy2DToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
dstArray – [in] Destination memory address
dstOffset – [in] Offset in bytes of destination array
srcHost – [in] Source host pointer
count – [in] Size of memory copy in bytes
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipMemcpy3D(const struct hipMemcpy3DParms *p)#
Copies data between host and device.
See also
hipMemcpy, hipMemcpy2DToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
p – [in] 3D memory copy parameters
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipMemcpy3DAsync(const struct hipMemcpy3DParms *p, hipStream_t stream)#
Copies data between host and device asynchronously.
See also
hipMemcpy, hipMemcpy2DToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
p – [in] 3D memory copy parameters
stream – [in] Stream to use
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipDrvMemcpy3D(const HIP_MEMCPY3D *pCopy)#
Copies data between host and device.
See also
hipMemcpy, hipMemcpy2DToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
pCopy – [in] 3D memory copy parameters
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-
hipError_t hipDrvMemcpy3DAsync(const HIP_MEMCPY3D *pCopy, hipStream_t stream)#
Copies data between host and device asynchronously.
See also
hipMemcpy, hipMemcpy2DToArray, hipMemcpy2D, hipMemcpyFromArray, hipMemcpyToSymbol, hipMemcpyAsync
- Parameters:
pCopy – [in] 3D memory copy parameters
stream – [in] Stream to use
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
External Resource Interoperability#
-
hipError_t hipImportExternalSemaphore(hipExternalSemaphore_t *extSem_out, const hipExternalSemaphoreHandleDesc *semHandleDesc)#
Imports an external semaphore.
See also
- Parameters:
extSem_out – [out] External semaphores to be waited on
semHandleDesc – [in] Semaphore import handle descriptor
- Returns:
hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue
-
hipError_t hipSignalExternalSemaphoresAsync(const hipExternalSemaphore_t *extSemArray, const hipExternalSemaphoreSignalParams *paramsArray, unsigned int numExtSems, hipStream_t stream)#
Signals a set of external semaphore objects.
See also
- Parameters:
extSemArray – [in] External semaphores to be waited on
paramsArray – [in] Array of semaphore parameters
numExtSems – [in] Number of semaphores to wait on
stream – [in] Stream to enqueue the wait operations in
- Returns:
hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue
-
hipError_t hipWaitExternalSemaphoresAsync(const hipExternalSemaphore_t *extSemArray, const hipExternalSemaphoreWaitParams *paramsArray, unsigned int numExtSems, hipStream_t stream)#
Waits on a set of external semaphore objects.
See also
- Parameters:
extSemArray – [in] External semaphores to be waited on
paramsArray – [in] Array of semaphore parameters
numExtSems – [in] Number of semaphores to wait on
stream – [in] Stream to enqueue the wait operations in
- Returns:
hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue
-
hipError_t hipDestroyExternalSemaphore(hipExternalSemaphore_t extSem)#
Destroys an external semaphore object and releases any references to the underlying resource. Any outstanding signals or waits must have completed before the semaphore is destroyed.
See also
- Parameters:
extSem – [in] handle to an external memory object
- Returns:
hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue
-
hipError_t hipImportExternalMemory(hipExternalMemory_t *extMem_out, const hipExternalMemoryHandleDesc *memHandleDesc)#
Imports an external memory object.
See also
- Parameters:
extMem_out – [out] Returned handle to an external memory object
memHandleDesc – [in] Memory import handle descriptor
- Returns:
hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue
-
hipError_t hipExternalMemoryGetMappedBuffer(void **devPtr, hipExternalMemory_t extMem, const hipExternalMemoryBufferDesc *bufferDesc)#
Maps a buffer onto an imported memory object.
See also
- Parameters:
devPtr – [out] Returned device pointer to buffer
extMem – [in] Handle to external memory object
bufferDesc – [in] Buffer descriptor
- Returns:
hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue
-
hipError_t hipDestroyExternalMemory(hipExternalMemory_t extMem)#
Destroys an external memory object.
See also
- Parameters:
extMem – [in] External memory object to be destroyed
- Returns:
hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue
-
hipError_t hipExternalMemoryGetMappedMipmappedArray(hipMipmappedArray_t *mipmap, hipExternalMemory_t extMem, const hipExternalMemoryMipmappedArrayDesc *mipmapDesc)#
Maps a mipmapped array onto an external memory object.
Returned mipmapped array must be freed using hipFreeMipmappedArray.
See also
hipImportExternalMemory, hipDestroyExternalMemory, hipExternalMemoryGetMappedBuffer, hipFreeMipmappedArray
- Parameters:
mipmap – [out] mipmapped array to return
extMem – [in] external memory object handle
mipmapDesc – [in] external mipmapped array descriptor
- Returns:
hipSuccess, hipErrorInvalidValue, hipErrorInvalidResourceHandle
Register Keyword#
The register keyword is deprecated in C++, and is silently ignored by both NVCC and HIP-Clang. You can pass the option -Wdeprecated-register
the compiler warning message.
Pragma Unroll#
Unroll with a bounds that is known at compile-time is supported. For example:
#pragma unroll 16 /* hint to compiler to unroll next loop by 16 */
for (int i=0; i<16; i++) ...
#pragma unroll 1 /* tell compiler to never unroll the loop */
for (int i=0; i<16; i++) ...
#pragma unroll /* hint to compiler to completely unroll next loop. */
for (int i=0; i<16; i++) ...
In-Line Assembly#
GCN ISA In-line assembly, is supported. For example:
asm volatile ("v_mac_f32_e32 %0, %2, %3" : "=v" (out[i]) : "0"(out[i]), "v" (a), "v" (in[i]));
We insert the GCN isa into the kernel using asm()
Assembler statement.
volatile
keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations.
v_mac_f32_e32
is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/)
Index for the respective operand in the ordered fashion is provided by % followed by position in the list of operands
“v” is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register, for more info please refer - [Supported Constraint Code List for AMDGPU](https://llvm.org/docs/LangRef.html#supported-constraint-code-list)
Output Constraints are specified by an “=” prefix as shown above (“=v”). This indicate that assembly will write to this operand, and the operand will then be made available as a return value of the asm
expression. Input constraints do not have a prefix - just the constraint code. The constraint string of “0” says to use the assigned register for output as an input as well (it being the 0’th constraint).
## C++ Support The following C++ features are not supported: - Run-time-type information (RTTI) - Try/catch - Virtual functions Virtual functions are not supported if objects containing virtual function tables are passed between GPU’s of different offload arch’s, e.g. between gfx906 and gfx1030. Otherwise virtual functions are supported.
Kernel Compilation#
hipcc now supports compiling C++/HIP kernels to binary code objects.
The file format for binary is .co
which means Code Object. The following command builds the code object using hipcc
.
hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]
[TARGET GPU] = GPU architecture
[INPUT FILE] = Name of the file containing kernels
[OUTPUT FILE] = Name of the generated code object file
Note
When using binary code objects is that the number of arguments to the kernel is different on HIP-Clang and NVCC path. Refer to the HIP module_api sample for differences in the arguments to be passed to the kernel.
gfx-arch-specific-kernel#
Clang defined ‘__gfx*__’ macros can be used to execute gfx arch specific codes inside the kernel. Refer to the sample in HIP 14_gpu_arch sample.