Basic Usage and Examples

Basic Usage and Examples#

2023-06-23

60 min read

Applies to Linux

Advanced Micro Devices, Inc.

This chapter explains how to use HIP Python’s main interfaces. The usage of the CUDA® interoperability layer is discussed in a separate chapter. We first aim to give an introduction to the Python API of HIP Python by means of basic examples before discussing the Cython API in the last sections of this chapter.

Note

All examples in this chapter have been tested with ROCm™ 5.4.3 on Ubuntu 22. The License applies to all examples in this chapter.

Basic Usage (Python)#

What will I learn?

How to use HIP Python modules in your Python code.

After installing the HIP Python package hip-python, you can import the individual modules that you need as shown below:

Listing 1 Importing HIP Python Modules#

from hip import hip
from hip import hiprtc
# ...

And you are ready to go!

Obtaining Device Properties#

What will I learn?

How I can obtain device attributes/properties via hipGetDeviceProperties.
How I can obtain device attributes/properties via hipDeviceGetAttribute.

Obtaining device properties such as the architecture or the number of compute units is important for many applications.

Via `hipGetDeviceProperties`#

A number of device properties can be obtained via the hipDeviceProp_t object. After creation (line 12) this object must be passed to the hipGetDeviceProperties routine (line 13). The second argument (0) is the device number.

Running the example below will print out the values of all queried device properties before the program eventually prints "ok" and quits.

Note

The hip_check routine in the snippet unpacks the result tuple – HIP Python routines always return a tuple, then checks the therein contained error code (first argument), and finally returns the rest of the tuple – either as single value or tuple sans error code. Such error check routines will be used throughout this and the following sections.

Listing 2 Obtaining Device Properties via hipGetDeviceProperties#

from hip import hip

def hip_check(call_result):
    err = call_result[0]
    result = call_result[1:]
    if len(result) == 1:
        result = result[0]
    if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
        raise RuntimeError(str(err))
    return result

props = hip.hipDeviceProp_t()
hip_check(hip.hipGetDeviceProperties(props,0))

for attrib in sorted(props.PROPERTIES()):
    print(f"props.{attrib}={getattr(props,attrib)}")
print("ok")

Via `hipDeviceGetAttribute`#

You can also obtain some of the properties that appeared in the previous example plus a number of additional properties via the hipDeviceGetAttribute routine as shown in the example below (line 26). In the example below, we query integer-type device attributes/properties. Therefore, we supply the address of a ctypes.c_int variable as first argument. The respective property, the second argument, is passed as enum constant of type hipDeviceAttribute_t.

Running this example will print out the values of all queried device attributes before the program prints "ok" and quits.

Listing 3 Obtaining Device Properties via hipDeviceGetAttribute#

from hip import hip

def hip_check(call_result):
    err = call_result[0]
    result = call_result[1:]
    if len(result) == 1:
        result = result[0]
    if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
        raise RuntimeError(str(err))
    return result

device_num = 0

for attrib in (
   hip.hipDeviceAttribute_t.hipDeviceAttributeMaxBlockDimX,
   hip.hipDeviceAttribute_t.hipDeviceAttributeMaxBlockDimY,
   hip.hipDeviceAttribute_t.hipDeviceAttributeMaxBlockDimZ,
   hip.hipDeviceAttribute_t.hipDeviceAttributeMaxGridDimX,
   hip.hipDeviceAttribute_t.hipDeviceAttributeMaxGridDimY,
   hip.hipDeviceAttribute_t.hipDeviceAttributeMaxGridDimZ,
   hip.hipDeviceAttribute_t.hipDeviceAttributeWarpSize,
):
    value = hip_check(hip.hipDeviceGetAttribute(attrib,device_num))
    print(f"{attrib.name}: {value}")
print("ok")

HIP Streams#

What will I learn?

How I can use HIP Python’s hipStream_t objects and the associated HIP Python routines.
That I can directly pass Python 3 array objects to HIP runtime routines such as hipMemcpy and hipMemcpyAsync.

An important concept in HIP are streams. They allow to overlap host and device work as well as device computations with data movement to or from that same device.

The below example showcases how to use HIP Python’s hipStream_t objects and the associated HIP Python routines. The example further demonstrates that you can pass Python 3 array.array types directly to HIP Python interfaces that expect an host buffer. One example of such interfaces is hipMemcpyAsync (lines 23 and 25).

Listing 4 HIP Streams#

import ctypes
import random
import array

from hip import hip

def hip_check(call_result):
    err = call_result[0]
    result = call_result[1:]
    if len(result) == 1:
        result = result[0]
    if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
        raise RuntimeError(str(err))
    return result

# inputs
n = 100
x_h = array.array("i",[int(random.random()*10) for i in range(0,n)])
num_bytes = x_h.itemsize * len(x_h)
x_d = hip_check(hip.hipMalloc(num_bytes))

stream = hip_check(hip.hipStreamCreate())
hip_check(hip.hipMemcpyAsync(x_d,x_h,num_bytes,hip.hipMemcpyKind.hipMemcpyHostToDevice,stream))
hip_check(hip.hipMemsetAsync(x_d,0,num_bytes,stream))
hip_check(hip.hipMemcpyAsync(x_h,x_d,num_bytes,hip.hipMemcpyKind.hipMemcpyDeviceToHost,stream))
hip_check(hip.hipStreamSynchronize(stream))
hip_check(hip.hipStreamDestroy(stream))

# deallocate device data 
hip_check(hip.hipFree(x_d))

for i,x in enumerate(x_h):
    if x != 0:
        raise ValueError(f"expected '0' for element {i}, is: '{x}'")
print("ok")

What is happening?

A host buffer is filled with random numbers (line 18) before it is asynchronously copied to the device (line 23), where a asynchronous hipMemsetAsync (same stream) resets all bytes to 0 (line 24).
An asynchronous memcpy (same stream) is then issued to copy the device data back to the host (line 25). All operations within the stream are executed in order.
As the ~Async operations are non-blocking, the host waits via hipStreamSynchronize until operations in the stream have been completed (line 26) before destroying the stream (line 27).
Eventually the program deallocates device data via hipFree and checks if all bytes in the host buffer are now set to 0. If so, it quits with an “ok”.

Launching Kernels#

What will I learn?

How I can compile a HIP C++ kernel at runtime via hiprtcCompileProgram.
How I can launch kernels via hipModuleLaunchKernel.

HIP Python does not provide the necessary infrastructure to express device code in native Python. However, you can compile and launch kernels from within Python code via the just-in-time (JIT) compilation interface provided by HIP Python module hiprtc together with the kernel launch routines provided by HIP Python module hip. The example below demonstrates how to do so.

Listing 5 Compiling and Launching Kernels#

from hip import hip, hiprtc

def hip_check(call_result):
    err = call_result[0]
    result = call_result[1:]
    if len(result) == 1:
        result = result[0]
    if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
        raise RuntimeError(str(err))
    elif (
        isinstance(err, hiprtc.hiprtcResult)
        and err != hiprtc.hiprtcResult.HIPRTC_SUCCESS
    ):
        raise RuntimeError(str(err))
    return result


source = b"""\
extern "C" __global__ void print_tid() {
  printf("tid: %d\\n", (int) threadIdx.x);
}
"""

prog = hip_check(hiprtc.hiprtcCreateProgram(source, b"print_tid", 0, [], []))

props = hip.hipDeviceProp_t()
hip_check(hip.hipGetDeviceProperties(props,0))
arch = props.gcnArchName

print(f"Compiling kernel for {arch}")

cflags = [b"--offload-arch="+arch]
err, = hiprtc.hiprtcCompileProgram(prog, len(cflags), cflags)
if err != hiprtc.hiprtcResult.HIPRTC_SUCCESS:
    log_size = hip_check(hiprtc.hiprtcGetProgramLogSize(prog))
    log = bytearray(log_size)
    hip_check(hiprtc.hiprtcGetProgramLog(prog, log))
    raise RuntimeError(log.decode())
code_size = hip_check(hiprtc.hiprtcGetCodeSize(prog))
code = bytearray(code_size)
hip_check(hiprtc.hiprtcGetCode(prog, code))
module = hip_check(hip.hipModuleLoadData(code))
kernel = hip_check(hip.hipModuleGetFunction(module, b"print_tid"))
#
hip_check(
    hip.hipModuleLaunchKernel(
        kernel,
        *(1, 1, 1), # grid
        *(32, 1, 1),  # block
        sharedMemBytes=0,
        stream=None,
        kernelParams=None,
        extra=None,
    )
)

hip_check(hip.hipDeviceSynchronize())
hip_check(hip.hipModuleUnload(module))
hip_check(hiprtc.hiprtcDestroyProgram(prog.createRef()))

print("ok")

What is happening?

In the example, the kernel print_tid defined within the string source simply prints the block-local thread ID (threadIDx.x) for every thread running the kernel (line 20).
A program prog is then created in line 25 via hiprtcCreateProgram, where we pass source as first argument, we further give the program a name (note the b".."), specify zero headers and include names (last three arguments).
Next we query the architecture name via hipGetDeviceProperties (more details: Obtaining Device Properties) and use it in lines 33-34, where we specify compile flags (cflags) and compile prog via hiprtcCompileProgram. In case of a failure, we obtain the program log and raise it as RuntimeError.
In case of success, we query the code size via hiprtcGetCodeSize, create a buffer with that information, and then copy the code into this buffer via hiprtcGetCode. Afterwards, we load the code as module via hipModuleLoadData and then obtain our device kernel with name "print_tid" from it via hipModuleGetFunction.
This object is then passed as first argument to the hipModuleLaunchKernel routine, followed by the usual grid and block dimension triples, the size of the required shared memory, and stream to use (None means the null stream). The latter two arguments, kernelParams and extra, can be used for passing kernel arguments. We will take a look how to pass kernel arguments via extra in the next section.
After the kernel launch, the host waits on completion via hipDeviceSynchronize and then unloads the code module again via hipModuleUnload before quitting with an "ok".

Kernels with Arguments#

What will I learn?

How I can pass arguments to hipModuleLaunchKernel.

One of the difficulties that programmers face when attempting to launch kernels via hipModuleLaunchKernel is passing arguments to the kernels. When using the extra argument, the kernel arguments must be aligned in a certain way. In C/C++ programs, one can simply put all arguments into a struct and let the compiler take care of the argument alignment. Similarly, one could create a ctypes.Structure in python to do the same.

However, we do not want to oblige HIP Python users with creating such glue code. Instead, users can directly pass a list or tuple of arguments to the hipModuleLaunchKernel. The entries of these objects must either be of type DeviceArray (or can be converted to DeviceArray) or one of the ctypes types.

The former are typically the result of a hipMalloc call (or similar memory allocation routines). Please also see HIP Python’s Adapter Types for details on what other types can be converted to DeviceArray. The ctypes types are typically used to convert a scalar of the python bool, int, and float scalar types to a fixed precision.

The below example demonstrates the usage of hipModuleLaunchKernel by means of a simple kernel, which scales a vector by a factor. Here, We pass multiple arguments that require different alignments to the aforementioned routine in lines 85-90. We insert some additional unused* arguments into the extra tuple to stress the argument buffer allocator. Note the ctypes object construction for scalars and the direct passing of the device array x_d. Compare the argument list with the signature of the kernel defined in line 23. The example also introduces HIP Python’s dim3 struct (default value per dimension is 1), which can be unpacked just like a tuple or list.

Listing 6 Compiling and Launching Kernels With Arguments#

import ctypes
import array
import random
import math

from hip import hip, hiprtc

def hip_check(call_result):
    err = call_result[0]
    result = call_result[1:]
    if len(result) == 1:
        result = result[0]
    if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
        raise RuntimeError(str(err))
    elif (
        isinstance(err, hiprtc.hiprtcResult)
        and err != hiprtc.hiprtcResult.HIPRTC_SUCCESS
    ):
        raise RuntimeError(str(err))
    return result

source = b"""\
extern "C" __global__ void scale_vector(float factor, int n, short unused1, int unused2, float unused3, float *x) {
  int tid = threadIdx.x + blockIdx.x * blockDim.x;
  if ( tid == 0 ) {
    printf("tid: %d, factor: %f, x*: %lu, n: %lu, unused1: %d, unused2: %d, unused3: %f\\n",tid,factor,x,n,(int) unused1,unused2,unused3);
  }
  if (tid < n) {
     x[tid] *= factor;
  }
}
"""

prog = hip_check(hiprtc.hiprtcCreateProgram(source, b"scale_vector", 0, [], []))

props = hip.hipDeviceProp_t()
hip_check(hip.hipGetDeviceProperties(props,0))
arch = props.gcnArchName

print(f"Compiling kernel for {arch}")

cflags = [b"--offload-arch="+arch]
err, = hiprtc.hiprtcCompileProgram(prog, len(cflags), cflags)
if err != hiprtc.hiprtcResult.HIPRTC_SUCCESS:
    log_size = hip_check(hiprtc.hiprtcGetProgramLogSize(prog))
    log = bytearray(log_size)
    hip_check(hiprtc.hiprtcGetProgramLog(prog, log))
    raise RuntimeError(log.decode())
code_size = hip_check(hiprtc.hiprtcGetCodeSize(prog))
code = bytearray(code_size)
hip_check(hiprtc.hiprtcGetCode(prog, code))
module = hip_check(hip.hipModuleLoadData(code))
kernel = hip_check(hip.hipModuleGetFunction(module, b"scale_vector"))

# kernel launch

## inputs
n = 100
x_h = array.array("f",[random.random() for i in range(0,n)])
num_bytes = x_h.itemsize * len(x_h)
x_d = hip_check(hip.hipMalloc(num_bytes))
print(f"{hex(int(x_d))=}")

## upload host data
hip_check(hip.hipMemcpy(x_d,x_h,num_bytes,hip.hipMemcpyKind.hipMemcpyHostToDevice))

factor = 1.23

## expected result
x_expected = [a*factor for a in x_h]

block = hip.dim3(x=32)
grid = hip.dim3(math.ceil(n/block.x))

## launch
hip_check(
    hip.hipModuleLaunchKernel(
        kernel,
        *grid,
        *block,
        sharedMemBytes=0,
        stream=None,
        kernelParams=None,
        extra=( 
          ctypes.c_float(factor), # 4 bytes
          ctypes.c_int(n),  # 8 bytes
          ctypes.c_short(5), # unused1, 10 bytes
          ctypes.c_int(2), # unused2, 16 bytes (+2 padding bytes)
          ctypes.c_float(5.6), # unused3 20 bytes
          x_d, # 32 bytes (+4 padding bytes)
        )
    )
)

# copy result back
hip_check(hip.hipMemcpy(x_h,x_d,num_bytes,hip.hipMemcpyKind.hipMemcpyDeviceToHost))

for i,x_h_i in enumerate(x_h):
    if not math.isclose(x_h_i,x_expected[i],rel_tol=1e-6):
        raise RuntimeError(f"values do not match, {x_h[i]=} vs. {x_expected[i]=}, {i=}")

hip_check(hip.hipFree(x_d))

hip_check(hip.hipModuleUnload(module))
hip_check(hiprtc.hiprtcDestroyProgram(prog.createRef()))

print("ok")

What is happening?

See the previous section Launching Kernels for a textual description of the main steps.

hipBLAS and NumPy Interoperability#

What will I learn?

How I can use HIP Python’s hipblas module.
That I can pass numpy arrays to HIP runtime routines such as hipMemcpy and hipMemcpyAsync.

This example demonstrates how to initialize and use HIP Python’s hipblas module. Furthermore, it shows that you can simply pass numpy arrays to HIP runtime routines such as hipMemcpy and hipMemcpyAsync. This works because some of HIP Python’s interfaces support automatic conversion from various different types—in particular such types that implement the Python buffer protocol. The numpy arrays implement the Python buffer protocol and thus can be directly passed to those interfaces.

Listing 7 hipBLAS and NumPy Interoperability#

import ctypes
import math
import numpy as np

from hip import hip
from hip import hipblas

def hip_check(call_result):
    err = call_result[0]
    result = call_result[1:]
    if len(result) == 1:
        result = result[0]
    if isinstance(err,hip.hipError_t) and err != hip.hipError_t.hipSuccess:
        raise RuntimeError(str(err))
    elif isinstance(err,hipblas.hipblasStatus_t) and err != hipblas.hipblasStatus_t.HIPBLAS_STATUS_SUCCESS:
        raise RuntimeError(str(err))
    return result

num_elements = 100

# input data on host
alpha = ctypes.c_float(2)
x_h = np.random.rand(num_elements).astype(dtype=np.float32)
y_h = np.random.rand(num_elements).astype(dtype=np.float32)

# expected result
y_expected = alpha*x_h + y_h

# device vectors
num_bytes = num_elements * np.dtype(np.float32).itemsize
x_d = hip_check(hip.hipMalloc(num_bytes))
y_d = hip_check(hip.hipMalloc(num_bytes))

# copy input data to device
hip_check(hip.hipMemcpy(x_d,x_h,num_bytes,hip.hipMemcpyKind.hipMemcpyHostToDevice))
hip_check(hip.hipMemcpy(y_d,y_h,num_bytes,hip.hipMemcpyKind.hipMemcpyHostToDevice))

# call hipblasSaxpy + initialization & destruction of handle
handle = hip_check(hipblas.hipblasCreate())
hip_check(hipblas.hipblasSaxpy(handle, num_elements, ctypes.addressof(alpha), x_d, 1, y_d, 1))
hip_check(hipblas.hipblasDestroy(handle))

# copy result (stored in y_d) back to host (store in y_h)
hip_check(hip.hipMemcpy(y_h,y_d,num_bytes,hip.hipMemcpyKind.hipMemcpyDeviceToHost))

# compare to expected result
if np.allclose(y_expected,y_h):
    print("ok")
else:
    print("FAILED")
#print(f"{y_h=}")
#print(f"{y_expected=}")

# clean up
hip_check(hip.hipFree(x_d))
hip_check(hip.hipFree(y_d))

What is happening?

We initialize two float32-typed numpy arrays x_h and y_h on the host and fill them with random data (lines 23-24).
We compute the expected result on the host via numpy array operations (line 27).
We allocate device analogues for x_h and y_h (lines 31-32) and copy the host data over (line 35-36). Note that we can directly pass the numpy arrays x_h and y_h to hipMemcpy.
Before being able to call one of the compute routines of hipblas, it’s necessary to create a hipblas handle via hipblasCreate that will be passed to every hipblas routine as first argument (line 39).
In line 40 follows the call to hipblasSaxpy, where we pass the handle as first argument and the address of host ctypes.c_float variable alpha as third argument.
In line 41 the handle is destroyed via hipblasDestroy because it is not needed anymore.
The device data is downloaded in line 44. where we pass numpy array y_h as destination array.
We compare the expected host result with the downloaded device result (lines 47-50) and print "ok" if all is fine.
Finally, we deallocate the device arrays in lines 55-56.

HIP Python Device Arrays#

What will I learn?

How I can change the shape and datatype of HIP Python’s DeviceArray objects.
How I can obtain subarrays from HIP Python’s DeviceArray objects — which are again of that type — via array subscript.

This example demonstrates how to configure the shape and data type of a DeviceArray returned by hipMalloc (and related routines). It further shows how to retrieve single elements / contiguous subarrays with respect to the specified type and shape information.

Listing 8 Configuring and Slicing HIP Python’s DeviceArray#

verbose = False

import ctypes

from hip import hip, hipblas
import numpy as np

def hip_check(call_result):
    err = call_result[0]
    result = call_result[1:]
    if len(result) == 1:
        result = result[0]
    if isinstance(err,hip.hipError_t) and err != hip.hipError_t.hipSuccess:
        raise RuntimeError(str(err))
    elif isinstance(err,hipblas.hipblasStatus_t) and err != hipblas.hipblasStatus_t.HIPBLAS_STATUS_SUCCESS:
        raise RuntimeError(str(err))
    return result

# init host array and fill with ones
shape = (3,20) # shape[1]: inner dim
x_h = np.ones(shape,dtype="float32")
num_bytes = x_h.size * x_h.itemsize

# init device array and upload host data
x_d = hip_check(hip.hipMalloc(num_bytes)).configure(
    typestr="float32",shape=shape
)
hip_check(hip.hipMemcpy(x_d,x_h,num_bytes,hip.hipMemcpyKind.hipMemcpyHostToDevice))

# scale device array entries by row index using hipblasSscal
handle = hip_check(hipblas.hipblasCreate())
for r in range(0,shape[0]):
    row = x_d[r,:] # extract subarray
    row_len = row.size
    alpha = ctypes.c_float(r)
    hip_check(hipblas.hipblasSscal(handle, row_len, ctypes.addressof(alpha), row, 1))
    hip_check(hip.hipDeviceSynchronize())
hip_check(hipblas.hipblasDestroy(handle))

# copy device data back to host
hip_check(hip.hipMemcpy(x_h,x_d,num_bytes,hip.hipMemcpyKind.hipMemcpyDeviceToHost))

# deallocate device data
hip_check(hip.hipFree(x_d))

for r in range(0,shape[0]):
    row_rounded = [round(el) for el in x_h[r,:]]
    for c,e in enumerate(row_rounded):
        if e != r:
            raise ValueError(f"expected '{r}' for element ({r},{c}), is '{e}")
    if verbose:
        print("\t".join((str(i) for i in row_rounded))+"\n")
print("ok")

What is happening?

A two-dimensional row-major array of size (3,20) is created on the host. All elements are initialized to 1 (line 20-21).
A device array with the same number of bytes is created on the device (line 25).
The device array is reconfigured to have float32 type and the shape of the host array (line 25-27).
The host data is copied to the device array (line 28).
Within a loop over the row indices (index: r):
1. A pointer to row with index r is created via array subscript (line 33). This yields row.
2. row is passed to a hipblasSscal call that writes index r to all elements of the row (line 36).
Data is copied back from the device to the host array.
Finally, a check is performed on the host if the row values equal the respective row index (lines 44-50). The program quits with "ok" if all went well.

Note

Please also see HIP Python’s Adapter Types for more details on the capabilities of type DeviceArray and the CUDA Array interface that it implements.

Monte Carlo with hipRAND#

What will I learn?

How I can create an hiprand random number generator via hiprandCreateGenerator.
How I can generate uniformly-distributed random numbers via hiprandGenerateUniformDouble.

This example uses hiprand to estimate $π$ by means of the Monte-Carlo method.

Background

The unit square has the area $1^{2}$ , while the unit circle has the area $π (\frac{1}{2})^{2}$ . Therefore, the ratio between the latter and the former area is $\frac{π}{4}$ . Using the Monte-Carlo method, we randomly choose $N$ $(x, y)$ -coordinates in the unit square. We then estimate the ratio of areas as the ratio between the number of samples located within the unit circle and the total number of samples $N$ . The accuracy of the approach increases with $N$ .

Note

This example was derived from a similar example in the rocRAND repository on Github. See this repository for another higher-level interface to hiprand/rocrand (ctypes-based, no Cython interfaces).

Listing 9 Monte Carlo with hipRAND#

from hip import hip, hiprand
import numpy as np
import math

def hip_check(call_result):
    err = call_result[0]
    result = call_result[1:]
    if len(result) == 1:
        result = result[0]
    if isinstance(err, hiprand.hiprandStatus) and err != hiprand.hiprandStatus.HIPRAND_STATUS_SUCCESS:
        raise RuntimeError(str(err))
    if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
        raise RuntimeError(str(err))
    return result

print("Estimating Pi via the Monte Carlo method:\n")

def calculate_pi(n):
    """Calculate Pi for the given number of samples.
    """
    xy = np.empty(shape=(2, n)) # host array, default type is float64
    gen = hip_check(hiprand.hiprandCreateGenerator(hiprand.hiprandRngType.HIPRAND_RNG_PSEUDO_DEFAULT))
    xy_d = hip_check(hip.hipMalloc(xy.size*xy.itemsize)) # create same size device array
    hip_check(hiprand.hiprandGenerateUniformDouble(gen,xy_d,xy.size)) # generate device random numbers
    hip_check(hip.hipMemcpy(xy,xy_d,xy.size*xy.itemsize,hip.hipMemcpyKind.hipMemcpyDeviceToHost)) # copy to host
    hip_check(hip.hipFree(xy_d)) # free device array
    hip_check(hiprand.hiprandDestroyGenerator(gen))

    inside = xy[0]**2 + xy[1]**2 <= 1.0
    in_xy  = xy[:,  inside]
    estimate = 4*in_xy[0,:].size/n
    return estimate

print(f"#samples\testimate\trelative error")
n = 100
imax = 5
for i in range(1,imax):
    n *= 10
    estimate = calculate_pi(n)
    print(f"{n:12}\t{estimate:1.9f}\t{abs(estimate-math.pi)/math.pi:1.9f}")
print("ok")

What is happening?

Within a loop that per iteration multiplies the problem size n by 10 (line 37-38), we call a function calculate_pi with n as argument, in which:

We first create a two-dimensional host array xy of type double with n elements (line 21).
We then create a hiprandCreateGenerator generator of type HIPRAND_RNG_PSEUDO_DEFAULT (line 22).
We create a device array xy_d that stores the same number of bytes as xy.
We fill xy_d with random data via hiprandGenerateUniformDouble (line 24).
We then copy to xy from xy_d and free x_d (lines 25-26) and destroy the generator (line 27).
We use numpy array operations to count the number of random-generated $x - y$ -coordinates within the unit circle (lines 29-30).
Finally, we compute the ratio estimate for the given n and return it (lines 31-32).

A simple complex FFT with hipFFT#

What will I learn?

How I can create an hipfft 1D plan via hipfftPlan1d.
How I can run a complex in-place forward FFT via hipfftExecZ2Z.

This example demonstrates the usage of HIP Python’s hipfft library.

We perform a double-complex-to-double-complex in-place forward FFT of a constant time signal $f (t) = 1 - 1 j$ of which we have $N$ samples. The resulting FFT coefficients are all zero — aside from the first one, which has the value $N - N j$ .

Listing 10 A simple complex FFT with hipFFT#

import numpy as np
from hip import hip, hipfft

def hip_check(call_result):
    err = call_result[0]
    result = call_result[1:]
    if len(result) == 1:
        result = result[0]
    if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
        raise RuntimeError(str(err))
    if isinstance(err, hipfft.hipfftResult) and err != hipfft.hipfftResult.HIPFFT_SUCCESS:
        raise RuntimeError(str(err))
    return result

# initial data
N = 100
hx = np.zeros(N,dtype=np.cdouble)
hx[:] = 1 - 1j

# copy to device
dx = hip_check(hip.hipMalloc(hx.size*hx.itemsize))
hip_check(hip.hipMemcpy(dx, hx, dx.size, hip.hipMemcpyKind.hipMemcpyHostToDevice))

# create plan
plan = hip_check(hipfft.hipfftPlan1d(N, hipfft.hipfftType.HIPFFT_Z2Z, 1))

# execute plan
hip_check(hipfft.hipfftExecZ2Z(plan, idata=dx, odata=dx, direction=hipfft.HIPFFT_FORWARD))
hip_check(hip.hipDeviceSynchronize())

# copy to host and free device data
hip_check(hip.hipMemcpy(hx,dx,dx.size,hip.hipMemcpyKind.hipMemcpyDeviceToHost))
hip_check(hip.hipFree(dx))

if not np.isclose(hx[0].real,N) or not np.isclose(hx[0].imag,-N):
     raise RuntimeError("element 0 must be '{N}-j{N}'.")
for i in range(1,N):
   if not np.isclose(abs(hx[i]),0):
        raise RuntimeError(f"element {i} must be '0'")

hip_check(hipfft.hipfftDestroy(plan))
print("ok")

What is happening?

We start with creating the initial data in lines 17-18, where we use numpy for convenience.
We then create a device array of the same size and copy the device data over (lines 21-22).
We create a plan in line 25, where we specify the number of samples N and the the type of the FFT as double-complex-to-double-complex, HIPFFT_Z2Z.
Afterwards, we execute the FFT in-place (idata=dx and odata=dx) and specify that we run an forward FFT, HIPFFT_FORWARD (line 28).
The host then waits for completion of all activity on the device before copying data back to the host and freeing the device array (lines 29-33).
Finally, we check if the result is as expected and print "ok" if that’s the case (lines 35-42).

A multi-GPU broadcast with RCCL#

What will I learn?

How I can create a multi-GPU communicator via ncclCommInitAll.
How I can destroy a communicator again via ncclCommDestroy.
How I can open and close a communication group via ncclGroupStart and ncclGroupEnd, respectively.
How I can perform a broadcast via ncclBcast.

This example implements a single-node multi-GPU broadcast of a small array from one GPU’s device buffer to that of the other ones.

Listing 11 A multi-GPU broadcast with RCCL#

import numpy as np
from hip import hip, rccl

def hip_check(call_result):
    err = call_result[0]
    result = call_result[1:]
    if len(result) == 1:
        result = result[0]
    if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
        raise RuntimeError(str(err))
    if isinstance(err, rccl.ncclResult_t) and err != rccl.ncclResult_t.ncclSuccess:
        raise RuntimeError(str(err))
    return result

# init the communicators
num_gpus = hip_check(hip.hipGetDeviceCount())
comms = np.empty(num_gpus,dtype="uint64") # size of pointer type, such as ncclComm
devlist = np.array(range(0,num_gpus),dtype="int32")
hip_check(rccl.ncclCommInitAll(comms, num_gpus, devlist))

# init data on the devices
N = 4
ones = np.ones(N,dtype="int32")
zeros = np.zeros(ones.size,dtype="int32")
dxlist = []
for dev in devlist:
    hip_check(hip.hipSetDevice(dev))
    dx = hip_check(hip.hipMalloc(ones.size*ones.itemsize)) # items are bytes
    dxlist.append(dx)
    hx = ones if dev == 0 else zeros
    hip_check(hip.hipMemcpy(dx,hx,dx.size,hip.hipMemcpyKind.hipMemcpyHostToDevice))

# perform a broadcast
hip_check(rccl.ncclGroupStart())
for dev in devlist:
    hip_check(hip.hipSetDevice(dev))
    hip_check(rccl.ncclBcast(dxlist[dev], N, rccl.ncclDataType_t.ncclInt32, 0, int(comms[dev]), None)) 
    # conversion to Python int is required to not let the numpy datatype to be interpreted as single-element Py_buffer
hip_check(rccl.ncclGroupEnd())

# download and check the output; confirm all entries are one
hx = np.empty(N,dtype="int32")
for dev in devlist:
    dx=dxlist[dev]
    hx[:] = 0
    hip_check(hip.hipMemcpy(hx,dx,dx.size,hip.hipMemcpyKind.hipMemcpyDeviceToHost)) 
    for i,item in enumerate(hx):
        if item != 1:
            raise RuntimeError(f"failed for element {i}")

# clean up
for dx in dxlist:
    hip_check(hip.hipFree(dx))
for comm in comms:
    hip_check(rccl.ncclCommDestroy(int(comm)))
    # conversion to Python int is required to not let the numpy datatype to be interpreted as single-element Py_buffer

print("ok")

What is happening?

In line 17, we use the device count num_gpus (via hipGetDeviceCount) to create an array of pointers (same size as unsigned long, dtype="uint64"). This array named comms is intended to store a pointer to each device’s communicator.
We then create an array of device identifiers (line 18).
We pass both arrays to ncclCommInitAll as first and last argument, respectively (line 19). The second element is the device count. The aforementioned routine initializes all communicators and writes their address to the comms array.
In lines 22-28, we create an array dx on each device of size N that is initialized with zeros on all devices except device 0. The latter’s array is filled with ones.
We start a communication group in line 34, and then call ncclBcast per device in line 37. The first argument of the call is per-device dx, the second the size of dx. Then follows the ncclDataType_t, the root (device 0), then the communicator (int(comms[dev])) and finally the stream (None). Casting comms[dev] to int is required as the result is otherwise interpreted as single-element Py_buffer by HIP Python’s ncclBcast instead of as an address.
In line 39, we close the communication group again.
We download all data to the host per device and check if the elements are set to 1 (lines 42-50). Otherwise, a runtime error is emitted.
Finally, we clean up by deallocating all device memory and destroying the per-device communicators via ncclCommDestroy in line 55. Note that here again the comm must be converted to int before passing it to the HIP Python routine.

Note

Please also see HIP Python’s Adapter Types for more details on automatic type conversions supported by HIP Python’s datatypes.

Basic Usage (Cython)#

What will I learn?

How I can use HIP Python’s Cython modules in my Cython code.
How to compile my Cython code that uses HIP Python’s Cython modules.

In this section, we show how to use HIP Python’s Cython modules and how to compile projects that use them.

Cython Recap#

Note

This section expects that the user has at least some basic knowledge about the programming language Cython. If you are unfamiliar with the language, we refer to the Cython tutorials and the Language Basics page.

Cython modules are often split into a *.pxd and a *.pyx file, which are a Cython module’s declaration and implementation part respectively. While the former files are to some degree comparable to header files in C/C++, the latter can be compared to sources files. The declaration part may only contain cdef fields, variables, and function prototypes while the implementation part may contain the implementation of those entities as well as Python fields, variables, and functions.

The implementation part is the interface between the C/C++ and the Python world. Here, you can import Python code via Python’s import statements, you can C-import cdef declarations from other Cython declaration files (*.pxd) via cimport statements, and you can include C/C++ declarations from C/C++ header files as cdef declarations.

To build a Python module from a Cython module, the implementation part must be first “cythonized”, i.e. converted into a C/C++ file and then compiled with a compiler. It is recommended to use the compiler that was used for compiling the used python interpreter. Most people don’t do this manually but instead prefer to use the build infrastructure provided by setuptools. They then write a setup.py script that contains the code that performs the aforementioned two tasks.

Cython modules in HIP Python#

Per Python module hip.hip, hip.hiprtc, … , HIP Python ships an additional c-prefixed hip.c<pkg_name> module.

The module without the c prefix is compiled into the interface for HIP Python’s Python users. However, all cdef declarations therein can also be cimported by Cython users (typically cdef class declarations) and all Python objects therein can be imported by Cython users too (typically enum and function objects).
The module with the c prefix builds the bridge to the underlying HIP C library by including C definitions from the corresponding header files. This code is located in the declaration part. This part further declares runtime function loader prototypes. The definition of these function loaders in the implementation part first try to load the underlying C library and then if successful, try to load the function symbol from that shared object.

Note

The lazy-loading of functions at runtime can, under some circumstances, allow to use a HIP Python version that covers a superset or only a subset of the functions available within the respective library of a ROCm™ installation.

Using the Cython API#

You can import the Python objects that you need as shown below:

Listing 12 Importing HIP Python Modules into Cython *.pyx file#

from hip import hip # enum types, enum aliases, fields
from hip import hiprtc
# ...

In the same file, you can also or alternatively cimport the cdef entities as shown below:

Listing 13 Importing HIP Python Cython declaration files (*.pxd) into a Cython *.pxd or *.pyx file#

from hip cimport chip   # direct access to C interfaces and lazy function loaders
from hip cimport chiprtc
# ...

from hip cimport hip # access to `cdef class` and `ctypedef` types
                     # that have been created per C struct/union/typedef
from hip cimport hiprtc
# ...

Compiling a Cython module#

After having written your own mymodule.pyx file that uses HIP Python’s Cython API, you can compile the result using a setup.py script as shown below. In the setup.py script, we only assume that HIP or HIPRTC is used. Therefore, only amdhip64 is put into the rocm_libs list. It is further important to specify the HIP Platform as the header files from which we include the C interfaces will be included at compile time by the underlying C/C++ compiler. The compilation path must include all these interfaces.

Listing 14 Compiling a Cython module that uses HIP Python’s Cython API.#

import os, sys

mymodule = "mymodule"

from setuptools import Extension, setup
from Cython.Build import cythonize

ROCM_PATH=os.environ.get("ROCM_PATH", "/opt/rocm")
HIP_PLATFORM = os.environ.get("HIP_PLATFORM", "amd")

if HIP_PLATFORM not in ("amd", "hcc"):
   raise RuntimeError("Currently only HIP_PLATFORM=amd is supported")

def create_extension(name, sources):
   global ROCM_PATH
   global HIP_PLATFORM
   rocm_inc = os.path.join(ROCM_PATH,"include")
   rocm_lib_dir = os.path.join(ROCM_PATH,"lib")
   platform = HIP_PLATFORM.upper()
   cflags = ["-D", f"__HIP_PLATFORM_{platform}__"]

   return Extension(
      name,
      sources=sources,
      include_dirs=[rocm_inc],
      library_dirs=[rocm_lib_dir],
      libraries=rocm_libs,
      language="c",
      extra_compile_args=cflags,
   )

setup(
   ext_modules = cythonize(
      [create_extension(mymodule, [f"{mymodule}.pyx"]),],
      compiler_directives=dict(language_level=3),
      compile_time_env=dict(HIP_PYTHON=True),
   )
)

Basic Usage and Examples

Contents

Basic Usage and Examples#

Basic Usage (Python)#

Obtaining Device Properties#

Via hipGetDeviceProperties#

Via hipDeviceGetAttribute#

HIP Streams#

Launching Kernels#

Kernels with Arguments#

hipBLAS and NumPy Interoperability#

HIP Python Device Arrays#

Monte Carlo with hipRAND#

A simple complex FFT with hipFFT#

A multi-GPU broadcast with RCCL#

Basic Usage (Cython)#

Cython Recap#

Cython modules in HIP Python#

Using the Cython API#

Compiling a Cython module#

Via `hipGetDeviceProperties`#

Via `hipDeviceGetAttribute`#