Basic Usage and Examples#
Advanced Micro Devices, Inc.
2023-06-23
60 min read
This chapter explains how to use HIP Python’s main interfaces. The usage of the CUDA® interoperability layer is discussed in a separate chapter. We first aim to give an introduction to the Python API of HIP Python by means of basic examples before discussing the Cython API in the last sections of this chapter.
Note
All examples in this chapter have been tested with ROCm™ 5.4.3 on Ubuntu 22. The License applies to all examples in this chapter.
Basic Usage (Python)#
What will I learn?
How to use HIP Python modules in your Python code.
After installing the HIP Python package hip-python
, you can import the individual
modules that you need as shown below:
1from hip import hip
2from hip import hiprtc
3# ...
And you are ready to go!
Obtaining Device Properties#
What will I learn?
How I can obtain device attributes/properties via
hipGetDeviceProperties
.How I can obtain device attributes/properties via
hipDeviceGetAttribute
.
Obtaining device properties such as the architecture or the number of compute units is important for many applications.
Via hipGetDeviceProperties
#
A number of device properties can be obtained via the
hipDeviceProp_t
object. After creation (line 12) this object must
be passed to the hipGetDeviceProperties
routine (line 13).
The second argument (0
) is the device number.
Running the example below will print out the values of all queried device properties
before the program eventually prints "ok"
and quits.
Note
The hip_check
routine in the snippet unpacks the result tuple – HIP Python routines always return a tuple, then checks the therein contained error code (first argument),
and finally returns the rest of the tuple – either as single value or tuple sans error code. Such error check routines will be used throughout this and the following sections.
1from hip import hip
2
3def hip_check(call_result):
4 err = call_result[0]
5 result = call_result[1:]
6 if len(result) == 1:
7 result = result[0]
8 if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
9 raise RuntimeError(str(err))
10 return result
11
12props = hip.hipDeviceProp_t()
13hip_check(hip.hipGetDeviceProperties(props,0))
14
15for attrib in sorted(props.PROPERTIES()):
16 print(f"props.{attrib}={getattr(props,attrib)}")
17print("ok")
Via hipDeviceGetAttribute
#
You can also obtain some of the properties that appeared in the
previous example plus a number of additional properties via the hipDeviceGetAttribute
routine as shown in the example below (line 26).
In the example below, we query integer-type device attributes/properties.
Therefore, we supply the address of a ctypes.c_int
variable
as first argument. The respective property, the second argument, is passed
as enum constant of type hipDeviceAttribute_t
.
Running this example will print out the values of all queried device
attributes before the program prints "ok"
and quits.
1from hip import hip
2
3def hip_check(call_result):
4 err = call_result[0]
5 result = call_result[1:]
6 if len(result) == 1:
7 result = result[0]
8 if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
9 raise RuntimeError(str(err))
10 return result
11
12device_num = 0
13
14for attrib in (
15 hip.hipDeviceAttribute_t.hipDeviceAttributeMaxBlockDimX,
16 hip.hipDeviceAttribute_t.hipDeviceAttributeMaxBlockDimY,
17 hip.hipDeviceAttribute_t.hipDeviceAttributeMaxBlockDimZ,
18 hip.hipDeviceAttribute_t.hipDeviceAttributeMaxGridDimX,
19 hip.hipDeviceAttribute_t.hipDeviceAttributeMaxGridDimY,
20 hip.hipDeviceAttribute_t.hipDeviceAttributeMaxGridDimZ,
21 hip.hipDeviceAttribute_t.hipDeviceAttributeWarpSize,
22):
23 value = hip_check(hip.hipDeviceGetAttribute(attrib,device_num))
24 print(f"{attrib.name}: {value}")
25print("ok")
HIP Streams#
What will I learn?
How I can use HIP Python’s
hipStream_t
objects and the associated HIP Python routines.That I can directly pass Python 3
array
objects to HIP runtime routines such ashipMemcpy
andhipMemcpyAsync
.
An important concept in HIP are streams. They allow to overlap host and device work as well as device computations with data movement to or from that same device.
The below example showcases how to use HIP Python’s hipStream_t
objects and the associated HIP Python routines.
The example further demonstrates that you can pass Python 3 array.array
types
directly to HIP Python interfaces that expect an host buffer. One example of such
interfaces is hipMemcpyAsync
(lines 23 and 25).
1import ctypes
2import random
3import array
4
5from hip import hip
6
7def hip_check(call_result):
8 err = call_result[0]
9 result = call_result[1:]
10 if len(result) == 1:
11 result = result[0]
12 if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
13 raise RuntimeError(str(err))
14 return result
15
16# inputs
17n = 100
18x_h = array.array("i",[int(random.random()*10) for i in range(0,n)])
19num_bytes = x_h.itemsize * len(x_h)
20x_d = hip_check(hip.hipMalloc(num_bytes))
21
22stream = hip_check(hip.hipStreamCreate())
23hip_check(hip.hipMemcpyAsync(x_d,x_h,num_bytes,hip.hipMemcpyKind.hipMemcpyHostToDevice,stream))
24hip_check(hip.hipMemsetAsync(x_d,0,num_bytes,stream))
25hip_check(hip.hipMemcpyAsync(x_h,x_d,num_bytes,hip.hipMemcpyKind.hipMemcpyDeviceToHost,stream))
26hip_check(hip.hipStreamSynchronize(stream))
27hip_check(hip.hipStreamDestroy(stream))
28
29# deallocate device data
30hip_check(hip.hipFree(x_d))
31
32for i,x in enumerate(x_h):
33 if x != 0:
34 raise ValueError(f"expected '0' for element {i}, is: '{x}'")
35print("ok")
What is happening?
A host buffer is filled with random numbers (line 18) before it is asynchronously copied to the device (line 23), where a asynchronous
hipMemsetAsync
(same stream) resets all bytes to0
(line 24).An asynchronous memcpy (same stream) is then issued to copy the device data back to the host (line 25). All operations within the stream are executed in order.
As the
~Async
operations are non-blocking, the host waits viahipStreamSynchronize
until operations in the stream have been completed (line 26) before destroying the stream (line 27).Eventually the program deallocates device data via
hipFree
and checks if all bytes in the host buffer are now set to0
. If so, it quits with an “ok”.
Launching Kernels#
What will I learn?
How I can compile a HIP C++ kernel at runtime via
hiprtcCompileProgram
.How I can launch kernels via
hipModuleLaunchKernel
.
HIP Python does not provide the necessary infrastructure to express device code
in native Python. However, you can compile and launch kernels from within Python code
via the just-in-time (JIT) compilation interface provided by HIP Python module hiprtc
together with
the kernel launch routines provided by HIP Python module hip
.
The example below demonstrates how to do so.
1from hip import hip, hiprtc
2
3def hip_check(call_result):
4 err = call_result[0]
5 result = call_result[1:]
6 if len(result) == 1:
7 result = result[0]
8 if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
9 raise RuntimeError(str(err))
10 elif (
11 isinstance(err, hiprtc.hiprtcResult)
12 and err != hiprtc.hiprtcResult.HIPRTC_SUCCESS
13 ):
14 raise RuntimeError(str(err))
15 return result
16
17
18source = b"""\
19extern "C" __global__ void print_tid() {
20 printf("tid: %d\\n", (int) threadIdx.x);
21}
22"""
23
24prog = hip_check(hiprtc.hiprtcCreateProgram(source, b"print_tid", 0, [], []))
25
26props = hip.hipDeviceProp_t()
27hip_check(hip.hipGetDeviceProperties(props,0))
28arch = props.gcnArchName
29
30print(f"Compiling kernel for {arch}")
31
32cflags = [b"--offload-arch="+arch]
33err, = hiprtc.hiprtcCompileProgram(prog, len(cflags), cflags)
34if err != hiprtc.hiprtcResult.HIPRTC_SUCCESS:
35 log_size = hip_check(hiprtc.hiprtcGetProgramLogSize(prog))
36 log = bytearray(log_size)
37 hip_check(hiprtc.hiprtcGetProgramLog(prog, log))
38 raise RuntimeError(log.decode())
39code_size = hip_check(hiprtc.hiprtcGetCodeSize(prog))
40code = bytearray(code_size)
41hip_check(hiprtc.hiprtcGetCode(prog, code))
42module = hip_check(hip.hipModuleLoadData(code))
43kernel = hip_check(hip.hipModuleGetFunction(module, b"print_tid"))
44#
45hip_check(
46 hip.hipModuleLaunchKernel(
47 kernel,
48 *(1, 1, 1), # grid
49 *(32, 1, 1), # block
50 sharedMemBytes=0,
51 stream=None,
52 kernelParams=None,
53 extra=None,
54 )
55)
56
57hip_check(hip.hipDeviceSynchronize())
58hip_check(hip.hipModuleUnload(module))
59hip_check(hiprtc.hiprtcDestroyProgram(prog.createRef()))
60
61print("ok")
What is happening?
In the example, the kernel
print_tid
defined within the stringsource
simply prints the block-local thread ID (threadIDx.x
) for every thread running the kernel (line 20).A program
prog
is then created in line 25 viahiprtcCreateProgram
, where we passsource
as first argument, we further give the program a name (note theb".."
), specify zero headers and include names (last three arguments).Next we query the architecture name via
hipGetDeviceProperties
(more details: Obtaining Device Properties) and use it in lines 33-34, where we specify compile flags (cflags
) and compileprog
viahiprtcCompileProgram
. In case of a failure, we obtain the program log and raise it asRuntimeError
.In case of success, we query the code size via
hiprtcGetCodeSize
, create a buffer with that information, and then copy the code into this buffer viahiprtcGetCode
. Afterwards, we load the code asmodule
viahipModuleLoadData
and then obtain our device kernel with name"print_tid"
from it viahipModuleGetFunction
.This object is then passed as first argument to the
hipModuleLaunchKernel
routine, followed by the usual grid and block dimension triples, the size of the required shared memory, and stream to use (None
means the null stream). The latter two arguments,kernelParams
andextra
, can be used for passing kernel arguments. We will take a look how to pass kernel arguments viaextra
in the next section.After the kernel launch, the host waits on completion via
hipDeviceSynchronize
and then unloads the code module again viahipModuleUnload
before quitting with an"ok"
.
Kernels with Arguments#
What will I learn?
How I can pass arguments to hipModuleLaunchKernel
.
One of the difficulties that programmers face when attempting to launch
kernels via hipModuleLaunchKernel
is passing arguments to the kernels.
When using the extra
argument, the kernel arguments must be aligned in a certain way.
In C/C++ programs, one can simply put all arguments into a struct and let the
compiler take care of the argument alignment. Similarly,
one could create a ctypes.Structure
in python to do the same.
However, we do not want to oblige HIP Python users with creating such glue code.
Instead, users can directly pass a list
or tuple
of arguments to
the hipModuleLaunchKernel
. The entries of these objects must
either be of type DeviceArray
(or can be converted to DeviceArray
)
or one of the ctypes
types.
The former are typically the result of a hipMalloc
call (or similar memory allocation routines).
Please also see HIP Python’s Adapter Types for details on what other types can be converted to DeviceArray
.
The ctypes
types are typically used to convert a scalar of the python bool
, int
, and float
scalar types
to a fixed precision.
The below example demonstrates the usage of hipModuleLaunchKernel
by means of a simple kernel, which scales a vector by a factor.
Here, We pass multiple arguments that require different alignments to the aforementioned routine
in lines 85-90. We insert some additional unused*
arguments into the extra
tuple
to stress the
argument buffer allocator. Note the ctypes
object construction for scalars and the direct passing of the device array x_d
.
Compare the argument list with the signature of the kernel defined in line 23.
The example also introduces HIP Python’s dim3
struct (default value per dimension is 1), which can be unpacked just
like a tuple
or list
.
1import ctypes
2import array
3import random
4import math
5
6from hip import hip, hiprtc
7
8def hip_check(call_result):
9 err = call_result[0]
10 result = call_result[1:]
11 if len(result) == 1:
12 result = result[0]
13 if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
14 raise RuntimeError(str(err))
15 elif (
16 isinstance(err, hiprtc.hiprtcResult)
17 and err != hiprtc.hiprtcResult.HIPRTC_SUCCESS
18 ):
19 raise RuntimeError(str(err))
20 return result
21
22source = b"""\
23extern "C" __global__ void scale_vector(float factor, int n, short unused1, int unused2, float unused3, float *x) {
24 int tid = threadIdx.x + blockIdx.x * blockDim.x;
25 if ( tid == 0 ) {
26 printf("tid: %d, factor: %f, x*: %lu, n: %lu, unused1: %d, unused2: %d, unused3: %f\\n",tid,factor,x,n,(int) unused1,unused2,unused3);
27 }
28 if (tid < n) {
29 x[tid] *= factor;
30 }
31}
32"""
33
34prog = hip_check(hiprtc.hiprtcCreateProgram(source, b"scale_vector", 0, [], []))
35
36props = hip.hipDeviceProp_t()
37hip_check(hip.hipGetDeviceProperties(props,0))
38arch = props.gcnArchName
39
40print(f"Compiling kernel for {arch}")
41
42cflags = [b"--offload-arch="+arch]
43err, = hiprtc.hiprtcCompileProgram(prog, len(cflags), cflags)
44if err != hiprtc.hiprtcResult.HIPRTC_SUCCESS:
45 log_size = hip_check(hiprtc.hiprtcGetProgramLogSize(prog))
46 log = bytearray(log_size)
47 hip_check(hiprtc.hiprtcGetProgramLog(prog, log))
48 raise RuntimeError(log.decode())
49code_size = hip_check(hiprtc.hiprtcGetCodeSize(prog))
50code = bytearray(code_size)
51hip_check(hiprtc.hiprtcGetCode(prog, code))
52module = hip_check(hip.hipModuleLoadData(code))
53kernel = hip_check(hip.hipModuleGetFunction(module, b"scale_vector"))
54
55# kernel launch
56
57## inputs
58n = 100
59x_h = array.array("f",[random.random() for i in range(0,n)])
60num_bytes = x_h.itemsize * len(x_h)
61x_d = hip_check(hip.hipMalloc(num_bytes))
62print(f"{hex(int(x_d))=}")
63
64## upload host data
65hip_check(hip.hipMemcpy(x_d,x_h,num_bytes,hip.hipMemcpyKind.hipMemcpyHostToDevice))
66
67factor = 1.23
68
69## expected result
70x_expected = [a*factor for a in x_h]
71
72block = hip.dim3(x=32)
73grid = hip.dim3(math.ceil(n/block.x))
74
75## launch
76hip_check(
77 hip.hipModuleLaunchKernel(
78 kernel,
79 *grid,
80 *block,
81 sharedMemBytes=0,
82 stream=None,
83 kernelParams=None,
84 extra=(
85 ctypes.c_float(factor), # 4 bytes
86 ctypes.c_int(n), # 8 bytes
87 ctypes.c_short(5), # unused1, 10 bytes
88 ctypes.c_int(2), # unused2, 16 bytes (+2 padding bytes)
89 ctypes.c_float(5.6), # unused3 20 bytes
90 x_d, # 32 bytes (+4 padding bytes)
91 )
92 )
93)
94
95# copy result back
96hip_check(hip.hipMemcpy(x_h,x_d,num_bytes,hip.hipMemcpyKind.hipMemcpyDeviceToHost))
97
98for i,x_h_i in enumerate(x_h):
99 if not math.isclose(x_h_i,x_expected[i],rel_tol=1e-6):
100 raise RuntimeError(f"values do not match, {x_h[i]=} vs. {x_expected[i]=}, {i=}")
101
102hip_check(hip.hipFree(x_d))
103
104hip_check(hip.hipModuleUnload(module))
105hip_check(hiprtc.hiprtcDestroyProgram(prog.createRef()))
106
107print("ok")
What is happening?
See the previous section Launching Kernels for a textual description of the main steps.
hipBLAS and NumPy Interoperability#
What will I learn?
How I can use HIP Python’s
hipblas
module.That I can pass
numpy
arrays to HIP runtime routines such ashipMemcpy
andhipMemcpyAsync
.
This example demonstrates how to initialize and use HIP Python’s hipblas
module. Furthermore, it shows that you can simply pass numpy
arrays to HIP runtime routines such
as hipMemcpy
and hipMemcpyAsync
. This works because some of HIP Python’s interfaces
support automatic conversion from various different types—in particular such types that implement the Python buffer protocol. The numpy
arrays implement the Python buffer protocol
and thus can be directly passed to those interfaces.
1import ctypes
2import math
3import numpy as np
4
5from hip import hip
6from hip import hipblas
7
8def hip_check(call_result):
9 err = call_result[0]
10 result = call_result[1:]
11 if len(result) == 1:
12 result = result[0]
13 if isinstance(err,hip.hipError_t) and err != hip.hipError_t.hipSuccess:
14 raise RuntimeError(str(err))
15 elif isinstance(err,hipblas.hipblasStatus_t) and err != hipblas.hipblasStatus_t.HIPBLAS_STATUS_SUCCESS:
16 raise RuntimeError(str(err))
17 return result
18
19num_elements = 100
20
21# input data on host
22alpha = ctypes.c_float(2)
23x_h = np.random.rand(num_elements).astype(dtype=np.float32)
24y_h = np.random.rand(num_elements).astype(dtype=np.float32)
25
26# expected result
27y_expected = alpha*x_h + y_h
28
29# device vectors
30num_bytes = num_elements * np.dtype(np.float32).itemsize
31x_d = hip_check(hip.hipMalloc(num_bytes))
32y_d = hip_check(hip.hipMalloc(num_bytes))
33
34# copy input data to device
35hip_check(hip.hipMemcpy(x_d,x_h,num_bytes,hip.hipMemcpyKind.hipMemcpyHostToDevice))
36hip_check(hip.hipMemcpy(y_d,y_h,num_bytes,hip.hipMemcpyKind.hipMemcpyHostToDevice))
37
38# call hipblasSaxpy + initialization & destruction of handle
39handle = hip_check(hipblas.hipblasCreate())
40hip_check(hipblas.hipblasSaxpy(handle, num_elements, ctypes.addressof(alpha), x_d, 1, y_d, 1))
41hip_check(hipblas.hipblasDestroy(handle))
42
43# copy result (stored in y_d) back to host (store in y_h)
44hip_check(hip.hipMemcpy(y_h,y_d,num_bytes,hip.hipMemcpyKind.hipMemcpyDeviceToHost))
45
46# compare to expected result
47if np.allclose(y_expected,y_h):
48 print("ok")
49else:
50 print("FAILED")
51#print(f"{y_h=}")
52#print(f"{y_expected=}")
53
54# clean up
55hip_check(hip.hipFree(x_d))
56hip_check(hip.hipFree(y_d))
What is happening?
We initialize two
float32
-typednumpy
arraysx_h
andy_h
on the host and fill them with random data (lines 23-24).We compute the expected result on the host via
numpy
array operations (line 27).We allocate device analogues for
x_h
andy_h
(lines 31-32) and copy the host data over (line 35-36). Note that we can directly pass thenumpy
arraysx_h
andy_h
tohipMemcpy
.Before being able to call one of the compute routines of
hipblas
, it’s necessary to create ahipblas
handle viahipblasCreate
that will be passed to everyhipblas
routine as first argument (line 39).In line 40 follows the call to
hipblasSaxpy
, where we pass the handle as first argument and the address of hostctypes.c_float
variablealpha
as third argument.In line 41 the handle is destroyed via
hipblasDestroy
because it is not needed anymore.The device data is downloaded in line 44. where we pass
numpy
arrayy_h
as destination array.We compare the expected host result with the downloaded device result (lines 47-50) and print
"ok"
if all is fine.Finally, we deallocate the device arrays in lines 55-56.
HIP Python Device Arrays#
What will I learn?
How I can change the shape and datatype of HIP Python’s
DeviceArray
objects.How I can obtain subarrays from HIP Python’s
DeviceArray
objects — which are again of that type — via array subscript.
This example demonstrates how to configure
the shape and data type
of a DeviceArray
returned by hipMalloc
(and related routines).
It further shows how to retrieve single elements / contiguous subarrays
with respect to the specified type and shape information.
1verbose = False
2
3import ctypes
4
5from hip import hip, hipblas
6import numpy as np
7
8def hip_check(call_result):
9 err = call_result[0]
10 result = call_result[1:]
11 if len(result) == 1:
12 result = result[0]
13 if isinstance(err,hip.hipError_t) and err != hip.hipError_t.hipSuccess:
14 raise RuntimeError(str(err))
15 elif isinstance(err,hipblas.hipblasStatus_t) and err != hipblas.hipblasStatus_t.HIPBLAS_STATUS_SUCCESS:
16 raise RuntimeError(str(err))
17 return result
18
19# init host array and fill with ones
20shape = (3,20) # shape[1]: inner dim
21x_h = np.ones(shape,dtype="float32")
22num_bytes = x_h.size * x_h.itemsize
23
24# init device array and upload host data
25x_d = hip_check(hip.hipMalloc(num_bytes)).configure(
26 typestr="float32",shape=shape
27)
28hip_check(hip.hipMemcpy(x_d,x_h,num_bytes,hip.hipMemcpyKind.hipMemcpyHostToDevice))
29
30# scale device array entries by row index using hipblasSscal
31handle = hip_check(hipblas.hipblasCreate())
32for r in range(0,shape[0]):
33 row = x_d[r,:] # extract subarray
34 row_len = row.size
35 alpha = ctypes.c_float(r)
36 hip_check(hipblas.hipblasSscal(handle, row_len, ctypes.addressof(alpha), row, 1))
37 hip_check(hip.hipDeviceSynchronize())
38hip_check(hipblas.hipblasDestroy(handle))
39
40# copy device data back to host
41hip_check(hip.hipMemcpy(x_h,x_d,num_bytes,hip.hipMemcpyKind.hipMemcpyDeviceToHost))
42
43# deallocate device data
44hip_check(hip.hipFree(x_d))
45
46for r in range(0,shape[0]):
47 row_rounded = [round(el) for el in x_h[r,:]]
48 for c,e in enumerate(row_rounded):
49 if e != r:
50 raise ValueError(f"expected '{r}' for element ({r},{c}), is '{e}")
51 if verbose:
52 print("\t".join((str(i) for i in row_rounded))+"\n")
53print("ok")
What is happening?
A two-dimensional row-major array of size
(3,20)
is created on the host. All elements are initialized to1
(line 20-21).A device array with the same number of bytes is created on the device (line 25).
The device array is reconfigured to have
float32
type and the shape of the host array (line 25-27).The host data is copied to the device array (line 28).
Within a loop over the row indices (index:
r
):A pointer to row with index
r
is created via array subscript (line 33). This yieldsrow
.row
is passed to ahipblasSscal
call that writes indexr
to all elements of the row (line 36).
Data is copied back from the device to the host array.
Finally, a check is performed on the host if the row values equal the respective row index (lines 44-50). The program quits with
"ok"
if all went well.
Note
Please also see HIP Python’s Adapter Types for more details on the capabilities of type DeviceArray
and the CUDA Array interface
that it implements.
Monte Carlo with hipRAND#
What will I learn?
How I can create an
hiprand
random number generator viahiprandCreateGenerator
.How I can generate uniformly-distributed random numbers via
hiprandGenerateUniformDouble
.
This example uses hiprand
to estimate \(\pi\) by means of the Monte-Carlo method.
Background
The unit square has the area \(1^2\), while the unit circle has the area \(\pi\,(\frac{1}{2})^2\). Therefore, the ratio between the latter and the former area is \(\frac{\pi}{4}\). Using the Monte-Carlo method, we randomly choose \(N\) \((x,y)\)-coordinates in the unit square. We then estimate the ratio of areas as the ratio between the number of samples located within the unit circle and the total number of samples \(N\). The accuracy of the approach increases with \(N\).
Note
This example was derived from a similar example in the rocRAND repository on Github. See this repository for another higher-level interface to hiprand/rocrand (ctypes
-based, no Cython interfaces).
1from hip import hip, hiprand
2import numpy as np
3import math
4
5def hip_check(call_result):
6 err = call_result[0]
7 result = call_result[1:]
8 if len(result) == 1:
9 result = result[0]
10 if isinstance(err, hiprand.hiprandStatus) and err != hiprand.hiprandStatus.HIPRAND_STATUS_SUCCESS:
11 raise RuntimeError(str(err))
12 if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
13 raise RuntimeError(str(err))
14 return result
15
16print("Estimating Pi via the Monte Carlo method:\n")
17
18def calculate_pi(n):
19 """Calculate Pi for the given number of samples.
20 """
21 xy = np.empty(shape=(2, n)) # host array, default type is float64
22 gen = hip_check(hiprand.hiprandCreateGenerator(hiprand.hiprandRngType.HIPRAND_RNG_PSEUDO_DEFAULT))
23 xy_d = hip_check(hip.hipMalloc(xy.size*xy.itemsize)) # create same size device array
24 hip_check(hiprand.hiprandGenerateUniformDouble(gen,xy_d,xy.size)) # generate device random numbers
25 hip_check(hip.hipMemcpy(xy,xy_d,xy.size*xy.itemsize,hip.hipMemcpyKind.hipMemcpyDeviceToHost)) # copy to host
26 hip_check(hip.hipFree(xy_d)) # free device array
27 hip_check(hiprand.hiprandDestroyGenerator(gen))
28
29 inside = xy[0]**2 + xy[1]**2 <= 1.0
30 in_xy = xy[:, inside]
31 estimate = 4*in_xy[0,:].size/n
32 return estimate
33
34print(f"#samples\testimate\trelative error")
35n = 100
36imax = 5
37for i in range(1,imax):
38 n *= 10
39 estimate = calculate_pi(n)
40 print(f"{n:12}\t{estimate:1.9f}\t{abs(estimate-math.pi)/math.pi:1.9f}")
41print("ok")
What is happening?
Within a loop that per iteration multiplies the problem size n
by 10
(line 37-38), we call a function
calculate_pi
with n
as argument, in which:
We first create a two-dimensional host array
xy
of typedouble
withn
elements (line 21).We then create a
hiprandCreateGenerator
generator of typeHIPRAND_RNG_PSEUDO_DEFAULT
(line 22).We create a device array
xy_d
that stores the same number of bytes asxy
.We fill
xy_d
with random data viahiprandGenerateUniformDouble
(line 24).We then copy to
xy
fromxy_d
and freex_d
(lines 25-26) and destroy the generator (line 27).We use
numpy
array operations to count the number of random-generated \(x-y\)-coordinates within the unit circle (lines 29-30).Finally, we compute the ratio estimate for the given
n
and return it (lines 31-32).
A simple complex FFT with hipFFT#
What will I learn?
How I can create an
hipfft
1D plan viahipfftPlan1d
.How I can run a complex in-place forward FFT via
hipfftExecZ2Z
.
This example demonstrates the usage of HIP Python’s hipfft
library.
We perform a double-complex-to-double-complex in-place forward FFT of a constant time signal \(f(t) = 1-1j\) of which we have \(N\) samples. The resulting FFT coefficients are all zero — aside from the first one, which has the value \(N-Nj\).
1import numpy as np
2from hip import hip, hipfft
3
4def hip_check(call_result):
5 err = call_result[0]
6 result = call_result[1:]
7 if len(result) == 1:
8 result = result[0]
9 if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
10 raise RuntimeError(str(err))
11 if isinstance(err, hipfft.hipfftResult) and err != hipfft.hipfftResult.HIPFFT_SUCCESS:
12 raise RuntimeError(str(err))
13 return result
14
15# initial data
16N = 100
17hx = np.zeros(N,dtype=np.cdouble)
18hx[:] = 1 - 1j
19
20# copy to device
21dx = hip_check(hip.hipMalloc(hx.size*hx.itemsize))
22hip_check(hip.hipMemcpy(dx, hx, dx.size, hip.hipMemcpyKind.hipMemcpyHostToDevice))
23
24# create plan
25plan = hip_check(hipfft.hipfftPlan1d(N, hipfft.hipfftType.HIPFFT_Z2Z, 1))
26
27# execute plan
28hip_check(hipfft.hipfftExecZ2Z(plan, idata=dx, odata=dx, direction=hipfft.HIPFFT_FORWARD))
29hip_check(hip.hipDeviceSynchronize())
30
31# copy to host and free device data
32hip_check(hip.hipMemcpy(hx,dx,dx.size,hip.hipMemcpyKind.hipMemcpyDeviceToHost))
33hip_check(hip.hipFree(dx))
34
35if not np.isclose(hx[0].real,N) or not np.isclose(hx[0].imag,-N):
36 raise RuntimeError("element 0 must be '{N}-j{N}'.")
37for i in range(1,N):
38 if not np.isclose(abs(hx[i]),0):
39 raise RuntimeError(f"element {i} must be '0'")
40
41hip_check(hipfft.hipfftDestroy(plan))
42print("ok")
What is happening?
We start with creating the initial data in lines 17-18, where we use
numpy
for convenience.We then create a device array of the same size and copy the device data over (lines 21-22).
We create a plan in line 25, where we specify the number of samples
N
and the the type of the FFT as double-complex-to-double-complex,HIPFFT_Z2Z
.Afterwards, we execute the FFT in-place (
idata=dx
andodata=dx
) and specify that we run an forward FFT,HIPFFT_FORWARD
(line 28).The host then waits for completion of all activity on the device before copying data back to the host and freeing the device array (lines 29-33).
Finally, we check if the result is as expected and print
"ok"
if that’s the case (lines 35-42).
A multi-GPU broadcast with RCCL#
What will I learn?
How I can create a multi-GPU communicator via
ncclCommInitAll
.How I can destroy a communicator again via
ncclCommDestroy
.How I can open and close a communication group via
ncclGroupStart
andncclGroupEnd
, respectively.How I can perform a broadcast via
ncclBcast
.
This example implements a single-node multi-GPU broadcast of a small array from one GPU’s device buffer to that of the other ones.
1import numpy as np
2from hip import hip, rccl
3
4def hip_check(call_result):
5 err = call_result[0]
6 result = call_result[1:]
7 if len(result) == 1:
8 result = result[0]
9 if isinstance(err, hip.hipError_t) and err != hip.hipError_t.hipSuccess:
10 raise RuntimeError(str(err))
11 if isinstance(err, rccl.ncclResult_t) and err != rccl.ncclResult_t.ncclSuccess:
12 raise RuntimeError(str(err))
13 return result
14
15# init the communicators
16num_gpus = hip_check(hip.hipGetDeviceCount())
17comms = np.empty(num_gpus,dtype="uint64") # size of pointer type, such as ncclComm
18devlist = np.array(range(0,num_gpus),dtype="int32")
19hip_check(rccl.ncclCommInitAll(comms, num_gpus, devlist))
20
21# init data on the devices
22N = 4
23ones = np.ones(N,dtype="int32")
24zeros = np.zeros(ones.size,dtype="int32")
25dxlist = []
26for dev in devlist:
27 hip_check(hip.hipSetDevice(dev))
28 dx = hip_check(hip.hipMalloc(ones.size*ones.itemsize)) # items are bytes
29 dxlist.append(dx)
30 hx = ones if dev == 0 else zeros
31 hip_check(hip.hipMemcpy(dx,hx,dx.size,hip.hipMemcpyKind.hipMemcpyHostToDevice))
32
33# perform a broadcast
34hip_check(rccl.ncclGroupStart())
35for dev in devlist:
36 hip_check(hip.hipSetDevice(dev))
37 hip_check(rccl.ncclBcast(dxlist[dev], N, rccl.ncclDataType_t.ncclInt32, 0, int(comms[dev]), None))
38 # conversion to Python int is required to not let the numpy datatype to be interpreted as single-element Py_buffer
39hip_check(rccl.ncclGroupEnd())
40
41# download and check the output; confirm all entries are one
42hx = np.empty(N,dtype="int32")
43for dev in devlist:
44 dx=dxlist[dev]
45 hx[:] = 0
46 hip_check(hip.hipMemcpy(hx,dx,dx.size,hip.hipMemcpyKind.hipMemcpyDeviceToHost))
47 for i,item in enumerate(hx):
48 if item != 1:
49 raise RuntimeError(f"failed for element {i}")
50
51# clean up
52for dx in dxlist:
53 hip_check(hip.hipFree(dx))
54for comm in comms:
55 hip_check(rccl.ncclCommDestroy(int(comm)))
56 # conversion to Python int is required to not let the numpy datatype to be interpreted as single-element Py_buffer
57
58print("ok")
What is happening?
In line 17, we use the device count
num_gpus
(viahipGetDeviceCount
) to create an array of pointers (same size asunsigned long
,dtype="uint64"
). This array namedcomms
is intended to store a pointer to each device’s communicator.We then create an array of device identifiers (line 18).
We pass both arrays to
ncclCommInitAll
as first and last argument, respectively (line 19). The second element is the device count. The aforementioned routine initializes all communicators and writes their address to thecomms
array.In lines 22-28, we create an array
dx
on each device of sizeN
that is initialized with zeros on all devices except device0
. The latter’s array is filled with ones.We start a communication group in line 34, and then call
ncclBcast
per device in line 37. The first argument of the call is per-devicedx
, the second the size ofdx
. Then follows thencclDataType_t
, the root (device0
), then the communicator (int(comms[dev])
) and finally the stream (None
). Castingcomms[dev]
toint
is required as the result is otherwise interpreted as single-elementPy_buffer
by HIP Python’sncclBcast
instead of as an address.In line 39, we close the communication group again.
We download all data to the host per device and check if the elements are set to
1
(lines 42-50). Otherwise, a runtime error is emitted.Finally, we clean up by deallocating all device memory and destroying the per-device communicators via
ncclCommDestroy
in line 55. Note that here again thecomm
must be converted toint
before passing it to the HIP Python routine.
Note
Please also see HIP Python’s Adapter Types for more details on automatic type conversions supported by HIP Python’s datatypes.
Basic Usage (Cython)#
What will I learn?
How I can use HIP Python’s Cython modules in my Cython code.
How to compile my Cython code that uses HIP Python’s Cython modules.
In this section, we show how to use HIP Python’s Cython modules and how to compile projects that use them.
Cython Recap#
Note
This section expects that the user has at least some basic knowledge about the programming language Cython. If you are unfamiliar with the language, we refer to the Cython tutorials and the Language Basics page.
Cython modules are often split into a *.pxd
and a *.pyx
file, which are a Cython module’s
declaration and implementation part respectively.
While the former files are to some degree comparable to header files in C/C++,
the latter can be compared to sources files.
The declaration part may only contain cdef
fields, variables, and function prototypes
while the implementation part may contain the implementation of those entities
as well as Python fields, variables, and functions.
The implementation part is the interface between the C/C++ and the Python world.
Here, you can import Python code via Python’s import
statements,
you can C-import cdef
declarations from other Cython
declaration files (*.pxd
) via cimport
statements,
and you can include C/C++ declarations from C/C++ header files as cdef
declarations.
To build a Python module from a Cython module,
the implementation part must be first “cythonized”, i.e. converted into a C/C++ file
and then compiled with a compiler. It is recommended to use the compiler that
was used for compiling the used python interpreter.
Most people don’t do this manually but instead prefer to use the build infrastructure
provided by setuptools
. They then write a setup.py
script
that contains the code that performs the aforementioned two tasks.
Cython modules in HIP Python#
Per Python module hip.hip
, hip.hiprtc
, … ,
HIP Python ships an additional c
-prefixed hip.c<pkg_name>
module.
The module without the
c
prefix is compiled into the interface for HIP Python’s Python users. However, allcdef
declarations therein can also becimport
ed by Cython users (typicallycdef class
declarations) and all Python objects therein can beimport
ed by Cython users too (typically enum and function objects).The module with the
c
prefix builds the bridge to the underlying HIP C library by including C definitions from the corresponding header files. This code is located in the declaration part. This part further declares runtime function loader prototypes. The definition of these function loaders in the implementation part first try to load the underlying C library and then if successful, try to load the function symbol from that shared object.
Note
The lazy-loading of functions at runtime can, under some circumstances, allow to use a HIP Python version that covers a superset or only a subset of the functions available within the respective library of a ROCm™ installation.
Using the Cython API#
You can import the Python objects that you need as shown below:
1from hip import hip # enum types, enum aliases, fields
2from hip import hiprtc
3# ...
In the same file, you can also or alternatively cimport
the cdef
entities
as shown below:
1from hip cimport chip # direct access to C interfaces and lazy function loaders
2from hip cimport chiprtc
3# ...
4
5from hip cimport hip # access to `cdef class` and `ctypedef` types
6 # that have been created per C struct/union/typedef
7from hip cimport hiprtc
8# ...
Compiling a Cython module#
After having written your own mymodule.pyx
file that uses HIP Python’s Cython API,
you can compile the result using a setup.py
script as shown below. In the setup.py
script, we only assume that HIP or HIPRTC is used. Therefore, only amdhip64
is put into the rocm_libs
list.
It is further important to specify the HIP Platform as the header files from which we include the C interfaces will be included at compile time by the underlying C/C++ compiler. The compilation path must include all these interfaces.
1import os, sys
2
3mymodule = "mymodule"
4
5from setuptools import Extension, setup
6from Cython.Build import cythonize
7
8ROCM_PATH=os.environ.get("ROCM_PATH", "/opt/rocm")
9HIP_PLATFORM = os.environ.get("HIP_PLATFORM", "amd")
10
11if HIP_PLATFORM not in ("amd", "hcc"):
12 raise RuntimeError("Currently only HIP_PLATFORM=amd is supported")
13
14def create_extension(name, sources):
15 global ROCM_PATH
16 global HIP_PLATFORM
17 rocm_inc = os.path.join(ROCM_PATH,"include")
18 rocm_lib_dir = os.path.join(ROCM_PATH,"lib")
19 platform = HIP_PLATFORM.upper()
20 cflags = ["-D", f"__HIP_PLATFORM_{platform}__"]
21
22 return Extension(
23 name,
24 sources=sources,
25 include_dirs=[rocm_inc],
26 library_dirs=[rocm_lib_dir],
27 libraries=rocm_libs,
28 language="c",
29 extra_compile_args=cflags,
30 )
31
32setup(
33 ext_modules = cythonize(
34 [create_extension(mymodule, [f"{mymodule}.pyx"]),],
35 compiler_directives=dict(language_level=3),
36 compile_time_env=dict(HIP_PYTHON=True),
37 )
38)