SAXPY - Hello, HIP

SAXPY - Hello, HIP#

This tutorial explains the basic concepts of the single-source Heterogeneous-computing Interface for Portability (HIP) programming model and the essential tooling around it. It also reviews some commonalities of heterogenous APIs in general. This topic assumes basic familiarity with the C/C++ compilation model and language.

Prerequisites#

To follow this tutorial, you’ll need installed drivers and a HIP compiler toolchain to compile your code. Because HIP for ROCm supports compiling and running on Linux and Windows with AMD and NVIDIA GPUs, the combination of install instructions is more than worth covering as part of this tutorial. For more information about installing HIP development packages, see Install HIP.

Heterogeneous programming#

Heterogeneous programming and offloading APIs are often mentioned together. Heterogeneous programming deals with devices of varying capabilities simultaneously. Offloading focuses on the “remote” and asynchronous aspects of computation. HIP encompasses both. It exposes GPGPU (general-purpose GPU) programming much like ordinary host-side CPU programming and lets you move data across various devices.

When programming in HIP (and other heterogenous APIs for that matter), remember that target devices are built for a specific purpose. They are designed with different tradeoffs than traditional CPUs and therefore have very different performance characteristics. Even subtle changes in code might adversely affect execution time.

Your first lines of HIP code#

First, let’s do the “Hello, World!” of GPGPU: SAXPY. Single-precision A times X Plus Y (SAXPY) is a mathematical acronym; a vector equation $a \cdot x + y = z$ where $a \in R$ is a scalar and $x, y, z \in V$ are vector quantities of some large dimensionality. This vector space is defined over the set of reals. Practically speaking, you can compute this using a single for loop over three arrays.

for (int i = 0 ; i < N ; ++i)
    z[i] = a * x[i] + y[i];

In linear algebra libraries, such as BLAS (Basic Linear Algebra Subsystem) this operation is defined as AXPY “A times X Plus Y”. The “S” comes from single-precision, meaning that array element is float -s (IEEE 754 binary32 representation).

To quickly get started, use the set of HIP samples from GitHub. With Git configured on your machine, open a command-line and navigate to your desired working directory, then run:

git clone https://github.com/amd/rocm-examples.git

A simple implementation of SAXPY resides in the HIP-Basic/saxpy/main.hip file in this repository. The HIP code here mostly deals with where data has to be and when, and how devices transform this data. The first HIP calls deal with allocating device-side memory and copying data from host-side memory to device side in a C runtime-like fashion.

// Allocate and copy vectors to device memory.
float* d_x{};
float* d_y{};
HIP_CHECK(hipMalloc(&d_x, size_bytes));
HIP_CHECK(hipMalloc(&d_y, size_bytes));
HIP_CHECK(hipMemcpy(d_x, x.data(), size_bytes, hipMemcpyHostToDevice));
HIP_CHECK(hipMemcpy(d_y, y.data(), size_bytes, hipMemcpyHostToDevice));

HIP_CHECK is a custom macro borrowed from the examples utilities which checks the error code returned by API functions for errors and reports them to the console. It is not essential to the API, but it is a good practice to check the error codes of the HIP APIs in case you pass on incorrect values to the API, or the API might be out of resources.

The code selects the device to allocate to and to copy to. Commands are issued to the HIP runtime per thread, and every thread has a device set as the target of commands. The default device is 0, which is equivalent to calling hipSetDevice(0).

Launch the calculation on the device after the input data has been prepared.

__global__ void saxpy_kernel(const float a, const float* d_x, float* d_y, const unsigned int size)
{
    // ...
}

int main()
{
    // ...

    // Launch the kernel on the default stream.
    saxpy_kernel<<<dim3(grid_size), dim3(block_size), 0, hipStreamDefault>>>(a, d_x, d_y, size);
}

Analyze at the signature of the offloaded function:

__global__ instructs the compiler to generate code for this function as an entrypoint to a device program, such that it can be launched from the host.
The function does not return anything, because there is no trivial way to construct a return channel of a parallel invocation. Device-side entrypoints may not return a value, their results should be communicated using output parameters.
Device-side functions are typically called compute kernels, or just kernels for short. This is to distinguish them from non-graphics-related graphics shaders, or just shaders for short.
Arguments are taken by value and all arguments shall be TriviallyCopyable, meaning they should be memcpy-friendly. (Imagine if they had custom copy constructors. Where would that logic execute? On the host? On the device?) Pointer arguments are pointers to device memory, one typically backed by VRAM.
We said that we’ll be computing $a \cdot x + y = z$ , however we only pass two pointers to the function. We’ll be canonically reusing one of the inputs as outputs.

This function is launched from the host using a language extension often called the triple chevron syntax. Inside the angle brackets, provide the following.

The number of blocks to launch (our grid size)
The number of threads in a block (our block size)
The amount of shared memory to allocate by the host
The device stream to enqueue the operation on

The block size and shared memory become important later in Reduction. For now, a hardcoded 256 is a safe default for simple kernels such as this. Following the triple chevron is ordinary function argument passing.

Look at how the kernel is implemented.

__global__ void saxpy_kernel(const float a, const float* d_x, float* d_y, const unsigned int size)
{
    // Compute the current thread's index in the grid.
    const unsigned int global_idx = blockIdx.x * blockDim.x + threadIdx.x;

    // The grid can be larger than the number of items in the vectors. Avoid out-of-bounds addressing.
    if(global_idx < size)
    {
        d_y[global_idx] = a * d_x[global_idx] + d_y[global_idx];
    }
}

The unique linear index identifying the thread is computed from the block ID the thread is a member of, the block’s size and the ID of the thread within the block.
A check is made to avoid overindexing the input.
The useful part of the computation is carried out.

Retrieval of the result from the device is done much like input data copy. In this current step the results copied from device to host. The opposite direction of the input data copy:

HIP_CHECK(hipMemcpy(y.data(), d_y, size_bytes, hipMemcpyDeviceToHost));

SAXPY - Hello, HIP

Contents

SAXPY - Hello, HIP#

Prerequisites#

Heterogeneous programming#

Your first lines of HIP code#

Compiling on the command line#

Setting up the command line#

Invoking the compiler manually#