Two-dimensional kernels: Matrix multiplication tutorial

Two-dimensional kernels: Matrix multiplication tutorial#

GPUs provide a massively parallel architecture consisting of thousands of cores, making them exceptionally well-suited for data-parallel computations. Two-dimensional kernel patterns are commonly data-parallel, enabling us to leverage GPU capabilities to exploit this inherent parallelism.

Tasks involving large matrices, which are common in image processing and machine learning applications, can be significantly accelerated by distributing computations across GPU cores. This tutorial explores how to implement matrix multiplication using two-dimensional GPU kernels with HIP.

Prerequisites#

To follow this tutorial, you’ll need installed drivers and a HIP compiler toolchain to compile your code. HIP supports compiling and running on Linux and Windows with AMD GPUs, the combination of install instructions is more than worth covering as part of this tutorial. For more information about installing HIP development packages, see Install HIP.

Characteristics of 2D computational problems#

Spatial locality and data dependencies: Adjacent elements in the grid often exhibit strong spatial correlations, making memory access patterns and cache utilization critical for performance.
Natural 2D data representation: Many datasets—such as images, numerical matrices, and discretized physical fields—map directly onto a two-dimensional coordinate space.
Prevalence in simulation and modeling: Numerous scientific and engineering workloads such as finite difference methods, fluid dynamics, heat transfer, and image processing, are inherently two-dimensional.

Modern GPU architectures are engineered to exploit the parallelism of 2D computational grids. By designing kernels that operate on two-dimensional thread blocks and memory layouts, developers can optimize global memory access, minimize latency, and maximize throughput. Leveraging 2D kernel configurations not only aligns the computation with the GPU’s hardware topology but also enables substantial performance improvements for domain-specific applications.

Matrix multiplication#

Let A and B be two matrices defined as follows, \(A \in \mathbb{R}^{m \times n}\) and \(B \in \mathbb{R}^{n \times k}\). The matrix product \(C = A \cdot B\) is defined only when the number of columns of \(A\) equals the number of rows of \(B\). The resulting matrix \(C \in \mathbb{R}^{m \times k}\) has elements given by:

\[C_{ij} = \sum_{r=1}^{n} A_{ir} B_{rj}\]

for all \(i = 1, \dots, m\) and \(j = 1, \dots, k\).

In other words, each element \(C_{ij}\) is computed as the dot product of the \(i\)-th row vector of \(A\) and the j-th column vector of \(B\). This operation is repeated for all valid pairs of \((i, j)\) to construct the complete matrix \(C\).

For two square matrices of size \(N \times N\), the computational cost of classical matrix multiplication is \(O(N^3)\). As an example, consider \(N = 32\):

Multiplication operations: \(32^3 = 32{,}768\)
Addition operations: \(32^2 \times (32 - 1) = 31{,}744\)

Each element in the resulting matrix \(C\) is computed independently of the others, since it depends only on a single row of \(A\) and a single column of \(B\). This property makes matrix multiplication highly parallelizable and well-suited for execution on GPUs, multi-core CPUs, or distributed computing architectures.

CPU implementation#

A baseline CPU implementation provides a clear understanding of the classical matrix multiplication algorithm before exploring parallel GPU execution.

#include <iostream>
#include <cstdlib>
#define N 32

void cpu_matrix_multiplication(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; ++i) {
        for (int j = 0; j < n; ++j) {
            float sum = 0.0f;
            for (int k = 0; k < n; ++k) {
                sum += a[i * n + k] * b[k * n + j];
            }
            c[i * n + j] = sum;
        }
    }
}

int main() {
    float *a, *b, *c;
    a = (float*)malloc(sizeof(float) * N * N);
    b = (float*)malloc(sizeof(float) * N * N);
    c = (float*)malloc(sizeof(float) * N * N);

    // Initialize matrix A
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            a[i * N + j] = static_cast<float>(rand()) / RAND_MAX;
        }
    }

    // Initialize matrix B
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            b[i * N + j] = static_cast<float>(rand()) / RAND_MAX;
        }
    }

    cpu_matrix_multiplication(a, b, c, N);
    free(a);
    free(b);
    free(c);
    return 0;
}

The cpu_matrix_multiplication function performs the classical \(O(N^3)\) matrix multiplication algorithm using three nested loops. The implementation proceeds as follows:

Input parameters: Three pointers to contiguous memory blocks representing matrices \(A\), \(B\), and the output matrix \(C\).
Outer and middle loops: The indices \(i\) and \(j\) iterate over the rows and columns of the output matrix \(C\), respectively.
Innermost loop: For each element \(C_{ij}\), the loop over \(k\) performs a dot product between the i-th row of \(A\) and the \(j\)-th column of \(B\):

\[C_{ij} = \sum_{k=0}^{n-1} A_{ik} \cdot B_{kj}\]
Temporary accumulation: A local scalar sum accumulates the intermediate sum before being written to c[i * n + j].

This implementation has a computational complexity of \(O(N^3)\) and poor cache locality for large matrices, but it serves as a reference for understanding sequential computation before introducing GPU parallelization.

GPU implementation#

The following example demonstrates a complete HIP implementation of matrix multiplication, including host and device memory management, kernel implementation, configuration, and synchronization.

GPU kernel#

The core computation is performed by the GPU kernel, where each thread computes one element of the output matrix. For comparison, the CPU implementation is also provided.

GPU version

__global__ void gpu_matrix_multiplication(float *a, float *b, float *c, int n) {

    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;

    if (row < n && col < n) {
        for (int k = 0; k < n; ++k) {
            sum += a[row * n + k] * b[k * n + col];
        }
        c[row * n + col] = sum;
    }
}

CPU version

void cpu_matrix_multiplication(float *a, float *b, float *c, int n) {

    for (int i = 0; i < n; ++i) {
        for (int j = 0; j < n; ++j) {
            float sum = 0.0f;
            for (int k = 0; k < n; ++k) {
                sum += a[i * n + k] * b[k * n + j];
            }
            c[i * n + j] = sum;
        }
    }
}

The outer and middle loops of the CPU implementation are replaced by the parallel execution of the GPU implementation. Each GPU thread computes the single element \(C_{ij}\) of the output matrix corresponding to one dot product between a row of \(A\) and a column of \(B\). This decomposition exposes massive parallelism, as all elements of \(C\) can be computed independently and concurrently.

Thread and block identification#

Each thread’s global position within the grid is determined by:

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

Here:

threadIdx.(x|y): Local thread indices within a block.
blockIdx.(x|y): Block indices within the grid.
blockDim.(x|y): Dimensions of each block (used to scale offsets).

Boundary checking#

Since the total number of threads launched may exceed \(N^2\), boundary checking ensures that threads outside the matrix domain do not perform invalid memory accesses:

if (row < n && col < n) {
    // Safe computation region
}

Dot product computation#

Within the valid region, each thread executes a dot product over \(k\):

Loads one element from row row of matrix \(A\).
Loads one element from column col of matrix \(B\).
Multiplies and accumulates these values into sum.
Writes the final scalar result to c[row * n + col].

This kernel performs the same \(O(N^3)\) arithmetic operations as the CPU version but distributes them across thousands of concurrent GPU threads, achieving significant acceleration through parallel execution and memory throughput optimization.

Step 1: Host memory allocation and initialization#

Host memory is allocated for matrices and initialized with random floating-point values.

int main() {
    float *h_a, *h_b, *h_c;
    h_a = (float*)malloc(sizeof(float) * N * N);
    h_b = (float*)malloc(sizeof(float) * N * N);
    h_c = (float*)malloc(sizeof(float) * N * N);

    // Initialize matrix A
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            h_a[i * N + j] = static_cast<float>(rand()) / RAND_MAX;
        }
    }

    // Initialize matrix B
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            h_b[i * N + j] = static_cast<float>(rand()) / RAND_MAX;
        }
    }

Description:

Declare host (CPU) pointers h_a, h_b, and h_c.
Allocate contiguous memory for each \(N \times N\) matrix.
Initialize input matrices \(A\) and \(B\) with pseudo-random floating-point values in the range [0, 1).

Step 2: Device memory allocation and data transfer#

Memory is allocated on the GPU, and input matrices are transferred from host to device.

float *d_a, *d_b, *d_c;
hipMalloc((void**)&d_a, sizeof(float) * N * N);
hipMalloc((void**)&d_b, sizeof(float) * N * N);
hipMalloc((void**)&d_c, sizeof(float) * N * N);

hipMemcpy(d_a, h_a, sizeof(float) * N * N, hipMemcpyHostToDevice);
hipMemcpy(d_b, h_b, sizeof(float) * N * N, hipMemcpyHostToDevice);

Operations:

Allocate GPU (device) memory for matrices d_a, d_b, and d_c.
Transfer data from host to device using hipMemcpy() with direction hipMemcpyHostToDevice.

Step 3: Configure and launch kernel#

Kernel launch parameters define how threads are organized across blocks and the grid.

dim3 threadsPerBlock(BLOCK_SIZE, BLOCK_SIZE);
int n_blocks = static_cast<int>(ceil(static_cast<float>(N) / BLOCK_SIZE));
dim3 blocksPerGrid(n_blocks, n_blocks);

hipLaunchKernelGGL(gpu_matrix_multiplication,
                   blocksPerGrid,
                   threadsPerBlock,
                   0, 0,
                   d_a, d_b, d_c, N);

hipDeviceSynchronize();

Configuration details#

The dim3 type defines thread and block dimensions:

threadsPerBlock: Number of threads per block. A 16 × 16 block (256 threads total).
n_blocks: Number of blocks per dimension. Computed as \(\lceil N / \mathrm{BLOCK\_SIZE} \rceil\).
blocksPerGrid: A grid of blocks covering the entire \(N \times N\) matrix.

For example \(N = 256\) and BLOCK_SIZE = 16:

n_blocks: \(\lceil 256 / 16 \rceil = 16\)
blocksPerGrid: \(16 \times 16 = 256\)
Total threads: \(256 \text{ blocks} \times 256 \text{ threads/block} = 65{,}536\) threads.

Rounding up ensures full coverage of the matrix even when \(N\) is not an exact multiple of BLOCK_SIZE. The boundary check in the kernel

if (row < n && col < n)

prevents out-of-bounds memory access for extra threads.

Synchronization#

The call to hipDeviceSynchronize() ensures that all GPU computations complete before the CPU accesses results or proceeds to subsequent operations. This is essential for correctness and debugging.

Step 4: Copy results back and cleanup#

After the kernel execution, results are transferred back to host memory, and all allocated resources are released.

hipMemcpy(h_c, d_c, sizeof(float) * N * N, hipMemcpyDeviceToHost);

hipFree(d_a);
hipFree(d_b);
hipFree(d_c);
free(h_a);
free(h_b);
free(h_c);
return 0;
}

Summary of final steps:

Copy the result matrix from device to host using hipMemcpy.
Free all device memory allocations with hipFree.
Release host memory with free.
Return control to the operating system, indicating successful program termination.

Parallelization benefits#

For a 256 × 256 matrix multiplication:

Sequential CPU version: Computes 65,536 output elements serially.
Parallel GPU version: Executes up to 65,536 independent threads concurrently.

This results in a theoretical performance gain proportional to the number of active GPU threads and the device’s compute throughput.

Each output element \(C_{ij}\) is computed independently from others, since it depends solely on row \(i\) of matrix \(A\) and column \(j\) of matrix \(B\). This independence allows full utilization of GPU streaming multiprocessors and makes the algorithm highly scalable.

Best practices#

Choose optimal block sizes

Powers of two (e.g., 16 or 32) often yield better occupancy and memory alignment.
Handle boundary conditions

Always include thread boundary checks.
Synchronize appropriately

Use hipDeviceSynchronize() after kernel launches to ensure data consistency.
Memory coalescing

Arrange data access patterns so consecutive threads access contiguous memory locations, maximizing bandwidth utilization.
Use shared memory

Use shared memory to cache sub-blocks of matrices, significantly reducing global memory latency.
Profile and tune

Use tools such as rocprofv3 or ROCm compute profiler to identify bottlenecks and fine-tune kernel launch configurations.

Conclusion#

Two-dimensional GPU kernels provide an efficient mechanism to accelerate dense linear algebra computations such as matrix multiplication by exploiting fine-grained data parallelism. This example demonstrates:

Structuring GPU kernels for 2D problems.
Managing memory transfers between host and device.
Configuring thread and block hierarchies.
Achieving substantial speedups via massive parallel execution.

Understanding these concepts enables developers to implement optimized GPU solutions for computationally intensive workloads, including scientific simulations, numerical linear algebra, and machine learning. Because each output element is computed independently, matrix multiplication serves as an ideal introductory example for mastering GPU programming paradigms applicable to a wide range of data-parallel applications.