Stencil operations: Image convolution tutorial

Stencil operations: Image convolution tutorial#

Stencil operations represent an important class of embarrassingly parallel algorithms that are ideally suited for GPU acceleration. A stencil algorithm iteratively updates data in an array based on a data item’s adjacent cells, making it a fundamental technique in various computational domains.

Prerequisites#

To follow this tutorial, you’ll need installed drivers and a HIP compiler toolchain to compile your code. HIP supports compiling and running on Linux and Windows with AMD GPUs, the combination of install instructions is more than worth covering as part of this tutorial. For more information about installing HIP development packages, see Install HIP.

Applications of stencil operations#

Stencil algorithms are commonly used in:

Physics simulations: Modeling heat transfer, fluid dynamics, and wave propagation
Partial differential equations: Numerical solutions to scientific computing problems
Image processing: Convolutional operations for smoothing, sharpening, and edge detection
Convolutional Neural Networks: A major building block in modern deep learning

By applying different image convolution kernels (not to be confused with GPU kernels), these algorithms can smooth and sharpen image features and detect edges effectively.

Image convolution#

An image convolution applies a small matrix (the mask or filter kernel) to an input image. For each pixel \((x, y)\), the output is computed as:

\[I'(x, y) = \sum_{i=-r}^{r} \sum_{j=-r}^{r} M(i, j) \cdot I(x + i, y + j)\]

where \(M(i, j)\) is the mask coefficient, and \(r\) is half the mask width (assuming a square kernel).

Step by step description of the equation:

Center the mask over the current pixel
Multiply each mask value by the corresponding image pixel
Sum all the products
Store the result as the new pixel value

Dimensionality of stencils#

Stencil computations extend beyond 2D image grids:

1D: Signal filtering, time series processing
2D: Image processing, texture analysis
3D: Volume data, fluid flow, and physical field simulation

This tutorial focuses on 2D image convolution, the most common stencil operation in visual and scientific computing.

The smoothing operation#

The tutorial implements a box blur (uniform smoothing filter). Each pixel’s new value is the average of its local neighborhood. This operation:

Reduces noise by averaging local intensity variations.
Acts as a low-pass filter, attenuating high-frequency components.
Provides an ideal example of a stencil computation with uniform weights.

Two-dimensional grid architecture#

The tutorial uses a two-dimensional grid that maps to the shape of the image, significantly simplifying the implementation. This approach:

Maps threads directly to pixel positions
Simplifies coordinate calculations
Enables intuitive spatial reasoning
Aligns with the natural structure of images

Grid configuration#

Rather than using a single integer to represent the size of the grid, the tutorial uses a dim3 object containing three values to represent the number of block-based work items per dimension:

x dimension: Width of the image
y dimension: Height of the image
z dimension: Set to 1 for 2D problems

Complete implementation#

Header and setup#

#include <hip/hip_runtime.h>
#include <vector>
#include "image.h"

The convolution kernel#

Here’s the complete 2D convolution kernel for image smoothing:

__global__ void conv2d(uint8_t *image, float *mask, int image_width,
                      int image_height, int mask_width, int mask_height)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= image_width || y >= image_height) {
        return;
    }

    float sum = 0;
    for (int i = 0; i < mask_width; i++) {
        for (int j = 0; j < mask_height; j++) {
            // Calculate the coordinate of the pixel to read.
            int image_x = x + i - mask_width / 2;
            int image_y = y + j - mask_height / 2;

            // Do not read outside the image.
            if (image_x < 0 || image_x >= image_width ||
                image_y < 0 || image_y >= image_height) {
                continue;
            }

            // Accumulate the value of the pixel.
            int image_index = image_y * image_width + image_x;
            int mask_index = j * mask_width + i;
            sum += image[image_index] / 255.0f * mask[mask_index];
        }
    }

    int image_index = y * image_width + x;
    image[image_index] = sum * 255;
}

Thread identification in 2D#

To obtain thread IDs in both x and y dimensions:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

This calculation combines:

threadIdx.x and threadIdx.y: Local thread coordinates within a block
blockIdx.x and blockIdx.y: Block coordinates within the grid
blockDim.x and blockDim.y: Block dimensions

Boundary checking#

if (x >= image_width || y >= image_height) {
    return;
}

This ensures threads don’t process pixels outside the image bounds. Threads that exceed the image dimensions simply return without doing work.

Mask application loop#

The nested loops iterate over the mask dimensions:

for (int i = 0; i < mask_width; i++) {
    for (int j = 0; j < mask_height; j++) {
        // Process each mask element
    }
}

Coordinate calculation#

For each position in the mask, calculate the corresponding image coordinate:

int image_x = x + i - mask_width / 2;
int image_y = y + j - mask_height / 2;

The - mask_width / 2 and - mask_height / 2 center the mask over the current pixel.

Edge handling#

if (image_x < 0 || image_x >= image_width ||
    image_y < 0 || image_y >= image_height) {
    continue;
}

This prevents reading outside the image boundaries. When the mask extends beyond the image edge, the code simply skips those pixels (continue to the next iteration).

Accumulation#

int image_index = image_y * image_width + image_x;
int mask_index = j * mask_width + i;
sum += image[image_index] / 255.0f * mask[mask_index];

For each mask position:

Calculate the flattened array index for the image pixel
Calculate the flattened array index for the mask value
Normalize the pixel value (divide by 255 to get 0-1 range)
Multiply by the mask weight and accumulate

Writing the result#

int image_index = y * image_width + x;
image[image_index] = sum * 255;

After processing all mask positions, write the accumulated result back to the output image, scaling back to the 0-255 range.

Host code implementation#

Main function setup#

int main() {
    int width, height, channels;
    static const int maskWidth = 200;
    static const int maskHeight = 200;
    std::vector<float> mask(maskWidth * maskHeight * channels);

    // Initialize mask with uniform averaging weights
    for (int i = 0; i < maskWidth * maskHeight; ++i) {
        mask[i] = 1.0f / maskWidth / maskHeight / channels;
    }

    // Load an image from disk (implementation not shown)

Mask initialization#

The mask is initialized with uniform weights that sum to 1.0:

Each element is 1.0 / (maskWidth * maskHeight * channels)
This creates an averaging filter
When applied, it produces a smoothing (blurring) effect

Memory allocation and data transfer#

// Allocate GPU memory and copy data to the GPU.
uint8_t *d_image;
float *d_mask;
hipMalloc(&d_image, width * height * channels * sizeof(uint8_t));
hipMalloc(&d_mask, maskWidth * maskHeight * channels * sizeof(float));

hipMemcpy(d_image, image, width * height * channels * sizeof(uint8_t),
          hipMemcpyHostToDevice);
hipMemcpy(d_mask, mask.data(),
          maskWidth * maskHeight * channels * sizeof(float),
          hipMemcpyHostToDevice);

Grid configuration and kernel launch#

// Calculate grid size and launch the kernel.
dim3 block_size = {16, 16, 1};
dim3 grid_size = {(width + block_size.x - 1) / block_size.x,
                 (height + block_size.y - 1) / block_size.y, 1};

conv2d<<<grid_size, block_size>>>(d_image, d_mask, width, height,
                                  maskWidth, maskHeight);
hipDeviceSynchronize();

Grid size calculation:

block_size: 16 × 16 threads per block (256 threads total).
grid_size: Calculated to cover the entire image.
The (width + block_size.x - 1) / block_size.x formula ensures there are enough blocks to cover all pixels, rounding up.

Retrieving results and cleanup#

    // Copy the data back to the host.
    hipMemcpy(image, d_image, width * height * channels * sizeof(uint8_t),
              hipMemcpyDeviceToHost);

    // Store the image to disk (implementation not shown)

    hipFree(d_image);
    hipFree(d_mask);

    return 0;
}

Performance considerations#

For high-resolution images for example 4096×4096 pixels, the GPU acceleration provides:

Massive parallelism: Tens of thousands of concurrent threads.
High throughput: Leveraging GPU memory bandwidth and computational density.
Scalability: Linear speedup with increased SM occupancy.

Typical speedups over CPU implementations range from 10× to 100×, depending on image size, mask complexity, and GPU architecture.

Memory access patterns#

Image convolution requires repeated access to neighboring pixels, leading to non-coalesced memory transactions. Optimizing memory access is essential:

Non-coalesced memory accesses (not all accesses are contiguous).
Repeated reads of the same pixels by adjacent threads.
Potential for optimization using shared memory (advanced technique).

Best practices#

Handle boundaries carefully

Conditional branches in edge regions can cause wavefronts or warps to execute serially. Prefer approaches that avoid per-thread branching, including:
- Pre-clamping coordinates to valid index ranges
- Padding or haloing input images so all threads operate on valid data
- Using hardware boundary modes (such as texture sampling modes on RDNA) to offload boundary handling
These techniques help maintain high SIMD lane utilization across the entire grid.
Center your stencil properly

Compute the stencil origin using mask_width / 2 (or the equivalent for rectangular masks) to ensure correct alignment between the input data and the mask coefficients. This prevents off-by-one misalignment that can propagate as spatial artifacts.
Select mask sizes based on compute–memory tradeoffs

Larger kernels increase arithmetic intensity but also expand the set of neighbor loads per output element. Balance mask dimensions with available bandwidth, register pressure, and shared-memory capacity, particularly when implementing separable or multi-pass stencils.
Normalize properly

Ensure mask weights sum to the intended normalization constant, commonly 1.0 for averaging operations. When using integer or half-precision paths, verify scaling behavior to avoid overflow or unintended bias.
Consider edge strategies

Adopt a clear policy for pixels whose neighborhoods extend outside the valid domain. Options include skipping output generation, clamping to the nearest valid coordinate, wrapping coordinates, or mirroring.

Conclusion#

Stencil operations are a fundamental pattern in GPU computing that enables efficient parallel processing of spatially-dependent data. The 2D convolution example demonstrates:

How to structure kernels for stencil patterns
Proper boundary handling for neighborhood operations
Effective use of 2D thread grids that map naturally to image structure
Memory access patterns for adjacent data elements

By understanding stencil operations, developers can implement a wide range of image processing algorithms, scientific simulations, and deep learning operations on GPUs. The patterns demonstrated here extend beyond image processing to any computational problem involving spatial relationships in multi-dimensional data.

The key to successful stencil implementations is carefully managing boundary conditions, ensuring correct coordinate calculations, and leveraging the GPU’s parallel architecture to process many independent stencil operations simultaneously.