Atomic operations: Histogram tutorial

Atomic operations: Histogram tutorial#

In GPU programming, a core design principle is to avoid simultaneous writes to the same memory address by multiple threads. When multiple threads write to the same location without proper synchronization, this creates a race condition, where the final result depends on unpredictable thread execution order.

Unlike CPUs, GPUs are designed for high-throughput parallel execution with relaxed memory consistency models and limited cache coherence mechanisms. This architectural choice maximizes bandwidth and scalability but introduces challenges when multiple threads need to safely update shared state.

This tutorial demonstrates how to safely handle concurrent memory updates using atomic operations, illustrated through the practical example of computing an image brightness histogram on the GPU.

Prerequisites#

To follow this tutorial, you’ll need installed drivers and a HIP compiler toolchain to compile your code. HIP supports compiling and running on Linux and Windows with AMD GPUs, the combination of install instructions is more than worth covering as part of this tutorial. For more information about installing HIP development packages, see Install HIP.

Race condition#

A race condition occurs when two or more threads attempt to read-modify-write the same memory location concurrently without proper synchronization. Because GPU threads execute asynchronously across multiple cores (compute units), concurrent writes can interleave unpredictably, leading to incorrect results.

For example, if two threads simultaneously attempt:

histogram[bin] = histogram[bin] + 1;

both may read the same old value before either writes back, resulting in only one increment being reflected. This results in lost updates and nondeterministic output, which must be avoided.

Histogram#

A histogram partitions continuous data into discrete intervals called bins and counts how many data points fall into each bin. In image processing, a histogram typically represents the distribution of pixel intensities for example brightness or color channel values.

The histogram algorithm can be expressed as:

\[H[b] = \sum_{i=1}^{N} \delta(b - \lfloor f(x_i) \rfloor)\]

where \(f(x_i)\) maps each data value to its corresponding bin index \(b\), and \(\delta()\) is 1 when the value belongs to bin \(b\) and 0 otherwise.

The basic computational steps are:

Iterate through all pixels (or data points).
Determine the appropriate bin for each value.
Increment that bin’s count.

In a serial CPU program, this is straightforward. On a GPU, thousands of threads may attempt to increment the same bin concurrently, leading to race conditions unless atomic synchronization is used.

The Challenge in parallel context#

When multiple threads attempt to increment the same bin:

One thread’s update can overwrite another’s pending increment.
Memory coherence cannot guarantee ordered visibility across thread blocks.
The final result may be inconsistent or incorrect.

This necessitates synchronization mechanisms to ensure that updates occur in a mutually exclusive manner without introducing high overhead.

Atomic operations#

An atomic operation ensures that a compound operation — typically a read-modify-write sequence — executes as an indivisible unit. From the programmer’s perspective, atomicity guarantees that no other thread can observe a partially completed operation.

Formally, an operation \(O(x)\) on shared variable \(x\) is atomic if its execution satisfies:

\[\forall T_i, T_j, \text{ the effects of } O(x) \text{ appear serializable.}\]

That is, all threads observe results as if operations occurred in a single, sequential order.

Mechanics#

Atomic operations on GPUs are implemented in hardware through a memory arbitration unit that locks a cache line, performs the modification, and releases the lock. This ensures correctness even under massive parallelism.

When a thread performs an atomic operation:

The target memory location is temporarily locked.
The value is fetched and updated.
The update is written back, and the lock is released.

No other thread can modify the same memory location during this sequence.

Atomic functions#

HIP provides a wide set of atomic primitives to synchronize updates to shared memory or global memory locations:

Operation	Description
`atomicAdd`	Atomically adds a value to a memory location and returns the old value.
`atomicSub`	Atomically subtracts a value.
`atomicExch`	Atomically exchanges values between a register and memory.
`atomicCAS`	Performs an atomic compare-and-swap; fundamental for implementing locks.
`atomicMax` / `atomicMin`	Updates to the maximum or minimum of two values.
`atomicInc` / `atomicDec`	Atomically increments or decrements a counter, wrapping at a boundary.

Atomic operations in kernels can operate on block scope (shared memory), device scope (global memory), or system scope (system memory), depending on hardware support.

For more information, please check atomic functions.

Image brightness histogram#

We will compute a histogram that captures the distribution of pixel brightness in an RGB image. The algorithm:

Reads image data in channel-height-width format.
Converts RGB values to grayscale brightness.
Maps brightness to a histogram bin.
Atomically increments the corresponding bin counter.

Kernel implementation#

__global__ void calculateHistogram(float* imageData, int* histogram,
                                   int width, int height,
                                   int channels, int numBins)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= width || y >= height)
        return;

    int idx = (y * width + x) * channels;
    float brightness = 0.0f;

    for (int c = 0; c < channels; ++c)
        brightness += imageData[idx + c];

    brightness /= channels; // Normalize to [0, 1]
    int bin = static_cast<int>(brightness * numBins);

    // Atomic increment to avoid race conditions
    atomicAdd(&histogram[bin], 1);
}

Thread identification#

Each thread computes one pixel’s contribution using its 2D thread and block indices:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

This mapping provides a 1:1 correspondence between threads and pixels, making the computation naturally parallel.

Brightness computation#

Each pixel’s brightness is computed as the arithmetic mean of its RGB channels:

\[ I'(x, y) = \frac{R + G + B}{3} \]

This value is then normalized to [0, 1] and mapped to one of numBins histogram intervals.

Safe histogram update#

The key step is:

atomicAdd(&histogram[bin], 1);

This ensures that even if thousands of threads map to the same bin, each increment is serialized correctly, maintaining an accurate bin count.

Performance characteristics#

Benefits#

Correctness under parallel updates: Ensures race-free accumulation.
Simplified synchronization: No explicit locks or barriers needed.
Hardware-level efficiency: Implemented directly in the GPU memory subsystem.

Limitations#

While atomic operations guarantee correctness, they can serialize execution when multiple threads target the same memory address. This causes contention and reduces effective parallelism.

Typical performance degradation sources include:

Hot bins: When many pixels fall into a small subset of bins.
Global memory atomics: Global memory atomics are slower than shared memory atomics due to higher access latency.
Warp serialization: Threads within a warp waiting for the same atomic target serialize.

Best practices#

Apply atomic operations only where necessary

Atomic instructions serialize access to a memory location and use can diminish SIMT parallel efficiency and increase warp stalls. Restrict atomic usage to code paths where data races cannot be eliminated through algorithmic restructuring.
Minimize contention

High contention on a single address or a small set of addresses leads to serialization. Distribute writes across independent memory locations.
Leverage shared memory

Use fast, low-latency shared memory to aggregate partial results within a block before issuing a single atomic update to global memory.
Validate correctness

Validate the numerical and logical correctness of GPU kernels by comparing against single-threaded or deterministic multi-threaded CPU baselines.
Profile regularly

GPU performance is highly sensitive to thread divergence, memory-access patterns, and workload distribution. Regularly use profiling tools such as rocprofv3 or ROCm compute profiler to examine warp-level execution efficiency, memory-coalescing behavior, occupancy, and atomic throughput bottlenecks.

Conclusion#

Atomic operations provide a low-level synchronization mechanism that allows correct and deterministic parallel updates to shared data structures. In the histogram example, atomicAdd() ensures that all threads safely contribute to their corresponding bins, preventing race conditions.

While atomics incur some serialization overhead, they are indispensable for algorithms that require concurrent accumulation or counting. By applying techniques like privatization and reduction, developers can achieve both correctness and high performance on modern GPUs.

Atomic operations form the foundation for more advanced synchronization patterns, including parallel reductions, prefix sums, and graph traversal, and are essential for developing scalable, data-parallel GPU algorithms.