Reduction#
Reduction is a common algorithmic operation used in parallel programming to reduce an array of elements into a shorter array of elements or a single value. This document exploits reduction to introduce some key considerations while designing and optimizing GPU algorithms.
This document is a rejuvenation and extension of the invaluable work of Mark Harris. While the author approaches the topic with a less naive approach, reviewing some original material is valuable to see how much the underlying hardware has changed. This document provides a greater insight to demonstrate progress.
The algorithm#
Reduction has many names depending on the domain; in functional programming it’s referred to as fold, in C++, it’s called std::accumulate and in C++17, as std::reduce. A reduction takes a range of inputs and “reduces” the given range with a binary operation to a singular or scalar output. Canonically, a reduction requires a “zero” element that bootstraps the algorithm and serves as one of the initial operands to the binary operation. The “zero” element is generally called identity or neutral element in the group theory, which implies that it is an operand that doesn’t change the result. Some typical use cases are: calculating a sum or normalizing a dataset and finding the maximum value in the dataset. The latter use case is discussed further in this tutorial.
There are multiple variations of reduction that allow parallel processing. The approach taken by std::reduce requires the user-provided binary operator to operate on any combination of identity and input range elements, or even exclusively on any of them. This allows you to insert any number of identities to facilitate parallel processing and then combine the partial results of parallel execution.
Reduction on GPUs#
Implementing reductions on GPUs requires a basic understanding of the /understand/programming_model_reference. The document explores aspects of low-level optimization best discussed through the Inherent thread model, and refrains from using cooperative groups.
Synchronizing parallel threads of execution across a GPU is crucial for correctness as the partial results can’t be synchronized before they manifest. Synchronizing all the threads running on a GPU at any given time is possible, however, it is a costly and intricate operation. If synchronization is not absolutely necessary, map the parallel algorithm so that multiprocessors and blocks can make independent progress and need not sync frequently.
There are ten reduction implementations in the rocm-examples, which are described in the following sections.
Reducing thread divergence#
You can reduce divergence by keeping dataflow between memory addresses identical but reassigning the thread ids.
// Shared reduction
for (uint32_t i = 1; i < blockDim.x; i *= 2)
{
-    if (tid % (2 * i) == 0)
-        shared[tid] = op(shared[tid], shared[tid + i]);
+    if (uint32_t j = 2 * i * tid; j < blockDim.x)
+        shared[j] = op(shared[j], shared[j + i]);
    __syncthreads();
}
This way inactive threads start accumulating uniformly towards the higher thread ID index range and might uniformly skip to __syncthreads(). However, this introduces a bank conflicts issue.
Resolving bank conflicts#
Both AMD and NVIDIA implement shared memory in the hardware by organizing storage into banks of various sizes. This hardware element is known as Local Data Share (LDS) on AMD hardware. On NVIDIA hardware, it’s implemented using the same silicon as the L1 data cache. You can think of shared memory as a striped 2-dimensional range of memory. Shared memory bank’s count, width, and depth depend on the architecture. A bank conflict occurs when different threads in a warp access the same bank during the same operation. In this case, the hardware prevents the attempted concurrent accesses to the same bank by converting them into serial accesses.
A notable exception is when the shared read uniformly broadcasts to the same address across the entire warp. A better implementation of the naive algorithm is to form continuous ranges of the threads activities and their memory accesses.
// Shared reduction
-for (uint32_t i = 1; i < blockDim.x; i *= 2)
-{
-    if (tid % (2 * i) == 0)
+for (uint32_t i = blockDim.x / 2; i != 0; i /= 2)
+{
+    if (tid < i)
        shared[tid] = op(shared[tid], shared[tid + i]);
    __syncthreads();
}
Note
To avoid bank conflicts, read shared memory in a coalesced manner, which implies that reads/writes of each lane in a warp evaluate to consecutive locations. Analyzing the read/write patterns could help you to understand the cause of bank conflicts. For more details, check CDNA3 ISA or RDNA3 ISA data share operations chapter.
Utilize upper half of the block#
The preceding implementation is free of low-level GPU-specific anti-patterns. However, it still exhibits some common shortcomings. The loop performing the reduction in the shared memory starts from i = blockDim.x / 2 and the first predicate if (tid < i) immediately disables half of the block, which only helps load the data into the shared memory. You can change the kernel along with the calculation of factor on the host, as shown here:
const uint32_t tid = threadIdx.x,
               bid = blockIdx.x,
-              gid = bid * blockDim.x + tid;
+              gid = bid * (blockDim.x * 2) + tid;
// Read input from front buffer to shared
-shared[tid] = read_global_safe(gid);
+shared[tid] = op(read_global_safe(gid), read_global_safe(gid + blockDim.x));
__syncthreads();
By eliminating half of the threads and giving meaningful work to all the threads by unconditionally performing a binary op, you can prevent the wastage of half of the threads.
Even though global memory is read in a coalesced fashion, as preferred by the memory controller, optimal performance is still limited by the instruction throughput. Omit superfluous synchronization ——————————–
Warps are known to execute in a strict lockstep fashion. Therefore, once shared reduction reaches a point where only a single warp participates meaningfully, you can cut short the loop and let the rest of the warps terminate. Moreover, you can also unroll the loop without syncing the entire block.
The tmp namespace used beyond this point in this document holds a handful of template meta-programmed utilities to facilitate writing flexible and optimal code.
tmp::static_for is not just a constant folding within the optimizer but a variation of the language for loop, where the running index is a compile-time constant and is eligible for use in compile-time evaluated contexts.
Consider the following code:
constexpr int size = 4;
for (int i = 0 ; i < size ; ++i)
{
    printf("%d", i);
}
This compiles to the following binaries:
LLVM Block
main:
    push    rbx
    lea     rbx, [rip + .L.str]
    mov     rdi, rbx
    xor     esi, esi
    xor     eax, eax
    call    printf@PLT
    mov     rdi, rbx
    mov     esi, 1
    xor     eax, eax
    call    printf@PLT
    mov     rdi, rbx
    mov     esi, 2
    xor     eax, eax
    call    printf@PLT
    mov     rdi, rbx
    mov     esi, 3
    xor     eax, eax
    call    printf@PLT
    xor     eax, eax
    pop     rbx
    ret
.L.str:
    .asciz  "%d"
GCC
.LC0:
    .string "%d"
main:
    push    rbx
    xor     ebx, ebx
.L2:
    mov     esi, ebx
    mov     edi, OFFSET FLAT:.LC0
    xor     eax, eax
    add     ebx, 1
    call    printf
    cmp     ebx, 4
    jne     .L2
    xor     eax, eax
    pop     rbx
    ret
MSVC
main    PROC
    $LN12:
    push    rbx
    sub     rsp, 32
    xor     ebx, ebx
    npad    8
$LL4@main:
    mov     edx, ebx
    lea     rcx, OFFSET FLAT:'string'
    call    printf
    inc     ebx
    cmp     ebx, 4
    jl      SHORT $LL4@main
    xor     eax, eax
    add     rsp, 32
    pop     rbx
    ret     0
main    ENDP
LLVM unrolls the loop and compiles to a flat series of printf invocations, while both GCC and MSVC keep the loop intact, as visible from the compare (cmp) and the jump (jne, jl) instructions. LLVM code generation is identical to manually writing the unrolled loop:
printf("%d", 0);
printf("%d", 1);
printf("%d", 2);
printf("%d", 3);
While various non-standard pragmas are available to hint or force the compiler to unroll the loop, we instead use template meta-programming to force feed the compiler the unrolled loop.
constexpr int size = 4;
// Maybe unrolled loop
for (int i = 0 ; i < size ; ++i)
{
    printf("%d", i);
}
// Force unrolled loop
using namespace tmp;
static_for<0, less_than<size>, increment<1>>([]<int i>()
{
    printf("%d", i);
});
The most notable structural difference is that in the language for loop, the loop variable is given a name in the beginning, while in the static_for utility, the loop variable is given a name in the end. An important bonus is that in the loop’s body, you can use the running index i in contexts requiring constant expressions such as template arguments or inside if constexpr.
tmp::static_switch takes runtime value and runtime dispatches to a range of set of tabulated functions, where said value is a compile-time constant and is eligible for use in compile-time evaluated contexts.
Consider the following code:
int warp_size = device_props.warpSize;
switch (warp_size)
{
case 32:
    hipLaunchKernelGGL(kernel<32>, ...);
    break;
case 64:
    hipLaunchKernelGGL(kernel<64>, ...);
    break;
}
In the preceding code, note the code repetition for all possible values of warp_size, the code is prepared to handle. To avoid this, use tmp::static_switch, as shown:
tmp::static_switch<std::array{32, 64}>(warp_size, [&]<int WarpSize>()
{
    hipLaunchKernelGGL(kernel<WarpSize>, ...);
});
-template<typename T, typename F>
+template<uint32_t WarpSize, typename T, typename F>
__global__ void kernel(
    ...
)
{
    ...
// Shared reduction
-for (uint32_t i = blockDim.x / 2; i != 0; i /= 2)
+for (uint32_t i = blockDim.x / 2; i > WarpSize; i /= 2)
{
    if (tid < i)
        shared[tid] = op(shared[tid], shared[tid + i]);
    __syncthreads();
}
+// Warp reduction
+tmp::static_for<WarpSize, tmp::not_equal<0>, tmp::divide<2>>([&]<int I>()
+{
+    if (tid < I)
+        shared[tid] = op(shared[tid], shared[tid + I]);
+#ifdef __HIP_PLATFORM_NVIDIA__
+    __syncwarp(0xffffffff >> (WarpSize - I));
+#endif
+});
Because HIP typically targets hardware with warp sizes of 32 (NVIDIA GPUs and RDNA AMD GPUs) and 64 (CDNA AMD GPUs), portable HIP code must handle both. That is why instead of assuming a warp size of 32, make the warp size a template argument of the kernel. This allows you to unroll the final loop using tmp::static_for in a parametric way but still having the code read much like an ordinary loop.
Promoting the warp size to being a compile-time constant also requires you to handle it similarly on the host-side. You can sandwich the kernel launch with tmp::static_switch, promoting the snake-case run-time warp_size variable to a camel-case compile-time constant WarpSize.
// Device-side reduction
for (uint32_t curr = input_count; curr > 1;)
{
+    tmp::static_range_switch<std::array{32, 64}>(warp_size, [&]<int WarpSize>() noexcept
+    {
        hipLaunchKernelGGL(
-            kernel,
+            kernel<WarpSize>,
            dim3(new_size(curr)),
            dim3(block_size),
            factor * sizeof(unsigned),
            hipStreamDefault,
            front,
            back,
            kernel_op,
            zero_elem,
            curr);
+    });
    ...
}
Note
Neither RDNA- nor CDNA-based AMD hardware provides guaranteed independent progress to lanes of the same warp. When targeting NVIDIA hardware, lanes of a warp might execute somewhat independently as long as the programmer assists the compiler using dedicated built-in functions. This feature is called Independent Thread Scheduling. The HIP headers don’t expose the necessary warp primitives and their overloads.
Portable applications can still tap into this feature with carefully #ifdef -ed code, but at this particular optimization level, it’s a requirement. The code implicitly relies on the lockstep behavior of an ROCm wavefront, but CUDA warps don’t share this property. You must synchronize all the active lanes of a warp to avoid a data race with some lanes progressing faster than others in the same warp.
Unroll all loops#
While the previous step primarily aims to remove unnecessary syncing, it also unrolls the end of the loop. However, you could also force unrolling the first part of the loop. This saves a few scalar registers (values the compiler can prove to be uniform across warps).
-template<uint32_t WarpSize, typename T, typename F>
-__global__ void kernel(
+template<uint32_t BlockSize, uint32_t WarpSize, typename T, typename F>
+__global__ __launch_bounds__(BlockSize) void kernel(
    T* front,
    T* back,
    F op,
    T zero_elem,
    uint32_t front_size)
{
-    extern __shared__ T shared[];
+    __shared__ T shared[BlockSize];
    ...
    // Shared reduction
-    for (uint32_t i = blockDim.x / 2; i > WarpSize; i /= 2)
+    tmp::static_for<BlockSize / 2, tmp::greater_than<WarpSize>, tmp::divide<2>>([&]<int I>()
    {
-        if (tid < i)
-            shared[tid] = op(shared[tid], shared[tid + i]);
+        if (tid < I)
+            shared[tid] = op(shared[tid], shared[tid + I]);
        __syncthreads();
    }
+    );
Introducing yet another template argument for the kernel and moving from for to tmp::static_for leads to the following two notable improvements:
- Introducing new attribute - __launch_bounds__(BlockSize)to the kernel instructs the compiler that the kernel will only be launched using the designated block size. This implies that the launches of differing block sizes will fail. This allows the optimizer to enroll the- blockDim.xvariable in constant folding as well as get information about register usage.
- Turning the block size into a compile-time constant allows you to statically allocate the shared memory. 
Communicate using warp-collective functions#
Shared memory provides a fast communication path within a block, however when performing reduction within the last warp, you can use faster means of communication, which is warp-collective or cross-lane functions. Instead of using the hardware-backed shared memory, you can directly copy between the local memory (registers) of each lane in a warp. This can be achieve using the shuffle functions.
See how to use __shfl_down(), which is one of the most restrictive but also the most structured communication schemes.
// Warp reduction
if (tid < WarpSize)
{
    T res = op(shared[tid], shared[tid + WarpSize]);
    tmp::static_for<WarpSize / 2, tmp::not_equal<0>, tmp::divide<2>>([&]<int Delta>()
    {
        res = op(res, __shfl_down(res, Delta));
    });
    // Write result from shared to back buffer
    if (tid == 0)
        back[bid] = res;
}
Using warp-collective functions for communication requires the control flow to be uniform across warps, as the name warp-collective implies. Therefore, you can see that the thread ID is being checked outside the loop, but the result is written inside due to variable scoping.
Amortize bookkeeping variable overhead#
The previous sections explained how to reduce register usage to improve occupancy. This allows more blocks to execute in parallel on all multiprocessors, leading to more global store/load latency to be hidden. Reducing the number of kernels in flight while still carrying out the same workload reduces the wastage of registers while loading and maintaining bookkeeping variables such as kernel indices.
An example of this optimization is performing one binary op while loading input from global. Even though the operation is said to be carried out “in flight”, the two values are loaded into local memory (registers) before op is called.
A more general form of this optimization is wrapping most kernel logic in loops that carry out the workload of multiple kernel instances but require storing only a single instance of most of the bookkeeping logic. In code, this multiplicity factor is referred to via the ItemsPerThread compile-time constant, which is supplied by a template argument to allow for loop unrolling.
This kernel variant utilizes another generally applicable utility known as hip::static_array, which is a more restrictive wrapper over the builtin array than std::array, as it allows indexing only compile-time constants using the usual tuple-like template <size_t I> auto get<I>(...) interface.
Note
On a GPU, there is no stack, and the local memory is provisioned from the register file. This provisioning takes place statically. To paraphrase, the address range of a thread’s local memory is determined at compile-time. When an array is defined and used in the local storage, the compiler can only maintain its storage in the register file as long as all accesses to the array are computable by the compiler at compile-time. It doesn’t need to be a compile-time constant as long as the compiler can resolve the addresses of the accesses through constant folding or some other means. If the compiler fails to do so, the array will be backed by global memory, which is indicated by allocating a non-zero number of spill registers observable using static analysis tools. However, this is slower by the magnitude of multiple order. hip::static_array via its hip::get<> interface ensures that no such spills occur.
template<uint32_t BlockSize, uint32_t WarpSize, uint32_t ItemsPerThread>
__global__ static __launch_bounds__(BlockSize) void kernel(...)
The kernel now has three compile-time configurable parameters. The only part of the kernel that changes depends on how you load data from global and perform the binary operation on those loaded values. So, the following step to read input from front buffer to global is now split into two steps: reading ``ItemsPerThread` <reading-items>`and processing ``ItemsPerThread` <processing-items>`.
// Read input from front buffer to local
T res = op(read_global_safe(gid), read_global_safe(gid + blockDim.x));
Reading ItemsPerThread#
The change to reading happens inside read_global_safe:
auto read_global_safe = [&](const int32_t i) -> hip::static_array<T, ItemsPerThread>
{
    return [&]<int32_t... I>(std::integer_sequence<int32_t, I...>)
    {
        if(i + ItemsPerThread < front_size)
            return hip::static_array<T, ItemsPerThread>{
                front[i + I]...
            };
        else
            return hip::static_array<T, ItemsPerThread>{
                (i + I < front_size ? front[i + I] : zero_elem)...
            };
    }(std::make_integer_sequence<int32_t, ItemsPerThread>());
};
Note that each array element is being loaded consecutively without the flexibility of a configurable ItemsPerThread property. This is morally equivalent to:
T arr[4] = {
    front[gid + 0],
    front[gid + 1],
    front[gid + 2],
    front[gid + 3]
}
This is exactly what’s happening in the front[i + I]... fold-expression. However, this can only be issued if the entire read operates on real input without padding using zero_elem. If some reads over-index the input, the read turns into:
T arr[4] = {
    i + 0 < front_size ? front[i + 0] : zero_elem,
    i + 1 < front_size ? front[i + 1] : zero_elem,
    i + 2 < front_size ? front[i + 2] : zero_elem,
    i + 3 < front_size ? front[i + 3] : zero_elem
}
This makes it easier for the compiler to recognize vector loads from global. As the performance at large is dominated by how you move the data, it’s only natural to utilize dedicated instructions to move more data with less binary. This is evident by the huge performance improvement when loading two values per thread. For more information, see the compiler explorer to learn how loading for AMD (both RDNA and CDNA) compiles to global_load_dwordx4, where x4 denotes the 4-vector variant of the instruction.
Note
Note that read_global_safe, which used to take an uint32_t as the index type, now takes a signed integer. When indexing an array with unsigned integers, the compiler has to handle integer overflows, as the C/C++ standards defined them. It might happen that some part of the vector load indices overflow, thus resulting in a non-contiguous read. If you change the previously linked code to use an unsigned integer as the thread ID, the compiler won’t emit a vector load. Signed integer overflow is an undefined behavior, and hence, unknown to the optimizer. To convey the absence of overflow to the compiler with unsigned indices, add __builtin_assume(gid + 4 > gid), or the more portable [[assume]](gid + 4 > gid), once amdclang++ supports it.
read_global_safe implementation is an Immediately Invoked Lambda Expression (IILE), because ItemsPerThread is an integer value, while you need a compile-time iota-like sequence of integers as a pack for the fold-expressions to expand on. This can only occur as part of template argument deduction on the IILE.
Processing ItemsPerThread#
Once the kernel reads ItemsPerThread number of inputs to local, it immediately reduces them to a scalar. There is no reason to propagate the input element multiplicity to the warp reduction phase.
T res = [&]()
{
    // Read input from front buffer to local
    hip::static_array<T, ItemsPerThread> arr = read_global_safe(gid);
    // Reduce ItemsPerThread to scalar
    tmp::static_for<1, tmp::less_than<ItemsPerThread>, tmp::increment<1>>([&]<int I>()
    {
        get<0>(arr) = op(get<0>(arr), get<I>(arr));
    });
    return get<0>(arr);
}();
Two-pass reduction#
Alter kernel launch and input fetching such that no more blocks are launched than what a subsequent kernel launch’s single block can conveniently reduce, while performing multiple passes of input reading from global and combining their results before engaging in the end game tree-like reduction.
With this method, you can save at least one to two kernel launches for large inputs.
Conclusion#
Optimizing code on GPUs, like on any other architecture, requires careful consideration and balancing of resources and costs of various operations to obtain optimal performance. This document explored optimizing reductions much beyond the territory of diminishing returns. This approach introduced multiple optimization techniques and discussed opportunities.
The document focused on reductions when an entire device participates in it. Still, the choice of optimal compile-time constants or even the algorithm itself might not be optimal when its multiple blocks participate in multiple parallel reductions or when each thread performs its reduction. However, when multiple devices participate in the same reduction, other aspects must be considered.
Most solutions, including the ones covered in this document, are given to the end users in a turnkey fashion via algorithm primitive libraries. These solutions might not be the fastest in all cases, but they are close to being the gold standard for carrying out certain operations as reasonably as possible.