Reduce

Reduce#

template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false> class warp_reduce#

The warp_reduce class is a warp level parallel primitive which provides methods for performing reduction operations on items partitioned across threads in a hardware warp.

Overview

WarpSize must be equal to or less than the size of hardware warp (see rocprim::device_warp_size()
). If it is less, reduce is performed separately within groups determined by WarpSize.

For example, if
WarpSize is 4, hardware warp is 64, reduction will be performed in logical warps grouped like this: { {0, 1, 2, 3}, {4, 5, 6, 7 }, ..., {60, 61, 62, 63} } (thread is represented here by its id within hardware warp).
Logical warp is a group of WarpSize consecutive threads from the same hardware warp.
Supports non-commutative reduce operators. However, a reduce operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.
Number of threads executing warp_reduce’s function must be a multiple of WarpSize;
All threads from a logical warp must be in the same hardware warp.

Examples

In the examples reduce operation is performed on groups of 16 threads, each provides one int value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.

__global__ void example_kernel(...)
{
    // specialize warp_reduce for int and logical warp of 16 threads
    using warp_reduce_int = rocprim::warp_reduce<int, 16>;
    // allocate storage in shared memory
    __shared__ warp_reduce_int::storage_type temp[4];

    int logical_warp_id = threadIdx.x/16;
    int value = ...;
    // execute reduce
    warp_reduce_int().reduce(
        value, // input
        value, // output
        temp[logical_warp_id]
    );
    ...
}

Template Parameters:

T – the input/output type.
WarpSize – the size of logical warp size, which can be equal to or less than the size of hardware warp (see rocprim::device_warp_size()). Reduce operations are performed separately within groups determined by WarpSize.
UseAllReduce – input parameter to determine whether to broadcast final reduction value to all threads (default is false).

Public Types

using storage_type = typename base_type::storage_type#

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords . It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.

Public Functions

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize> __device__ inline auto reduce(T input, T &output, storage_type &storage, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type#

Performs reduction across threads in a logical warp.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

In the examples reduce operation is performed on groups of 16 threads, each provides one int value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.

__global__ void example_kernel(...)
{
    // specialize warp_reduce for int and logical warp of 16 threads
    using warp_reduce_int = rocprim::warp_reduce<int, 16>;
    // allocate storage in shared memory
    __shared__ warp_reduce_int::storage_type temp[4];

    int logical_warp_id = threadIdx.x/16;
    int value = ...;
    // execute reduction
    warp_reduce_int().reduce(
        value, // input
        value, // output
        temp[logical_warp_id],
        rocprim::minimum<float>()
    );
    ...
}

Template Parameters:

BinaryFunction – type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters:

input – [in] thread input value.
output – [out] reference to a thread output value. May be aliased with input.
storage – [in] reference to a temporary storage object of type storage_type.
reduce_op – [in] binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize> __device__ inline auto reduce(T, T&, storage_type&, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type#: Performs reduction across threads in a logical warp. Invalid Warp Size.

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize> __device__ inline auto reduce(T input, T &output, int valid_items, storage_type &storage, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type#

Performs reduction across threads in a logical warp.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

In the examples reduce operation is performed on groups of 16 threads, each provides one int value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.

__global__ void example_kernel(...)
{
    // specialize warp_reduce for int and logical warp of 16 threads
    using warp_reduce_int = rocprim::warp_reduce<int, 16>;
    // allocate storage in shared memory
    __shared__ warp_reduce_int::storage_type temp[4];

    int logical_warp_id = threadIdx.x/16;
    int value = ...;
    int valid_items = 4;
    // execute reduction
    warp_reduce_int().reduce(
        value, // input
        value, // output
        valid_items,
        temp[logical_warp_id]
    );
    ...
}

Template Parameters:

BinaryFunction – type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters:

input – [in] thread input value.
output – [out] reference to a thread output value. May be aliased with input.
valid_items – [in] number of items that will be reduced in the warp.
storage – [in] reference to a temporary storage object of type storage_type.
reduce_op – [in] binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize> __device__ inline auto reduce(T, T&, int, storage_type&, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type#: Performs reduction across threads in a logical warp. Invalid Warp Size.

template<class Flag, class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize> __device__ inline auto head_segmented_reduce(T input, T &output, Flag flag, storage_type &storage, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type#

Performs head-segmented reduction across threads in a logical warp.

Storage reusage: Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Template Parameters:

Flag – type of head flags. Must be contextually convertible to bool.
BinaryFunction – type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters:

input – [in] thread input value.
output – [out] reference to a thread output value. May be aliased with input.
flag – [in] thread head flag, true flags mark beginnings of segments.
storage – [in] reference to a temporary storage object of type storage_type.
reduce_op – [in] binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class Flag, class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize> __device__ inline auto head_segmented_reduce(T, T&, Flag, storage_type&, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type#: Performs head-segmented reduction across threads in a logical warp. Invalid Warp Size.

template<class Flag, class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize> __device__ inline auto tail_segmented_reduce(T input, T &output, Flag flag, storage_type &storage, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type#

Performs tail-segmented reduction across threads in a logical warp.

Storage reusage: Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Template Parameters:

Flag – type of tail flags. Must be contextually convertible to bool.
BinaryFunction – type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters:

input – [in] thread input value.
output – [out] reference to a thread output value. May be aliased with input.
flag – [in] thread tail flag, true flags mark ends of segments.
storage – [in] reference to a temporary storage object of type storage_type.
reduce_op – [in] binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class Flag, class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize> __device__ inline auto tail_segmented_reduce(T, T&, Flag, storage_type&, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type#: Performs tail-segmented reduction across threads in a logical warp. Invalid Warp Size.

Reduce

Contents

Reduce#