Reduce#
-
template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>
class warp_reduce# The warp_reduce class is a warp level parallel primitive which provides methods for performing reduction operations on items partitioned across threads in a hardware warp.
- Overview
WarpSize
must be equal to or less than the size of hardware warp (see rocprim::device_warp_size()). If it is less, reduce is performed separately within groups determined by WarpSize.
For example, if
WarpSize
is 4, hardware warp is 64, reduction will be performed in logical warps grouped like this:{ {0, 1, 2, 3}, {4, 5, 6, 7 }, ..., {60, 61, 62, 63} }
(thread is represented here by its id within hardware warp).Logical warp is a group of
WarpSize
consecutive threads from the same hardware warp.Supports non-commutative reduce operators. However, a reduce operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.
Number of threads executing warp_reduce’s function must be a multiple of
WarpSize
;All threads from a logical warp must be in the same hardware warp.
- Examples
In the examples reduce operation is performed on groups of 16 threads, each provides one
int
value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.__global__ void example_kernel(...) { // specialize warp_reduce for int and logical warp of 16 threads using warp_reduce_int = rocprim::warp_reduce<int, 16>; // allocate storage in shared memory __shared__ warp_reduce_int::storage_type temp[4]; int logical_warp_id = threadIdx.x/16; int value = ...; // execute reduce warp_reduce_int().reduce( value, // input value, // output temp[logical_warp_id] ); ... }
- Template Parameters:
T – - the input/output type.
WarpSize – - the size of logical warp size, which can be equal to or less than the size of hardware warp (see rocprim::device_warp_size()). Reduce operations are performed separately within groups determined by WarpSize.
UseAllReduce – - input parameter to determine whether to broadcast final reduction value to all threads (default is false).
Public Types
-
using storage_type = typename base_type::storage_type#
Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.
Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords
. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.
Public Functions
-
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto reduce(T input, T &output, storage_type &storage, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type# Performs reduction across threads in a logical warp.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
In the examples reduce operation is performed on groups of 16 threads, each provides one
int
value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.__global__ void example_kernel(...) { // specialize warp_reduce for int and logical warp of 16 threads using warp_reduce_int = rocprim::warp_reduce<int, 16>; // allocate storage in shared memory __shared__ warp_reduce_int::storage_type temp[4]; int logical_warp_id = threadIdx.x/16; int value = ...; // execute reduction warp_reduce_int().reduce( value, // input value, // output temp[logical_warp_id], rocprim::minimum<float>() ); ... }
- Template Parameters:
BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.storage – [in] - reference to a temporary storage object of type storage_type.
reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto reduce(T, T&, storage_type&, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type# Performs reduction across threads in a logical warp. Invalid Warp Size.
-
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto reduce(T input, T &output, int valid_items, storage_type &storage, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type# Performs reduction across threads in a logical warp.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
In the examples reduce operation is performed on groups of 16 threads, each provides one
int
value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.__global__ void example_kernel(...) { // specialize warp_reduce for int and logical warp of 16 threads using warp_reduce_int = rocprim::warp_reduce<int, 16>; // allocate storage in shared memory __shared__ warp_reduce_int::storage_type temp[4]; int logical_warp_id = threadIdx.x/16; int value = ...; int valid_items = 4; // execute reduction warp_reduce_int().reduce( value, // input value, // output valid_items, temp[logical_warp_id] ); ... }
- Template Parameters:
BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.valid_items – [in] - number of items that will be reduced in the warp.
storage – [in] - reference to a temporary storage object of type storage_type.
reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto reduce(T, T&, int, storage_type&, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type# Performs reduction across threads in a logical warp. Invalid Warp Size.
-
template<class Flag, class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto head_segmented_reduce(T input, T &output, Flag flag, storage_type &storage, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type# Performs head-segmented reduction across threads in a logical warp.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.
- Template Parameters:
Flag – - type of head flags. Must be contextually convertible to
bool
.BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.flag – [in] - thread head flag,
true
flags mark beginnings of segments.storage – [in] - reference to a temporary storage object of type storage_type.
reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class Flag, class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto head_segmented_reduce(T, T&, Flag, storage_type&, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type# Performs head-segmented reduction across threads in a logical warp. Invalid Warp Size.
-
template<class Flag, class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto tail_segmented_reduce(T input, T &output, Flag flag, storage_type &storage, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type# Performs tail-segmented reduction across threads in a logical warp.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.
- Template Parameters:
Flag – - type of tail flags. Must be contextually convertible to
bool
.BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.flag – [in] - thread tail flag,
true
flags mark ends of segments.storage – [in] - reference to a temporary storage object of type storage_type.
reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class Flag, class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto tail_segmented_reduce(T, T&, Flag, storage_type&, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type# Performs tail-segmented reduction across threads in a logical warp. Invalid Warp Size.