Reduce#
- 
template<class T, unsigned int WarpSize = device_warp_size(), bool UseAllReduce = false>
 class warp_reduce#
- The warp_reduce class is a warp level parallel primitive which provides methods for performing reduction operations on items partitioned across threads in a hardware warp. - Overview
- WarpSizemust be equal to or less than the size of hardware warp (see rocprim::device_warp_size()- ). If it is less, reduce is performed separately within groups determined by WarpSize. - For example, if - WarpSizeis 4, hardware warp is 64, reduction will be performed in logical warps grouped like this:- { {0, 1, 2, 3}, {4, 5, 6, 7 }, ..., {60, 61, 62, 63} }(thread is represented here by its id within hardware warp).
- Logical warp is a group of - WarpSizeconsecutive threads from the same hardware warp.
- Supports non-commutative reduce operators. However, a reduce operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision. 
- Number of threads executing warp_reduce’s function must be a multiple of - WarpSize;
- All threads from a logical warp must be in the same hardware warp. 
 
- Examples
- In the examples reduce operation is performed on groups of 16 threads, each provides one - intvalue, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.- __global__ void example_kernel(...) { // specialize warp_reduce for int and logical warp of 16 threads using warp_reduce_int = rocprim::warp_reduce<int, 16>; // allocate storage in shared memory __shared__ warp_reduce_int::storage_type temp[4]; int logical_warp_id = threadIdx.x/16; int value = ...; // execute reduce warp_reduce_int().reduce( value, // input value, // output temp[logical_warp_id] ); ... } 
 - Template Parameters:
- T – - the input/output type. 
- WarpSize – - the size of logical warp size, which can be equal to or less than the size of hardware warp (see rocprim::device_warp_size()). Reduce operations are performed separately within groups determined by WarpSize. 
- UseAllReduce – - input parameter to determine whether to broadcast final reduction value to all threads (default is false). 
 
 - Public Types - 
using storage_type = typename base_type::storage_type#
- Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive. - Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords - . It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.
 - Public Functions - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto reduce(T input, T &output, storage_type &storage, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type#
- Performs reduction across threads in a logical warp. - Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
- Examples
- In the examples reduce operation is performed on groups of 16 threads, each provides one - intvalue, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.- __global__ void example_kernel(...) { // specialize warp_reduce for int and logical warp of 16 threads using warp_reduce_int = rocprim::warp_reduce<int, 16>; // allocate storage in shared memory __shared__ warp_reduce_int::storage_type temp[4]; int logical_warp_id = threadIdx.x/16; int value = ...; // execute reduction warp_reduce_int().reduce( value, // input value, // output temp[logical_warp_id], rocprim::minimum<float>() ); ... } 
 - Template Parameters:
- BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>. 
- Parameters:
- input – [in] - thread input value. 
- output – [out] - reference to a thread output value. May be aliased with - input.
- storage – [in] - reference to a temporary storage object of type storage_type. 
- reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto reduce(T, T&, storage_type&, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type#
- Performs reduction across threads in a logical warp. Invalid Warp Size. 
 - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto reduce(T input, T &output, int valid_items, storage_type &storage, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type#
- Performs reduction across threads in a logical warp. - Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
- Examples
- In the examples reduce operation is performed on groups of 16 threads, each provides one - intvalue, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.- __global__ void example_kernel(...) { // specialize warp_reduce for int and logical warp of 16 threads using warp_reduce_int = rocprim::warp_reduce<int, 16>; // allocate storage in shared memory __shared__ warp_reduce_int::storage_type temp[4]; int logical_warp_id = threadIdx.x/16; int value = ...; int valid_items = 4; // execute reduction warp_reduce_int().reduce( value, // input value, // output valid_items, temp[logical_warp_id] ); ... } 
 - Template Parameters:
- BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>. 
- Parameters:
- input – [in] - thread input value. 
- output – [out] - reference to a thread output value. May be aliased with - input.
- valid_items – [in] - number of items that will be reduced in the warp. 
- storage – [in] - reference to a temporary storage object of type storage_type. 
- reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto reduce(T, T&, int, storage_type&, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type#
- Performs reduction across threads in a logical warp. Invalid Warp Size. 
 - 
template<class Flag, class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto head_segmented_reduce(T input, T &output, Flag flag, storage_type &storage, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type#
- Performs head-segmented reduction across threads in a logical warp. - Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
 - Template Parameters:
- Flag – - type of head flags. Must be contextually convertible to - bool.
- BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>. 
 
- Parameters:
- input – [in] - thread input value. 
- output – [out] - reference to a thread output value. May be aliased with - input.
- flag – [in] - thread head flag, - trueflags mark beginnings of segments.
- storage – [in] - reference to a temporary storage object of type storage_type. 
- reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<class Flag, class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto head_segmented_reduce(T, T&, Flag, storage_type&, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type#
- Performs head-segmented reduction across threads in a logical warp. Invalid Warp Size. 
 - 
template<class Flag, class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto tail_segmented_reduce(T input, T &output, Flag flag, storage_type &storage, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type#
- Performs tail-segmented reduction across threads in a logical warp. - Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
 - Template Parameters:
- Flag – - type of tail flags. Must be contextually convertible to - bool.
- BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>. 
 
- Parameters:
- input – [in] - thread input value. 
- output – [out] - reference to a thread output value. May be aliased with - input.
- flag – [in] - thread tail flag, - trueflags mark ends of segments.
- storage – [in] - reference to a temporary storage object of type storage_type. 
- reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<class Flag, class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto tail_segmented_reduce(T, T&, Flag, storage_type&, BinaryFunction reduce_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type#
- Performs tail-segmented reduction across threads in a logical warp. Invalid Warp Size.