Reduce#
Class#
- 
template<class T, unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
 class block_reduce#
- The block_reduce class is a block level parallel primitive which provides methods for performing reductions operations on items partitioned across threads in a block. - Overview
- Supports non-commutative reduce operators. However, a reduce operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision. 
- Computation can more efficient when: - ItemsPerThreadis greater than one,
- Tis an arithmetic type,
- reduce operation is simple addition operator, and 
- the number of threads in the block is a multiple of the hardware warp size (see rocprim::device_warp_size()). 
 
- block_reduce has three alternative implementations: - block_reduce_algorithm::using_warp_reduce,- block_reduce_algorithm::raking_reduceand- block_reduce_algorithm::raking_reduce_commutative_only.
- If the block sizes less than 64 only one warp reduction is used. The block reduction algorithm stores the result only in the first thread(lane_id = 0 warp_id = 0), when the block size is larger then the warp size. 
 
- Examples
- In the examples reduce operation is performed on block of 192 threads, each provides one - intvalue, result is returned using the same variable as for input.- __global__ void example_kernel(...) { // specialize warp_reduce for int and logical warp of 192 threads using block_reduce_int = rocprim::block_reduce<int, 192>; // allocate storage in shared memory __shared__ block_reduce_int::storage_type storage; int value = ...; // execute reduce block_reduce_int().reduce( value, // input value, // output storage ); ... } 
 - Template Parameters:
- T – - the input/output type. 
- BlockSize – - the number of threads in a block. 
- Algorithm – - selected reduce algorithm, block_reduce_algorithm::default_algorithm by default. 
 
 - Public Types - 
using storage_type = typename base_type::storage_type#
- Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive. - Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords - . It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.
 - Public Functions - 
template<class BinaryFunction = ::rocprim::plus<T>>
 __device__ inline void reduce(T input, T &output, storage_type &storage, BinaryFunction reduce_op = BinaryFunction())#
- Performs reduction across threads in a block. - Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
- Examples
- The examples present min reduce operations performed on a block of 256 threads, each provides one - floatvalue.- __global__ void example_kernel(...) // blockDim.x = 256 { // specialize block_reduce for float and block of 256 threads using block_reduce_f = rocprim::block_reduce<float, 256>; // allocate storage in shared memory for the block __shared__ block_reduce_float::storage_type storage; float input = ...; float output; // execute min reduce block_reduce_float().reduce( input, output, storage, rocprim::minimum<float>() ); ... } - If the - inputvalues across threads in a block are- {1, -2, 3, -4, ..., 255, -256}, then- outputvalue will be- {-256}.
 - Template Parameters:
- BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>. 
- Parameters:
- input – [in] - thread input value. 
- output – [out] - reference to a thread output value. May be aliased with - input.
- storage – [in] - reference to a temporary storage object of type storage_type. 
- reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<class BinaryFunction = ::rocprim::plus<T>>
 __device__ inline void reduce(T input, T &output, BinaryFunction reduce_op = BinaryFunction())#
- Performs reduction across threads in a block. - This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. - This overload does not accept storage argument. Required shared memory is allocated by the method itself. 
 - Template Parameters:
- BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>. 
- Parameters:
- input – [in] - thread input value. 
- output – [out] - reference to a thread output value. May be aliased with - input.
- reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
 __device__ inline void reduce(T (&input)[ItemsPerThread], T &output, storage_type &storage, BinaryFunction reduce_op = BinaryFunction())#
- Performs reduction across threads in a block. - Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
- Examples
- The examples present maximum reduce operations performed on a block of 128 threads, each provides two - longvalue.- __global__ void example_kernel(...) // blockDim.x = 128 { // specialize block_reduce for long and block of 128 threads using block_reduce_f = rocprim::block_reduce<long, 128>; // allocate storage in shared memory for the block __shared__ block_reduce_long::storage_type storage; long input[2] = ...; long output[2]; // execute max reduce block_reduce_long().reduce( input, output, storage, rocprim::maximum<long>() ); ... } - If the - inputvalues across threads in a block are- {-1, 2, -3, 4, ..., -255, 256}, then- outputvalue will be- {256}.
 - Template Parameters:
- ItemsPerThread – - number of items in the - inputarray.
- BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>. 
 
- Parameters:
- input – [in] - reference to an array containing thread input values. 
- output – [out] - reference to a thread output array. May be aliased with - input.
- storage – [in] - reference to a temporary storage object of type storage_type. 
- reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
 __device__ inline void reduce(T (&input)[ItemsPerThread], T &output, BinaryFunction reduce_op = BinaryFunction())#
- Performs reduction across threads in a block. - This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. - This overload does not accept storage argument. Required shared memory is allocated by the method itself. 
 - Template Parameters:
- ItemsPerThread – - number of items in the - inputarray.
- BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>. 
 
- Parameters:
- input – [in] - reference to an array containing thread input values. 
- output – [out] - reference to a thread output array. May be aliased with - input.
- reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<class BinaryFunction = ::rocprim::plus<T>>
 __device__ inline void reduce(T input, T &output, unsigned int valid_items, storage_type &storage, BinaryFunction reduce_op = BinaryFunction())#
- Performs reduction across threads in a block. - Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
- Examples
- The examples present min reduce operations performed on a block of 256 threads, each provides one - floatvalue.- __global__ void example_kernel(...) // blockDim.x = 256 { // specialize block_reduce for float and block of 256 threads using block_reduce_f = rocprim::block_reduce<float, 256>; // allocate storage in shared memory for the block __shared__ block_reduce_float::storage_type storage; float input = ...; unsigned int valid_items = 250; float output; // execute min reduce block_reduce_float().reduce( input, output, valid_items, storage, rocprim::minimum<float>() ); ... } 
 - Template Parameters:
- BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>. 
- Parameters:
- input – [in] - thread input value. 
- output – [out] - reference to a thread output value. May be aliased with - input.
- valid_items – [in] - number of items that will be reduced in the block. 
- storage – [in] - reference to a temporary storage object of type storage_type. 
- reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<class BinaryFunction = ::rocprim::plus<T>>
 __device__ inline void reduce(T input, T &output, unsigned int valid_items, BinaryFunction reduce_op = BinaryFunction())#
- Performs reduction across threads in a block. - This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. - This overload does not accept storage argument. Required shared memory is allocated by the method itself. 
 - Template Parameters:
- ItemsPerThread – - number of items in the - inputarray.
- BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>. 
 
- Parameters:
- input – [in] - reference to an array containing thread input values. 
- output – [out] - reference to a thread output array. May be aliased with - input.
- valid_items – [in] - number of items that will be reduced in the block. 
- reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 
Algorithms#
- 
enum class rocprim::block_reduce_algorithm#
- Available algorithms for block_reduce primitive. - Values: - 
enumerator using_warp_reduce#
- A warp_reduce based algorithm. 
 - 
enumerator raking_reduce#
- An algorithm which limits calculations to a single hardware warp. 
 - 
enumerator raking_reduce_commutative_only#
- raking reduce that supports only commutative operators 
 - 
enumerator default_algorithm#
- Default block_reduce algorithm. 
 
- 
enumerator using_warp_reduce#