Reduce#
Class#
-
template<class T, unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_reduce# The block_reduce class is a block level parallel primitive which provides methods for performing reductions operations on items partitioned across threads in a block.
- Overview
Supports non-commutative reduce operators. However, a reduce operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.
Computation can more efficient when:
ItemsPerThread
is greater than one,T
is an arithmetic type,reduce operation is simple addition operator, and
the number of threads in the block is a multiple of the hardware warp size (see rocprim::device_warp_size()).
block_reduce has two alternative implementations:
block_reduce_algorithm::using_warp_reduce
, block_reduce_algorithm::raking_reduce and block_reduce_algorithm::raking_reduce_commutative_only.If the block sizes less than 64 only one warp reduction is used. The block reduction algorithm stores the result only in the first thread(lane_id = 0 warp_id = 0), when the block size is larger then the warp size.
- Examples
In the examples reduce operation is performed on block of 192 threads, each provides one
int
value, result is returned using the same variable as for input.__global__ void example_kernel(...) { // specialize warp_reduce for int and logical warp of 192 threads using block_reduce_int = rocprim::block_reduce<int, 192>; // allocate storage in shared memory __shared__ block_reduce_int::storage_type storage; int value = ...; // execute reduce block_reduce_int().reduce( value, // input value, // output storage ); ... }
- Template Parameters:
T – - the input/output type.
BlockSize – - the number of threads in a block.
Algorithm – - selected reduce algorithm, block_reduce_algorithm::default_algorithm by default.
Public Types
-
using storage_type = typename base_type::storage_type#
Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.
Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords
. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.
Public Functions
-
template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void reduce(T input, T &output, storage_type &storage, BinaryFunction reduce_op = BinaryFunction())# Performs reduction across threads in a block.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present min reduce operations performed on a block of 256 threads, each provides one
float
value.__global__ void example_kernel(...) // blockDim.x = 256 { // specialize block_reduce for float and block of 256 threads using block_reduce_f = rocprim::block_reduce<float, 256>; // allocate storage in shared memory for the block __shared__ block_reduce_float::storage_type storage; float input = ...; float output; // execute min reduce block_reduce_float().reduce( input, output, storage, rocprim::minimum<float>() ); ... }
If the
input
values across threads in a block are{1, -2, 3, -4, ..., 255, -256}
, thenoutput
value will be{-256}
.
- Template Parameters:
BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.storage – [in] - reference to a temporary storage object of type storage_type.
reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void reduce(T input, T &output, BinaryFunction reduce_op = BinaryFunction())# Performs reduction across threads in a block.
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters:
BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void reduce(T (&input)[ItemsPerThread], T &output, storage_type &storage, BinaryFunction reduce_op = BinaryFunction())# Performs reduction across threads in a block.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present maximum reduce operations performed on a block of 128 threads, each provides two
long
value.__global__ void example_kernel(...) // blockDim.x = 128 { // specialize block_reduce for long and block of 128 threads using block_reduce_f = rocprim::block_reduce<long, 128>; // allocate storage in shared memory for the block __shared__ block_reduce_long::storage_type storage; long input[2] = ...; long output[2]; // execute max reduce block_reduce_long().reduce( input, output, storage, rocprim::maximum<long>() ); ... }
If the
input
values across threads in a block are{-1, 2, -3, 4, ..., -255, 256}
, thenoutput
value will be{256}
.
- Template Parameters:
ItemsPerThread – - number of items in the
input
array.BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - reference to an array containing thread input values.
output – [out] - reference to a thread output array. May be aliased with
input
.storage – [in] - reference to a temporary storage object of type storage_type.
reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void reduce(T (&input)[ItemsPerThread], T &output, BinaryFunction reduce_op = BinaryFunction())# Performs reduction across threads in a block.
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters:
ItemsPerThread – - number of items in the
input
array.BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - reference to an array containing thread input values.
output – [out] - reference to a thread output array. May be aliased with
input
.reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void reduce(T input, T &output, unsigned int valid_items, storage_type &storage, BinaryFunction reduce_op = BinaryFunction())# Performs reduction across threads in a block.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present min reduce operations performed on a block of 256 threads, each provides one
float
value.__global__ void example_kernel(...) // blockDim.x = 256 { // specialize block_reduce for float and block of 256 threads using block_reduce_f = rocprim::block_reduce<float, 256>; // allocate storage in shared memory for the block __shared__ block_reduce_float::storage_type storage; float input = ...; unsigned int valid_items = 250; float output; // execute min reduce block_reduce_float().reduce( input, output, valid_items, storage, rocprim::minimum<float>() ); ... }
- Template Parameters:
BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.valid_items – [in] - number of items that will be reduced in the block.
storage – [in] - reference to a temporary storage object of type storage_type.
reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void reduce(T input, T &output, unsigned int valid_items, BinaryFunction reduce_op = BinaryFunction())# Performs reduction across threads in a block.
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters:
ItemsPerThread – - number of items in the
input
array.BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - reference to an array containing thread input values.
output – [out] - reference to a thread output array. May be aliased with
input
.valid_items – [in] - number of items that will be reduced in the block.
reduce_op – [in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
Algorithms#
-
enum class rocprim::block_reduce_algorithm#
Available algorithms for block_reduce primitive.
Values:
-
enumerator using_warp_reduce#
A warp_reduce based algorithm.
-
enumerator raking_reduce#
An algorithm which limits calculations to a single hardware warp.
-
enumerator raking_reduce_commutative_only#
raking reduce that supports only commutative operators
-
enumerator default_algorithm#
Default block_reduce algorithm.
-
enumerator using_warp_reduce#