Reduce#
Configuring the kernel#
reduce#
-
template<unsigned int BlockSize = 256, unsigned int ItemsPerThread = 8, ::rocprim::block_reduce_algorithm BlockReduceMethod = ::rocprim::block_reduce_algorithm::default_algorithm, unsigned int SizeLimit = std::numeric_limits<unsigned int>::max()>
struct reduce_config : public rocprim::detail::reduce_config_params# Configuration of device-level reduce primitives.
- Template Parameters:
BlockSize – - number of threads in a block.
ItemsPerThread – - number of items processed by each thread.
BlockReduceMethod – - algorithm for block reduce.
SizeLimit – - limit on the number of items reduced by a single launch
Subclassed by rocprim::detail::default_reduce_config< arch, key_type, enable >
reduce_by_key#
-
template<unsigned int BlockSize, unsigned int ItemsPerThread, block_load_method LoadKeysMethod = block_load_method::block_load_transpose, block_load_method LoadValuesMethod = block_load_method::block_load_transpose, block_scan_algorithm ScanAlgorithm = block_scan_algorithm::using_warp_scan, unsigned int TilesPerBlock = 1, unsigned int SizeLimit = std::numeric_limits<unsigned int>::max()>
struct reduce_by_key_config : public rocprim::detail::reduce_by_key_config_params# Configuration of device-level reduce-by-key operation.
- Template Parameters:
BlockSize – number of threads in a block.
ItemsPerThread – number of items processed by each thread per tile.
LoadKeysMethod – method of loading keys
LoadValuesMethod – method of loading values
ScanAlgorithm – block level scan algorithm to use
TilesPerBlock – number of tiles (
BlockSize
*ItemsPerThread
items) to process per blockSizeLimit – limit on the number of items for a single reduce_by_key kernel launch.
reduce#
-
template<class Config = default_config, class InputIterator, class OutputIterator, class InitValueType, class BinaryFunction = ::rocprim::plus<typename std::iterator_traits<InputIterator>::value_type>>
inline hipError_t rocprim::reduce(void *temporary_storage, size_t &storage_size, InputIterator input, OutputIterator output, const InitValueType initial_value, const size_t size, BinaryFunction reduce_op = BinaryFunction(), const hipStream_t stream = 0, bool debug_synchronous = false)# Parallel reduction primitive for device level.
reduce function performs a device-wide reduction operation using binary
reduce_op
operator.- Overview
Does not support non-commutative reduction operators. Reduction operator should also be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.
Returns the required size of
temporary_storage
instorage_size
iftemporary_storage
in a null pointer.Ranges specified by
input
must have at leastsize
elements, whileoutput
only needs one element.By default, the input type is used for accumulation. A custom type can be specified using
rocprim::transform_iterator
, see the example below.
- Example
In this example a device-level min-reduction operation is performed on an array of integer values (
short
s are reduced intoint
s) using custom operator.#include <rocprim/rocprim.hpp> // custom reduce function auto min_op = [] __device__ (int a, int b) -> int { return a < b ? a : b; }; // Prepare input and output (declare pointers, allocate device memory etc.) size_t input_size; // e.g., 8 short * input; // e.g., [4, 7, 6, 2, 5, 1, 3, 8] int * output; // empty array of 1 element int start_value; // e.g., 9 size_t temporary_storage_size_bytes; void * temporary_storage_ptr = nullptr; // Get required size of the temporary storage rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input, output, start_value, input_size, min_op ); // allocate temporary storage hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes); // perform reduce rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input, output, start_value, input_size, min_op ); // output: [1]
The same example as above, but now a custom accumulator type is specified.
#include <rocprim/rocprim.hpp> auto min_op = [] __device__ (int a, int b) -> int { return a < b ? a : b; }; size_t input_size; short * input; int * output; int start_value; // Use a transform iterator to specifiy a custom accumulator type auto input_iterator = rocprim::make_transform_iterator( input, [] __device__ (T in) { return static_cast<int>(in); }); size_t temporary_storage_size_bytes; void * temporary_storage_ptr = nullptr; // Use the transform iterator rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input_iterator, output, start_value, input_size, min_op ); hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes); rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input_iterator, output, start_value, input_size, min_op );
- Template Parameters:
Config – - [optional] Configuration of the primitive, must be
default_config
orreduce_config
.InputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type.
OutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type.
InitValueType – - type of the initial value.
BinaryFunction – - type of binary function used for reduction. Default type is
rocprim::plus<T>
, whereT
is avalue_type
ofInputIterator
.
- Parameters:
temporary_storage – [in] - pointer to a device-accessible temporary storage. When a null pointer is passed, the required allocation size (in bytes) is written to
storage_size
and function returns without performing the reduction operation.storage_size – [inout] - reference to a size (in bytes) of
temporary_storage
.input – [in] - iterator to the first element in the range to reduce.
output – [out] - iterator to the first element in the output range. It can be same as
input
.initial_value – [in] - initial value to start the reduction.
size – [in] - number of element in the input range.
reduce_op – [in] - binary operation function object that will be used for reduction. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it. The default value isBinaryFunction()
.stream – [in] - [optional] HIP stream object. The default is
0
(default stream).debug_synchronous – [in] - [optional] If true, synchronization after every kernel launch is forced in order to check for errors. The default value is
false
.
- Returns:
hipSuccess
(0
) after successful reduction; otherwise a HIP runtime error of typehipError_t
.
-
template<class Config = default_config, class InputIterator, class OutputIterator, class BinaryFunction = ::rocprim::plus<typename std::iterator_traits<InputIterator>::value_type>>
inline hipError_t rocprim::reduce(void *temporary_storage, size_t &storage_size, InputIterator input, OutputIterator output, const size_t size, BinaryFunction reduce_op = BinaryFunction(), const hipStream_t stream = 0, bool debug_synchronous = false)# Parallel reduce primitive for device level.
reduce function performs a device-wide reduction operation using binary
reduce_op
operator.- Overview
Does not support non-commutative reduction operators. Reduction operator should also be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.
Returns the required size of
temporary_storage
instorage_size
iftemporary_storage
in a null pointer.Ranges specified by
input
must have at leastsize
elements, whileoutput
only needs one element.By default, the input type is used for accumulation. A custom type can be specified using
rocprim::transform_iterator
, see the example below.
- Example
In this example a device-level sum operation is performed on an array of integer values (
short
s are reduced intoint
s).#include <rocprim/rocprim.hpp> // Prepare input and output (declare pointers, allocate device memory etc.) size_t input_size; // e.g., 8 short * input; // e.g., [1, 2, 3, 4, 5, 6, 7, 8] int * output; // empty array of 1 element size_t temporary_storage_size_bytes; void * temporary_storage_ptr = nullptr; // Get required size of the temporary storage rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input, output, input_size, rocprim::plus<int>() ); // allocate temporary storage hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes); // perform reduce rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input, output, input_size, rocprim::plus<int>() ); // output: [36]
The same example as above, but now a custom accumulator type is specified.
#include <rocprim/rocprim.hpp> size_t input_size; short * input; int * output; // Use a transform iterator to specifiy a custom accumulator type auto input_iterator = rocprim::make_transform_iterator( input, [] __device__ (T in) { return static_cast<int>(in); }); size_t temporary_storage_size_bytes; void * temporary_storage_ptr = nullptr; // Use the transform iterator rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input_iterator, output, start_value, input_size, rocprim::plus<int>() ); hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes); rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input_iterator, output, start_value, input_size, rocprim::plus<int>() );
- Template Parameters:
Config – - [optional] Configuration of the primitive, must be
default_config
orreduce_config
.InputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type.
OutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type.
BinaryFunction – - type of binary function used for reduction. Default type is
rocprim::plus<T>
, whereT
is avalue_type
ofInputIterator
.
- Parameters:
temporary_storage – [in] - pointer to a device-accessible temporary storage. When a null pointer is passed, the required allocation size (in bytes) is written to
storage_size
and function returns without performing the reduction operation.storage_size – [inout] - reference to a size (in bytes) of
temporary_storage
.input – [in] - iterator to the first element in the range to reduce.
output – [out] - iterator to the first element in the output range. It can be same as
input
.size – [in] - number of element in the input range.
reduce_op – [in] - binary operation function object that will be used for reduction. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it. Default is BinaryFunction().stream – [in] - [optional] HIP stream object. Default is
0
(default stream).debug_synchronous – [in] - [optional] If true, synchronization after every kernel launch is forced in order to check for errors. Default value is
false
.
- Returns:
hipSuccess
(0
) after successful reduction; otherwise a HIP runtime error of typehipError_t
.
segmented_reduce#
-
template<class Config = default_config, class InputIterator, class OutputIterator, class OffsetIterator, class BinaryFunction = ::rocprim::plus<typename std::iterator_traits<InputIterator>::value_type>, class InitValueType = typename std::iterator_traits<InputIterator>::value_type>
inline hipError_t rocprim::segmented_reduce(void *temporary_storage, size_t &storage_size, InputIterator input, OutputIterator output, unsigned int segments, OffsetIterator begin_offsets, OffsetIterator end_offsets, BinaryFunction reduce_op = BinaryFunction(), InitValueType initial_value = InitValueType(), hipStream_t stream = 0, bool debug_synchronous = false)# Parallel segmented reduction primitive for device level.
segmented_reduce function performs a device-wide reduction operation across multiple sequences using binary
reduce_op
operator.- Overview
Returns the required size of
temporary_storage
instorage_size
iftemporary_storage
in a null pointer.Ranges specified by
input
must have at leastsize
elements,output
must havesegments
elements.Ranges specified by
begin_offsets
andend_offsets
must have at leastsegments
elements. They may use the same sequenceoffsets
of at leastsegments + 1
elements:offsets
forbegin_offsets
andoffsets + 1
forend_offsets
.
- Example
In this example a device-level segmented min-reduction operation is performed on an array of integer values (
short
s are reduced intoint
s) using custom operator.#include <rocprim/rocprim.hpp> // custom reduce function auto min_op = [] __device__ (int a, int b) -> int { return a < b ? a : b; }; // Prepare input and output (declare pointers, allocate device memory etc.) unsigned int segments; // e.g., 3 short * input; // e.g., [4, 7, 6, 2, 5, 1, 3, 8] int * output; // empty array of 3 elements int * offsets; // e.g. [0, 2, 3, 8] int init_value; // e.g., 9 size_t temporary_storage_size_bytes; void * temporary_storage_ptr = nullptr; // Get required size of the temporary storage rocprim::segmented_reduce( temporary_storage_ptr, temporary_storage_size_bytes, input, output, segments, offsets, offsets + 1, min_op, init_value ); // allocate temporary storage hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes); // perform segmented reduction rocprim::segmented_reduce( temporary_storage_ptr, temporary_storage_size_bytes, input, output, segments, offsets, offsets + 1, min_op, init_value ); // output: [4, 6, 1]
- Template Parameters:
Config – - [optional] Configuration of the primitive, must be
default_config
orreduce_config
.InputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type.
OutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type.
OffsetIterator – - random-access iterator type of segment offsets. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type.
BinaryFunction – - type of binary function used for reduction. Default type is
rocprim::plus<T>
, whereT
is avalue_type
ofInputIterator
.InitValueType – - type of the initial value.
- Parameters:
temporary_storage – [in] - pointer to a device-accessible temporary storage. When a null pointer is passed, the required allocation size (in bytes) is written to
storage_size
and function returns without performing the reduction operation.storage_size – [inout] - reference to a size (in bytes) of
temporary_storage
.input – [in] - iterator to the first element in the range to reduce.
output – [out] - iterator to the first element in the output range.
segments – [in] - number of segments in the input range.
begin_offsets – [in] - iterator to the first element in the range of beginning offsets.
end_offsets – [in] - iterator to the first element in the range of ending offsets.
initial_value – [in] - initial value to start the reduction.
reduce_op – [in] - binary operation function object that will be used for reduction. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it. The default value isBinaryFunction()
.stream – [in] - [optional] HIP stream object. The default is
0
(default stream).debug_synchronous – [in] - [optional] If true, synchronization after every kernel launch is forced in order to check for errors. The default value is
false
.
- Returns:
hipSuccess
(0
) after successful reduction; otherwise a HIP runtime error of typehipError_t
.
reduce_by_key#
-
template<class Config = default_config, class KeysInputIterator, class ValuesInputIterator, class UniqueOutputIterator, class AggregatesOutputIterator, class UniqueCountOutputIterator, class BinaryFunction = ::rocprim::plus<typename std::iterator_traits<ValuesInputIterator>::value_type>, class KeyCompareFunction = ::rocprim::equal_to<typename std::iterator_traits<KeysInputIterator>::value_type>>
inline hipError_t rocprim::reduce_by_key(void *temporary_storage, size_t &storage_size, KeysInputIterator keys_input, ValuesInputIterator values_input, const size_t size, UniqueOutputIterator unique_output, AggregatesOutputIterator aggregates_output, UniqueCountOutputIterator unique_count_output, BinaryFunction reduce_op = BinaryFunction(), KeyCompareFunction key_compare_op = KeyCompareFunction(), hipStream_t stream = 0, bool debug_synchronous = false)# Parallel reduce-by-key primitive for device level.
reduce_by_key function performs a device-wide reduction operation on groups of consecutive values having the same key using binary
reduce_op
operator. The first key of each group is copied tounique_output
and the reduction of the group is written toaggregates_output
. The total number of groups is written tounique_count_output
.- Overview
Supports non-commutative reduction operators. However, a reduction operator should be associative.
When used with non-associative functions (e.g. floating point arithmetic operations):
the results may be non-deterministic and/or vary in precision,
and bit-wise reproducibility is not guaranteed, that is, results from multiple runs using the same input values on the same device may not be bit-wise identical. If deterministic behavior is required, Use rocprim::deterministic_reduce_by_key instead.
Returns the required size of
temporary_storage
instorage_size
iftemporary_storage
in a null pointer.Ranges specified by
keys_input
andvalues_input
must have at leastsize
elements.Range specified by
unique_count_output
must have at least 1 element.Ranges specified by
unique_output
andaggregates_output
must have at least*unique_count_output
(i.e. the number of unique keys) elements.
- Example
In this example a device-level sum operation is performed on an array of integer values and integer keys.
#include <rocprim/rocprim.hpp> // Prepare input and output (declare pointers, allocate device memory etc.) size_t input_size; // e.g., 8 int * keys_input; // e.g., [1, 1, 1, 2, 10, 10, 10, 88] int * values_input; // e.g., [1, 2, 3, 4, 5, 6, 7, 8] int * unique_output; // empty array of at least 4 elements int * aggregates_output; // empty array of at least 4 elements int * unique_count_output; // empty array of 1 element size_t temporary_storage_size_bytes; void * temporary_storage_ptr = nullptr; // Get required size of the temporary storage rocprim::reduce_by_key( temporary_storage_ptr, temporary_storage_size_bytes, keys_input, values_input, input_size, unique_output, aggregates_output, unique_count_output ); // allocate temporary storage hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes); // perform reduction rocprim::reduce_by_key( temporary_storage_ptr, temporary_storage_size_bytes, keys_input, values_input, input_size, unique_output, aggregates_output, unique_count_output ); // unique_output: [1, 2, 10, 88] // aggregates_output: [6, 4, 18, 8] // unique_count_output: [4]
- Template Parameters:
Config – - [optional] Configuration of the primitive, must be
default_config
orreduce_by_key_config
.KeysInputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type.
ValuesInputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type.
UniqueOutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type.
AggregatesOutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type.
UniqueCountOutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type.
BinaryFunction – - type of binary function used for reduction. Default type is
rocprim::plus<T>
, whereT
is avalue_type
ofValuesInputIterator
.KeyCompareFunction – - type of binary function used to determine keys equality. Default type is
rocprim::equal_to<T>
, whereT
is avalue_type
ofKeysInputIterator
.
- Parameters:
temporary_storage – [in] - pointer to a device-accessible temporary storage. When a null pointer is passed, the required allocation size (in bytes) is written to
storage_size
and function returns without performing the reduction operation.storage_size – [inout] - reference to a size (in bytes) of
temporary_storage
.keys_input – [in] - iterator to the first element in the range of keys.
values_input – [in] - iterator to the first element in the range of values to reduce.
size – [in] - number of element in the input range.
unique_output – [out] - iterator to the first element in the output range of unique keys.
aggregates_output – [out] - iterator to the first element in the output range of reductions.
unique_count_output – [out] - iterator to total number of groups.
reduce_op – [in] - binary operation function object that will be used for reduction. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it and must not have any side effects since the function may be called on uninitalized data. Default is BinaryFunction().key_compare_op – [in] - binary operation function object that will be used to determine key equality. The signature of the function should be equivalent to the following:
bool f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it and must not have any side effects since the function may be called on uninitalized data. Default is KeyCompareFunction().stream – [in] - [optional] HIP stream object. Default is
0
(default stream).debug_synchronous – [in] - [optional] If true, synchronization after every kernel launch is forced in order to check for errors. Default value is
false
.
- Returns:
hipSuccess
(0
) after successful reduction; otherwise a HIP runtime error of typehipError_t
.
deterministic#
-
template<class Config = default_config, class KeysInputIterator, class ValuesInputIterator, class UniqueOutputIterator, class AggregatesOutputIterator, class UniqueCountOutputIterator, class BinaryFunction = ::rocprim::plus<typename std::iterator_traits<ValuesInputIterator>::value_type>, class KeyCompareFunction = ::rocprim::equal_to<typename std::iterator_traits<KeysInputIterator>::value_type>>
inline hipError_t rocprim::deterministic_reduce_by_key(void *temporary_storage, size_t &storage_size, KeysInputIterator keys_input, ValuesInputIterator values_input, const size_t size, UniqueOutputIterator unique_output, AggregatesOutputIterator aggregates_output, UniqueCountOutputIterator unique_count_output, BinaryFunction reduce_op = BinaryFunction(), KeyCompareFunction key_compare_op = KeyCompareFunction(), hipStream_t stream = 0, bool debug_synchronous = false)# Bitwise-reproducible parallel reduce-by-key primitive for device level.
This function behaves the same as
reduce_by_key()
, except that unlikereduce_by_key()
, it provides run-to-run deterministic behavior for non-associative scan operators like floating point arithmetic operations. Refer to the documentation for rocprim::reduce_by_key for a detailed description of this function.