Reduce#
Configuring the kernel#
reduce#
- 
template<unsigned int BlockSize = 256, unsigned int ItemsPerThread = 8, ::rocprim::block_reduce_algorithm BlockReduceMethod = ::rocprim::block_reduce_algorithm::default_algorithm, unsigned int SizeLimit = std::numeric_limits<unsigned int>::max()>
 struct reduce_config : public rocprim::detail::reduce_config_params#
- Configuration of device-level reduce primitives. - Template Parameters:
- BlockSize – - number of threads in a block. 
- ItemsPerThread – - number of items processed by each thread. 
- BlockReduceMethod – - algorithm for block reduce. 
- SizeLimit – - limit on the number of items reduced by a single launch 
 
 - Subclassed by rocprim::detail::default_reduce_config< arch, key_type, enable >, rocprim::detail::default_segmented_reduce_config< arch, key_type, enable > 
reduce_by_key#
- 
template<unsigned int BlockSize, unsigned int ItemsPerThread, block_load_method LoadKeysMethod = block_load_method::block_load_transpose, block_load_method LoadValuesMethod = block_load_method::block_load_transpose, block_scan_algorithm ScanAlgorithm = block_scan_algorithm::using_warp_scan, unsigned int TilesPerBlock = 1, unsigned int SizeLimit = std::numeric_limits<unsigned int>::max()>
 struct reduce_by_key_config : public rocprim::detail::reduce_by_key_config_params#
- Configuration of device-level reduce-by-key operation. - Template Parameters:
- BlockSize – number of threads in a block. 
- ItemsPerThread – number of items processed by each thread per tile. 
- LoadKeysMethod – method of loading keys 
- LoadValuesMethod – method of loading values 
- ScanAlgorithm – block level scan algorithm to use 
- TilesPerBlock – number of tiles ( - BlockSize*- ItemsPerThreaditems) to process per block. This parameter is only here for legacy purposes. Its no longer used.
- SizeLimit – limit on the number of items for a single reduce_by_key kernel launch. 
 
 
reduce#
- 
template<class Config = default_config, class InputIterator, class OutputIterator, class InitValueType, class BinaryFunction = ::rocprim::plus<typename std::iterator_traits<InputIterator>::value_type>>
 inline hipError_t rocprim::reduce(void *temporary_storage, size_t &storage_size, InputIterator input, OutputIterator output, const InitValueType initial_value, const size_t size, BinaryFunction reduce_op = BinaryFunction(), const hipStream_t stream = 0, bool debug_synchronous = false)#
- Parallel reduction primitive for device level. - reduce function performs a device-wide reduction operation using binary - reduce_opoperator.- Overview
- Does not support non-commutative reduction operators. Reduction operator should also be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision. 
- Returns the required size of - temporary_storagein- storage_sizeif- temporary_storagein a null pointer.
- Ranges specified by - inputmust have at least- sizeelements, while- outputonly needs one element.
- By default, the input type is used for accumulation. A custom type can be specified using - rocprim::transform_iterator, see the example below.
 
- Example
- In this example a device-level min-reduction operation is performed on an array of integer values ( - shorts are reduced into- ints) using custom operator.- #include <rocprim/rocprim.hpp> // custom reduce function auto min_op = [] __device__ (int a, int b) -> int { return a < b ? a : b; }; // Prepare input and output (declare pointers, allocate device memory etc.) size_t input_size; // e.g., 8 short * input; // e.g., [4, 7, 6, 2, 5, 1, 3, 8] int * output; // empty array of 1 element int start_value; // e.g., 9 size_t temporary_storage_size_bytes; void * temporary_storage_ptr = nullptr; // Get required size of the temporary storage rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input, output, start_value, input_size, min_op ); // allocate temporary storage hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes); // perform reduce rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input, output, start_value, input_size, min_op ); // output: [1] - The same example as above, but now a custom accumulator type is specified. - #include <rocprim/rocprim.hpp> auto min_op = [] __device__ (int a, int b) -> int { return a < b ? a : b; }; size_t input_size; short * input; int * output; int start_value; // Use a transform iterator to specifiy a custom accumulator type auto input_iterator = rocprim::make_transform_iterator( input, [] __device__ (T in) { return static_cast<int>(in); }); size_t temporary_storage_size_bytes; void * temporary_storage_ptr = nullptr; // Use the transform iterator rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input_iterator, output, start_value, input_size, min_op ); hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes); rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input_iterator, output, start_value, input_size, min_op ); 
 - Template Parameters:
- Config – - [optional] Configuration of the primitive, must be - default_configor- reduce_config.
- InputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type. 
- OutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type. 
- InitValueType – - type of the initial value. 
- BinaryFunction – - type of binary function used for reduction. Default type is - rocprim::plus<T>, where- Tis a- value_typeof- InputIterator.
 
- Parameters:
- temporary_storage – [in] - pointer to a device-accessible temporary storage. When a null pointer is passed, the required allocation size (in bytes) is written to - storage_sizeand function returns without performing the reduction operation.
- storage_size – [inout] - reference to a size (in bytes) of - temporary_storage.
- input – [in] - iterator to the first element in the range to reduce. 
- output – [out] - iterator to the first element in the output range. It can be same as - input.
- initial_value – [in] - initial value to start the reduction. 
- size – [in] - number of element in the input range. 
- reduce_op – [in] - binary operation function object that will be used for reduction. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it. The default value is- BinaryFunction().
- stream – [in] - [optional] HIP stream object. The default is - 0(default stream).
- debug_synchronous – [in] - [optional] If true, synchronization after every kernel launch is forced in order to check for errors. The default value is - false.
 
- Returns:
- hipSuccess(- 0) after successful reduction; otherwise a HIP runtime error of type- hipError_t.
 
- 
template<class Config = default_config, class InputIterator, class OutputIterator, class BinaryFunction = ::rocprim::plus<typename std::iterator_traits<InputIterator>::value_type>>
 inline hipError_t rocprim::reduce(void *temporary_storage, size_t &storage_size, InputIterator input, OutputIterator output, const size_t size, BinaryFunction reduce_op = BinaryFunction(), const hipStream_t stream = 0, bool debug_synchronous = false)#
- Parallel reduce primitive for device level. - reduce function performs a device-wide reduction operation using binary - reduce_opoperator.- Overview
- Does not support non-commutative reduction operators. Reduction operator should also be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision. 
- Returns the required size of - temporary_storagein- storage_sizeif- temporary_storagein a null pointer.
- Ranges specified by - inputmust have at least- sizeelements, while- outputonly needs one element.
- By default, the input type is used for accumulation. A custom type can be specified using - rocprim::transform_iterator, see the example below.
 
- Example
- In this example a device-level sum operation is performed on an array of integer values ( - shorts are reduced into- ints).- #include <rocprim/rocprim.hpp> // Prepare input and output (declare pointers, allocate device memory etc.) size_t input_size; // e.g., 8 short * input; // e.g., [1, 2, 3, 4, 5, 6, 7, 8] int * output; // empty array of 1 element size_t temporary_storage_size_bytes; void * temporary_storage_ptr = nullptr; // Get required size of the temporary storage rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input, output, input_size, rocprim::plus<int>() ); // allocate temporary storage hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes); // perform reduce rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input, output, input_size, rocprim::plus<int>() ); // output: [36] - The same example as above, but now a custom accumulator type is specified. - #include <rocprim/rocprim.hpp> size_t input_size; short * input; int * output; // Use a transform iterator to specifiy a custom accumulator type auto input_iterator = rocprim::make_transform_iterator( input, [] __device__ (T in) { return static_cast<int>(in); }); size_t temporary_storage_size_bytes; void * temporary_storage_ptr = nullptr; // Use the transform iterator rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input_iterator, output, start_value, input_size, rocprim::plus<int>() ); hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes); rocprim::reduce( temporary_storage_ptr, temporary_storage_size_bytes, input_iterator, output, start_value, input_size, rocprim::plus<int>() ); 
 - Template Parameters:
- Config – - [optional] Configuration of the primitive, must be - default_configor- reduce_config.
- InputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type. 
- OutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type. 
- BinaryFunction – - type of binary function used for reduction. Default type is - rocprim::plus<T>, where- Tis a- value_typeof- InputIterator.
 
- Parameters:
- temporary_storage – [in] - pointer to a device-accessible temporary storage. When a null pointer is passed, the required allocation size (in bytes) is written to - storage_sizeand function returns without performing the reduction operation.
- storage_size – [inout] - reference to a size (in bytes) of - temporary_storage.
- input – [in] - iterator to the first element in the range to reduce. 
- output – [out] - iterator to the first element in the output range. It can be same as - input.
- size – [in] - number of element in the input range. 
- reduce_op – [in] - binary operation function object that will be used for reduction. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it. Default is BinaryFunction().
- stream – [in] - [optional] HIP stream object. Default is - 0(default stream).
- debug_synchronous – [in] - [optional] If true, synchronization after every kernel launch is forced in order to check for errors. Default value is - false.
 
- Returns:
- hipSuccess(- 0) after successful reduction; otherwise a HIP runtime error of type- hipError_t.
 
segmented_reduce#
- 
template<class Config = default_config, class InputIterator, class OutputIterator, class OffsetIterator, class BinaryFunction = ::rocprim::plus<typename std::iterator_traits<InputIterator>::value_type>, class InitValueType = typename std::iterator_traits<InputIterator>::value_type>
 inline hipError_t rocprim::segmented_reduce(void *temporary_storage, size_t &storage_size, InputIterator input, OutputIterator output, unsigned int segments, OffsetIterator begin_offsets, OffsetIterator end_offsets, BinaryFunction reduce_op = BinaryFunction(), InitValueType initial_value = InitValueType(), hipStream_t stream = 0, bool debug_synchronous = false)#
- Parallel segmented reduction primitive for device level. - segmented_reduce function performs a device-wide reduction operation across multiple sequences using binary - reduce_opoperator.- Overview
- Returns the required size of - temporary_storagein- storage_sizeif- temporary_storagein a null pointer.
- Ranges specified by - inputmust have at least- sizeelements,- outputmust have- segmentselements.
- Ranges specified by - begin_offsetsand- end_offsetsmust have at least- segmentselements. They may use the same sequence- offsetsof at least- segments + 1elements:- offsetsfor- begin_offsetsand- offsets + 1for- end_offsets.
 
- Example
- In this example a device-level segmented min-reduction operation is performed on an array of integer values ( - shorts are reduced into- ints) using custom operator.- #include <rocprim/rocprim.hpp> // custom reduce function auto min_op = [] __device__ (int a, int b) -> int { return a < b ? a : b; }; // Prepare input and output (declare pointers, allocate device memory etc.) unsigned int segments; // e.g., 3 short * input; // e.g., [4, 7, 6, 2, 5, 1, 3, 8] int * output; // empty array of 3 elements int * offsets; // e.g. [0, 2, 3, 8] int init_value; // e.g., 9 size_t temporary_storage_size_bytes; void * temporary_storage_ptr = nullptr; // Get required size of the temporary storage rocprim::segmented_reduce( temporary_storage_ptr, temporary_storage_size_bytes, input, output, segments, offsets, offsets + 1, min_op, init_value ); // allocate temporary storage hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes); // perform segmented reduction rocprim::segmented_reduce( temporary_storage_ptr, temporary_storage_size_bytes, input, output, segments, offsets, offsets + 1, min_op, init_value ); // output: [4, 6, 1] 
 - Template Parameters:
- Config – - [optional] Configuration of the primitive, must be - default_configor- reduce_config.
- InputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type. 
- OutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type. 
- OffsetIterator – - random-access iterator type of segment offsets. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type. 
- BinaryFunction – - type of binary function used for reduction. Default type is - rocprim::plus<T>, where- Tis a- value_typeof- InputIterator.
- InitValueType – - type of the initial value. 
 
- Parameters:
- temporary_storage – [in] - pointer to a device-accessible temporary storage. When a null pointer is passed, the required allocation size (in bytes) is written to - storage_sizeand function returns without performing the reduction operation.
- storage_size – [inout] - reference to a size (in bytes) of - temporary_storage.
- input – [in] - iterator to the first element in the range to reduce. 
- output – [out] - iterator to the first element in the output range. 
- segments – [in] - number of segments in the input range. 
- begin_offsets – [in] - iterator to the first element in the range of beginning offsets. 
- end_offsets – [in] - iterator to the first element in the range of ending offsets. 
- initial_value – [in] - initial value to start the reduction. 
- reduce_op – [in] - binary operation function object that will be used for reduction. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it. The default value is- BinaryFunction().
- stream – [in] - [optional] HIP stream object. The default is - 0(default stream).
- debug_synchronous – [in] - [optional] If true, synchronization after every kernel launch is forced in order to check for errors. The default value is - false.
 
- Returns:
- hipSuccess(- 0) after successful reduction; otherwise a HIP runtime error of type- hipError_t.
 
reduce_by_key#
- 
template<typename Config = default_config, typename KeysInputIterator, typename ValuesInputIterator, typename UniqueOutputIterator, typename AggregatesOutputIterator, typename UniqueCountOutputIterator, typename BinaryFunction = ::rocprim::plus<typename std::iterator_traits<ValuesInputIterator>::value_type>, typename KeyCompareFunction = ::rocprim::equal_to<typename std::iterator_traits<KeysInputIterator>::value_type>>
 inline hipError_t rocprim::reduce_by_key(void *temporary_storage, size_t &storage_size, KeysInputIterator keys_input, ValuesInputIterator values_input, const size_t size, UniqueOutputIterator unique_output, AggregatesOutputIterator aggregates_output, UniqueCountOutputIterator unique_count_output, BinaryFunction reduce_op = BinaryFunction(), KeyCompareFunction key_compare_op = KeyCompareFunction(), hipStream_t stream = 0, bool debug_synchronous = false)#
- Parallel reduce-by-key primitive for device level. - reduce_by_key function performs a device-wide reduction operation on groups of consecutive values having the same key using binary - reduce_opoperator. The first key of each group is copied to- unique_outputand the reduction of the group is written to- aggregates_output. The total number of groups is written to- unique_count_output.- Overview
- Supports non-commutative reduction operators. However, a reduction operator should be associative. 
- When used with non-associative functions (e.g. floating point arithmetic operations): - the results may be non-deterministic and/or vary in precision, 
- and bit-wise reproducibility is not guaranteed, that is, results from multiple runs using the same input values on the same device may not be bit-wise identical. If deterministic behavior is required, Use rocprim::deterministic_reduce_by_key instead. 
 
- Returns the required size of - temporary_storagein- storage_sizeif- temporary_storagein a null pointer.
- Ranges specified by - keys_inputand- values_inputmust have at least- sizeelements.
- Range specified by - unique_count_outputmust have at least 1 element.
- Ranges specified by - unique_outputand- aggregates_outputmust have at least- *unique_count_output(i.e. the number of unique keys) elements.
 
- Example
- In this example a device-level sum operation is performed on an array of integer values and integer keys. - #include <rocprim/rocprim.hpp> // Prepare input and output (declare pointers, allocate device memory etc.) size_t input_size; // e.g., 8 int * keys_input; // e.g., [1, 1, 1, 2, 10, 10, 10, 88] int * values_input; // e.g., [1, 2, 3, 4, 5, 6, 7, 8] int * unique_output; // empty array of at least 4 elements int * aggregates_output; // empty array of at least 4 elements int * unique_count_output; // empty array of 1 element size_t temporary_storage_size_bytes; void * temporary_storage_ptr = nullptr; // Get required size of the temporary storage rocprim::reduce_by_key( temporary_storage_ptr, temporary_storage_size_bytes, keys_input, values_input, input_size, unique_output, aggregates_output, unique_count_output ); // allocate temporary storage hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes); // perform reduction rocprim::reduce_by_key( temporary_storage_ptr, temporary_storage_size_bytes, keys_input, values_input, input_size, unique_output, aggregates_output, unique_count_output ); // unique_output: [1, 2, 10, 88] // aggregates_output: [6, 4, 18, 8] // unique_count_output: [4] 
 - Template Parameters:
- Config – - [optional] Configuration of the primitive, must be - default_configor- reduce_by_key_config.
- KeysInputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type. 
- ValuesInputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type. 
- UniqueOutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type. 
- AggregatesOutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type. 
- UniqueCountOutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type. 
- BinaryFunction – - type of binary function used for reduction. Default type is - rocprim::plus<T>, where- Tis a- value_typeof- ValuesInputIterator.
- KeyCompareFunction – - type of binary function used to determine keys equality. Default type is - rocprim::equal_to<T>, where- Tis a- value_typeof- KeysInputIterator.
 
- Parameters:
- temporary_storage – [in] - pointer to a device-accessible temporary storage. When a null pointer is passed, the required allocation size (in bytes) is written to - storage_sizeand function returns without performing the reduction operation.
- storage_size – [inout] - reference to a size (in bytes) of - temporary_storage.
- keys_input – [in] - iterator to the first element in the range of keys. 
- values_input – [in] - iterator to the first element in the range of values to reduce. 
- size – [in] - number of element in the input range. 
- unique_output – [out] - iterator to the first element in the output range of unique keys. 
- aggregates_output – [out] - iterator to the first element in the output range of reductions. 
- unique_count_output – [out] - iterator to total number of groups. 
- reduce_op – [in] - binary operation function object that will be used for reduction. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it and must not have any side effects since the function may be called on uninitalized data. Default is BinaryFunction().
- key_compare_op – [in] - binary operation function object that will be used to determine key equality. The signature of the function should be equivalent to the following: - bool f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it and must not have any side effects since the function may be called on uninitalized data. Default is KeyCompareFunction().
- stream – [in] - [optional] HIP stream object. Default is - 0(default stream).
- debug_synchronous – [in] - [optional] If true, synchronization after every kernel launch is forced in order to check for errors. Default value is - false.
 
- Returns:
- hipSuccess(- 0) after successful reduction; otherwise a HIP runtime error of type- hipError_t.
 
deterministic#
- 
template<typename Config = default_config, typename KeysInputIterator, typename ValuesInputIterator, typename UniqueOutputIterator, typename AggregatesOutputIterator, typename UniqueCountOutputIterator, typename BinaryFunction = ::rocprim::plus<typename std::iterator_traits<ValuesInputIterator>::value_type>, typename KeyCompareFunction = ::rocprim::equal_to<typename std::iterator_traits<KeysInputIterator>::value_type>>
 inline hipError_t rocprim::deterministic_reduce_by_key(void *temporary_storage, size_t &storage_size, KeysInputIterator keys_input, ValuesInputIterator values_input, const size_t size, UniqueOutputIterator unique_output, AggregatesOutputIterator aggregates_output, UniqueCountOutputIterator unique_count_output, BinaryFunction reduce_op = BinaryFunction(), KeyCompareFunction key_compare_op = KeyCompareFunction(), hipStream_t stream = 0, bool debug_synchronous = false)#
- Bitwise-reproducible parallel reduce-by-key primitive for device level. - This function behaves the same as - reduce_by_key(), except that unlike- reduce_by_key(), it provides run-to-run deterministic behavior for non-associative scan operators like floating point arithmetic operations. Refer to the documentation for rocprim::reduce_by_key for a detailed description of this function.