Reduce#

Configuring the kernel#

reduce#

template<unsigned int BlockSize = 256, unsigned int ItemsPerThread = 8, ::rocprim::block_reduce_algorithm BlockReduceMethod = ::rocprim::block_reduce_algorithm::default_algorithm, unsigned int SizeLimit = std::numeric_limits<unsigned int>::max()>
struct reduce_config : public rocprim::detail::reduce_config_params#

Configuration of device-level reduce primitives.

Template Parameters:
  • BlockSize – - number of threads in a block.

  • ItemsPerThread – - number of items processed by each thread.

  • BlockReduceMethod – - algorithm for block reduce.

  • SizeLimit – - limit on the number of items reduced by a single launch

Subclassed by rocprim::detail::default_reduce_config< arch, key_type, enable >

reduce_by_key#

template<unsigned int BlockSize, unsigned int ItemsPerThread, block_load_method LoadKeysMethod = block_load_method::block_load_transpose, block_load_method LoadValuesMethod = block_load_method::block_load_transpose, block_scan_algorithm ScanAlgorithm = block_scan_algorithm::using_warp_scan, unsigned int TilesPerBlock = 1, unsigned int SizeLimit = std::numeric_limits<unsigned int>::max()>
struct reduce_by_key_config : public rocprim::detail::reduce_by_key_config_params#

Configuration of device-level reduce-by-key operation.

Template Parameters:
  • BlockSize – number of threads in a block.

  • ItemsPerThread – number of items processed by each thread per tile.

  • LoadKeysMethod – method of loading keys

  • LoadValuesMethod – method of loading values

  • ScanAlgorithm – block level scan algorithm to use

  • TilesPerBlock – number of tiles (BlockSize * ItemsPerThread items) to process per block

  • SizeLimit – limit on the number of items for a single reduce_by_key kernel launch.

reduce#

template<class Config = default_config, class InputIterator, class OutputIterator, class InitValueType, class BinaryFunction = ::rocprim::plus<typename std::iterator_traits<InputIterator>::value_type>>
inline hipError_t rocprim::reduce(void *temporary_storage, size_t &storage_size, InputIterator input, OutputIterator output, const InitValueType initial_value, const size_t size, BinaryFunction reduce_op = BinaryFunction(), const hipStream_t stream = 0, bool debug_synchronous = false)#

Parallel reduction primitive for device level.

reduce function performs a device-wide reduction operation using binary reduce_op operator.

Overview

  • Does not support non-commutative reduction operators. Reduction operator should also be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.

  • Returns the required size of temporary_storage in storage_size if temporary_storage in a null pointer.

  • Ranges specified by input must have at least size elements, while output only needs one element.

  • By default, the input type is used for accumulation. A custom type can be specified using rocprim::transform_iterator, see the example below.

Example

In this example a device-level min-reduction operation is performed on an array of integer values (shorts are reduced into ints) using custom operator.

#include <rocprim/rocprim.hpp>

// custom reduce function
auto min_op =
    [] __device__ (int a, int b) -> int
    {
        return a < b ? a : b;
    };

// Prepare input and output (declare pointers, allocate device memory etc.)
size_t input_size;    // e.g., 8
short * input;        // e.g., [4, 7, 6, 2, 5, 1, 3, 8]
int * output;         // empty array of 1 element
int start_value;      // e.g., 9

size_t temporary_storage_size_bytes;
void * temporary_storage_ptr = nullptr;
// Get required size of the temporary storage
rocprim::reduce(
    temporary_storage_ptr, temporary_storage_size_bytes,
    input, output, start_value, input_size, min_op
);

// allocate temporary storage
hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes);

// perform reduce
rocprim::reduce(
    temporary_storage_ptr, temporary_storage_size_bytes,
    input, output, start_value, input_size, min_op
);
// output: [1]

The same example as above, but now a custom accumulator type is specified.

#include <rocprim/rocprim.hpp>

auto min_op =
    [] __device__ (int a, int b) -> int
    {
        return a < b ? a : b;
    };

size_t input_size;
short * input;
int * output;
int start_value;

// Use a transform iterator to specifiy a custom accumulator type
auto input_iterator = rocprim::make_transform_iterator(
    input, [] __device__ (T in) { return static_cast<int>(in); });

size_t temporary_storage_size_bytes;
void * temporary_storage_ptr = nullptr;
// Use the transform iterator
rocprim::reduce(
    temporary_storage_ptr, temporary_storage_size_bytes,
    input_iterator, output, start_value, input_size, min_op
);

hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes);

rocprim::reduce(
    temporary_storage_ptr, temporary_storage_size_bytes,
    input_iterator, output, start_value, input_size, min_op
);

Template Parameters:
  • Config – - [optional] Configuration of the primitive, must be default_config or reduce_config.

  • InputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type.

  • OutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type.

  • InitValueType – - type of the initial value.

  • BinaryFunction – - type of binary function used for reduction. Default type is rocprim::plus<T>, where T is a value_type of InputIterator.

Parameters:
  • temporary_storage[in] - pointer to a device-accessible temporary storage. When a null pointer is passed, the required allocation size (in bytes) is written to storage_size and function returns without performing the reduction operation.

  • storage_size[inout] - reference to a size (in bytes) of temporary_storage.

  • input[in] - iterator to the first element in the range to reduce.

  • output[out] - iterator to the first element in the output range. It can be same as input.

  • initial_value[in] - initial value to start the reduction.

  • size[in] - number of element in the input range.

  • reduce_op[in] - binary operation function object that will be used for reduction. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it. The default value is BinaryFunction().

  • stream[in] - [optional] HIP stream object. The default is 0 (default stream).

  • debug_synchronous[in] - [optional] If true, synchronization after every kernel launch is forced in order to check for errors. The default value is false.

Returns:

hipSuccess (0) after successful reduction; otherwise a HIP runtime error of type hipError_t.

template<class Config = default_config, class InputIterator, class OutputIterator, class BinaryFunction = ::rocprim::plus<typename std::iterator_traits<InputIterator>::value_type>>
inline hipError_t rocprim::reduce(void *temporary_storage, size_t &storage_size, InputIterator input, OutputIterator output, const size_t size, BinaryFunction reduce_op = BinaryFunction(), const hipStream_t stream = 0, bool debug_synchronous = false)#

Parallel reduce primitive for device level.

reduce function performs a device-wide reduction operation using binary reduce_op operator.

Overview

  • Does not support non-commutative reduction operators. Reduction operator should also be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.

  • Returns the required size of temporary_storage in storage_size if temporary_storage in a null pointer.

  • Ranges specified by input must have at least size elements, while output only needs one element.

  • By default, the input type is used for accumulation. A custom type can be specified using rocprim::transform_iterator, see the example below.

Example

In this example a device-level sum operation is performed on an array of integer values (shorts are reduced into ints).

#include <rocprim/rocprim.hpp>

// Prepare input and output (declare pointers, allocate device memory etc.)
size_t input_size;    // e.g., 8
short * input;        // e.g., [1, 2, 3, 4, 5, 6, 7, 8]
int * output;         // empty array of 1 element

size_t temporary_storage_size_bytes;
void * temporary_storage_ptr = nullptr;
// Get required size of the temporary storage
rocprim::reduce(
    temporary_storage_ptr, temporary_storage_size_bytes,
    input, output, input_size, rocprim::plus<int>()
);

// allocate temporary storage
hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes);

// perform reduce
rocprim::reduce(
    temporary_storage_ptr, temporary_storage_size_bytes,
    input, output, input_size, rocprim::plus<int>()
);
// output: [36]

The same example as above, but now a custom accumulator type is specified.

#include <rocprim/rocprim.hpp>

size_t input_size;
short * input;
int * output;

// Use a transform iterator to specifiy a custom accumulator type
auto input_iterator = rocprim::make_transform_iterator(
    input, [] __device__ (T in) { return static_cast<int>(in); });

size_t temporary_storage_size_bytes;
void * temporary_storage_ptr = nullptr;
// Use the transform iterator
rocprim::reduce(
    temporary_storage_ptr, temporary_storage_size_bytes,
    input_iterator, output, start_value, input_size, rocprim::plus<int>()
);

hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes);

rocprim::reduce(
    temporary_storage_ptr, temporary_storage_size_bytes,
    input_iterator, output, start_value, input_size, rocprim::plus<int>()
);

Template Parameters:
  • Config – - [optional] Configuration of the primitive, must be default_config or reduce_config.

  • InputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type.

  • OutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type.

  • BinaryFunction – - type of binary function used for reduction. Default type is rocprim::plus<T>, where T is a value_type of InputIterator.

Parameters:
  • temporary_storage[in] - pointer to a device-accessible temporary storage. When a null pointer is passed, the required allocation size (in bytes) is written to storage_size and function returns without performing the reduction operation.

  • storage_size[inout] - reference to a size (in bytes) of temporary_storage.

  • input[in] - iterator to the first element in the range to reduce.

  • output[out] - iterator to the first element in the output range. It can be same as input.

  • size[in] - number of element in the input range.

  • reduce_op[in] - binary operation function object that will be used for reduction. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it. Default is BinaryFunction().

  • stream[in] - [optional] HIP stream object. Default is 0 (default stream).

  • debug_synchronous[in] - [optional] If true, synchronization after every kernel launch is forced in order to check for errors. Default value is false.

Returns:

hipSuccess (0) after successful reduction; otherwise a HIP runtime error of type hipError_t.

segmented_reduce#

template<class Config = default_config, class InputIterator, class OutputIterator, class OffsetIterator, class BinaryFunction = ::rocprim::plus<typename std::iterator_traits<InputIterator>::value_type>, class InitValueType = typename std::iterator_traits<InputIterator>::value_type>
inline hipError_t rocprim::segmented_reduce(void *temporary_storage, size_t &storage_size, InputIterator input, OutputIterator output, unsigned int segments, OffsetIterator begin_offsets, OffsetIterator end_offsets, BinaryFunction reduce_op = BinaryFunction(), InitValueType initial_value = InitValueType(), hipStream_t stream = 0, bool debug_synchronous = false)#

Parallel segmented reduction primitive for device level.

segmented_reduce function performs a device-wide reduction operation across multiple sequences using binary reduce_op operator.

Overview

  • Returns the required size of temporary_storage in storage_size if temporary_storage in a null pointer.

  • Ranges specified by input must have at least size elements, output must have segments elements.

  • Ranges specified by begin_offsets and end_offsets must have at least segments elements. They may use the same sequence offsets of at least segments + 1 elements: offsets for begin_offsets and offsets + 1 for end_offsets.

Example

In this example a device-level segmented min-reduction operation is performed on an array of integer values (shorts are reduced into ints) using custom operator.

#include <rocprim/rocprim.hpp>

// custom reduce function
auto min_op =
    [] __device__ (int a, int b) -> int
    {
        return a < b ? a : b;
    };

// Prepare input and output (declare pointers, allocate device memory etc.)
unsigned int segments;   // e.g., 3
short * input;           // e.g., [4, 7, 6, 2, 5, 1, 3, 8]
int * output;            // empty array of 3 elements
int * offsets;           // e.g. [0, 2, 3, 8]
int init_value;          // e.g., 9

size_t temporary_storage_size_bytes;
void * temporary_storage_ptr = nullptr;
// Get required size of the temporary storage
rocprim::segmented_reduce(
    temporary_storage_ptr, temporary_storage_size_bytes,
    input, output,
    segments, offsets, offsets + 1,
    min_op, init_value
);

// allocate temporary storage
hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes);

// perform segmented reduction
rocprim::segmented_reduce(
    temporary_storage_ptr, temporary_storage_size_bytes,
    input, output,
    segments, offsets, offsets + 1,
    min_op, init_value
);
// output: [4, 6, 1]

Template Parameters:
  • Config – - [optional] Configuration of the primitive, must be default_config or reduce_config.

  • InputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type.

  • OutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type.

  • OffsetIterator – - random-access iterator type of segment offsets. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type.

  • BinaryFunction – - type of binary function used for reduction. Default type is rocprim::plus<T>, where T is a value_type of InputIterator.

  • InitValueType – - type of the initial value.

Parameters:
  • temporary_storage[in] - pointer to a device-accessible temporary storage. When a null pointer is passed, the required allocation size (in bytes) is written to storage_size and function returns without performing the reduction operation.

  • storage_size[inout] - reference to a size (in bytes) of temporary_storage.

  • input[in] - iterator to the first element in the range to reduce.

  • output[out] - iterator to the first element in the output range.

  • segments[in] - number of segments in the input range.

  • begin_offsets[in] - iterator to the first element in the range of beginning offsets.

  • end_offsets[in] - iterator to the first element in the range of ending offsets.

  • initial_value[in] - initial value to start the reduction.

  • reduce_op[in] - binary operation function object that will be used for reduction. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it. The default value is BinaryFunction().

  • stream[in] - [optional] HIP stream object. The default is 0 (default stream).

  • debug_synchronous[in] - [optional] If true, synchronization after every kernel launch is forced in order to check for errors. The default value is false.

Returns:

hipSuccess (0) after successful reduction; otherwise a HIP runtime error of type hipError_t.

reduce_by_key#

template<class Config = default_config, class KeysInputIterator, class ValuesInputIterator, class UniqueOutputIterator, class AggregatesOutputIterator, class UniqueCountOutputIterator, class BinaryFunction = ::rocprim::plus<typename std::iterator_traits<ValuesInputIterator>::value_type>, class KeyCompareFunction = ::rocprim::equal_to<typename std::iterator_traits<KeysInputIterator>::value_type>>
inline hipError_t rocprim::reduce_by_key(void *temporary_storage, size_t &storage_size, KeysInputIterator keys_input, ValuesInputIterator values_input, const size_t size, UniqueOutputIterator unique_output, AggregatesOutputIterator aggregates_output, UniqueCountOutputIterator unique_count_output, BinaryFunction reduce_op = BinaryFunction(), KeyCompareFunction key_compare_op = KeyCompareFunction(), hipStream_t stream = 0, bool debug_synchronous = false)#

Parallel reduce-by-key primitive for device level.

reduce_by_key function performs a device-wide reduction operation on groups of consecutive values having the same key using binary reduce_op operator. The first key of each group is copied to unique_output and the reduction of the group is written to aggregates_output. The total number of groups is written to unique_count_output.

Overview

  • Supports non-commutative reduction operators. However, a reduction operator should be associative.

  • When used with non-associative functions (e.g. floating point arithmetic operations):

    • the results may be non-deterministic and/or vary in precision,

    • and bit-wise reproducibility is not guaranteed, that is, results from multiple runs using the same input values on the same device may not be bit-wise identical. If deterministic behavior is required, Use rocprim::deterministic_reduce_by_key instead.

  • Returns the required size of temporary_storage in storage_size if temporary_storage in a null pointer.

  • Ranges specified by keys_input and values_input must have at least size elements.

  • Range specified by unique_count_output must have at least 1 element.

  • Ranges specified by unique_output and aggregates_output must have at least *unique_count_output (i.e. the number of unique keys) elements.

Example

In this example a device-level sum operation is performed on an array of integer values and integer keys.

#include <rocprim/rocprim.hpp>

// Prepare input and output (declare pointers, allocate device memory etc.)
size_t input_size;          // e.g., 8
int * keys_input;           // e.g., [1, 1, 1, 2, 10, 10, 10, 88]
int * values_input;         // e.g., [1, 2, 3, 4,  5,  6,  7,  8]
int * unique_output;        // empty array of at least 4 elements
int * aggregates_output;    // empty array of at least 4 elements
int * unique_count_output;  // empty array of 1 element

size_t temporary_storage_size_bytes;
void * temporary_storage_ptr = nullptr;
// Get required size of the temporary storage
rocprim::reduce_by_key(
    temporary_storage_ptr, temporary_storage_size_bytes,
    keys_input, values_input, input_size,
    unique_output, aggregates_output, unique_count_output
);

// allocate temporary storage
hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes);

// perform reduction
rocprim::reduce_by_key(
    temporary_storage_ptr, temporary_storage_size_bytes,
    keys_input, values_input, input_size,
    unique_output, aggregates_output, unique_count_output
);
// unique_output:       [1, 2, 10, 88]
// aggregates_output:   [6, 4, 18,  8]
// unique_count_output: [4]

Template Parameters:
  • Config – - [optional] Configuration of the primitive, must be default_config or reduce_by_key_config.

  • KeysInputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type.

  • ValuesInputIterator – - random-access iterator type of the input range. Must meet the requirements of a C++ InputIterator concept. It can be a simple pointer type.

  • UniqueOutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type.

  • AggregatesOutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type.

  • UniqueCountOutputIterator – - random-access iterator type of the output range. Must meet the requirements of a C++ OutputIterator concept. It can be a simple pointer type.

  • BinaryFunction – - type of binary function used for reduction. Default type is rocprim::plus<T>, where T is a value_type of ValuesInputIterator.

  • KeyCompareFunction – - type of binary function used to determine keys equality. Default type is rocprim::equal_to<T>, where T is a value_type of KeysInputIterator.

Parameters:
  • temporary_storage[in] - pointer to a device-accessible temporary storage. When a null pointer is passed, the required allocation size (in bytes) is written to storage_size and function returns without performing the reduction operation.

  • storage_size[inout] - reference to a size (in bytes) of temporary_storage.

  • keys_input[in] - iterator to the first element in the range of keys.

  • values_input[in] - iterator to the first element in the range of values to reduce.

  • size[in] - number of element in the input range.

  • unique_output[out] - iterator to the first element in the output range of unique keys.

  • aggregates_output[out] - iterator to the first element in the output range of reductions.

  • unique_count_output[out] - iterator to total number of groups.

  • reduce_op[in] - binary operation function object that will be used for reduction. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it and must not have any side effects since the function may be called on uninitalized data. Default is BinaryFunction().

  • key_compare_op[in] - binary operation function object that will be used to determine key equality. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it and must not have any side effects since the function may be called on uninitalized data. Default is KeyCompareFunction().

  • stream[in] - [optional] HIP stream object. Default is 0 (default stream).

  • debug_synchronous[in] - [optional] If true, synchronization after every kernel launch is forced in order to check for errors. Default value is false.

Returns:

hipSuccess (0) after successful reduction; otherwise a HIP runtime error of type hipError_t.

deterministic#

template<class Config = default_config, class KeysInputIterator, class ValuesInputIterator, class UniqueOutputIterator, class AggregatesOutputIterator, class UniqueCountOutputIterator, class BinaryFunction = ::rocprim::plus<typename std::iterator_traits<ValuesInputIterator>::value_type>, class KeyCompareFunction = ::rocprim::equal_to<typename std::iterator_traits<KeysInputIterator>::value_type>>
inline hipError_t rocprim::deterministic_reduce_by_key(void *temporary_storage, size_t &storage_size, KeysInputIterator keys_input, ValuesInputIterator values_input, const size_t size, UniqueOutputIterator unique_output, AggregatesOutputIterator aggregates_output, UniqueCountOutputIterator unique_count_output, BinaryFunction reduce_op = BinaryFunction(), KeyCompareFunction key_compare_op = KeyCompareFunction(), hipStream_t stream = 0, bool debug_synchronous = false)#

Bitwise-reproducible parallel reduce-by-key primitive for device level.

This function behaves the same as reduce_by_key(), except that unlike reduce_by_key(), it provides run-to-run deterministic behavior for non-associative scan operators like floating point arithmetic operations. Refer to the documentation for rocprim::reduce_by_key for a detailed description of this function.