Reduce#

Class#

template<class T, unsigned int BlockSizeX, block_reduce_algorithm Algorithm = block_reduce_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_reduce#

The block_reduce class is a block level parallel primitive which provides methods for performing reductions operations on items partitioned across threads in a block.

Overview

  • Supports non-commutative reduce operators. However, a reduce operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.

  • Computation can more efficient when:

    • ItemsPerThread is greater than one,

    • T is an arithmetic type,

    • reduce operation is simple addition operator, and

    • the number of threads in the block is a multiple of the hardware warp size (see rocprim::device_warp_size()).

  • block_reduce has two alternative implementations: block_reduce_algorithm::using_warp_reduce, block_reduce_algorithm::raking_reduce and block_reduce_algorithm::raking_reduce_commutative_only.

  • If the block sizes less than 64 only one warp reduction is used. The block reduction algorithm stores the result only in the first thread(lane_id = 0 warp_id = 0), when the block size is larger then the warp size.

Examples

In the examples reduce operation is performed on block of 192 threads, each provides one int value, result is returned using the same variable as for input.

__global__ void example_kernel(...)
{
    // specialize warp_reduce for int and logical warp of 192 threads
    using block_reduce_int = rocprim::block_reduce<int, 192>;
    // allocate storage in shared memory
    __shared__ block_reduce_int::storage_type storage;

    int value = ...;
    // execute reduce
    block_reduce_int().reduce(
        value, // input
        value, // output
        storage
    );
    ...
}

Template Parameters:

Public Types

using storage_type = typename base_type::storage_type#

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords __shared__. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.

Public Functions

template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void reduce(T input, T &output, storage_type &storage, BinaryFunction reduce_op = BinaryFunction())#

Performs reduction across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present min reduce operations performed on a block of 256 threads, each provides one float value.

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize block_reduce for float and block of 256 threads
    using block_reduce_f = rocprim::block_reduce<float, 256>;
    // allocate storage in shared memory for the block
    __shared__ block_reduce_float::storage_type storage;

    float input = ...;
    float output;
    // execute min reduce
    block_reduce_float().reduce(
        input,
        output,
        storage,
        rocprim::minimum<float>()
    );
    ...
}

If the input values across threads in a block are {1, -2, 3, -4, ..., 255, -256}, then output value will be {-256}.

Template Parameters:

BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • reduce_op[in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void reduce(T input, T &output, BinaryFunction reduce_op = BinaryFunction())#

Performs reduction across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Template Parameters:

BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • reduce_op[in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void reduce(T (&input)[ItemsPerThread], T &output, storage_type &storage, BinaryFunction reduce_op = BinaryFunction())#

Performs reduction across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present maximum reduce operations performed on a block of 128 threads, each provides two long value.

__global__ void example_kernel(...) // blockDim.x = 128
{
    // specialize block_reduce for long and block of 128 threads
    using block_reduce_f = rocprim::block_reduce<long, 128>;
    // allocate storage in shared memory for the block
    __shared__ block_reduce_long::storage_type storage;

    long input[2] = ...;
    long output[2];
    // execute max reduce
    block_reduce_long().reduce(
        input,
        output,
        storage,
        rocprim::maximum<long>()
    );
    ...
}

If the input values across threads in a block are {-1, 2, -3, 4, ..., -255, 256}, then output value will be {256}.

Template Parameters:
  • ItemsPerThread – - number of items in the input array.

  • BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - reference to an array containing thread input values.

  • output[out] - reference to a thread output array. May be aliased with input.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • reduce_op[in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void reduce(T (&input)[ItemsPerThread], T &output, BinaryFunction reduce_op = BinaryFunction())#

Performs reduction across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Template Parameters:
  • ItemsPerThread – - number of items in the input array.

  • BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - reference to an array containing thread input values.

  • output[out] - reference to a thread output array. May be aliased with input.

  • reduce_op[in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void reduce(T input, T &output, unsigned int valid_items, storage_type &storage, BinaryFunction reduce_op = BinaryFunction())#

Performs reduction across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present min reduce operations performed on a block of 256 threads, each provides one float value.

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize block_reduce for float and block of 256 threads
    using block_reduce_f = rocprim::block_reduce<float, 256>;
    // allocate storage in shared memory for the block
    __shared__ block_reduce_float::storage_type storage;

    float input = ...;
    unsigned int valid_items = 250;
    float output;
    // execute min reduce
    block_reduce_float().reduce(
        input,
        output,
        valid_items,
        storage,
        rocprim::minimum<float>()
    );
    ...
}

Template Parameters:

BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • valid_items[in] - number of items that will be reduced in the block.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • reduce_op[in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void reduce(T input, T &output, unsigned int valid_items, BinaryFunction reduce_op = BinaryFunction())#

Performs reduction across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Template Parameters:
  • ItemsPerThread – - number of items in the input array.

  • BinaryFunction – - type of binary function used for reduce. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - reference to an array containing thread input values.

  • output[out] - reference to a thread output array. May be aliased with input.

  • valid_items[in] - number of items that will be reduced in the block.

  • reduce_op[in] - binary operation function object that will be used for reduce. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

Algorithms#

enum rocprim::block_reduce_algorithm#

Available algorithms for block_reduce primitive.

Values:

enumerator using_warp_reduce#

A warp_reduce based algorithm.

enumerator raking_reduce#

An algorithm which limits calculations to a single hardware warp.

enumerator raking_reduce_commutative_only#

raking reduce that supports only commutative operators

enumerator default_algorithm#

Default block_reduce algorithm.