Scan#

Class#

template<class T, unsigned int BlockSizeX, block_scan_algorithm Algorithm = block_scan_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_scan#

The block_scan class is a block level parallel primitive which provides methods for performing inclusive and exclusive scan operations of items partitioned across threads in a block.

Overview

  • Supports non-commutative scan operators. However, a scan operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.

  • Computation can more efficient when:

    • ItemsPerThread is greater than one,

    • T is an arithmetic type,

    • scan operation is simple addition operator, and

    • the number of threads in the block is a multiple of the hardware warp size (see rocprim::device_warp_size()).

  • block_scan has two alternative implementations: block_scan_algorithm::using_warp_scan and block_scan_algorithm::reduce_then_scan.

Examples

In the examples scan operation is performed on block of 192 threads, each provides one int value, result is returned using the same variable as for input.

__global__ void example_kernel(...)
{
    // specialize warp_scan for int and logical warp of 192 threads
    using block_scan_int = rocprim::block_scan<int, 192>;
    // allocate storage in shared memory
    __shared__ block_scan_int::storage_type storage;

    int value = ...;
    // execute inclusive scan
    block_scan_int().inclusive_scan(
        value, // input
        value, // output
        storage
    );
    ...
}

Template Parameters:
  • T – - the input/output type.

  • BlockSizeX – - the number of threads in a block’s x dimension.

  • Algorithm – - selected scan algorithm, block_scan_algorithm::default_algorithm by default.

  • BlockSizeY – - the number of threads in a block’s y dimension, defaults to 1.

  • BlockSizeZ – - the number of threads in a block’s z dimension, defaults to 1.

Public Types

using storage_type = typename base_type::storage_type#

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords . It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.

Public Functions

template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T input, T &output, storage_type &storage, BinaryFunction scan_op = BinaryFunction())#

Performs inclusive scan across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present inclusive min scan operations performed on a block of 256 threads, each provides one float value.

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize block_scan for float and block of 256 threads
    using block_scan_f = rocprim::block_scan<float, 256>;
    // allocate storage in shared memory for the block
    __shared__ block_scan_float::storage_type storage;

    float input = ...;
    float output;
    // execute inclusive min scan
    block_scan_float().inclusive_scan(
        input,
        output,
        storage,
        rocprim::minimum<float>()
    );
    ...
}

If the input values across threads in a block are {1, -2, 3, -4, ..., 255, -256}, then output values in will be {1, -2, -2, -4, ..., -254, -256}.

Template Parameters:

BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T input, T &output, BinaryFunction scan_op = BinaryFunction())#

Performs inclusive scan across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Template Parameters:

BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T input, T &output, T &reduction, storage_type &storage, BinaryFunction scan_op = BinaryFunction())#

Performs inclusive scan and reduction across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present inclusive min scan operations performed on a block of 256 threads, each provides one float value.

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize block_scan for float and block of 256 threads
    using block_scan_f = rocprim::block_scan<float, 256>;
    // allocate storage in shared memory for the block
    __shared__ block_scan_float::storage_type storage;

    float input = ...;
    float output;
    float reduction;
    // execute inclusive min scan
    block_scan_float().inclusive_scan(
        input,
        output,
        reduction,
        storage,
        rocprim::minimum<float>()
    );
    ...
}

If the input values across threads in a block are {1, -2, 3, -4, ..., 255, -256}, then output values in will be {1, -2, -2, -4, ..., -254, -256}, and the reduction will be -256.

Template Parameters:

BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • reduction[out] - result of reducing of all input values in a block.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T input, T &output, T &reduction, BinaryFunction scan_op = BinaryFunction())#

Performs inclusive scan and reduction across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Template Parameters:

BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • reduction[out] - result of reducing of all input values in a block.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class PrefixCallback, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T input, T &output, storage_type &storage, PrefixCallback &prefix_callback_op, BinaryFunction scan_op)#

Performs inclusive scan across threads in a block, and uses prefix_callback_op to generate prefix value for the whole block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present inclusive prefix sum operations performed on a block of 256 threads, each thread provides one int value.

struct my_block_prefix
{
    int prefix;

    __device__ my_block_prefix(int prefix) : prefix(prefix) {}

    __device__ int operator()(int block_reduction)
    {
        int old_prefix = prefix;
        prefix = prefix + block_reduction;
        return old_prefix;
    }
};

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize block_scan for int and block of 256 threads
    using block_scan_f = rocprim::block_scan<int, 256>;
    // allocate storage in shared memory for the block
    __shared__ block_scan_int::storage_type storage;

    // init prefix functor
    my_block_prefix prefix_callback(10);

    int input;
    int output;
    // execute inclusive prefix sum
    block_scan_int().inclusive_scan(
        input,
        output,
        storage,
        prefix_callback,
        rocprim::plus<int>()
    );
    ...
}

If the input values across threads in a block are {1, 1, 1, ..., 1}, then output values in will be {11, 12, 13, ..., 266}, and the prefix will be 266.

Template Parameters:
  • PrefixCallback – - type of the unary function object used for generating block-wide prefix value for the scan operation.

  • BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • prefix_callback_op[inout] - function object for generating block prefix value. The signature of the prefix_callback_op should be equivalent to the following: T f(const T &block_reduction);. The signature does not need to have const &, but function object must not modify the objects passed to it. The object will be called by the first warp of the block with block reduction of input values as input argument. The result of the first thread will be used as the block-wide prefix.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], storage_type &storage, BinaryFunction scan_op = BinaryFunction())#

Performs inclusive scan across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present inclusive maximum scan operations performed on a block of 128 threads, each provides two long value.

__global__ void example_kernel(...) // blockDim.x = 128
{
    // specialize block_scan for long and block of 128 threads
    using block_scan_f = rocprim::block_scan<long, 128>;
    // allocate storage in shared memory for the block
    __shared__ block_scan_long::storage_type storage;

    long input[2] = ...;
    long output[2];
    // execute inclusive min scan
    block_scan_long().inclusive_scan(
        input,
        output,
        storage,
        rocprim::maximum<long>()
    );
    ...
}

If the input values across threads in a block are {-1, 2, -3, 4, ..., -255, 256}, then output values in will be {-1, 2, 2, 4, ..., 254, 256}.

Template Parameters:
  • ItemsPerThread – - number of items in the input array.

  • BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - reference to an array containing thread input values.

  • output[out] - reference to a thread output array. May be aliased with input.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], BinaryFunction scan_op = BinaryFunction())#

Performs inclusive scan across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Template Parameters:
  • ItemsPerThread – - number of items in the input array.

  • BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - reference to an array containing thread input values.

  • output[out] - reference to a thread output array. May be aliased with input.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], T &reduction, storage_type &storage, BinaryFunction scan_op = BinaryFunction())#

Performs inclusive scan and reduction across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present inclusive maximum scan operations performed on a block of 128 threads, each provides two long value.

__global__ void example_kernel(...) // blockDim.x = 128
{
    // specialize block_scan for long and block of 128 threads
    using block_scan_f = rocprim::block_scan<long, 128>;
    // allocate storage in shared memory for the block
    __shared__ block_scan_long::storage_type storage;

    long input[2] = ...;
    long output[2];
    long reduction;
    // execute inclusive min scan
    block_scan_long().inclusive_scan(
        input,
        output,
        reduction,
        storage,
        rocprim::maximum<long>()
    );
    ...
}

If the input values across threads in a block are {-1, 2, -3, 4, ..., -255, 256}, then output values in will be {-1, 2, 2, 4, ..., 254, 256} and the reduction will be 256.

Template Parameters:
  • ItemsPerThread – - number of items in the input array.

  • BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - reference to an array containing thread input values.

  • output[out] - reference to a thread output array. May be aliased with input.

  • reduction[out] - result of reducing of all input values in a block.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], T &reduction, BinaryFunction scan_op = BinaryFunction())#

Performs inclusive scan and reduction across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Template Parameters:
  • ItemsPerThread – - number of items in the input array.

  • BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - reference to an array containing thread input values.

  • output[out] - reference to a thread output array. May be aliased with input.

  • reduction[out] - result of reducing of all input values in a block.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<unsigned int ItemsPerThread, class PrefixCallback, class BinaryFunction>
__device__ inline void inclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], storage_type &storage, PrefixCallback &prefix_callback_op, BinaryFunction scan_op)#

Performs inclusive scan across threads in a block, and uses prefix_callback_op to generate prefix value for the whole block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present inclusive prefix sum operations performed on a block of 128 threads, each thread provides two int value.

struct my_block_prefix
{
    int prefix;

    __device__ my_block_prefix(int prefix) : prefix(prefix) {}

    __device__ int operator()(int block_reduction)
    {
        int old_prefix = prefix;
        prefix = prefix + block_reduction;
        return old_prefix;
    }
};

__global__ void example_kernel(...) // blockDim.x = 128
{
    // specialize block_scan for int and block of 128 threads
    using block_scan_f = rocprim::block_scan<int, 128>;
    // allocate storage in shared memory for the block
    __shared__ block_scan_int::storage_type storage;

    // init prefix functor
    my_block_prefix prefix_callback(10);

    int input[2] = ...;
    int output[2];
    // execute inclusive prefix sum
    block_scan_int().inclusive_scan(
        input,
        output,
        storage,
        prefix_callback,
        rocprim::plus<int>()
    );
    ...
}

If the input values across threads in a block are {1, 1, 1, ..., 1}, then output values in will be {11, 12, 13, ..., 266}, and the prefix will be 266.

Template Parameters:
  • ItemsPerThread – - number of items in the input array.

  • PrefixCallback – - type of the unary function object used for generating block-wide prefix value for the scan operation.

  • BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - reference to an array containing thread input values.

  • output[out] - reference to a thread output array. May be aliased with input.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • prefix_callback_op[inout] - function object for generating block prefix value. The signature of the prefix_callback_op should be equivalent to the following: T f(const T &block_reduction);. The signature does not need to have const &, but function object must not modify the objects passed to it. The object will be called by the first warp of the block with block reduction of input values as input argument. The result of the first thread will be used as the block-wide prefix.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T input, T &output, T init, storage_type &storage, BinaryFunction scan_op = BinaryFunction())#

Performs exclusive scan across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present exclusive min scan operations performed on a block of 256 threads, each provides one float value.

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize block_scan for float and block of 256 threads
    using block_scan_f = rocprim::block_scan<float, 256>;
    // allocate storage in shared memory for the block
    __shared__ block_scan_float::storage_type storage;

    float init = ...;
    float input = ...;
    float output;
    // execute exclusive min scan
    block_scan_float().exclusive_scan(
        input,
        output,
        init,
        storage,
        rocprim::minimum<float>()
    );
    ...
}

If the input values across threads in a block are {1, -2, 3, -4, ..., 255, -256} and init is 0, then output values in will be {0, 0, -2, -2, -4, ..., -254, -254}.

Template Parameters:

BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • init[in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T input, T &output, T init, BinaryFunction scan_op = BinaryFunction())#

Performs exclusive scan across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Template Parameters:

BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • init[in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T input, T &output, T init, T &reduction, storage_type &storage, BinaryFunction scan_op = BinaryFunction())#

Performs exclusive scan and reduction across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present exclusive min scan operations performed on a block of 256 threads, each provides one float value.

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize block_scan for float and block of 256 threads
    using block_scan_f = rocprim::block_scan<float, 256>;
    // allocate storage in shared memory for the block
    __shared__ block_scan_float::storage_type storage;

    float init = 0;
    float input = ...;
    float output;
    float reduction;
    // execute exclusive min scan
    block_scan_float().exclusive_scan(
        input,
        output,
        init,
        reduction,
        storage,
        rocprim::minimum<float>()
    );
    ...
}

If the input values across threads in a block are {1, -2, 3, -4, ..., 255, -256} and init is 0, then output values in will be {0, 0, -2, -2, -4, ..., -254, -254} and the reduction will be -256.

Template Parameters:

BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • init[in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.

  • reduction[out] - result of reducing of all input values in a block.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T input, T &output, T init, T &reduction, BinaryFunction scan_op = BinaryFunction())#

Performs exclusive scan and reduction across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Template Parameters:

BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • init[in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.

  • reduction[out] - result of reducing of all input values in a block.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class PrefixCallback, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T input, T &output, storage_type &storage, PrefixCallback &prefix_callback_op, BinaryFunction scan_op)#

Performs exclusive scan across threads in a block, and uses prefix_callback_op to generate prefix value for the whole block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present exclusive prefix sum operations performed on a block of 256 threads, each thread provides one int value.

struct my_block_prefix
{
    int prefix;

    __device__ my_block_prefix(int prefix) : prefix(prefix) {}

    __device__ int operator()(int block_reduction)
    {
        int old_prefix = prefix;
        prefix = prefix + block_reduction;
        return old_prefix;
    }
};

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize block_scan for int and block of 256 threads
    using block_scan_f = rocprim::block_scan<int, 256>;
    // allocate storage in shared memory for the block
    __shared__ block_scan_int::storage_type storage;

    // init prefix functor
    my_block_prefix prefix_callback(10);

    int input;
    int output;
    // execute exclusive prefix sum
    block_scan_int().exclusive_scan(
        input,
        output,
        storage,
        prefix_callback,
        rocprim::plus<int>()
    );
    ...
}

If the input values across threads in a block are {1, 1, 1, ..., 1}, then output values in will be {10, 11, 12, 13, ..., 265}, and the prefix will be 266.

Template Parameters:
  • PrefixCallback – - type of the unary function object used for generating block-wide prefix value for the scan operation.

  • BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • prefix_callback_op[inout] - function object for generating block prefix value. The signature of the prefix_callback_op should be equivalent to the following: T f(const T &block_reduction);. The signature does not need to have const &, but function object must not modify the objects passed to it. The object will be called by the first warp of the block with block reduction of input values as input argument. The result of the first thread will be used as the block-wide prefix.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], T init, storage_type &storage, BinaryFunction scan_op = BinaryFunction())#

Performs exclusive scan across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present exclusive maximum scan operations performed on a block of 128 threads, each provides two long value.

__global__ void example_kernel(...) // blockDim.x = 128
{
    // specialize block_scan for long and block of 128 threads
    using block_scan_f = rocprim::block_scan<long, 128>;
    // allocate storage in shared memory for the block
    __shared__ block_scan_long::storage_type storage;

    long init = ...;
    long input[2] = ...;
    long output[2];
    // execute exclusive min scan
    block_scan_long().exclusive_scan(
        input,
        output,
        init,
        storage,
        rocprim::maximum<long>()
    );
    ...
}

If the input values across threads in a block are {-1, 2, -3, 4, ..., -255, 256} and init is 0, then output values in will be {0, 0, 2, 2, 4, ..., 254, 254}.

Template Parameters:
  • ItemsPerThread – - number of items in the input array.

  • BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - reference to an array containing thread input values.

  • output[out] - reference to a thread output array. May be aliased with input.

  • init[in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], T init, BinaryFunction scan_op = BinaryFunction())#

Performs exclusive scan across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Template Parameters:
  • ItemsPerThread – - number of items in the input array.

  • BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - reference to an array containing thread input values.

  • output[out] - reference to a thread output array. May be aliased with input.

  • init[in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], T init, T &reduction, storage_type &storage, BinaryFunction scan_op = BinaryFunction())#

Performs exclusive scan and reduction across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present exclusive maximum scan operations performed on a block of 128 threads, each provides two long value.

__global__ void example_kernel(...) // blockDim.x = 128
{
    // specialize block_scan for long and block of 128 threads
    using block_scan_f = rocprim::block_scan<long, 128>;
    // allocate storage in shared memory for the block
    __shared__ block_scan_long::storage_type storage;

    long init = ...;
    long input[2] = ...;
    long output[2];
    long reduction;
    // execute exclusive min scan
    block_scan_long().exclusive_scan(
        input,
        output,
        init,
        reduction,
        storage,
        rocprim::maximum<long>()
    );
    ...
}

If the input values across threads in a block are {-1, 2, -3, 4, ..., -255, 256} and init is 0, then output values in will be {0, 0, 2, 2, 4, ..., 254, 254} and the reduction will be 256.

Template Parameters:
  • ItemsPerThread – - number of items in the input array.

  • BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - reference to an array containing thread input values.

  • output[out] - reference to a thread output array. May be aliased with input.

  • init[in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.

  • reduction[out] - result of reducing of all input values in a block.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], T init, T &reduction, BinaryFunction scan_op = BinaryFunction())#

Performs exclusive scan and reduction across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Template Parameters:
  • ItemsPerThread – - number of items in the input array.

  • BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - reference to an array containing thread input values.

  • output[out] - reference to a thread output array. May be aliased with input.

  • init[in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.

  • reduction[out] - result of reducing of all input values in a block.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<unsigned int ItemsPerThread, class PrefixCallback, class BinaryFunction>
__device__ inline void exclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], storage_type &storage, PrefixCallback &prefix_callback_op, BinaryFunction scan_op)#

Performs exclusive scan across threads in a block, and uses prefix_callback_op to generate prefix value for the whole block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present exclusive prefix sum operations performed on a block of 128 threads, each thread provides two int value.

struct my_block_prefix
{
    int prefix;

    __device__ my_block_prefix(int prefix) : prefix(prefix) {}

    __device__ int operator()(int block_reduction)
    {
        int old_prefix = prefix;
        prefix = prefix + block_reduction;
        return old_prefix;
    }
};

__global__ void example_kernel(...) // blockDim.x = 128
{
    // specialize block_scan for int and block of 128 threads
    using block_scan_f = rocprim::block_scan<int, 128>;
    // allocate storage in shared memory for the block
    __shared__ block_scan_int::storage_type storage;

    // init prefix functor
    my_block_prefix prefix_callback(10);

    int input[2] = ...;
    int output[2];
    // execute exclusive prefix sum
    block_scan_int().exclusive_scan(
        input,
        output,
        storage,
        prefix_callback,
        rocprim::plus<int>()
    );
    ...
}

If the input values across threads in a block are {1, 1, 1, ..., 1}, then output values in will be {10, 11, 12, 13, ..., 265}, and the prefix will be 266.

Template Parameters:
  • ItemsPerThread – - number of items in the input array.

  • PrefixCallback – - type of the unary function object used for generating block-wide prefix value for the scan operation.

  • BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - reference to an array containing thread input values.

  • output[out] - reference to a thread output array. May be aliased with input.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • prefix_callback_op[inout] - function object for generating block prefix value. The signature of the prefix_callback_op should be equivalent to the following: T f(const T &block_reduction);. The signature does not need to have const &, but function object must not modify the objects passed to it. The object will be called by the first warp of the block with block reduction of input values as input argument. The result of the first thread will be used as the block-wide prefix.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

Algorithms#

enum class rocprim::block_scan_algorithm#

Available algorithms for block_scan primitive.

Values:

enumerator using_warp_scan#

A warp_scan based algorithm.

enumerator reduce_then_scan#

An algorithm which limits calculations to a single hardware warp.

enumerator default_algorithm#

Default block_scan algorithm.