Scan#

template<class T, unsigned int WarpSize = device_warp_size()>
class warp_scan#

The warp_scan class is a warp level parallel primitive which provides methods for performing inclusive and exclusive scan operations of items partitioned across threads in a hardware warp.

Overview

  • WarpSize must be equal to or less than the size of hardware warp (see rocprim::device_warp_size()

    ). If it is less, scan is performed separately within groups determined by WarpSize.

    For example, if

    WarpSize is 4, hardware warp is 64, scan will be performed in logical warps grouped like this: { {0, 1, 2, 3}, {4, 5, 6, 7 }, ..., {60, 61, 62, 63} } (thread is represented here by its id within hardware warp).

  • Logical warp is a group of WarpSize consecutive threads from the same hardware warp.

  • Supports non-commutative scan operators. However, a scan operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.

  • Number of threads executing warp_scan’s function must be a multiple of WarpSize;

  • All threads from a logical warp must be in the same hardware warp.

Examples

In the examples scan operation is performed on groups of 16 threads, each provides one int value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.

__global__ void example_kernel(...)
{
    // specialize warp_scan for int and logical warp of 16 threads
    using warp_scan_int = rocprim::warp_scan<int, 16>;
    // allocate storage in shared memory
    __shared__ warp_scan_int::storage_type temp[4];

    int logical_warp_id = threadIdx.x/16;
    int value = ...;
    // execute inclusive scan
    warp_scan_int().inclusive_scan(
        value, // input
        value, // output
        temp[logical_warp_id]
    );
    ...
}

Template Parameters:
  • T – - the input/output type.

  • WarpSize – - the size of logical warp size, which can be equal to or less than the size of hardware warp (see rocprim::device_warp_size()). Scan operations are performed separately within groups determined by WarpSize.

Public Types

using storage_type = typename base_type::storage_type#

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords . It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.

Public Functions

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto inclusive_scan(T input, T &output, storage_type &storage, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= 64), void>::type#

Performs inclusive scan across threads in a logical warp.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present inclusive min scan operations performed on groups of 32 threads, each provides one float value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 256.

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize warp_scan for float and logical warp of 32 threads
    using warp_scan_f = rocprim::warp_scan<float, 32>;
    // allocate storage in shared memory
    __shared__ warp_scan_float::storage_type temp[8]; // 256/32 = 8

    int logical_warp_id = threadIdx.x/32;
    float value = ...;
    // execute inclusive min scan
    warp_scan_float().inclusive_scan(
        value, // input
        value, // output
        temp[logical_warp_id],
        rocprim::minimum<float>()
    );
    ...
}

If the input values across threads in a block/tile are {1, -2, 3, -4, ..., 255, -256}, then output values in the first logical warp will be {1, -2, -2, -4, ..., -32}, in the second: {33, -34, -34, -36, ..., -64} etc.

Template Parameters:

BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto inclusive_scan(T, T&, storage_type&, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > 64), void>::type#

Performs inclusive scan across threads in a logical warp. Invalid Warp Size.

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto inclusive_scan(T input, T &output, T &reduction, storage_type &storage, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= 64), void>::type#

Performs inclusive scan and reduction across threads in a logical warp.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present inclusive prefix sum operations performed on groups of 64 threads, each thread provides one int value. Hardware warp size is 64. Block (tile) size is 256.

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize warp_scan for int and logical warp of 64 threads
    using warp_scan_int = rocprim::warp_scan<int, 64>;
    // allocate storage in shared memory
    __shared__ warp_scan_int::storage_type temp[4]; // 256/64 = 4

    int logical_warp_id = threadIdx.x/64;
    int input = ...;
    int output, reduction;
    // inclusive prefix sum
    warp_scan_int().inclusive_scan(
        input,
        output,
        reduction,
        temp[logical_warp_id]
    );
    ...
}

If the input values across threads in a block/tile are {1, 1, 1, 1, ..., 1, 1}, then output values in the every logical warp will be {1, 2, 3, 4, ..., 64}. The reduction will be equal 64.

Template Parameters:

BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • reduction[out] - result of reducing of all input values in logical warp.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto inclusive_scan(T, T&, T&, storage_type&, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > 64), void>::type#

Performs inclusive scan and reduction across threads in a logical warp. Invalid Warp Size.

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto exclusive_scan(T input, T &output, T init, storage_type &storage, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= 64), void>::type#

Performs exclusive scan across threads in a logical warp.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present exclusive min scan operations performed on groups of 32 threads, each provides one float value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 256.

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize warp_scan for float and logical warp of 32 threads
    using warp_scan_f = rocprim::warp_scan<float, 32>;
    // allocate storage in shared memory
    __shared__ warp_scan_float::storage_type temp[8]; // 256/32 = 8

    int logical_warp_id = threadIdx.x/32;
    float value = ...;
    // execute exclusive min scan
    warp_scan_float().exclusive_scan(
        value, // input
        value, // output
        100.0f, // init
        temp[logical_warp_id],
        rocprim::minimum<float>()
    );
    ...
}

If the initial value is 100 and input values across threads in a block/tile are {1, -2, 3, -4, ..., 255, -256}, then output values in the first logical warp will be {100, 1, -2, -2, -4, ..., -30}, in the second: {100, 33, -34, -34, -36, ..., -62} etc.

Template Parameters:

BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • init[in] - initial value used to start the exclusive scan. Should be the same for all threads in a logical warp.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto exclusive_scan(T, T&, T, storage_type&, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > 64), void>::type#

Performs exclusive scan across threads in a logical warp. Invalid Warp Size.

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto exclusive_scan(T input, T &output, T init, T &reduction, storage_type &storage, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= 64), void>::type#

Performs exclusive scan and reduction across threads in a logical warp.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present exclusive prefix sum operations performed on groups of 64 threads, each thread provides one int value. Hardware warp size is 64. Block (tile) size is 256.

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize warp_scan for int and logical warp of 64 threads
    using warp_scan_int = rocprim::warp_scan<int, 64>;
    // allocate storage in shared memory
    __shared__ warp_scan_int::storage_type temp[4]; // 256/64 = 4

    int logical_warp_id = threadIdx.x/64;
    int input = ...;
    int output, reduction;
    // exclusive prefix sum
    warp_scan_int().exclusive_scan(
        input,
        output,
        10, // init
        reduction,
        temp[logical_warp_id]
    );
    ...
}

If the initial value is 10 and input values across threads in a block/tile are {1, 1, ..., 1, 1}, then output values in every logical warp will be {10, 11, 12, 13, ..., 73}. The reduction will be 64.

Template Parameters:

BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • output[out] - reference to a thread output value. May be aliased with input.

  • init[in] - initial value used to start the exclusive scan. Should be the same for all threads in a logical warp.

  • reduction[out] - result of reducing of all input values in logical warp. init value is not included in the reduction.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto exclusive_scan(T, T&, T, T&, storage_type&, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > 64), void>::type#

Performs exclusive scan and reduction across threads in a logical warp. Invalid Warp Size.

template<class BinaryFunction = ::rocprim::plus<>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto exclusive_scan(T input, T &output, storage_type &storage, BinaryFunction scan_op = BinaryFunction()) -> void#

Performs exclusive scan without an initial value across threads in a logical warp.

Template Parameters:

BinaryFunction – binary function used for scan

Parameters:
  • input – Thread input value

  • output[out] Reference to thread output value. Each threads value for the scan will be written to it. May be aliased with input. The value written is unspecified for the first thread of each logical warp.

  • storage[in] Reference to a temporary storage object of type storage_type.

  • scan_op – The function object used to combine elements used for the scan

template<class BinaryFunction = ::rocprim::plus<>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto exclusive_scan(T input, T &output, storage_type &storage, T &reduction, BinaryFunction scan_op = BinaryFunction()) -> void#

Performs exclusive scan and reduction without an initial value across threads in a logical warp.

Template Parameters:

BinaryFunction – binary function used for scan

Parameters:
  • input – Thread input value

  • output[out] Reference to thread output value. Each threads value for the scan will be written to it. May be aliased with input. The value written is unspecified for the first thread of each logical warp.

  • reduction[out] Result of reducing of all input values in the logical warp.

  • storage[in] Reference to a temporary storage object of type storage_type.

  • scan_op – The function object used to combine elements used for the scan

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto scan(T input, T &inclusive_output, T &exclusive_output, T init, storage_type &storage, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= 64), void>::type#

Performs inclusive and exclusive scan operations across threads in a logical warp.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present min inclusive and exclusive scan operations performed on groups of 32 threads, each provides one float value, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 256.

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize warp_scan for float and logical warp of 32 threads
    using warp_scan_f = rocprim::warp_scan<float, 32>;
    // allocate storage in shared memory
    __shared__ warp_scan_float::storage_type temp[8]; // 256/32 = 8

    int logical_warp_id = threadIdx.x/32;
    float input = ...;
    float ex_output, in_output;
    // execute exclusive min scan
    warp_scan_float().scan(
        input,
        in_output,
        ex_output,
        100.0f, // init
        temp[logical_warp_id],
        rocprim::minimum<float>()
    );
    ...
}

If the initial value is 100 and input values across threads in a block/tile are {1, -2, 3, -4, ..., 255, -256}, then in_output values in the first logical warp will be {1, -2, -2, -4, ..., -32}, in the second: {33, -34, -34, -36, ..., -64} and so forth, ex_output values in the first logical warp will be {100, 1, -2, -2, -4, ..., -30}, in the second: {100, 33, -34, -34, -36, ..., -62} etc.

Template Parameters:

BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • inclusive_output[out] - reference to a thread inclusive-scan output value.

  • exclusive_output[out] - reference to a thread exclusive-scan output value.

  • init[in] - initial value used to start the exclusive scan. Should be the same for all threads in a logical warp.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto scan(T, T&, T&, T, storage_type&, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > 64), void>::type#

Performs inclusive and exclusive scan operations across threads Invalid Warp Size.

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto scan(T input, T &inclusive_output, T &exclusive_output, T init, T &reduction, storage_type &storage, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= 64), void>::type#

Performs inclusive and exclusive scan operations, and reduction across threads in a logical warp.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

The examples present inclusive and exclusive prefix sum operations performed on groups of 64 threads, each thread provides one int value. Hardware warp size is 64. Block (tile) size is 256.

__global__ void example_kernel(...) // blockDim.x = 256
{
    // specialize warp_scan for int and logical warp of 64 threads
    using warp_scan_int = rocprim::warp_scan<int, 64>;
    // allocate storage in shared memory
    __shared__ warp_scan_int::storage_type temp[4]; // 256/64 = 4

    int logical_warp_id = threadIdx.x/64;
    int input = ...;
    int in_output, ex_output, reduction;
    // inclusive and exclusive prefix sum
    warp_scan_int().scan(
        input,
        in_output,
        ex_output,
        init,
        reduction,
        temp[logical_warp_id]
    );
    ...
}

If the initial value is 10 and input values across threads in a block/tile are {1, 1, ..., 1, 1}, then in_output values in every logical warp will be {1, 2, 3, 4, ..., 63, 64}, and ex_output values in every logical warp will be {10, 11, 12, 13, ..., 73}. The reduction will be 64.

Template Parameters:

BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.

Parameters:
  • input[in] - thread input value.

  • inclusive_output[out] - reference to a thread inclusive-scan output value.

  • exclusive_output[out] - reference to a thread exclusive-scan output value.

  • init[in] - initial value used to start the exclusive scan. Should be the same for all threads in a logical warp.

  • reduction[out] - result of reducing of all input values in logical warp. init value is not included in the reduction.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • scan_op[in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: T f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto scan(T, T&, T&, T, T&, storage_type&, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > 64), void>::type#

Performs inclusive and exclusive scan operations across threads Invalid Warp Size.

template<unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto broadcast(T input, const unsigned int src_lane, storage_type &storage) -> typename std::enable_if<(FunctionWarpSize <= 64), T>::type#

Broadcasts value from one thread to all threads in logical warp.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Parameters:
  • input[in] - value to broadcast.

  • src_lane[in] - id of the thread whose value should be broadcasted

  • storage[in] - reference to a temporary storage object of type storage_type.

template<unsigned int FunctionWarpSize = WarpSize>
__device__ inline auto broadcast(T, const unsigned int, storage_type&) -> typename std::enable_if<(FunctionWarpSize > 64), T>::type#

Broadcasts value from one thread to all threads in logical warp. Invalid Warp Size.