Scan#
- 
template<class T, unsigned int WarpSize = device_warp_size()>
 class warp_scan#
- The warp_scan class is a warp level parallel primitive which provides methods for performing inclusive and exclusive scan operations of items partitioned across threads in a hardware warp. - Overview
- WarpSizemust be equal to or less than the size of hardware warp (see rocprim::device_warp_size()- ). If it is less, scan is performed separately within groups determined by WarpSize. - For example, if - WarpSizeis 4, hardware warp is 64, scan will be performed in logical warps grouped like this:- { {0, 1, 2, 3}, {4, 5, 6, 7 }, ..., {60, 61, 62, 63} }(thread is represented here by its id within hardware warp).
- Logical warp is a group of - WarpSizeconsecutive threads from the same hardware warp.
- Supports non-commutative scan operators. However, a scan operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision. 
- Number of threads executing warp_scan’s function must be a multiple of - WarpSize;
- All threads from a logical warp must be in the same hardware warp. 
 
- Examples
- In the examples scan operation is performed on groups of 16 threads, each provides one - intvalue, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 64.- __global__ void example_kernel(...) { // specialize warp_scan for int and logical warp of 16 threads using warp_scan_int = rocprim::warp_scan<int, 16>; // allocate storage in shared memory __shared__ warp_scan_int::storage_type temp[4]; int logical_warp_id = threadIdx.x/16; int value = ...; // execute inclusive scan warp_scan_int().inclusive_scan( value, // input value, // output temp[logical_warp_id] ); ... } 
 - Template Parameters:
- T – - the input/output type. 
- WarpSize – - the size of logical warp size, which can be equal to or less than the size of hardware warp (see rocprim::device_warp_size()). Scan operations are performed separately within groups determined by WarpSize. 
 
 - Public Types - 
using storage_type = typename base_type::storage_type#
- Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive. - Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords - . It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.
 - Public Functions - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto inclusive_scan(T input, T &output, storage_type &storage, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type#
- Performs inclusive scan across threads in a logical warp. - Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
- Examples
- The examples present inclusive min scan operations performed on groups of 32 threads, each provides one - floatvalue, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 256.- __global__ void example_kernel(...) // blockDim.x = 256 { // specialize warp_scan for float and logical warp of 32 threads using warp_scan_f = rocprim::warp_scan<float, 32>; // allocate storage in shared memory __shared__ warp_scan_float::storage_type temp[8]; // 256/32 = 8 int logical_warp_id = threadIdx.x/32; float value = ...; // execute inclusive min scan warp_scan_float().inclusive_scan( value, // input value, // output temp[logical_warp_id], rocprim::minimum<float>() ); ... } - If the input values across threads in a block/tile are - {1, -2, 3, -4, ..., 255, -256}, then output values in the first logical warp will be- {1, -2, -2, -4, ..., -32},in the second:- {33, -34, -34, -36, ..., -64}etc.
 - Template Parameters:
- BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>. 
- Parameters:
- input – [in] - thread input value. 
- output – [out] - reference to a thread output value. May be aliased with - input.
- storage – [in] - reference to a temporary storage object of type storage_type. 
- scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto inclusive_scan(T, T&, storage_type&, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type#
- Performs inclusive scan across threads in a logical warp. Invalid Warp Size. 
 - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto inclusive_scan(T input, T &output, T &reduction, storage_type &storage, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type#
- Performs inclusive scan and reduction across threads in a logical warp. - Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
- Examples
- The examples present inclusive prefix sum operations performed on groups of 64 threads, each thread provides one - intvalue. Hardware warp size is 64. Block (tile) size is 256.- __global__ void example_kernel(...) // blockDim.x = 256 { // specialize warp_scan for int and logical warp of 64 threads using warp_scan_int = rocprim::warp_scan<int, 64>; // allocate storage in shared memory __shared__ warp_scan_int::storage_type temp[4]; // 256/64 = 4 int logical_warp_id = threadIdx.x/64; int input = ...; int output, reduction; // inclusive prefix sum warp_scan_int().inclusive_scan( input, output, reduction, temp[logical_warp_id] ); ... } - If the - inputvalues across threads in a block/tile are- {1, 1, 1, 1, ..., 1, 1}, then- outputvalues in the every logical warp will be- {1, 2, 3, 4, ..., 64}. The- reductionwill be equal- 64.
 - Template Parameters:
- BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>. 
- Parameters:
- input – [in] - thread input value. 
- output – [out] - reference to a thread output value. May be aliased with - input.
- reduction – [out] - result of reducing of all - inputvalues in logical warp.
- storage – [in] - reference to a temporary storage object of type storage_type. 
- scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto inclusive_scan(T, T&, T&, storage_type&, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type#
- Performs inclusive scan and reduction across threads in a logical warp. Invalid Warp Size. 
 - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto exclusive_scan(T input, T &output, T init, storage_type &storage, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type#
- Performs exclusive scan across threads in a logical warp. - Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
- Examples
- The examples present exclusive min scan operations performed on groups of 32 threads, each provides one - floatvalue, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 256.- __global__ void example_kernel(...) // blockDim.x = 256 { // specialize warp_scan for float and logical warp of 32 threads using warp_scan_f = rocprim::warp_scan<float, 32>; // allocate storage in shared memory __shared__ warp_scan_float::storage_type temp[8]; // 256/32 = 8 int logical_warp_id = threadIdx.x/32; float value = ...; // execute exclusive min scan warp_scan_float().exclusive_scan( value, // input value, // output 100.0f, // init temp[logical_warp_id], rocprim::minimum<float>() ); ... } - If the initial value is - 100and input values across threads in a block/tile are- {1, -2, 3, -4, ..., 255, -256}, then output values in the first logical warp will be- {100, 1, -2, -2, -4, ..., -30},in the second:- {100, 33, -34, -34, -36, ..., -62}etc.
 - Template Parameters:
- BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>. 
- Parameters:
- input – [in] - thread input value. 
- output – [out] - reference to a thread output value. May be aliased with - input.
- init – [in] - initial value used to start the exclusive scan. Should be the same for all threads in a logical warp. 
- storage – [in] - reference to a temporary storage object of type storage_type. 
- scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto exclusive_scan(T, T&, T, storage_type&, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type#
- Performs exclusive scan across threads in a logical warp. Invalid Warp Size. 
 - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto exclusive_scan(T input, T &output, T init, T &reduction, storage_type &storage, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type#
- Performs exclusive scan and reduction across threads in a logical warp. - Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
- Examples
- The examples present exclusive prefix sum operations performed on groups of 64 threads, each thread provides one - intvalue. Hardware warp size is 64. Block (tile) size is 256.- __global__ void example_kernel(...) // blockDim.x = 256 { // specialize warp_scan for int and logical warp of 64 threads using warp_scan_int = rocprim::warp_scan<int, 64>; // allocate storage in shared memory __shared__ warp_scan_int::storage_type temp[4]; // 256/64 = 4 int logical_warp_id = threadIdx.x/64; int input = ...; int output, reduction; // exclusive prefix sum warp_scan_int().exclusive_scan( input, output, 10, // init reduction, temp[logical_warp_id] ); ... } - If the initial value is - 10and- inputvalues across threads in a block/tile are- {1, 1, ..., 1, 1}, then- outputvalues in every logical warp will be- {10, 11, 12, 13, ..., 73}. The- reductionwill be 64.
 - Template Parameters:
- BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>. 
- Parameters:
- input – [in] - thread input value. 
- output – [out] - reference to a thread output value. May be aliased with - input.
- init – [in] - initial value used to start the exclusive scan. Should be the same for all threads in a logical warp. 
- reduction – [out] - result of reducing of all - inputvalues in logical warp.- initvalue is not included in the reduction.
- storage – [in] - reference to a temporary storage object of type storage_type. 
- scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto exclusive_scan(T, T&, T, T&, storage_type&, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type#
- Performs exclusive scan and reduction across threads in a logical warp. Invalid Warp Size. 
 - 
template<class BinaryFunction = ::rocprim::plus<>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto exclusive_scan(T input, T &output, storage_type &storage, BinaryFunction scan_op = BinaryFunction()) -> void#
- Performs exclusive scan without an initial value across threads in a logical warp. - Template Parameters:
- BinaryFunction – binary function used for scan 
- Parameters:
- input – Thread input value 
- output – [out] Reference to thread output value. Each threads value for the scan will be written to it. May be aliased with - input. The value written is unspecified for the first thread of each logical warp.
- storage – [in] Reference to a temporary storage object of type storage_type. 
- scan_op – The function object used to combine elements used for the scan 
 
 
 - 
template<class BinaryFunction = ::rocprim::plus<>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto exclusive_scan(T input, T &output, storage_type &storage, T &reduction, BinaryFunction scan_op = BinaryFunction()) -> void#
- Performs exclusive scan and reduction without an initial value across threads in a logical warp. - Template Parameters:
- BinaryFunction – binary function used for scan 
- Parameters:
- input – Thread input value 
- output – [out] Reference to thread output value. Each threads value for the scan will be written to it. May be aliased with - input. The value written is unspecified for the first thread of each logical warp.
- reduction – [out] Result of reducing of all - inputvalues in the logical warp.
- storage – [in] Reference to a temporary storage object of type storage_type. 
- scan_op – The function object used to combine elements used for the scan 
 
 
 - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto scan(T input, T &inclusive_output, T &exclusive_output, T init, storage_type &storage, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type#
- Performs inclusive and exclusive scan operations across threads in a logical warp. - Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
- Examples
- The examples present min inclusive and exclusive scan operations performed on groups of 32 threads, each provides one - floatvalue, result is returned using the same variable as for input. Hardware warp size is 64. Block (tile) size is 256.- __global__ void example_kernel(...) // blockDim.x = 256 { // specialize warp_scan for float and logical warp of 32 threads using warp_scan_f = rocprim::warp_scan<float, 32>; // allocate storage in shared memory __shared__ warp_scan_float::storage_type temp[8]; // 256/32 = 8 int logical_warp_id = threadIdx.x/32; float input = ...; float ex_output, in_output; // execute exclusive min scan warp_scan_float().scan( input, in_output, ex_output, 100.0f, // init temp[logical_warp_id], rocprim::minimum<float>() ); ... } - If the initial value is - 100and input values across threads in a block/tile are- {1, -2, 3, -4, ..., 255, -256}, then- in_outputvalues in the first logical warp will be- {1, -2, -2, -4, ..., -32},in the second:- {33, -34, -34, -36, ..., -64}and so forth,- ex_outputvalues in the first logical warp will be- {100, 1, -2, -2, -4, ..., -30},in the second:- {100, 33, -34, -34, -36, ..., -62}etc.
 - Template Parameters:
- BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>. 
- Parameters:
- input – [in] - thread input value. 
- inclusive_output – [out] - reference to a thread inclusive-scan output value. 
- exclusive_output – [out] - reference to a thread exclusive-scan output value. 
- init – [in] - initial value used to start the exclusive scan. Should be the same for all threads in a logical warp. 
- storage – [in] - reference to a temporary storage object of type storage_type. 
- scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto scan(T, T&, T&, T, storage_type&, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type#
- Performs inclusive and exclusive scan operations across threads Invalid Warp Size. 
 - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto scan(T input, T &inclusive_output, T &exclusive_output, T init, T &reduction, storage_type &storage, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), void>::type#
- Performs inclusive and exclusive scan operations, and reduction across threads in a logical warp. - Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
- Examples
- The examples present inclusive and exclusive prefix sum operations performed on groups of 64 threads, each thread provides one - intvalue. Hardware warp size is 64. Block (tile) size is 256.- __global__ void example_kernel(...) // blockDim.x = 256 { // specialize warp_scan for int and logical warp of 64 threads using warp_scan_int = rocprim::warp_scan<int, 64>; // allocate storage in shared memory __shared__ warp_scan_int::storage_type temp[4]; // 256/64 = 4 int logical_warp_id = threadIdx.x/64; int input = ...; int in_output, ex_output, reduction; // inclusive and exclusive prefix sum warp_scan_int().scan( input, in_output, ex_output, init, reduction, temp[logical_warp_id] ); ... } - If the initial value is - 10and- inputvalues across threads in a block/tile are- {1, 1, ..., 1, 1}, then- in_outputvalues in every logical warp will be- {1, 2, 3, 4, ..., 63, 64}, and- ex_outputvalues in every logical warp will be- {10, 11, 12, 13, ..., 73}. The- reductionwill be 64.
 - Template Parameters:
- BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>. 
- Parameters:
- input – [in] - thread input value. 
- inclusive_output – [out] - reference to a thread inclusive-scan output value. 
- exclusive_output – [out] - reference to a thread exclusive-scan output value. 
- init – [in] - initial value used to start the exclusive scan. Should be the same for all threads in a logical warp. 
- reduction – [out] - result of reducing of all - inputvalues in logical warp.- initvalue is not included in the reduction.
- storage – [in] - reference to a temporary storage object of type storage_type. 
- scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following: - T f(const T &a, const T &b);. The signature does not need to have- const &, but function object must not modify the objects passed to it.
 
 
 - 
template<class BinaryFunction = ::rocprim::plus<T>, unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto scan(T, T&, T&, T, T&, storage_type&, BinaryFunction scan_op = BinaryFunction()) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), void>::type#
- Performs inclusive and exclusive scan operations across threads Invalid Warp Size. 
 - 
template<unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto broadcast(T input, const unsigned int src_lane, storage_type &storage) -> typename std::enable_if<(FunctionWarpSize <= device_warp_size()), T>::type#
- Broadcasts value from one thread to all threads in logical warp. - Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
 - Parameters:
- input – [in] - value to broadcast. 
- src_lane – [in] - id of the thread whose value should be broadcasted 
- storage – [in] - reference to a temporary storage object of type storage_type. 
 
 
 - 
template<unsigned int FunctionWarpSize = WarpSize>
 __device__ inline auto broadcast(T, const unsigned int, storage_type&) -> typename std::enable_if<(FunctionWarpSize > device_warp_size()), T>::type#
- Broadcasts value from one thread to all threads in logical warp. Invalid Warp Size.