Scan#
Class#
-
template<class T, unsigned int BlockSizeX, block_scan_algorithm Algorithm = block_scan_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_scan# The block_scan class is a block level parallel primitive which provides methods for performing inclusive and exclusive scan operations of items partitioned across threads in a block.
- Overview
Supports non-commutative scan operators. However, a scan operator should be associative. When used with non-associative functions the results may be non-deterministic and/or vary in precision.
Computation can more efficient when:
ItemsPerThread
is greater than one,T
is an arithmetic type,scan operation is simple addition operator, and
the number of threads in the block is a multiple of the hardware warp size (see rocprim::device_warp_size()).
block_scan has two alternative implementations:
block_scan_algorithm::using_warp_scan
and block_scan_algorithm::reduce_then_scan.
- Examples
In the examples scan operation is performed on block of 192 threads, each provides one
int
value, result is returned using the same variable as for input.__global__ void example_kernel(...) { // specialize warp_scan for int and logical warp of 192 threads using block_scan_int = rocprim::block_scan<int, 192>; // allocate storage in shared memory __shared__ block_scan_int::storage_type storage; int value = ...; // execute inclusive scan block_scan_int().inclusive_scan( value, // input value, // output storage ); ... }
- Template Parameters:
T – - the input/output type.
BlockSizeX – - the number of threads in a block’s x dimension.
Algorithm – - selected scan algorithm, block_scan_algorithm::default_algorithm by default.
BlockSizeY – - the number of threads in a block’s y dimension, defaults to 1.
BlockSizeZ – - the number of threads in a block’s z dimension, defaults to 1.
Public Types
-
using storage_type = typename base_type::storage_type#
Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.
Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords
. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.
Public Functions
-
template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T input, T &output, storage_type &storage, BinaryFunction scan_op = BinaryFunction())# Performs inclusive scan across threads in a block.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present inclusive min scan operations performed on a block of 256 threads, each provides one
float
value.__global__ void example_kernel(...) // blockDim.x = 256 { // specialize block_scan for float and block of 256 threads using block_scan_f = rocprim::block_scan<float, 256>; // allocate storage in shared memory for the block __shared__ block_scan_float::storage_type storage; float input = ...; float output; // execute inclusive min scan block_scan_float().inclusive_scan( input, output, storage, rocprim::minimum<float>() ); ... }
If the
input
values across threads in a block are{1, -2, 3, -4, ..., 255, -256}
, thenoutput
values in will be{1, -2, -2, -4, ..., -254, -256}
.
- Template Parameters:
BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.storage – [in] - reference to a temporary storage object of type storage_type.
scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T input, T &output, BinaryFunction scan_op = BinaryFunction())# Performs inclusive scan across threads in a block.
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters:
BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T input, T &output, T &reduction, storage_type &storage, BinaryFunction scan_op = BinaryFunction())# Performs inclusive scan and reduction across threads in a block.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present inclusive min scan operations performed on a block of 256 threads, each provides one
float
value.__global__ void example_kernel(...) // blockDim.x = 256 { // specialize block_scan for float and block of 256 threads using block_scan_f = rocprim::block_scan<float, 256>; // allocate storage in shared memory for the block __shared__ block_scan_float::storage_type storage; float input = ...; float output; float reduction; // execute inclusive min scan block_scan_float().inclusive_scan( input, output, reduction, storage, rocprim::minimum<float>() ); ... }
If the
input
values across threads in a block are{1, -2, 3, -4, ..., 255, -256}
, thenoutput
values in will be{1, -2, -2, -4, ..., -254, -256}
, and thereduction
will be-256
.
- Template Parameters:
BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.reduction – [out] - result of reducing of all
input
values in a block.storage – [in] - reference to a temporary storage object of type storage_type.
scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T input, T &output, T &reduction, BinaryFunction scan_op = BinaryFunction())# Performs inclusive scan and reduction across threads in a block.
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters:
BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.reduction – [out] - result of reducing of all
input
values in a block.scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class PrefixCallback, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T input, T &output, storage_type &storage, PrefixCallback &prefix_callback_op, BinaryFunction scan_op)# Performs inclusive scan across threads in a block, and uses
prefix_callback_op
to generate prefix value for the whole block.- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present inclusive prefix sum operations performed on a block of 256 threads, each thread provides one
int
value.struct my_block_prefix { int prefix; __device__ my_block_prefix(int prefix) : prefix(prefix) {} __device__ int operator()(int block_reduction) { int old_prefix = prefix; prefix = prefix + block_reduction; return old_prefix; } }; __global__ void example_kernel(...) // blockDim.x = 256 { // specialize block_scan for int and block of 256 threads using block_scan_f = rocprim::block_scan<int, 256>; // allocate storage in shared memory for the block __shared__ block_scan_int::storage_type storage; // init prefix functor my_block_prefix prefix_callback(10); int input; int output; // execute inclusive prefix sum block_scan_int().inclusive_scan( input, output, storage, prefix_callback, rocprim::plus<int>() ); ... }
If the
input
values across threads in a block are{1, 1, 1, ..., 1}
, thenoutput
values in will be{11, 12, 13, ..., 266}
, and theprefix
will be266
.
- Template Parameters:
PrefixCallback – - type of the unary function object used for generating block-wide prefix value for the scan operation.
BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.storage – [in] - reference to a temporary storage object of type storage_type.
prefix_callback_op – [inout] - function object for generating block prefix value. The signature of the
prefix_callback_op
should be equivalent to the following:T f(const T &block_reduction);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it. The object will be called by the first warp of the block with block reduction ofinput
values as input argument. The result of the first thread will be used as the block-wide prefix.scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], storage_type &storage, BinaryFunction scan_op = BinaryFunction())# Performs inclusive scan across threads in a block.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present inclusive maximum scan operations performed on a block of 128 threads, each provides two
long
value.__global__ void example_kernel(...) // blockDim.x = 128 { // specialize block_scan for long and block of 128 threads using block_scan_f = rocprim::block_scan<long, 128>; // allocate storage in shared memory for the block __shared__ block_scan_long::storage_type storage; long input[2] = ...; long output[2]; // execute inclusive min scan block_scan_long().inclusive_scan( input, output, storage, rocprim::maximum<long>() ); ... }
If the
input
values across threads in a block are{-1, 2, -3, 4, ..., -255, 256}
, thenoutput
values in will be{-1, 2, 2, 4, ..., 254, 256}
.
- Template Parameters:
ItemsPerThread – - number of items in the
input
array.BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - reference to an array containing thread input values.
output – [out] - reference to a thread output array. May be aliased with
input
.storage – [in] - reference to a temporary storage object of type storage_type.
scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], BinaryFunction scan_op = BinaryFunction())# Performs inclusive scan across threads in a block.
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters:
ItemsPerThread – - number of items in the
input
array.BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - reference to an array containing thread input values.
output – [out] - reference to a thread output array. May be aliased with
input
.scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], T &reduction, storage_type &storage, BinaryFunction scan_op = BinaryFunction())# Performs inclusive scan and reduction across threads in a block.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present inclusive maximum scan operations performed on a block of 128 threads, each provides two
long
value.__global__ void example_kernel(...) // blockDim.x = 128 { // specialize block_scan for long and block of 128 threads using block_scan_f = rocprim::block_scan<long, 128>; // allocate storage in shared memory for the block __shared__ block_scan_long::storage_type storage; long input[2] = ...; long output[2]; long reduction; // execute inclusive min scan block_scan_long().inclusive_scan( input, output, reduction, storage, rocprim::maximum<long>() ); ... }
If the
input
values across threads in a block are{-1, 2, -3, 4, ..., -255, 256}
, thenoutput
values in will be{-1, 2, 2, 4, ..., 254, 256}
and thereduction
will be256
.
- Template Parameters:
ItemsPerThread – - number of items in the
input
array.BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - reference to an array containing thread input values.
output – [out] - reference to a thread output array. May be aliased with
input
.reduction – [out] - result of reducing of all
input
values in a block.storage – [in] - reference to a temporary storage object of type storage_type.
scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void inclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], T &reduction, BinaryFunction scan_op = BinaryFunction())# Performs inclusive scan and reduction across threads in a block.
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters:
ItemsPerThread – - number of items in the
input
array.BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - reference to an array containing thread input values.
output – [out] - reference to a thread output array. May be aliased with
input
.reduction – [out] - result of reducing of all
input
values in a block.scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<unsigned int ItemsPerThread, class PrefixCallback, class BinaryFunction>
__device__ inline void inclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], storage_type &storage, PrefixCallback &prefix_callback_op, BinaryFunction scan_op)# Performs inclusive scan across threads in a block, and uses
prefix_callback_op
to generate prefix value for the whole block.- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present inclusive prefix sum operations performed on a block of 128 threads, each thread provides two
int
value.struct my_block_prefix { int prefix; __device__ my_block_prefix(int prefix) : prefix(prefix) {} __device__ int operator()(int block_reduction) { int old_prefix = prefix; prefix = prefix + block_reduction; return old_prefix; } }; __global__ void example_kernel(...) // blockDim.x = 128 { // specialize block_scan for int and block of 128 threads using block_scan_f = rocprim::block_scan<int, 128>; // allocate storage in shared memory for the block __shared__ block_scan_int::storage_type storage; // init prefix functor my_block_prefix prefix_callback(10); int input[2] = ...; int output[2]; // execute inclusive prefix sum block_scan_int().inclusive_scan( input, output, storage, prefix_callback, rocprim::plus<int>() ); ... }
If the
input
values across threads in a block are{1, 1, 1, ..., 1}
, thenoutput
values in will be{11, 12, 13, ..., 266}
, and theprefix
will be266
.
- Template Parameters:
ItemsPerThread – - number of items in the
input
array.PrefixCallback – - type of the unary function object used for generating block-wide prefix value for the scan operation.
BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - reference to an array containing thread input values.
output – [out] - reference to a thread output array. May be aliased with
input
.storage – [in] - reference to a temporary storage object of type storage_type.
prefix_callback_op – [inout] - function object for generating block prefix value. The signature of the
prefix_callback_op
should be equivalent to the following:T f(const T &block_reduction);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it. The object will be called by the first warp of the block with block reduction ofinput
values as input argument. The result of the first thread will be used as the block-wide prefix.scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T input, T &output, T init, storage_type &storage, BinaryFunction scan_op = BinaryFunction())# Performs exclusive scan across threads in a block.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present exclusive min scan operations performed on a block of 256 threads, each provides one
float
value.__global__ void example_kernel(...) // blockDim.x = 256 { // specialize block_scan for float and block of 256 threads using block_scan_f = rocprim::block_scan<float, 256>; // allocate storage in shared memory for the block __shared__ block_scan_float::storage_type storage; float init = ...; float input = ...; float output; // execute exclusive min scan block_scan_float().exclusive_scan( input, output, init, storage, rocprim::minimum<float>() ); ... }
If the
input
values across threads in a block are{1, -2, 3, -4, ..., 255, -256}
andinit
is0
, thenoutput
values in will be{0, 0, -2, -2, -4, ..., -254, -254}
.
- Template Parameters:
BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.storage – [in] - reference to a temporary storage object of type storage_type.
init – [in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.
scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T input, T &output, T init, BinaryFunction scan_op = BinaryFunction())# Performs exclusive scan across threads in a block.
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters:
BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.init – [in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.
scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T input, T &output, T init, T &reduction, storage_type &storage, BinaryFunction scan_op = BinaryFunction())# Performs exclusive scan and reduction across threads in a block.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present exclusive min scan operations performed on a block of 256 threads, each provides one
float
value.__global__ void example_kernel(...) // blockDim.x = 256 { // specialize block_scan for float and block of 256 threads using block_scan_f = rocprim::block_scan<float, 256>; // allocate storage in shared memory for the block __shared__ block_scan_float::storage_type storage; float init = 0; float input = ...; float output; float reduction; // execute exclusive min scan block_scan_float().exclusive_scan( input, output, init, reduction, storage, rocprim::minimum<float>() ); ... }
If the
input
values across threads in a block are{1, -2, 3, -4, ..., 255, -256}
andinit
is0
, thenoutput
values in will be{0, 0, -2, -2, -4, ..., -254, -254}
and thereduction
will be-256
.
- Template Parameters:
BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.init – [in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.
reduction – [out] - result of reducing of all
input
values in a block.storage – [in] - reference to a temporary storage object of type storage_type.
scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T input, T &output, T init, T &reduction, BinaryFunction scan_op = BinaryFunction())# Performs exclusive scan and reduction across threads in a block.
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters:
BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.init – [in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.
reduction – [out] - result of reducing of all
input
values in a block.scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<class PrefixCallback, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T input, T &output, storage_type &storage, PrefixCallback &prefix_callback_op, BinaryFunction scan_op)# Performs exclusive scan across threads in a block, and uses
prefix_callback_op
to generate prefix value for the whole block.- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present exclusive prefix sum operations performed on a block of 256 threads, each thread provides one
int
value.struct my_block_prefix { int prefix; __device__ my_block_prefix(int prefix) : prefix(prefix) {} __device__ int operator()(int block_reduction) { int old_prefix = prefix; prefix = prefix + block_reduction; return old_prefix; } }; __global__ void example_kernel(...) // blockDim.x = 256 { // specialize block_scan for int and block of 256 threads using block_scan_f = rocprim::block_scan<int, 256>; // allocate storage in shared memory for the block __shared__ block_scan_int::storage_type storage; // init prefix functor my_block_prefix prefix_callback(10); int input; int output; // execute exclusive prefix sum block_scan_int().exclusive_scan( input, output, storage, prefix_callback, rocprim::plus<int>() ); ... }
If the
input
values across threads in a block are{1, 1, 1, ..., 1}
, thenoutput
values in will be{10, 11, 12, 13, ..., 265}
, and theprefix
will be266
.
- Template Parameters:
PrefixCallback – - type of the unary function object used for generating block-wide prefix value for the scan operation.
BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - thread input value.
output – [out] - reference to a thread output value. May be aliased with
input
.storage – [in] - reference to a temporary storage object of type storage_type.
prefix_callback_op – [inout] - function object for generating block prefix value. The signature of the
prefix_callback_op
should be equivalent to the following:T f(const T &block_reduction);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it. The object will be called by the first warp of the block with block reduction ofinput
values as input argument. The result of the first thread will be used as the block-wide prefix.scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], T init, storage_type &storage, BinaryFunction scan_op = BinaryFunction())# Performs exclusive scan across threads in a block.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present exclusive maximum scan operations performed on a block of 128 threads, each provides two
long
value.__global__ void example_kernel(...) // blockDim.x = 128 { // specialize block_scan for long and block of 128 threads using block_scan_f = rocprim::block_scan<long, 128>; // allocate storage in shared memory for the block __shared__ block_scan_long::storage_type storage; long init = ...; long input[2] = ...; long output[2]; // execute exclusive min scan block_scan_long().exclusive_scan( input, output, init, storage, rocprim::maximum<long>() ); ... }
If the
input
values across threads in a block are{-1, 2, -3, 4, ..., -255, 256}
andinit
is 0, thenoutput
values in will be{0, 0, 2, 2, 4, ..., 254, 254}
.
- Template Parameters:
ItemsPerThread – - number of items in the
input
array.BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - reference to an array containing thread input values.
output – [out] - reference to a thread output array. May be aliased with
input
.init – [in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.
storage – [in] - reference to a temporary storage object of type storage_type.
scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], T init, BinaryFunction scan_op = BinaryFunction())# Performs exclusive scan across threads in a block.
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters:
ItemsPerThread – - number of items in the
input
array.BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - reference to an array containing thread input values.
output – [out] - reference to a thread output array. May be aliased with
input
.init – [in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.
scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], T init, T &reduction, storage_type &storage, BinaryFunction scan_op = BinaryFunction())# Performs exclusive scan and reduction across threads in a block.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present exclusive maximum scan operations performed on a block of 128 threads, each provides two
long
value.__global__ void example_kernel(...) // blockDim.x = 128 { // specialize block_scan for long and block of 128 threads using block_scan_f = rocprim::block_scan<long, 128>; // allocate storage in shared memory for the block __shared__ block_scan_long::storage_type storage; long init = ...; long input[2] = ...; long output[2]; long reduction; // execute exclusive min scan block_scan_long().exclusive_scan( input, output, init, reduction, storage, rocprim::maximum<long>() ); ... }
If the
input
values across threads in a block are{-1, 2, -3, 4, ..., -255, 256}
andinit
is 0, thenoutput
values in will be{0, 0, 2, 2, 4, ..., 254, 254}
and thereduction
will be256
.
- Template Parameters:
ItemsPerThread – - number of items in the
input
array.BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - reference to an array containing thread input values.
output – [out] - reference to a thread output array. May be aliased with
input
.init – [in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.
reduction – [out] - result of reducing of all
input
values in a block.storage – [in] - reference to a temporary storage object of type storage_type.
scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<unsigned int ItemsPerThread, class BinaryFunction = ::rocprim::plus<T>>
__device__ inline void exclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], T init, T &reduction, BinaryFunction scan_op = BinaryFunction())# Performs exclusive scan and reduction across threads in a block.
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload does not accept storage argument. Required shared memory is allocated by the method itself.
- Template Parameters:
ItemsPerThread – - number of items in the
input
array.BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - reference to an array containing thread input values.
output – [out] - reference to a thread output array. May be aliased with
input
.init – [in] - initial value used to start the exclusive scan. Should be the same for all threads in a block.
reduction – [out] - result of reducing of all
input
values in a block.scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
-
template<unsigned int ItemsPerThread, class PrefixCallback, class BinaryFunction>
__device__ inline void exclusive_scan(T (&input)[ItemsPerThread], T (&output)[ItemsPerThread], storage_type &storage, PrefixCallback &prefix_callback_op, BinaryFunction scan_op)# Performs exclusive scan across threads in a block, and uses
prefix_callback_op
to generate prefix value for the whole block.- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Examples
The examples present exclusive prefix sum operations performed on a block of 128 threads, each thread provides two
int
value.struct my_block_prefix { int prefix; __device__ my_block_prefix(int prefix) : prefix(prefix) {} __device__ int operator()(int block_reduction) { int old_prefix = prefix; prefix = prefix + block_reduction; return old_prefix; } }; __global__ void example_kernel(...) // blockDim.x = 128 { // specialize block_scan for int and block of 128 threads using block_scan_f = rocprim::block_scan<int, 128>; // allocate storage in shared memory for the block __shared__ block_scan_int::storage_type storage; // init prefix functor my_block_prefix prefix_callback(10); int input[2] = ...; int output[2]; // execute exclusive prefix sum block_scan_int().exclusive_scan( input, output, storage, prefix_callback, rocprim::plus<int>() ); ... }
If the
input
values across threads in a block are{1, 1, 1, ..., 1}
, thenoutput
values in will be{10, 11, 12, 13, ..., 265}
, and theprefix
will be266
.
- Template Parameters:
ItemsPerThread – - number of items in the
input
array.PrefixCallback – - type of the unary function object used for generating block-wide prefix value for the scan operation.
BinaryFunction – - type of binary function used for scan. Default type is rocprim::plus<T>.
- Parameters:
input – [in] - reference to an array containing thread input values.
output – [out] - reference to a thread output array. May be aliased with
input
.storage – [in] - reference to a temporary storage object of type storage_type.
prefix_callback_op – [inout] - function object for generating block prefix value. The signature of the
prefix_callback_op
should be equivalent to the following:T f(const T &block_reduction);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it. The object will be called by the first warp of the block with block reduction ofinput
values as input argument. The result of the first thread will be used as the block-wide prefix.scan_op – [in] - binary operation function object that will be used for scan. The signature of the function should be equivalent to the following:
T f(const T &a, const T &b);
. The signature does not need to haveconst &
, but function object must not modify the objects passed to it.
Algorithms#
-
enum class rocprim::block_scan_algorithm#
Available algorithms for block_scan primitive.
Values:
-
enumerator reduce_then_scan#
An algorithm which limits calculations to a single hardware warp.
-
enumerator default_algorithm#
Default block_scan algorithm.
-
enumerator reduce_then_scan#