Adjacent difference#

template<class T, unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_adjacent_difference#

The block_adjacent_difference class is a block level parallel primitive which provides methods for applying binary functions for pairs of consecutive items partition across a thread block.

Overview

  • There are two types of flags:

    • Head flags.

    • Tail flags.

  • The above flags are used to differentiate items from their predecessors or successors.

  • E.g. Head flags are convenient for differentiating disjoint data segments as part of a segmented reduction/scan.

Examples

In the examples discontinuity operation is performed on block of 128 threads, using type int.

__global__ void example_kernel(...)
{
    // specialize discontinuity for int and a block of 128 threads
    using block_adjacent_difference_int = rocprim::block_adjacent_difference<int, 128>;
    // allocate storage in shared memory
    __shared__ block_adjacent_difference_int::storage_type storage;

    // segment of consecutive items to be used
    int input[8];
    ...
    int head_flags[8];
    block_adjacent_difference_int b_discontinuity;
    using flag_op_type = typename rocprim::greater<int>;
    b_discontinuity.flag_heads(head_flags, input, flag_op_type(), storage);
    ...
}

Template Parameters:
  • T – - the input type.

  • BlockSize – - the number of threads in a block.

Public Types

using storage_type = storage_type_#

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords . It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.

Public Functions

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_heads(Flag (&head_flags)[ItemsPerThread], const T (&input)[ItemsPerThread], FlagOp flag_op, storage_type &storage)#

Tags head_flags that indicate discontinuities between items partitioned across the thread block, where the first item has no reference and is always flagged.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use subtract_left() or block_discontinuity::flag_heads() instead.

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Example.
__global__ void example_kernel(...)
{
    // specialize discontinuity for int and a block of 128 threads
    using block_adjacent_difference_int = rocprim::block_adjacent_difference<int, 128>;
    // allocate storage in shared memory
    __shared__ block_adjacent_difference_int::storage_type storage;

    // segment of consecutive items to be used
    int input[8];
    ...
    int head_flags[8];
    block_adjacent_difference_int b_discontinuity;
    using flag_op_type = typename rocprim::greater<int>;
    b_discontinuity.flag_heads(head_flags, input, flag_op_type(), storage);
    ...
}

Template Parameters:
  • ItemsPerThread – - [inferred] the number of items to be processed by each thread.

  • Flag – - [inferred] the flag type.

  • FlagOp – - [inferred] type of binary function used for flagging.

Parameters:
  • head_flags[out] - array that contains the head flags.

  • input[in] - array that data is loaded from.

  • flag_op[in] - binary operation function object that will be used for flagging. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b); or bool (const T& a, const T& b, unsigned int b_index);. The signature does not need to have const &, but function object must not modify the objects passed to it.

  • storage[in] - reference to a temporary storage object of type storage_type.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_heads(Flag (&head_flags)[ItemsPerThread], const T (&input)[ItemsPerThread], FlagOp flag_op)#

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use subtract_left() or block_discontinuity::flag_heads() instead. This overload does not take a reference to temporary storage, instead it is declared as part of the function itself. Note that this does NOT decrease the shared memory requirements of a kernel using this function.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_heads(Flag (&head_flags)[ItemsPerThread], T tile_predecessor_item, const T (&input)[ItemsPerThread], FlagOp flag_op, storage_type &storage)#

Tags head_flags that indicate discontinuities between items partitioned across the thread block, where the first item of the first thread is compared against a tile_predecessor_item.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use subtract_left() or block_discontinuity::flag_heads() instead.

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Example.
__global__ void example_kernel(...)
{
    // specialize discontinuity for int and a block of 128 threads
    using block_adjacent_difference_int = rocprim::block_adjacent_difference<int, 128>;
    // allocate storage in shared memory
    __shared__ block_adjacent_difference_int::storage_type storage;

    // segment of consecutive items to be used
    int input[8];
    int tile_item = 0;
    if (threadIdx.x == 0)
    {
        tile_item = ...
    }
    ...
    int head_flags[8];
    block_adjacent_difference_int b_discontinuity;
    using flag_op_type = typename rocprim::greater<int>;
    b_discontinuity.flag_heads(head_flags, tile_item, input, flag_op_type(),
                               storage);
    ...
}

Template Parameters:
  • ItemsPerThread – - [inferred] the number of items to be processed by each thread.

  • Flag – - [inferred] the flag type.

  • FlagOp – - [inferred] type of binary function used for flagging.

Parameters:
  • head_flags[out] - array that contains the head flags.

  • tile_predecessor_item[in] - first tile item from thread to be compared against.

  • input[in] - array that data is loaded from.

  • flag_op[in] - binary operation function object that will be used for flagging. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b); or bool (const T& a, const T& b, unsigned int b_index);. The signature does not need to have const &, but function object must not modify the objects passed to it.

  • storage[in] - reference to a temporary storage object of type storage_type.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_heads(Flag (&head_flags)[ItemsPerThread], T tile_predecessor_item, const T (&input)[ItemsPerThread], FlagOp flag_op)#

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use subtract_left() or block_discontinuity::flag_heads() instead.

This overload does not accept a reference to temporary storage, instead it is declared as part of the function itself. Note that this does NOT decrease the shared memory requirements of a kernel using this function.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_tails(Flag (&tail_flags)[ItemsPerThread], const T (&input)[ItemsPerThread], FlagOp flag_op, storage_type &storage)#

Tags tail_flags that indicate discontinuities between items partitioned across the thread block, where the last item has no reference and is always flagged.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use subtract_right() or block_discontinuity::flag_tails() instead.

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Example.
__global__ void example_kernel(...)
{
    // specialize discontinuity for int and a block of 128 threads
    using block_adjacent_difference_int = rocprim::block_adjacent_difference<int, 128>;
    // allocate storage in shared memory
    __shared__ block_adjacent_difference_int::storage_type storage;

    // segment of consecutive items to be used
    int input[8];
    ...
    int tail_flags[8];
    block_adjacent_difference_int b_discontinuity;
    using flag_op_type = typename rocprim::greater<int>;
    b_discontinuity.flag_tails(tail_flags, input, flag_op_type(), storage);
    ...
}

Template Parameters:
  • ItemsPerThread – - [inferred] the number of items to be processed by each thread.

  • Flag – - [inferred] the flag type.

  • FlagOp – - [inferred] type of binary function used for flagging.

Parameters:
  • tail_flags[out] - array that contains the tail flags.

  • input[in] - array that data is loaded from.

  • flag_op[in] - binary operation function object that will be used for flagging. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b); or bool (const T& a, const T& b, unsigned int b_index);. The signature does not need to have const &, but function object must not modify the objects passed to it.

  • storage[in] - reference to a temporary storage object of type storage_type.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_tails(Flag (&tail_flags)[ItemsPerThread], const T (&input)[ItemsPerThread], FlagOp flag_op)#

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use subtract_right() or block_discontinuity::flag_tails() instead.

This overload does not accept a reference to temporary storage, instead it is declared as part of the function itself. Note that this does NOT decrease the shared memory requirements of a kernel using this function.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_tails(Flag (&tail_flags)[ItemsPerThread], T tile_successor_item, const T (&input)[ItemsPerThread], FlagOp flag_op, storage_type &storage)#

Tags tail_flags that indicate discontinuities between items partitioned across the thread block, where the last item of the last thread is compared against a tile_successor_item.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use subtract_right() or block_discontinuity::flag_tails() instead.

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Example.
__global__ void example_kernel(...)
{
    // specialize discontinuity for int and a block of 128 threads
    using block_adjacent_difference_int = rocprim::block_adjacent_difference<int, 128>;
    // allocate storage in shared memory
    __shared__ block_adjacent_difference_int::storage_type storage;

    // segment of consecutive items to be used
    int input[8];
    int tile_item = 0;
    if (threadIdx.x == 0)
    {
        tile_item = ...
    }
    ...
    int tail_flags[8];
    block_adjacent_difference_int b_discontinuity;
    using flag_op_type = typename rocprim::greater<int>;
    b_discontinuity.flag_tails(tail_flags, tile_item, input, flag_op_type(),
                               storage);
    ...
}

Template Parameters:
  • ItemsPerThread – - [inferred] the number of items to be processed by each thread.

  • Flag – - [inferred] the flag type.

  • FlagOp – - [inferred] type of binary function used for flagging.

Parameters:
  • tail_flags[out] - array that contains the tail flags.

  • tile_successor_item[in] - last tile item from thread to be compared against.

  • input[in] - array that data is loaded from.

  • flag_op[in] - binary operation function object that will be used for flagging. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b); or bool (const T& a, const T& b, unsigned int b_index);. The signature does not need to have const &, but function object must not modify the objects passed to it.

  • storage[in] - reference to a temporary storage object of type storage_type.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_tails(Flag (&tail_flags)[ItemsPerThread], T tile_successor_item, const T (&input)[ItemsPerThread], FlagOp flag_op)#

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use subtract_right() or block_discontinuity::flag_tails() instead.

This overload does not accept a reference to temporary storage, instead it is declared as part of the function itself. Note that this does NOT decrease the shared memory requirements of a kernel using this function.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_heads_and_tails(Flag (&head_flags)[ItemsPerThread], Flag (&tail_flags)[ItemsPerThread], const T (&input)[ItemsPerThread], FlagOp flag_op, storage_type &storage)#

Tags both head_flags andtail_flags that indicate discontinuities between items partitioned across the thread block.

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Example.
__global__ void example_kernel(...)
{
    // specialize discontinuity for int and a block of 128 threads
    using block_adjacent_difference_int = rocprim::block_adjacent_difference<int, 128>;
    // allocate storage in shared memory
    __shared__ block_adjacent_difference_int::storage_type storage;

    // segment of consecutive items to be used
    int input[8];
    ...
    int head_flags[8];
    int tail_flags[8];
    block_adjacent_difference_int b_discontinuity;
    using flag_op_type = typename rocprim::greater<int>;
    b_discontinuity.flag_heads_and_tails(head_flags, tail_flags, input,
                                         flag_op_type(), storage);
    ...
}

Template Parameters:
  • ItemsPerThread – - [inferred] the number of items to be processed by each thread.

  • Flag – - [inferred] the flag type.

  • FlagOp – - [inferred] type of binary function used for flagging.

Parameters:
  • head_flags[out] - array that contains the head flags.

  • tail_flags[out] - array that contains the tail flags.

  • input[in] - array that data is loaded from.

  • flag_op[in] - binary operation function object that will be used for flagging. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b); or bool (const T& a, const T& b, unsigned int b_index);. The signature does not need to have const &, but function object must not modify the objects passed to it.

  • storage[in] - reference to a temporary storage object of type storage_type.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_heads_and_tails(Flag (&head_flags)[ItemsPerThread], Flag (&tail_flags)[ItemsPerThread], const T (&input)[ItemsPerThread], FlagOp flag_op)#

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use block_discontinuity::flag_heads_and_tails() instead.

This overload does not accept a reference to temporary storage, instead it is declared as part of the function itself. Note that this does NOT decrease the shared memory requirements of a kernel using this function.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_heads_and_tails(Flag (&head_flags)[ItemsPerThread], Flag (&tail_flags)[ItemsPerThread], T tile_successor_item, const T (&input)[ItemsPerThread], FlagOp flag_op, storage_type &storage)#

Tags both head_flags andtail_flags that indicate discontinuities between items partitioned across the thread block, where the last item of the last thread is compared against a tile_successor_item.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use block_discontinuity::flag_heads_and_tails() instead.

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Example.
__global__ void example_kernel(...)
{
    // specialize discontinuity for int and a block of 128 threads
    using block_adjacent_difference_int = rocprim::block_adjacent_difference<int, 128>;
    // allocate storage in shared memory
    __shared__ block_adjacent_difference_int::storage_type storage;

    // segment of consecutive items to be used
    int input[8];
    int tile_item = 0;
    if (threadIdx.x == 0)
    {
        tile_item = ...
    }
    ...
    int head_flags[8];
    int tail_flags[8];
    block_adjacent_difference_int b_discontinuity;
    using flag_op_type = typename rocprim::greater<int>;
    b_discontinuity.flag_heads_and_tails(head_flags, tail_flags, tile_item,
                                         input, flag_op_type(),
                                         storage);
    ...
}

Template Parameters:
  • ItemsPerThread – - [inferred] the number of items to be processed by each thread.

  • Flag – - [inferred] the flag type.

  • FlagOp – - [inferred] type of binary function used for flagging.

Parameters:
  • head_flags[out] - array that contains the head flags.

  • tail_flags[out] - array that contains the tail flags.

  • tile_successor_item[in] - last tile item from thread to be compared against.

  • input[in] - array that data is loaded from.

  • flag_op[in] - binary operation function object that will be used for flagging. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b); or bool (const T& a, const T& b, unsigned int b_index);. The signature does not need to have const &, but function object must not modify the objects passed to it.

  • storage[in] - reference to a temporary storage object of type storage_type.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_heads_and_tails(Flag (&head_flags)[ItemsPerThread], Flag (&tail_flags)[ItemsPerThread], T tile_successor_item, const T (&input)[ItemsPerThread], FlagOp flag_op)#

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use block_discontinuity::flag_heads_and_tails() instead.

This overload does not accept a reference to temporary storage, instead it is declared as part of the function itself. Note that this does NOT decrease the shared memory requirements of a kernel using this function.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_heads_and_tails(Flag (&head_flags)[ItemsPerThread], T tile_predecessor_item, Flag (&tail_flags)[ItemsPerThread], const T (&input)[ItemsPerThread], FlagOp flag_op, storage_type &storage)#

Tags both head_flags andtail_flags that indicate discontinuities between items partitioned across the thread block, where the first item of the first thread is compared against a tile_predecessor_item.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use block_discontinuity::flag_heads_and_tails() instead.

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Example.
__global__ void example_kernel(...)
{
    // specialize discontinuity for int and a block of 128 threads
    using block_adjacent_difference_int = rocprim::block_adjacent_difference<int, 128>;
    // allocate storage in shared memory
    __shared__ block_adjacent_difference_int::storage_type storage;

    // segment of consecutive items to be used
    int input[8];
    int tile_item = 0;
    if (threadIdx.x == 0)
    {
        tile_item = ...
    }
    ...
    int head_flags[8];
    int tail_flags[8];
    block_adjacent_difference_int b_discontinuity;
    using flag_op_type = typename rocprim::greater<int>;
    b_discontinuity.flag_heads_and_tails(head_flags, tile_item, tail_flags,
                                         input, flag_op_type(),
                                         storage);
    ...
}

Template Parameters:
  • ItemsPerThread – - [inferred] the number of items to be processed by each thread.

  • Flag – - [inferred] the flag type.

  • FlagOp – - [inferred] type of binary function used for flagging.

Parameters:
  • head_flags[out] - array that contains the head flags.

  • tile_predecessor_item[in] - first tile item from thread to be compared against.

  • tail_flags[out] - array that contains the tail flags.

  • input[in] - array that data is loaded from.

  • flag_op[in] - binary operation function object that will be used for flagging. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b); or bool (const T& a, const T& b, unsigned int b_index);. The signature does not need to have const &, but function object must not modify the objects passed to it.

  • storage[in] - reference to a temporary storage object of type storage_type.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_heads_and_tails(Flag (&head_flags)[ItemsPerThread], T tile_predecessor_item, Flag (&tail_flags)[ItemsPerThread], const T (&input)[ItemsPerThread], FlagOp flag_op)#

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use block_discontinuity::flag_heads_and_tails() instead.

This overload does not accept a reference to temporary storage, instead it is declared as part of the function itself. Note that this does NOT decrease the shared memory requirements of a kernel using this function.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_heads_and_tails(Flag (&head_flags)[ItemsPerThread], T tile_predecessor_item, Flag (&tail_flags)[ItemsPerThread], T tile_successor_item, const T (&input)[ItemsPerThread], FlagOp flag_op, storage_type &storage)#

Tags both head_flags andtail_flags that indicate discontinuities between items partitioned across the thread block, where the first and last items of the first and last thread is compared against a tile_predecessor_item and a tile_successor_item.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use block_discontinuity::flag_heads_and_tails() instead.

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Example.
__global__ void example_kernel(...)
{
    // specialize discontinuity for int and a block of 128 threads
    using block_adjacent_difference_int = rocprim::block_adjacent_difference<int, 128>;
    // allocate storage in shared memory
    __shared__ block_adjacent_difference_int::storage_type storage;

    // segment of consecutive items to be used
    int input[8];
    int tile_predecessor_item = 0;
    int tile_successor_item = 0;
    if (threadIdx.x == 0)
    {
        tile_predecessor_item = ...
        tile_successor_item = ...
    }
    ...
    int head_flags[8];
    int tail_flags[8];
    block_adjacent_difference_int b_discontinuity;
    using flag_op_type = typename rocprim::greater<int>;
    b_discontinuity.flag_heads_and_tails(head_flags, tile_predecessor_item,
                                         tail_flags, tile_successor_item,
                                         input, flag_op_type(),
                                         storage);
    ...
}

Template Parameters:
  • ItemsPerThread – - [inferred] the number of items to be processed by each thread.

  • Flag – - [inferred] the flag type.

  • FlagOp – - [inferred] type of binary function used for flagging.

Parameters:
  • head_flags[out] - array that contains the head flags.

  • tile_predecessor_item[in] - first tile item from thread to be compared against.

  • tail_flags[out] - array that contains the tail flags.

  • tile_successor_item[in] - last tile item from thread to be compared against.

  • input[in] - array that data is loaded from.

  • flag_op[in] - binary operation function object that will be used for flagging. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b); or bool (const T& a, const T& b, unsigned int b_index);. The signature does not need to have const &, but function object must not modify the objects passed to it.

  • storage[in] - reference to a temporary storage object of type storage_type.

template<unsigned int ItemsPerThread, class Flag, class FlagOp>
__device__ inline void flag_heads_and_tails(Flag (&head_flags)[ItemsPerThread], T tile_predecessor_item, Flag (&tail_flags)[ItemsPerThread], T tile_successor_item, const T (&input)[ItemsPerThread], FlagOp flag_op)#

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Deprecated:

The flags API of block_adjacent_difference is deprecated, use block_discontinuity::flag_heads_and_tails() instead.

This overload does not accept a reference to temporary storage, instead it is declared as part of the function itself. Note that this does NOT decrease the shared memory requirements of a kernel using this function.

template<typename Output, unsigned int ItemsPerThread, typename BinaryFunction>
__device__ inline void subtract_left(const T (&input)[ItemsPerThread], Output (&output)[ItemsPerThread], const BinaryFunction op, storage_type &storage)#

Apply a function to each consecutive pair of elements partitioned across threads in the block and write the output to the position of the left item.

The first item in the first thread is copied from the input then for the rest the following code applies.

// For each i in [1, block_size * ItemsPerThread) across threads in a block
output[i] = op(input[i], input[i-1]);

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads() .

Template Parameters:
  • Output – - [inferred] the type of output, must be assignable from the result of op

  • ItemsPerThread – - [inferred] the number of items processed by each thread

  • BinaryFunction – - [inferred] the type of the function to apply

Parameters:
  • input[in] - array that data is loaded from partitioned across the threads in the block

  • output[out] - array where the result of function application will be written to

  • op[in] - binary function applied to the items. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b) The signature does not need to have const & but the function object must not modify the objects passed to it.

  • storage – reference to a temporary storage object of type storage_type

template<typename Output, unsigned int ItemsPerThread, typename BinaryFunction>
__device__ inline void subtract_left(const T (&input)[ItemsPerThread], Output (&output)[ItemsPerThread], const BinaryFunction op, const T tile_predecessor, storage_type &storage)#

Apply a function to each consecutive pair of elements partitioned across threads in the block and write the output to the position of the left item, with an explicit item before the tile.

// For the first item on the first thread use the tile predecessor
output[0] = op(input[0], tile_predecessor)
// For other items, i in [1, block_size * ItemsPerThread) across threads in a block
output[i] = op(input[i], input[i-1]);

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads() .

Template Parameters:
  • Output – - [inferred] the type of output, must be assignable from the result of op

  • ItemsPerThread – - [inferred] the number of items processed by each thread

  • BinaryFunction – - [inferred] the type of the function to apply

Parameters:
  • input[in] - array that data is loaded from partitioned across the threads in the block

  • output[out] - array where the result of function application will be written to

  • op[in] - binary function applied to the items. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b) The signature does not need to have const & but the function object must not modify the objects passed to it.

  • tile_predecessor[in] - the item before the tile, will be used as the input of the first application of op

  • storage – - reference to a temporary storage object of type storage_type

template<typename Output, unsigned int ItemsPerThread, typename BinaryFunction>
__device__ inline void subtract_left_partial(const T (&input)[ItemsPerThread], Output (&output)[ItemsPerThread], const BinaryFunction op, const unsigned int valid_items, storage_type &storage)#

Apply a function to each consecutive pair of elements partitioned across threads in the block and write the output to the position of the left item, in a partial tile.

output[0] = input[0]
// For each item i in [1, valid_items) across threads in a block
output[i] = op(input[i], input[i-1]);
// Just copy "invalid" items in [valid_items, block_size * ItemsPerThread)
output[i] = input[i]

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads() .

Template Parameters:
  • Output – - [inferred] the type of output, must be assignable from the result of op

  • ItemsPerThread – - [inferred] the number of items processed by each thread

  • BinaryFunction – - [inferred] the type of the function to apply

Parameters:
  • input[in] - array that data is loaded from partitioned across the threads in the block

  • output[out] - array where the result of function application will be written to

  • op[in] - binary function applied to the items. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b) The signature does not need to have const & but the function object must not modify the objects passed to it.

  • valid_items[in] - number of items in the block which are considered “valid” and will be used. Must be less or equal to BlockSize * ItemsPerThread

  • storage – - reference to a temporary storage object of type storage_type

template<typename Output, unsigned int ItemsPerThread, typename BinaryFunction>
__device__ inline void subtract_left_partial(const T (&input)[ItemsPerThread], Output (&output)[ItemsPerThread], const BinaryFunction op, const T tile_predecessor, const unsigned int valid_items, storage_type &storage)#

Apply a function to each consecutive pair of elements partitioned across threads in the block and write the output to the position of the left item, in a partial tile with a predecessor.

This combines subtract_left_partial() with a tile predecessor.

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads() .

Template Parameters:
  • Output – - [inferred] the type of output, must be assignable from the result of op

  • ItemsPerThread – - [inferred] the number of items processed by each thread

  • BinaryFunction – - [inferred] the type of the function to apply

Parameters:
  • input[in] - array that data is loaded from partitioned across the threads in the block

  • output[out] - array where the result of function application will be written to

  • op[in] - binary function applied to the items. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b) The signature does not need to have const & but the function object must not modify the objects passed to it.

  • tile_predecessor[in] - the item before the tile, will be used as the input of the first application of op

  • valid_items[in] - number of items in the block which are considered “valid” and will be used. Must be less or equal to BlockSize * ItemsPerThread

  • storage – - reference to a temporary storage object of type storage_type

template<typename Output, unsigned int ItemsPerThread, typename BinaryFunction>
__device__ inline void subtract_right(const T (&input)[ItemsPerThread], Output (&output)[ItemsPerThread], const BinaryFunction op, storage_type &storage)#

Apply a function to each consecutive pair of elements partitioned across threads in the block and write the output to the position of the right item.

The last item in the last thread is copied from the input then for the rest the following code applies.

// For each i in [0, block_size * ItemsPerThread - 1) across threads in a block
output[i] = op(input[i], input[i+1]);

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads() .

Template Parameters:
  • Output – - [inferred] the type of output, must be assignable from the result of op

  • ItemsPerThread – - [inferred] the number of items processed by each thread

  • BinaryFunction – - [inferred] the type of the function to apply

Parameters:
  • input[in] - array that data is loaded from partitioned across the threads in the block

  • output[out] - array where the result of function application will be written to

  • op[in] - binary function applied to the items. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b) The signature does not need to have const & but the function object must not modify the objects passed to it.

  • storage – - reference to a temporary storage object of type storage_type

template<typename Output, unsigned int ItemsPerThread, typename BinaryFunction>
__device__ inline void subtract_right(const T (&input)[ItemsPerThread], Output (&output)[ItemsPerThread], const BinaryFunction op, const T tile_successor, storage_type &storage)#

Apply a function to each consecutive pair of elements partitioned across threads in the block and write the output to the position of the right item, with an explicit item after the tile.

// For each items i in [0, block_size * ItemsPerThread - 1) across threads in a block
output[i] = op(input[i], input[i+1]);
// For the last item on the last thread use the tile successor
output[block_size * ItemsPerThread - 1] =
     op(input[block_size * ItemsPerThread - 1], tile_successor)

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads() .

Template Parameters:
  • Output – - [inferred] the type of output, must be assignable from the result of op

  • ItemsPerThread – - [inferred] the number of items processed by each thread

  • BinaryFunction – - [inferred] the type of the function to apply

Parameters:
  • input[in] - array that data is loaded from partitioned across the threads in the block

  • output[out] - array where the result of function application will be written to

  • op[in] - binary function applied to the items. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b) The signature does not need to have const & but the function object must not modify the objects passed to it.

  • tile_successor[in] - the item after the tile, will be used as the input of the last application of op

  • storage – - reference to a temporary storage object of type storage_type

template<typename Output, unsigned int ItemsPerThread, typename BinaryFunction>
__device__ inline void subtract_right_partial(const T (&input)[ItemsPerThread], Output (&output)[ItemsPerThread], const BinaryFunction op, const unsigned int valid_items, storage_type &storage)#

Apply a function to each consecutive pair of elements partitioned across threads in the block and write the output to the position of the right item, in a partial tile.

// For each item i in [0, valid_items) across threads in a block
output[i] = op(input[i], input[i + 1]);
// Just copy "invalid" items in [valid_items, block_size * ItemsPerThread)
output[i] = input[i]

Storage reuse

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads() .

Template Parameters:
  • Output – - [inferred] the type of output, must be assignable from the result of op

  • ItemsPerThread – - [inferred] the number of items processed by each thread

  • BinaryFunction – - [inferred] the type of the function to apply

Parameters:
  • input[in] - array that data is loaded from partitioned across the threads in the block

  • output[out] - array where the result of function application will be written to

  • op[in] - binary function applied to the items. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b) The signature does not need to have const & but the function object must not modify the objects passed to it.

  • valid_items[in] - number of items in the block which are considered “valid” and will be used. Must be less or equal to BlockSize * ItemsPerThread

  • storage – - reference to a temporary storage object of type storage_type