Shuffle#

template<class T, unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_shuffle#

The block_shuffle class is a block level parallel primitive which provides methods for shuffling data partitioned across a block.

Overview

It is commonplace for blocks of threads to rearrange data items between threads. The BlockShuffle abstraction allows threads to efficiently shift items either (a) up to their successor or (b) down to their predecessor.

  • Computation can more efficient when:

    • ItemsPerThread is greater than one,

    • T is an arithmetic type,

    • the number of threads in the block is a multiple of the hardware warp size (see rocprim::warp_size()).

Examples

In the examples shuffle operation is performed on block of 192 threads, each provides one int value, result is returned using the same variable as for input.

__global__ void example_kernel(...)
{
    // specialize block__shuffle_int for int and logical warp of 192 threads
    using block__shuffle_int = rocprim::block_shuffle<int, 192>;
    // allocate storage in shared memory
    __shared__ block_shuffle::storage_type storage;

    int value = ...;
    // execute block shuffle
    block__shuffle_int().inclusive_up(
        value, // input
        value, // output
        storage
    );
    ...
}

Template Parameters:
  • T – - the input/output type.

  • BlockSizeX – - the number of threads in a block’s x dimension, it has no defaults value.

  • BlockSizeY – - the number of threads in a block’s y dimension, defaults to 1.

  • BlockSizeZ – - the number of threads in a block’s z dimension, defaults to 1.

Public Types

using storage_type = storage_type_#

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords . It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.

Public Functions

__device__ inline void offset(T input, T &output, int distance = 1)#

Shuffles data across threads in a block, offseted by the distance value.

A thread with threadId i receives data from a thread with threadIdx (i-distance), whre distance may be a negative value.

allocated by the method itself.

Any shuffle operation with invalid input or output threadIds are not carried out, i.e. threadId < 0 || threadId >= BlockSize.

Example.
__global__ void example_kernel(...)
{
    // specialize block__shuffle_int for int and logical warp of 192 threads
    using block__shuffle_int = rocprim::block_shuffle<int, 192>;

    int value = ...;
    // execute block shuffle
    block__shuffle_int().offset(
        value, // input
        value  // output
    );
    ...
}

Parameters:
  • input[in] - input data to be shuffled to another thread.

  • output[out] - reference to a output value, that receives data from another thread

  • distance[in] - The input threadId + distance = output threadId.

__device__ inline void rotate(T input, T &output, unsigned int distance = 1)#

Shuffles data across threads in a block, offseted by the distance value.

A thread with threadId i receives data from a thread with threadIdx (i-distance)BlockSize, whre distance may be a negative value.

allocated by the method itself.

Data is rotated around the block, using (input_threadId + distance) modulous BlockSize to ensure valid threadIds.

Example.
__global__ void example_kernel(...)
{
    // specialize block__shuffle_int for int and logical warp of 192 threads
    using block__shuffle_int = rocprim::block_shuffle<int, 192>;

    int value = ...;
    // execute block shuffle
    block__shuffle_int().rotate(
        value, // input
        value  // output
    );
    ...
}

Parameters:
  • input[in] - input data to be shuffled to another thread.

  • output[out] - reference to a output value, that receives data from another thread

  • distance[in] - The input threadId + distance = output threadId.

template<unsigned int ItemsPerThread>
__device__ inline void up(T (&input)[ItemsPerThread], T (&prev)[ItemsPerThread])#

The thread block rotates a blocked arrange of input items, shifting it up by one item.

Example.
__global__ void example_kernel(...)
{
    // specialize block__shuffle_int for int and logical warp of 192 threads
    using block__shuffle_int = rocprim::block_shuffle<int, 192>;

    int value = ...;
    // execute block shuffle
    block__shuffle_int().up(
        value, // input
        value  // output
    );
    ...
}

Parameters:
  • input[in] - The calling thread’s input items

  • prev[out] - The corresponding predecessor items (may be aliased to input). The item prev[0] is not updated for thread0.

template<unsigned int ItemsPerThread>
__device__ inline void up(T (&input)[ItemsPerThread], T (&prev)[ItemsPerThread], T &block_suffix)#

The thread block rotates a blocked arrange of input items, shifting it up by one item.

Parameters:
  • input[in] - The calling thread’s input items

  • prev[out] - The corresponding predecessor items (may be aliased to input). The item prev[0] is not updated for thread0.

  • block_suffix[out] - The item input[ItemsPerThread-1] from thread, provided to all threads

template<unsigned int ItemsPerThread>
__device__ inline void down(T (&input)[ItemsPerThread], T (&next)[ItemsPerThread])#

The thread block rotates a blocked arrange of input items, shifting it down by one item.

Example.
__global__ void example_kernel(...)
{
    // specialize block__shuffle_int for int and logical warp of 192 threads
    using block__shuffle_int = rocprim::block_shuffle<int, 192>;

    int value = ...;
    // execute block shuffle
    block__shuffle_int().down(
        value, // input
        value  // output
    );
    ...
}

Parameters:
  • input[in] - The calling thread’s input items

  • next[out] - The corresponding successor items (may be aliased to input). The item prev[0] is not updated for threadBlockSize - 1.

template<unsigned int ItemsPerThread>
__device__ inline void down(T (&input)[ItemsPerThread], T (&next)[ItemsPerThread], T &block_prefix)#

The thread block rotates a blocked arrange of input items, shifting it down by one item.

Parameters:
  • input[in] - The calling thread’s input items

  • next[out] - The corresponding successor items (may be aliased to input). The item prev[0] is not updated for threadBlockSize - 1.

  • block_prefix[out] - The item input[0] from thread, provided to all threads