Shuffle#
-
template<class T, unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_shuffle# The block_shuffle class is a block level parallel primitive which provides methods for shuffling data partitioned across a block.
- Overview
It is commonplace for blocks of threads to rearrange data items between threads. The BlockShuffle abstraction allows threads to efficiently shift items either (a) up to their successor or (b) down to their predecessor.
Computation can more efficient when:
ItemsPerThread
is greater than one,T
is an arithmetic type,the number of threads in the block is a multiple of the hardware warp size (see rocprim::warp_size()).
- Examples
In the examples shuffle operation is performed on block of 192 threads, each provides one
int
value, result is returned using the same variable as for input.__global__ void example_kernel(...) { // specialize block__shuffle_int for int and logical warp of 192 threads using block__shuffle_int = rocprim::block_shuffle<int, 192>; // allocate storage in shared memory __shared__ block_shuffle::storage_type storage; int value = ...; // execute block shuffle block__shuffle_int().inclusive_up( value, // input value, // output storage ); ... }
- Template Parameters:
T – - the input/output type.
BlockSizeX – - the number of threads in a block’s x dimension, it has no defaults value.
BlockSizeY – - the number of threads in a block’s y dimension, defaults to 1.
BlockSizeZ – - the number of threads in a block’s z dimension, defaults to 1.
Public Types
-
using storage_type = storage_type_#
Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.
Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords
. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.
Public Functions
-
__device__ inline void offset(T input, T &output, int distance = 1)#
Shuffles data across threads in a block, offseted by the distance value.
- A thread with threadId i receives data from a thread with threadIdx (i-distance), whre distance may be a negative value.
allocated by the method itself.
- Any shuffle operation with invalid input or output threadIds are not carried out, i.e. threadId < 0 || threadId >= BlockSize.
- Example.
__global__ void example_kernel(...) { // specialize block__shuffle_int for int and logical warp of 192 threads using block__shuffle_int = rocprim::block_shuffle<int, 192>; int value = ...; // execute block shuffle block__shuffle_int().offset( value, // input value // output ); ... }
- Parameters:
input – [in] - input data to be shuffled to another thread.
output – [out] - reference to a output value, that receives data from another thread
distance – [in] - The input threadId + distance = output threadId.
-
__device__ inline void rotate(T input, T &output, unsigned int distance = 1)#
Shuffles data across threads in a block, offseted by the distance value.
- A thread with threadId i receives data from a thread with threadIdx (i-distance)BlockSize, whre distance may be a negative value.
allocated by the method itself.
- Data is rotated around the block, using (input_threadId + distance) modulous BlockSize to ensure valid threadIds.
- Example.
__global__ void example_kernel(...) { // specialize block__shuffle_int for int and logical warp of 192 threads using block__shuffle_int = rocprim::block_shuffle<int, 192>; int value = ...; // execute block shuffle block__shuffle_int().rotate( value, // input value // output ); ... }
- Parameters:
input – [in] - input data to be shuffled to another thread.
output – [out] - reference to a output value, that receives data from another thread
distance – [in] - The input threadId + distance = output threadId.
-
template<unsigned int ItemsPerThread>
__device__ inline void up(T (&input)[ItemsPerThread], T (&prev)[ItemsPerThread])# The thread block rotates a blocked arrange of input items, shifting it up by one item.
- Example.
__global__ void example_kernel(...) { // specialize block__shuffle_int for int and logical warp of 192 threads using block__shuffle_int = rocprim::block_shuffle<int, 192>; int value = ...; // execute block shuffle block__shuffle_int().up( value, // input value // output ); ... }
- Parameters:
input – [in] - The calling thread’s input items
prev – [out] - The corresponding predecessor items (may be aliased to
input
). The itemprev
[0] is not updated for thread0.
-
template<unsigned int ItemsPerThread>
__device__ inline void up(T (&input)[ItemsPerThread], T (&prev)[ItemsPerThread], T &block_suffix)# The thread block rotates a blocked arrange of input items, shifting it up by one item.
- Parameters:
input – [in] - The calling thread’s input items
prev – [out] - The corresponding predecessor items (may be aliased to
input
). The itemprev
[0] is not updated for thread0.block_suffix – [out] - The item
input
[ItemsPerThread-1] from thread, provided to all threads
-
template<unsigned int ItemsPerThread>
__device__ inline void down(T (&input)[ItemsPerThread], T (&next)[ItemsPerThread])# The thread block rotates a blocked arrange of input items, shifting it down by one item.
- Example.
__global__ void example_kernel(...) { // specialize block__shuffle_int for int and logical warp of 192 threads using block__shuffle_int = rocprim::block_shuffle<int, 192>; int value = ...; // execute block shuffle block__shuffle_int().down( value, // input value // output ); ... }
- Parameters:
input – [in] - The calling thread’s input items
next – [out] - The corresponding successor items (may be aliased to
input
). The itemprev
[0] is not updated for threadBlockSize - 1.
-
template<unsigned int ItemsPerThread>
__device__ inline void down(T (&input)[ItemsPerThread], T (&next)[ItemsPerThread], T &block_prefix)# The thread block rotates a blocked arrange of input items, shifting it down by one item.
- Parameters:
input – [in] - The calling thread’s input items
next – [out] - The corresponding successor items (may be aliased to
input
). The itemprev
[0] is not updated for threadBlockSize - 1.block_prefix – [out] - The item
input
[0] from thread, provided to all threads