Store#
Class#
-
template<class T, unsigned int BlockSizeX, unsigned int ItemsPerThread, block_store_method Method = block_store_method::block_store_direct, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_store# The
block_store
class is a block level parallel primitive which provides methods for storing an arrangement of items into a blocked/striped arrangement on continous memory.- Overview
The
block_store
class has a number of different methods to store data:
- Example:
In the examples store operation is performed on block of 128 threads, using type
int
and 8 items per thread.__global__ void kernel(int * output) { const int offset = blockIdx.x * 128 * 8; int items[8]; rocprim::block_store<int, 128, 8, store_method> blockstore; blockstore.store(output + offset, items); ... }
- Template Parameters:
T – - the output/output type.
BlockSize – - the number of threads in a block.
ItemsPerThread – - the number of items to be processed by each thread.
Method – - the method to store data.
Public Types
-
using storage_type = storage_type_#
Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.
Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords
. It can be aliased to an externally allocated memory, or be a part of a union with other storage types to increase shared memory reusability.
Public Functions
-
template<class OutputIterator>
__device__ inline void store(OutputIterator block_output, T (&items)[ItemsPerThread])# Stores an arrangement of items from across the thread block into an arrangement on continuous memory.
- Overview
The type
T
must be such that an object of typeInputIterator
can be dereferenced and then implicitly converted toT
.
- Template Parameters:
OutputIterator – - [inferred] an iterator type for output (can be a simple pointer.
- Parameters:
block_output – [out] - the output iterator from the thread block to store to.
items – [in] - array that data is read from.
-
template<class OutputIterator>
__device__ inline void store(OutputIterator block_output, T (&items)[ItemsPerThread], unsigned int valid)# Stores an arrangement of items from across the thread block into an arrangement on continuous memory, which is guarded by range
valid
.- Overview
The type
T
must be such that an object of typeInputIterator
can be dereferenced and then implicitly converted toT
.
- Template Parameters:
OutputIterator – - [inferred] an iterator type for output (can be a simple pointer.
- Parameters:
block_output – [out] - the output iterator from the thread block to store to.
items – [in] - array that data is read from.
valid – [in] - maximum range of valid numbers to read.
-
template<class OutputIterator>
__device__ inline void store(OutputIterator block_output, T (&items)[ItemsPerThread], storage_type &storage)# Stores an arrangement of items from across the thread block into an arrangement on continuous memory, using temporary storage.
- Overview
The type
T
must be such that an object of typeInputIterator
can be dereferenced and then implicitly converted toT
.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Example.
__global__ void kernel(...) { int items[8]; using block_store_int = rocprim::block_store<int, 128, 8>; block_store_int bstore; __shared__ typename block_store_int::storage_type storage; bstore.store(..., items, storage); ... }
- Template Parameters:
OutputIterator – - [inferred] an iterator type for output (can be a simple pointer.
- Parameters:
block_output – [out] - the output iterator from the thread block to store to.
items – [in] - array that data is read from.
storage – [in] - temporary storage for outputs.
-
template<class OutputIterator>
__device__ inline void store(OutputIterator block_output, T (&items)[ItemsPerThread], unsigned int valid, storage_type &storage)# Stores an arrangement of items from across the thread block into an arrangement on continuous memory, which is guarded by range
valid
, using temporary storage.- Overview
The type
T
must be such that an object of typeInputIterator
can be dereferenced and then implicitly converted toT
.
- Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
.- Example.
__global__ void kernel(...) { int items[8]; using block_store_int = rocprim::block_store<int, 128, 8>; block_store_int bstore; __shared__ typename block_store_int::storage_type storage; bstore.store(..., items, valid, storage); ... }
- Template Parameters:
OutputIterator – - [inferred] an iterator type for output (can be a simple pointer.
- Parameters:
block_output – [out] - the output iterator from the thread block to store to.
items – [in] - array that data is read from.
valid – [in] - maximum range of valid numbers to read.
storage – [in] - temporary storage for outputs.
Algorithms#
-
enum class rocprim::block_store_method#
block_store_method
enumerates the methods available to store a striped arrangement of items into a blocked/striped arrangement on continuous memoryValues:
-
enumerator block_store_direct#
A blocked arrangement of items is stored into a blocked arrangement on continuous memory.
- Performance Notes:
Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing.
-
enumerator block_store_striped#
A striped arrangement of items is stored into a blocked arrangement on continuous memory.
-
enumerator block_store_vectorize#
A blocked arrangement of items is stored into a blocked arrangement on continuous memory using vectorization as an optimization.
- Performance Notes:
Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to
block_store_direct
.
- Requirements:
The output offset (
block_output
) must be quad-item aligned.The following conditions will prevent vectorization and switch to default
block_store_direct:
ItemsPerThread
is odd.The datatype
T
is not a primitive or a HIP vector type (e.g. int2, int4, etc.
-
enumerator block_store_transpose#
A blocked arrangement of items is locally transposed and stored as a striped arrangement of data on continuous memory.
- Performance Notes:
Performance remains high due to increased memory coalescing, regardless of the number of items per thread.
Performance may be better compared to
block_store_direct
andblock_store_vectorize
due to reordering on local memory.
-
enumerator block_store_warp_transpose#
A blocked arrangement of items is locally transposed and stored as a warp-striped arrangement of data on continuous memory.
- Requirements:
The number of threads in the block must be a multiple of the size of hardware warp.
- Performance Notes:
Performance remains high due to increased memory coalescing, regardless of the number of items per thread.
Performance may be better compared to
block_store_direct
andblock_store_vectorize
due to reordering on local memory.
-
enumerator default_method#
Defaults to
block_store_direct
.
-
enumerator block_store_direct#