Store#
Class#
- 
template<class T, unsigned int BlockSizeX, unsigned int ItemsPerThread, block_store_method Method = block_store_method::block_store_direct, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
 class block_store#
- The - block_storeclass is a block level parallel primitive which provides methods for storing an arrangement of items into a blocked/striped arrangement on continous memory.- Overview
- The - block_storeclass has a number of different methods to store data:
 
- Example:
- In the examples store operation is performed on block of 128 threads, using type - intand 8 items per thread.- __global__ void kernel(int * output) { const int offset = blockIdx.x * 128 * 8; int items[8]; rocprim::block_store<int, 128, 8, store_method> blockstore; blockstore.store(output + offset, items); ... } 
 - Template Parameters:
- T – - the output/output type. 
- BlockSize – - the number of threads in a block. 
- ItemsPerThread – - the number of items to be processed by each thread. 
- Method – - the method to store data. 
 
 - Public Types - 
using storage_type = storage_type_#
- Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive. - Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords - . It can be aliased to an externally allocated memory, or be a part of a union with other storage types to increase shared memory reusability.
 - Public Functions - 
template<class OutputIterator>
 __device__ inline void store(OutputIterator block_output, T (&items)[ItemsPerThread])#
- Stores an arrangement of items from across the thread block into an arrangement on continuous memory. - Overview
- The type - Tmust be such that an object of type- InputIteratorcan be dereferenced and then implicitly converted to- T.
 
 - Template Parameters:
- OutputIterator – - [inferred] an iterator type for output (can be a simple pointer. 
- Parameters:
- block_output – [out] - the output iterator from the thread block to store to. 
- items – [in] - array that data is read from. 
 
 
 - 
template<class OutputIterator>
 __device__ inline void store(OutputIterator block_output, T (&items)[ItemsPerThread], unsigned int valid)#
- Stores an arrangement of items from across the thread block into an arrangement on continuous memory, which is guarded by range - valid.- Overview
- The type - Tmust be such that an object of type- InputIteratorcan be dereferenced and then implicitly converted to- T.
 
 - Template Parameters:
- OutputIterator – - [inferred] an iterator type for output (can be a simple pointer. 
- Parameters:
- block_output – [out] - the output iterator from the thread block to store to. 
- items – [in] - array that data is read from. 
- valid – [in] - maximum range of valid numbers to read. 
 
 
 - 
template<class OutputIterator>
 __device__ inline void store(OutputIterator block_output, T (&items)[ItemsPerThread], storage_type &storage)#
- Stores an arrangement of items from across the thread block into an arrangement on continuous memory, using temporary storage. - Overview
- The type - Tmust be such that an object of type- InputIteratorcan be dereferenced and then implicitly converted to- T.
 
- Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
- Example.
- __global__ void kernel(...) { int items[8]; using block_store_int = rocprim::block_store<int, 128, 8>; block_store_int bstore; __shared__ typename block_store_int::storage_type storage; bstore.store(..., items, storage); ... } 
 - Template Parameters:
- OutputIterator – - [inferred] an iterator type for output (can be a simple pointer. 
- Parameters:
- block_output – [out] - the output iterator from the thread block to store to. 
- items – [in] - array that data is read from. 
- storage – [in] - temporary storage for outputs. 
 
 
 - 
template<class OutputIterator>
 __device__ inline void store(OutputIterator block_output, T (&items)[ItemsPerThread], unsigned int valid, storage_type &storage)#
- Stores an arrangement of items from across the thread block into an arrangement on continuous memory, which is guarded by range - valid, using temporary storage.- Overview
- The type - Tmust be such that an object of type- InputIteratorcan be dereferenced and then implicitly converted to- T.
 
- Storage reusage
- Synchronization barrier should be placed before - storageis reused or repurposed:- __syncthreads()or- rocprim::syncthreads().
- Example.
- __global__ void kernel(...) { int items[8]; using block_store_int = rocprim::block_store<int, 128, 8>; block_store_int bstore; __shared__ typename block_store_int::storage_type storage; bstore.store(..., items, valid, storage); ... } 
 - Template Parameters:
- OutputIterator – - [inferred] an iterator type for output (can be a simple pointer. 
- Parameters:
- block_output – [out] - the output iterator from the thread block to store to. 
- items – [in] - array that data is read from. 
- valid – [in] - maximum range of valid numbers to read. 
- storage – [in] - temporary storage for outputs. 
 
 
 
Algorithms#
- 
enum class rocprim::block_store_method#
- block_store_methodenumerates the methods available to store a striped arrangement of items into a blocked/striped arrangement on continuous memory- Values: - 
enumerator block_store_direct#
- A blocked arrangement of items is stored into a blocked arrangement on continuous memory. - Performance Notes:
- Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing. 
 
 
 - 
enumerator block_store_striped#
- A striped arrangement of items is stored into a blocked arrangement on continuous memory. 
 - 
enumerator block_store_vectorize#
- A blocked arrangement of items is stored into a blocked arrangement on continuous memory using vectorization as an optimization. - Performance Notes:
- Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to - block_store_direct.
 
- Requirements:
- The output offset ( - block_output) must be quad-item aligned.
- The following conditions will prevent vectorization and switch to default - block_store_direct:- ItemsPerThreadis odd.
- The datatype - Tis not a primitive or a HIP vector type (e.g. int2, int4, etc.
 
 
 
 - 
enumerator block_store_transpose#
- A blocked arrangement of items is locally transposed and stored as a striped arrangement of data on continuous memory. - Performance Notes:
- Performance remains high due to increased memory coalescing, regardless of the number of items per thread. 
- Performance may be better compared to - block_store_directand- block_store_vectorizedue to reordering on local memory.
 
 
 - 
enumerator block_store_warp_transpose#
- A blocked arrangement of items is locally transposed and stored as a warp-striped arrangement of data on continuous memory. - Requirements:
- The number of threads in the block must be a multiple of the size of hardware warp. 
 
- Performance Notes:
- Performance remains high due to increased memory coalescing, regardless of the number of items per thread. 
- Performance may be better compared to - block_store_directand- block_store_vectorizedue to reordering on local memory.
 
 
 - 
enumerator default_method#
- Defaults to - block_store_direct.
 
- 
enumerator block_store_direct#