Store#

Class#

template<class T, unsigned int BlockSizeX, unsigned int ItemsPerThread, block_store_method Method = block_store_method::block_store_direct, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_store#

The block_store class is a block level parallel primitive which provides methods for storing an arrangement of items into a blocked/striped arrangement on continous memory.

Overview

Example:

In the examples store operation is performed on block of 128 threads, using type int and 8 items per thread.

__global__ void kernel(int * output)
{
    const int offset = blockIdx.x * 128 * 8;
    int items[8];
    rocprim::block_store<int, 128, 8, store_method> blockstore;
    blockstore.store(output + offset, items);
    ...
}

Template Parameters:
  • T – - the output/output type.

  • BlockSize – - the number of threads in a block.

  • ItemsPerThread – - the number of items to be processed by each thread.

  • Method – - the method to store data.

Public Types

using storage_type = storage_type_#

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords . It can be aliased to an externally allocated memory, or be a part of a union with other storage types to increase shared memory reusability.

Public Functions

template<class OutputIterator>
__device__ inline void store(OutputIterator block_output, T (&items)[ItemsPerThread])#

Stores an arrangement of items from across the thread block into an arrangement on continuous memory.

Overview

  • The type T must be such that an object of type InputIterator can be dereferenced and then implicitly converted to T.

Template Parameters:

OutputIterator – - [inferred] an iterator type for output (can be a simple pointer.

Parameters:
  • block_output[out] - the output iterator from the thread block to store to.

  • items[in] - array that data is read from.

template<class OutputIterator>
__device__ inline void store(OutputIterator block_output, T (&items)[ItemsPerThread], unsigned int valid)#

Stores an arrangement of items from across the thread block into an arrangement on continuous memory, which is guarded by range valid.

Overview

  • The type T must be such that an object of type InputIterator can be dereferenced and then implicitly converted to T.

Template Parameters:

OutputIterator – - [inferred] an iterator type for output (can be a simple pointer.

Parameters:
  • block_output[out] - the output iterator from the thread block to store to.

  • items[in] - array that data is read from.

  • valid[in] - maximum range of valid numbers to read.

template<class OutputIterator>
__device__ inline void store(OutputIterator block_output, T (&items)[ItemsPerThread], storage_type &storage)#

Stores an arrangement of items from across the thread block into an arrangement on continuous memory, using temporary storage.

Overview

  • The type T must be such that an object of type InputIterator can be dereferenced and then implicitly converted to T.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Example.
__global__ void kernel(...)
{
    int items[8];
    using block_store_int = rocprim::block_store<int, 128, 8>;
    block_store_int bstore;
    __shared__ typename block_store_int::storage_type storage;
    bstore.store(..., items, storage);
    ...
}

Template Parameters:

OutputIterator – - [inferred] an iterator type for output (can be a simple pointer.

Parameters:
  • block_output[out] - the output iterator from the thread block to store to.

  • items[in] - array that data is read from.

  • storage[in] - temporary storage for outputs.

template<class OutputIterator>
__device__ inline void store(OutputIterator block_output, T (&items)[ItemsPerThread], unsigned int valid, storage_type &storage)#

Stores an arrangement of items from across the thread block into an arrangement on continuous memory, which is guarded by range valid, using temporary storage.

Overview

  • The type T must be such that an object of type InputIterator can be dereferenced and then implicitly converted to T.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Example.
__global__ void kernel(...)
{
    int items[8];
    using block_store_int = rocprim::block_store<int, 128, 8>;
    block_store_int bstore;
    __shared__ typename block_store_int::storage_type storage;
    bstore.store(..., items, valid, storage);
    ...
}

Template Parameters:

OutputIterator – - [inferred] an iterator type for output (can be a simple pointer.

Parameters:
  • block_output[out] - the output iterator from the thread block to store to.

  • items[in] - array that data is read from.

  • valid[in] - maximum range of valid numbers to read.

  • storage[in] - temporary storage for outputs.

Algorithms#

enum class rocprim::block_store_method#

block_store_method enumerates the methods available to store a striped arrangement of items into a blocked/striped arrangement on continuous memory

Values:

enumerator block_store_direct#

A blocked arrangement of items is stored into a blocked arrangement on continuous memory.

Performance Notes:

  • Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing.

enumerator block_store_striped#

A striped arrangement of items is stored into a blocked arrangement on continuous memory.

enumerator block_store_vectorize#

A blocked arrangement of items is stored into a blocked arrangement on continuous memory using vectorization as an optimization.

Performance Notes:

  • Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to block_store_direct.

Requirements:

  • The output offset (block_output) must be quad-item aligned.

  • The following conditions will prevent vectorization and switch to default block_store_direct:

    • ItemsPerThread is odd.

    • The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.

enumerator block_store_transpose#

A blocked arrangement of items is locally transposed and stored as a striped arrangement of data on continuous memory.

Performance Notes:

  • Performance remains high due to increased memory coalescing, regardless of the number of items per thread.

  • Performance may be better compared to block_store_direct and block_store_vectorize due to reordering on local memory.

enumerator block_store_warp_transpose#

A blocked arrangement of items is locally transposed and stored as a warp-striped arrangement of data on continuous memory.

Requirements:

  • The number of threads in the block must be a multiple of the size of hardware warp.

Performance Notes:

  • Performance remains high due to increased memory coalescing, regardless of the number of items per thread.

  • Performance may be better compared to block_store_direct and block_store_vectorize due to reordering on local memory.

enumerator default_method#

Defaults to block_store_direct.