Load#

Class#

template<class T, unsigned int BlockSizeX, unsigned int ItemsPerThread, block_load_method Method = block_load_method::block_load_direct, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_load#

The block_load class is a block level parallel primitive which provides methods for loading data from continuous memory into a blocked arrangement of items across the thread block.

Overview

Example:

In the examples load operation is performed on block of 128 threads, using type int and 8 items per thread.

__global__ void example_kernel(int * input, ...)
{
    const int offset = blockIdx.x * 128 * 8;
    int items[8];
    rocprim::block_load<int, 128, 8, load_method> blockload;
    blockload.load(input + offset, items);
    ...
}

Template Parameters:
  • T – - the input/output type.

  • BlockSize – - the number of threads in a block.

  • ItemsPerThread – - the number of items to be processed by each thread.

  • Method – - the method to load data.

Public Types

using storage_type = storage_type_#

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords . It can be aliased to an externally allocated memory, or be a part of a union with other storage types to increase shared memory reusability.

Public Functions

template<class InputIterator>
__device__ inline void load(InputIterator block_input, T (&items)[ItemsPerThread])#

Loads data from continuous memory into an arrangement of items across the thread block.

Overview

  • The type T must be such that an object of type InputIterator can be dereferenced and then implicitly converted to T.

Template Parameters:

InputIterator – - [inferred] an iterator type for input (can be a simple pointer.

Parameters:
  • block_input[in] - the input iterator from the thread block to load from.

  • items[out] - array that data is loaded to.

template<class InputIterator>
__device__ inline void load(InputIterator block_input, T (&items)[ItemsPerThread], unsigned int valid)#

Loads data from continuous memory into an arrangement of items across the thread block, which is guarded by range valid.

Overview

  • The type T must be such that an object of type InputIterator can be dereferenced and then implicitly converted to T.

Template Parameters:

InputIterator – - [inferred] an iterator type for input (can be a simple pointer.

Parameters:
  • block_input[in] - the input iterator from the thread block to load from.

  • items[out] - array that data is loaded to.

  • valid[in] - maximum range of valid numbers to load.

template<class InputIterator, class Default>
__device__ inline void load(InputIterator block_input, T (&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds)#

Loads data from continuous memory into an arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements.

Overview

  • The type T must be such that an object of type InputIterator can be dereferenced and then implicitly converted to T.

Template Parameters:
  • InputIterator – - [inferred] an iterator type for input (can be a simple pointer.

  • Default – - [inferred] The data type of the default value.

Parameters:
  • block_input[in] - the input iterator from the thread block to load from.

  • items[out] - array that data is loaded to.

  • valid[in] - maximum range of valid numbers to load.

  • out_of_bounds[in] - default value assigned to out-of-bound items.

template<class InputIterator>
__device__ inline void load(InputIterator block_input, T (&items)[ItemsPerThread], storage_type &storage)#

Loads data from continuous memory into an arrangement of items across the thread block, using temporary storage.

Overview

  • The type T must be such that an object of type InputIterator can be dereferenced and then implicitly converted to T.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Example.
__global__ void example_kernel(...)
{
    int items[8];
    using block_load_int = rocprim::block_load<int, 128, 8>;
    block_load_int bload;
    __shared__ typename block_load_int::storage_type storage;
    bload.load(..., items, storage);
    ...
}

Template Parameters:

InputIterator – - [inferred] an iterator type for input (can be a simple pointer.

Parameters:
  • block_input[in] - the input iterator from the thread block to load from.

  • items[out] - array that data is loaded to.

  • storage[in] - temporary storage for inputs.

template<class InputIterator>
__device__ inline void load(InputIterator block_input, T (&items)[ItemsPerThread], unsigned int valid, storage_type &storage)#

Loads data from continuous memory into an arrangement of items across the thread block, which is guarded by range valid, using temporary storage.

Overview

  • The type T must be such that an object of type InputIterator can be dereferenced and then implicitly converted to T.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Example.
__global__ void example_kernel(...)
{
    int items[8];
    using block_load_int = rocprim::block_load<int, 128, 8>;
    block_load_int bload;
    tile_static typename block_load_int::storage_type storage;
    bload.load(..., items, valid, storage);
    ...
}

Template Parameters:

InputIterator – - [inferred] an iterator type for input (can be a simple pointer

Parameters:
  • block_input[in] - the input iterator from the thread block to load from.

  • items[out] - array that data is loaded to.

  • valid[in] - maximum range of valid numbers to load.

  • storage[in] - temporary storage for inputs.

template<class InputIterator, class Default>
__device__ inline void load(InputIterator block_input, T (&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds, storage_type &storage)#

Loads data from continuous memory into an arrangement of items across the thread block, which is guarded by range with a fall-back value for out-of-bound elements, using temporary storage.

Overview

  • The type T must be such that an object of type InputIterator can be dereferenced and then implicitly converted to T.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Example.
__global__ void example_kernel(...)
{
    int items[8];
    using block_load_int = rocprim::block_load<int, 128, 8>;
    block_load_int bload;
    __shared__ typename block_load_int::storage_type storage;
    bload.load(..., items, valid, out_of_bounds, storage);
    ...
}

Template Parameters:
  • InputIterator – - [inferred] an iterator type for input (can be a simple pointer.

  • Default – - [inferred] The data type of the default value.

Parameters:
  • block_input[in] - the input iterator from the thread block to load from.

  • items[out] - array that data is loaded to.

  • valid[in] - maximum range of valid numbers to load.

  • out_of_bounds[in] - default value assigned to out-of-bound items.

  • storage[in] - temporary storage for inputs.

Algorithms#

enum class rocprim::block_load_method#

block_load_method enumerates the methods available to load data from continuous memory into a blocked arrangement of items across the thread block

Values:

enumerator block_load_direct#

Data from continuous memory is loaded into a blocked arrangement of items.

Performance Notes:

  • Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing.

enumerator block_load_striped#

A striped arrangement of data is read directly from memory.

enumerator block_load_vectorize#

Data from continuous memory is loaded into a blocked arrangement of items using vectorization as an optimization.

Performance Notes:

  • Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to block_load_direct.

Requirements:

  • The input offset (block_input) must be quad-item aligned.

  • The following conditions will prevent vectorization and switch to default block_load_direct:

    • ItemsPerThread is odd.

    • The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.

enumerator block_load_transpose#

A striped arrangement of data from continuous memory is locally transposed into a blocked arrangement of items.

Performance Notes:

  • Performance remains high due to increased memory coalescing, regardless of the number of items per thread.

  • Performance may be better compared to block_load_direct and block_load_vectorize due to reordering on local memory.

enumerator block_load_warp_transpose#

A warp-striped arrangement of data from continuous memory is locally transposed into a blocked arrangement of items.

Requirements:

  • The number of threads in the block must be a multiple of the size of hardware warp.

Performance Notes:

  • Performance remains high due to increased memory coalescing, regardless of the number of items per thread.

  • Performance may be better compared to block_load_direct and block_load_vectorize due to reordering on local memory.

enumerator default_method#

Defaults to block_load_direct.