Load

Load#

Class#

template<class T, unsigned int ItemsPerThread, unsigned int VirtualWaveSize = ::rocprim::arch::wavefront::min_size(), warp_load_method Method = warp_load_method::warp_load_direct, ::rocprim::arch::wavefront::target TargetWaveSize = ::rocprim::arch::wavefront::get_target(), typename Enabled = void> class warp_load#

The warp_load class is a warp level parallel primitive which provides methods for loading data from continuous memory into a blocked arrangement of items across a warp.

Overview

The warp_load class has a number of different methods to load data:

Example:

In the example a load operation is performed on a warp of 8 threads, using type int and 4 items per thread.

__global__ void example_kernel(int * input, ...)
{
    constexpr unsigned int threads_per_block = 128;
    constexpr unsigned int threads_per_warp  =   8;
    constexpr unsigned int items_per_thread  =   4;
    constexpr unsigned int warps_per_block   = threads_per_block / threads_per_warp;
    const unsigned int warp_id = hipThreadIdx_x / threads_per_warp;
    const int offset = blockIdx.x * threads_per_block * items_per_thread
        + warp_id * threads_per_warp * items_per_thread;
    int items[items_per_thread];
    rocprim::warp_load<int, items_per_thread, threads_per_warp, load_method> warp_load;
    warp_load.load(input + offset, items);
    ...
}

Template Parameters:

T – the input/output type.
ItemsPerThread – the number of items to be processed by each thread.
VirtualWaveSize – the number of threads in the warp. It must be a divisor of the kernel block size.
Method – the method to load data.
TargetWaveSize – The hardware wavefront size. It can be used to specialize the targeted wavefront size when compiling to SPIR-V.

Public Types

using storage_type = storage_type_#

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords . It can be aliased to an externally allocated memory, or be a part of a union with other storage types to increase shared memory reusability.

Public Functions

template<class InputIterator> __device__ inline void load(InputIterator input, T (&items)[ItemsPerThread], storage_type&)#

Loads data from continuous memory into an arrangement of items across the warp.

Overview

The type T must be such that an object of type InputIterator can be dereferenced and then implicitly converted to T.

Template Parameters:

InputIterator – [inferred] an iterator type for input (can be a simple pointer.

Parameters:

input – [in] the input iterator to load from.
items – [out] array that data is loaded to.
- – [in] temporary storage for inputs.

template<class InputIterator> __device__ inline void load(InputIterator input, T (&items)[ItemsPerThread], unsigned int valid, storage_type&)#

Loads data from continuous memory into an arrangement of items across the warp.

Overview

The type T must be such that an object of type InputIterator can be dereferenced and then implicitly converted to T.

Template Parameters:

InputIterator – [inferred] an iterator type for input (can be a simple pointer.

Parameters:

input – [in] the input iterator to load from.
items – [out] array that data is loaded to.
valid – [in] maximum range of valid numbers to load.
- – [in] temporary storage for inputs.

template<class InputIterator, class Default> __device__ inline void load(InputIterator input, T (&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds, storage_type&)#

Loads data from continuous memory into an arrangement of items across the warp.

Overview

The type T must be such that an object of type InputIterator can be dereferenced and then implicitly converted to T.

Template Parameters:

InputIterator – [inferred] an iterator type for input (can be a simple pointer.

Parameters:

input – [in] the input iterator to load from.
items – [out] array that data is loaded to.
valid – [in] maximum range of valid numbers to load.
out_of_bounds – [in] default value assigned to out-of-bound items.
- – [in] temporary storage for inputs.

Algorithms#

enum class rocprim::warp_load_method#

warp_load_method enumerates the methods available to load data from continuous memory into a blocked/striped arrangement of items across the warp

Values:

enumerator warp_load_direct#

Data from continuous memory is loaded into a blocked arrangement of items.

Performance Notes:

Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing.

enumerator warp_load_striped#: A striped arrangement of data is read directly from memory.

enumerator warp_load_vectorize#

Data from continuous memory is loaded into a blocked arrangement of items using vectorization as an optimization.

Performance Notes:

Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to warp_load_direct.

Requirements:

The input offset (block_input) must be quad-item aligned.
The following conditions will prevent vectorization and switch to default warp_load_direct:
- ItemsPerThread is odd.
- The datatype T is not a primitive or a HIP vector type (e.g. int2, int4, etc.

enumerator warp_load_transpose#

A striped arrangement of data from continuous memory is locally transposed into a blocked arrangement of items.

Performance Notes:

Performance remains high due to increased memory coalescing, regardless of the number of items per thread.
Performance may be better compared to warp_load_direct and warp_load_vectorize due to reordering on local memory.

enumerator default_method#: Defaults to warp_load_direct.

Load

Contents

Load#

Class#

Algorithms#