Load#
Class#
- 
template<class T, unsigned int ItemsPerThread, unsigned int WarpSize = ::rocprim::device_warp_size(), warp_load_method Method = warp_load_method::warp_load_direct>
 class warp_load#
- The - warp_loadclass is a warp level parallel primitive which provides methods for loading data from continuous memory into a blocked arrangement of items across a warp.- Overview
- The - warp_loadclass has a number of different methods to load data:
 
- Example:
- In the example a load operation is performed on a warp of 8 threads, using type - intand 4 items per thread.- __global__ void example_kernel(int * input, ...) { constexpr unsigned int threads_per_block = 128; constexpr unsigned int threads_per_warp = 8; constexpr unsigned int items_per_thread = 4; constexpr unsigned int warps_per_block = threads_per_block / threads_per_warp; const unsigned int warp_id = hipThreadIdx_x / threads_per_warp; const int offset = blockIdx.x * threads_per_block * items_per_thread + warp_id * threads_per_warp * items_per_thread; int items[items_per_thread]; rocprim::warp_load<int, items_per_thread, threads_per_warp, load_method> warp_load; warp_load.load(input + offset, items); ... } 
 - Template Parameters:
- T – - the input/output type. 
- ItemsPerThread – - the number of items to be processed by each thread. 
- WarpSize – - the number of threads in the warp. It must be a divisor of the kernel block size. 
- Method – - the method to load data. 
 
 - Public Types - 
using storage_type = storage_type_#
- Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive. - Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords - . It can be aliased to an externally allocated memory, or be a part of a union with other storage types to increase shared memory reusability.
 - Public Functions - 
template<class InputIterator>
 __device__ inline void load(InputIterator input, T (&items)[ItemsPerThread], storage_type&)#
- Loads data from continuous memory into an arrangement of items across the warp. - Overview
- The type - Tmust be such that an object of type- InputIteratorcan be dereferenced and then implicitly converted to- T.
 
 - Template Parameters:
- InputIterator – - [inferred] an iterator type for input (can be a simple pointer. 
- Parameters:
- input – [in] - the input iterator to load from. 
- items – [out] - array that data is loaded to. 
- - – [in] temporary storage for inputs. 
 
 
 - 
template<class InputIterator>
 __device__ inline void load(InputIterator input, T (&items)[ItemsPerThread], unsigned int valid, storage_type&)#
- Loads data from continuous memory into an arrangement of items across the warp. - Overview
- The type - Tmust be such that an object of type- InputIteratorcan be dereferenced and then implicitly converted to- T.
 
 - Template Parameters:
- InputIterator – - [inferred] an iterator type for input (can be a simple pointer. 
- Parameters:
- input – [in] - the input iterator to load from. 
- items – [out] - array that data is loaded to. 
- valid – [in] - maximum range of valid numbers to load. 
- - – [in] temporary storage for inputs. 
 
 
 - 
template<class InputIterator, class Default>
 __device__ inline void load(InputIterator input, T (&items)[ItemsPerThread], unsigned int valid, Default out_of_bounds, storage_type&)#
- Loads data from continuous memory into an arrangement of items across the warp. - Overview
- The type - Tmust be such that an object of type- InputIteratorcan be dereferenced and then implicitly converted to- T.
 
 - Template Parameters:
- InputIterator – - [inferred] an iterator type for input (can be a simple pointer. 
- Parameters:
- input – [in] - the input iterator to load from. 
- items – [out] - array that data is loaded to. 
- valid – [in] - maximum range of valid numbers to load. 
- out_of_bounds – [in] - default value assigned to out-of-bound items. 
- - – [in] temporary storage for inputs. 
 
 
 
Algorithms#
- 
enum class rocprim::warp_load_method#
- warp_load_methodenumerates the methods available to load data from continuous memory into a blocked/striped arrangement of items across the warp- Values: - 
enumerator warp_load_direct#
- Data from continuous memory is loaded into a blocked arrangement of items. - Performance Notes:
- Performance decreases with increasing number of items per thread (stride between reads), because of reduced memory coalescing. 
 
 
 - 
enumerator warp_load_striped#
- A striped arrangement of data is read directly from memory. 
 - 
enumerator warp_load_vectorize#
- Data from continuous memory is loaded into a blocked arrangement of items using vectorization as an optimization. - Performance Notes:
- Performance remains high due to increased memory coalescing, provided that vectorization requirements are fulfilled. Otherwise, performance will default to - warp_load_direct.
 
- Requirements:
- The input offset ( - block_input) must be quad-item aligned.
- The following conditions will prevent vectorization and switch to default - warp_load_direct:- ItemsPerThreadis odd.
- The datatype - Tis not a primitive or a HIP vector type (e.g. int2, int4, etc.
 
 
 
 - 
enumerator warp_load_transpose#
- A striped arrangement of data from continuous memory is locally transposed into a blocked arrangement of items. - Performance Notes:
- Performance remains high due to increased memory coalescing, regardless of the number of items per thread. 
- Performance may be better compared to - warp_load_directand- warp_load_vectorizedue to reordering on local memory.
 
 
 - 
enumerator default_method#
- Defaults to - warp_load_direct.
 
- 
enumerator warp_load_direct#