BlockRunLengthDecode< ItemT, BLOCK_DIM_X, RUNS_PER_THREAD, DECODED_ITEMS_PER_THREAD, DecodedOffsetT, BLOCK_DIM_Y, BLOCK_DIM_Z > Class Template Reference

BlockRunLengthDecode&lt; ItemT, BLOCK_DIM_X, RUNS_PER_THREAD, DECODED_ITEMS_PER_THREAD, DecodedOffsetT, BLOCK_DIM_Y, BLOCK_DIM_Z &gt; Class Template Reference#

hipCUB: hipcub::BlockRunLengthDecode< ItemT, BLOCK_DIM_X, RUNS_PER_THREAD, DECODED_ITEMS_PER_THREAD, DecodedOffsetT, BLOCK_DIM_Y, BLOCK_DIM_Z > Class Template Reference
hipcub::BlockRunLengthDecode< ItemT, BLOCK_DIM_X, RUNS_PER_THREAD, DECODED_ITEMS_PER_THREAD, DecodedOffsetT, BLOCK_DIM_Y, BLOCK_DIM_Z > Class Template Reference

The BlockRunLengthDecode class supports decoding a run-length encoded array of items. That is, given the two arrays run_value[N] and run_lengths[N], run_value[i] is repeated run_lengths[i] many times in the output array. Due to the nature of the run-length decoding algorithm ("decompression"), the output size of the run-length decoded array is runtime-dependent and potentially without any upper bound. To address this, BlockRunLengthDecode allows retrieving a "window" from the run-length decoded array. The window's offset can be specified and BLOCK_THREADS * DECODED_ITEMS_PER_THREAD (i.e., referred to as window_size) decoded items from the specified window will be returned. More...

#include <block_run_length_decode.hpp>

Classes

struct  TempStorage
 

Public Member Functions

template<typename RunLengthT , typename TotalDecodedSizeT >
__device__ __forceinline__ BlockRunLengthDecode (TempStorage &temp_storage, ItemT(&run_values)[RUNS_PER_THREAD], RunLengthT(&run_lengths)[RUNS_PER_THREAD], TotalDecodedSizeT &total_decoded_size)
 Constructor specialised for user-provided temporary storage, initializing using the runs' lengths. The algorithm's temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls.
 
template<typename UserRunOffsetT >
__device__ __forceinline__ BlockRunLengthDecode (TempStorage &temp_storage, ItemT(&run_values)[RUNS_PER_THREAD], UserRunOffsetT(&run_offsets)[RUNS_PER_THREAD])
 Constructor specialised for user-provided temporary storage, initializing using the runs' offsets. The algorithm's temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls.
 
template<typename RunLengthT , typename TotalDecodedSizeT >
__device__ __forceinline__ BlockRunLengthDecode (ItemT(&run_values)[RUNS_PER_THREAD], RunLengthT(&run_lengths)[RUNS_PER_THREAD], TotalDecodedSizeT &total_decoded_size)
 Constructor specialised for static temporary storage, initializing using the runs' lengths.
 
template<typename UserRunOffsetT >
__device__ __forceinline__ BlockRunLengthDecode (ItemT(&run_values)[RUNS_PER_THREAD], UserRunOffsetT(&run_offsets)[RUNS_PER_THREAD])
 Constructor specialised for static temporary storage, initializing using the runs' offsets.
 
template<typename RelativeOffsetT >
__device__ __forceinline__ void RunLengthDecode (ItemT(&decoded_items)[DECODED_ITEMS_PER_THREAD], RelativeOffsetT(&item_offsets)[DECODED_ITEMS_PER_THREAD], DecodedOffsetT from_decoded_offset=0)
 Run-length decodes the runs previously passed via a call to Init(...) and returns the run-length decoded items in a blocked arrangement to decoded_items. If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required. item_offsets can be used to retrieve each run-length decoded item's relative index within its run. E.g., the run-length encoded array of 3, 1, 4 with the respective run lengths of 2, 1, 3 would yield the run-length decoded array of 3, 3, 1, 4, 4, 4 with the relative offsets of 0, 1, 0, 0, 1, 2. \smemreuse. More...
 
__device__ __forceinline__ void RunLengthDecode (ItemT(&decoded_items)[DECODED_ITEMS_PER_THREAD], DecodedOffsetT from_decoded_offset=0)
 Run-length decodes the runs previously passed via a call to Init(...) and returns the run-length decoded items in a blocked arrangement to decoded_items. If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required. More...
 

Detailed Description

template<typename ItemT, int BLOCK_DIM_X, int RUNS_PER_THREAD, int DECODED_ITEMS_PER_THREAD, typename DecodedOffsetT = uint32_t, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1>
class hipcub::BlockRunLengthDecode< ItemT, BLOCK_DIM_X, RUNS_PER_THREAD, DECODED_ITEMS_PER_THREAD, DecodedOffsetT, BLOCK_DIM_Y, BLOCK_DIM_Z >

The BlockRunLengthDecode class supports decoding a run-length encoded array of items. That is, given the two arrays run_value[N] and run_lengths[N], run_value[i] is repeated run_lengths[i] many times in the output array. Due to the nature of the run-length decoding algorithm ("decompression"), the output size of the run-length decoded array is runtime-dependent and potentially without any upper bound. To address this, BlockRunLengthDecode allows retrieving a "window" from the run-length decoded array. The window's offset can be specified and BLOCK_THREADS * DECODED_ITEMS_PER_THREAD (i.e., referred to as window_size) decoded items from the specified window will be returned.

Note
: Trailing runs of length 0 are supported (i.e., they may only appear at the end of the run_lengths array). A run of length zero may not be followed by a run length that is not zero.
__global__ void ExampleKernel(...)
{
// Specialising BlockRunLengthDecode to run-length decode items of type uint64_t
using RunItemT = uint64_t;
// Type large enough to index into the run-length decoded array
using RunLengthT = uint32_t;
// Specialising BlockRunLengthDecode for a 1D block of 128 threads
constexpr int BLOCK_DIM_X = 128;
// Specialising BlockRunLengthDecode to have each thread contribute 2 run-length encoded runs
constexpr int RUNS_PER_THREAD = 2;
// Specialising BlockRunLengthDecode to have each thread hold 4 run-length decoded items
constexpr int DECODED_ITEMS_PER_THREAD = 4;
// Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer items each
using BlockRunLengthDecodeT =
// Allocate shared memory for BlockRunLengthDecode
__shared__ typename BlockRunLengthDecodeT::TempStorage temp_storage;
// The run-length encoded items and how often they shall be repeated in the run-length decoded output
RunItemT run_values[RUNS_PER_THREAD];
RunLengthT run_lengths[RUNS_PER_THREAD];
...
// Initialize the BlockRunLengthDecode with the runs that we want to run-length decode
uint32_t total_decoded_size = 0;
BlockRunLengthDecodeT block_rld(temp_storage, run_values, run_lengths, total_decoded_size);
// Run-length decode ("decompress") the runs into a window buffer of limited size. This is repeated until all runs
// have been decoded.
uint32_t decoded_window_offset = 0U;
while (decoded_window_offset < total_decoded_size)
{
RunLengthT relative_offsets[DECODED_ITEMS_PER_THREAD];
RunItemT decoded_items[DECODED_ITEMS_PER_THREAD];
// The number of decoded items that are valid within this window (aka pass) of run-length decoding
uint32_t num_valid_items = total_decoded_size - decoded_window_offset;
block_rld.RunLengthDecode(decoded_items, relative_offsets, decoded_window_offset);
decoded_window_offset += BLOCK_DIM_X * DECODED_ITEMS_PER_THREAD;
...
}
}
The BlockRunLengthDecode class supports decoding a run-length encoded array of items....
Definition: block_run_length_decode.hpp:131
Suppose the set of input run_values across the block of threads is { [0, 1], [2, 3], [4, 5], [6, 7], ..., [254, 255] } and run_lengths is { [1, 2], [3, 4], [5, 1], [2, 3], ..., [5, 1] }. The corresponding output decoded_items in those threads will be { [0, 1, 1, 2], [2, 2, 3, 3], [3, 3, 4, 4], [4, 4, 4, 5], ..., [169, 169, 170, 171] } and relative_offsets will be { [0, 0, 1, 0], [1, 2, 0, 1], [2, 3, 0, 1], [2, 3, 4, 0], ..., [3, 4, 0, 0] } during the first iteration of the while loop.
Template Parameters
ItemTThe data type of the items being run-length decoded
BLOCK_DIM_XThe thread block length in threads along the X dimension
RUNS_PER_THREADThe number of consecutive runs that each thread contributes
DECODED_ITEMS_PER_THREADThe maximum number of decoded items that each thread holds
DecodedOffsetTType used to index into the block's decoded items (large enough to hold the sum over all the runs' lengths)
BLOCK_DIM_YThe thread block length in threads along the Y dimension
BLOCK_DIM_ZThe thread block length in threads along the Z dimension

Member Function Documentation

◆ RunLengthDecode() [1/2]

template<typename ItemT , int BLOCK_DIM_X, int RUNS_PER_THREAD, int DECODED_ITEMS_PER_THREAD, typename DecodedOffsetT = uint32_t, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1>
template<typename RelativeOffsetT >
__device__ __forceinline__ void hipcub::BlockRunLengthDecode< ItemT, BLOCK_DIM_X, RUNS_PER_THREAD, DECODED_ITEMS_PER_THREAD, DecodedOffsetT, BLOCK_DIM_Y, BLOCK_DIM_Z >::RunLengthDecode ( ItemT(&)  decoded_items[DECODED_ITEMS_PER_THREAD],
RelativeOffsetT(&)  item_offsets[DECODED_ITEMS_PER_THREAD],
DecodedOffsetT  from_decoded_offset = 0 
)
inline

Run-length decodes the runs previously passed via a call to Init(...) and returns the run-length decoded items in a blocked arrangement to decoded_items. If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required. item_offsets can be used to retrieve each run-length decoded item's relative index within its run. E.g., the run-length encoded array of 3, 1, 4 with the respective run lengths of 2, 1, 3 would yield the run-length decoded array of 3, 3, 1, 4, 4, 4 with the relative offsets of 0, 1, 0, 0, 1, 2. \smemreuse.

Parameters
[out]decoded_itemsThe run-length decoded items to be returned in a blocked arrangement
[out]item_offsetsThe run-length decoded items' relative offset within the run they belong to
[in]from_decoded_offsetIf invoked with from_decoded_offset that is larger than total_decoded_size results in undefined behavior.

◆ RunLengthDecode() [2/2]

template<typename ItemT , int BLOCK_DIM_X, int RUNS_PER_THREAD, int DECODED_ITEMS_PER_THREAD, typename DecodedOffsetT = uint32_t, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1>
__device__ __forceinline__ void hipcub::BlockRunLengthDecode< ItemT, BLOCK_DIM_X, RUNS_PER_THREAD, DECODED_ITEMS_PER_THREAD, DecodedOffsetT, BLOCK_DIM_Y, BLOCK_DIM_Z >::RunLengthDecode ( ItemT(&)  decoded_items[DECODED_ITEMS_PER_THREAD],
DecodedOffsetT  from_decoded_offset = 0 
)
inline

Run-length decodes the runs previously passed via a call to Init(...) and returns the run-length decoded items in a blocked arrangement to decoded_items. If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required.

Parameters
[out]decoded_itemsThe run-length decoded items to be returned in a blocked arrangement
[in]from_decoded_offsetIf invoked with from_decoded_offset that is larger than total_decoded_size results in undefined behavior.

The documentation for this class was generated from the following file:
  • /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hipcub/checkouts/docs-5.3.3/hipcub/include/hipcub/backend/rocprim/block/block_run_length_decode.hpp