Run-length decode#
-
template<typename ItemT, unsigned int BlockSizeX, int RUNS_PER_THREAD, int DECODED_ITEMS_PER_THREAD, typename DecodedOffsetT = uint32_t, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_run_length_decode# The block_run_length_decode class supports decoding a run-length encoded array of items. That is, given the two arrays
run_value[N]
andrun_lengths[N]
,run_value[i]
is repeatedrun_lengths[i]
many times in the output array.- Examples
Due to the nature of the run-length decoding algorithm (“decompression”), the output size of the run-length decoded array is runtime-dependent and potentially without any upper bound. To address this,
block_run_length_decode
allows retrieving a “window” from the run-length decoded array. The window’s offset can be specified andBLOCK_THREADS * DECODED_ITEMS_PER_THREAD
(i.e., referred to aswindow_size
) decoded items from the specified window will be returned.__global__ void ExampleKernel(...) { // Specialising block_run_length_decode to run-length decode items of type uint64_t using RunItemT = uint64_t; // Type large enough to index into the run-length decoded array using RunLengthT = uint32_t; // Specialising block_run_length_decode for a 1D block of 128 threads constexpr int BLOCK_DIM_X = 128; // Specialising block_run_length_decode to have each thread contribute 2 run-length encoded runs constexpr int RUNS_PER_THREAD = 2; // Specialising block_run_length_decode to have each thread hold 4 run-length decoded items constexpr int DECODED_ITEMS_PER_THREAD = 4; // Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer items each using block_run_length_decodeT = hipcub::block_run_length_decode<RunItemT, BLOCK_DIM_X, RUNS_PER_THREAD, DECODED_ITEMS_PER_THREAD>; // Allocate shared memory for block_run_length_decode __shared__ typename block_run_length_decodeT::TempStorage temp_storage; // The run-length encoded items and how often they shall be repeated in the run-length decoded output RunItemT run_values[RUNS_PER_THREAD]; RunLengthT run_lengths[RUNS_PER_THREAD]; ... // Initialize the block_run_length_decode with the runs that we want to run-length decode uint32_t total_decoded_size = 0; block_run_length_decodeT block_rld(temp_storage, run_values, run_lengths, total_decoded_size); // Run-length decode ("decompress") the runs into a window buffer of limited size. This is repeated until all runs // have been decoded. uint32_t decoded_window_offset = 0U; while (decoded_window_offset < total_decoded_size) { RunLengthT relative_offsets[DECODED_ITEMS_PER_THREAD]; RunItemT decoded_items[DECODED_ITEMS_PER_THREAD]; // The number of decoded items that are valid within this window (aka pass) of run-length decoding uint32_t num_valid_items = total_decoded_size - decoded_window_offset; block_rld.run_length_decode(decoded_items, relative_offsets, decoded_window_offset); decoded_window_offset += BLOCK_DIM_X * DECODED_ITEMS_PER_THREAD; ... } }
Suppose the set of input
run_values
across the block of threads is{ [0, 1], [2, 3], [4, 5], [6, 7], ..., [254, 255] }
andrun_lengths
is{ [1, 2], [3, 4], [5, 1], [2, 3], ..., [5, 1] }
.The corresponding output
decoded_items
in those threads will be{ [0, 1, 1, 2], [2, 2, 3, 3], [3, 3, 4, 4], [4, 4, 4, 5], ..., [169, 169, 170, 171] }
andrelative_offsets
will be{ [0, 0, 1, 0], [1, 2, 0, 1], [2, 3, 0, 1], [2, 3, 4, 0], ..., [3, 4, 0, 0] }
during the first iteration of the while loop.
Note
Trailing runs of length 0 are supported (i.e., they may only appear at the end of the
run_lengths
array). A run of length zero may not be followed by a run length that is not zero.- Template Parameters:
ItemT – The data type of the items being run-length decoded
BLOCK_DIM_X – The thread block length in threads along the X dimension
RUNS_PER_THREAD – The number of consecutive runs that each thread contributes
DECODED_ITEMS_PER_THREAD – The maximum number of decoded items that each thread holds
DecodedOffsetT – Type used to index into the block’s decoded items (large enough to hold the sum over all the runs’ lengths)
BLOCK_DIM_Y – The thread block length in threads along the Y dimension
BLOCK_DIM_Z – The thread block length in threads along the Z dimension
Public Types
-
using storage_type = detail::raw_storage<storage_type_>#
Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.
Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords
. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.
Public Functions
-
template<typename RunLengthT, typename TotalDecodedSizeT>
__device__ inline block_run_length_decode(storage_type &temp_storage, ItemT (&run_values)[RUNS_PER_THREAD], RunLengthT (&run_lengths)[RUNS_PER_THREAD], TotalDecodedSizeT &total_decoded_size)# Constructor specialised for user-provided temporary storage, initializing using the runs’ lengths. The algorithm’s temporary storage may not be repurposed between the constructor call and subsequent run_length_decode calls.
-
template<typename UserRunOffsetT>
__device__ inline block_run_length_decode(storage_type &temp_storage, ItemT (&run_values)[RUNS_PER_THREAD], UserRunOffsetT (&run_offsets)[RUNS_PER_THREAD])# Constructor specialised for user-provided temporary storage, initializing using the runs’ offsets. The algorithm’s temporary storage may not be repurposed between the constructor call and subsequent run_length_decode calls.
-
template<typename RunLengthT, typename TotalDecodedSizeT>
__device__ inline block_run_length_decode(ItemT (&run_values)[RUNS_PER_THREAD], RunLengthT (&run_lengths)[RUNS_PER_THREAD], TotalDecodedSizeT &total_decoded_size)# Constructor specialised for static temporary storage, initializing using the runs’ lengths.
-
template<typename UserRunOffsetT>
__device__ inline block_run_length_decode(ItemT (&run_values)[RUNS_PER_THREAD], UserRunOffsetT (&run_offsets)[RUNS_PER_THREAD])# Constructor specialised for static temporary storage, initializing using the runs’ offsets.
-
template<typename RelativeOffsetT>
__device__ inline void run_length_decode(ItemT (&decoded_items)[DECODED_ITEMS_PER_THREAD], RelativeOffsetT (&item_offsets)[DECODED_ITEMS_PER_THREAD], DecodedOffsetT from_decoded_offset = 0)# Run-length decodes the runs previously passed via a call to Init(…) and returns the run-length decoded items in a blocked arrangement to
decoded_items
. If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to run_length_decode adjustingfrom_decoded_offset
can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to run_length_decode is not required.item_offsets
can be used to retrieve each run-length decoded item’s relative index within its run. E.g., the run-length encoded array of3, 1, 4
with the respective run lengths of2, 1, 3
would yield the run-length decoded array of3, 3, 1, 4, 4, 4
with the relative offsets of0, 1, 0, 0, 1, 2
.- Parameters:
decoded_items – [out] The run-length decoded items to be returned in a blocked arrangement
item_offsets – [out] The run-length decoded items’ relative offset within the run they belong to
from_decoded_offset – [in] If invoked with from_decoded_offset that is larger than total_decoded_size results in undefined behavior.
-
__device__ inline void run_length_decode(ItemT (&decoded_items)[DECODED_ITEMS_PER_THREAD], DecodedOffsetT from_decoded_offset = 0)#
Run-length decodes the runs previously passed via a call to Init(…) and returns the run-length decoded items in a blocked arrangement to
decoded_items
. If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to run_length_decode adjustingfrom_decoded_offset
can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to run_length_decode is not required.- Parameters:
decoded_items – [out] The run-length decoded items to be returned in a blocked arrangement
from_decoded_offset – [in] If invoked with from_decoded_offset that is larger than total_decoded_size results in undefined behavior.