BlockRunLengthDecode< ItemT, BLOCK_DIM_X, RUNS_PER_THREAD, DECODED_ITEMS_PER_THREAD, DecodedOffsetT, BLOCK_DIM_Y, BLOCK_DIM_Z > Class Template Reference#
The BlockRunLengthDecode class supports decoding a run-length encoded array of items. That is, given the two arrays run_value[N] and run_lengths[N], run_value[i] is repeated run_lengths[i] many times in the output array. Due to the nature of the run-length decoding algorithm ("decompression"), the output size of the run-length decoded array is runtime-dependent and potentially without any upper bound. To address this, BlockRunLengthDecode allows retrieving a "window" from the run-length decoded array. The window's offset can be specified and BLOCK_THREADS * DECODED_ITEMS_PER_THREAD (i.e., referred to as window_size) decoded items from the specified window will be returned. More...
#include <block_run_length_decode.hpp>
Classes | |
struct | TempStorage |
Public Member Functions | |
template<typename RunLengthT , typename TotalDecodedSizeT > | |
__device__ __forceinline__ | BlockRunLengthDecode (TempStorage &temp_storage, ItemT(&run_values)[RUNS_PER_THREAD], RunLengthT(&run_lengths)[RUNS_PER_THREAD], TotalDecodedSizeT &total_decoded_size) |
Constructor specialised for user-provided temporary storage, initializing using the runs' lengths. The algorithm's temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls. | |
template<typename UserRunOffsetT > | |
__device__ __forceinline__ | BlockRunLengthDecode (TempStorage &temp_storage, ItemT(&run_values)[RUNS_PER_THREAD], UserRunOffsetT(&run_offsets)[RUNS_PER_THREAD]) |
Constructor specialised for user-provided temporary storage, initializing using the runs' offsets. The algorithm's temporary storage may not be repurposed between the constructor call and subsequent RunLengthDecode calls. | |
template<typename RunLengthT , typename TotalDecodedSizeT > | |
__device__ __forceinline__ | BlockRunLengthDecode (ItemT(&run_values)[RUNS_PER_THREAD], RunLengthT(&run_lengths)[RUNS_PER_THREAD], TotalDecodedSizeT &total_decoded_size) |
Constructor specialised for static temporary storage, initializing using the runs' lengths. | |
template<typename UserRunOffsetT > | |
__device__ __forceinline__ | BlockRunLengthDecode (ItemT(&run_values)[RUNS_PER_THREAD], UserRunOffsetT(&run_offsets)[RUNS_PER_THREAD]) |
Constructor specialised for static temporary storage, initializing using the runs' offsets. | |
template<typename RelativeOffsetT > | |
__device__ __forceinline__ void | RunLengthDecode (ItemT(&decoded_items)[DECODED_ITEMS_PER_THREAD], RelativeOffsetT(&item_offsets)[DECODED_ITEMS_PER_THREAD], DecodedOffsetT from_decoded_offset=0) |
Run-length decodes the runs previously passed via a call to Init(...) and returns the run-length decoded items in a blocked arrangement to decoded_items . If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required. item_offsets can be used to retrieve each run-length decoded item's relative index within its run. E.g., the run-length encoded array of 3, 1, 4 with the respective run lengths of 2, 1, 3 would yield the run-length decoded array of 3, 3, 1, 4, 4, 4 with the relative offsets of 0, 1, 0, 0, 1, 2 . \smemreuse. More... | |
__device__ __forceinline__ void | RunLengthDecode (ItemT(&decoded_items)[DECODED_ITEMS_PER_THREAD], DecodedOffsetT from_decoded_offset=0) |
Run-length decodes the runs previously passed via a call to Init(...) and returns the run-length decoded items in a blocked arrangement to decoded_items . If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required. More... | |
Detailed Description
template<typename ItemT, int BLOCK_DIM_X, int RUNS_PER_THREAD, int DECODED_ITEMS_PER_THREAD, typename DecodedOffsetT = uint32_t, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1>
class hipcub::BlockRunLengthDecode< ItemT, BLOCK_DIM_X, RUNS_PER_THREAD, DECODED_ITEMS_PER_THREAD, DecodedOffsetT, BLOCK_DIM_Y, BLOCK_DIM_Z >
The BlockRunLengthDecode class supports decoding a run-length encoded array of items. That is, given the two arrays run_value[N] and run_lengths[N], run_value[i] is repeated run_lengths[i] many times in the output array. Due to the nature of the run-length decoding algorithm ("decompression"), the output size of the run-length decoded array is runtime-dependent and potentially without any upper bound. To address this, BlockRunLengthDecode allows retrieving a "window" from the run-length decoded array. The window's offset can be specified and BLOCK_THREADS * DECODED_ITEMS_PER_THREAD (i.e., referred to as window_size) decoded items from the specified window will be returned.
- Note
- : Trailing runs of length 0 are supported (i.e., they may only appear at the end of the run_lengths array). A run of length zero may not be followed by a run length that is not zero.
- __global__ void ExampleKernel(...){// Specialising BlockRunLengthDecode to run-length decode items of type uint64_tusing RunItemT = uint64_t;// Type large enough to index into the run-length decoded arrayusing RunLengthT = uint32_t;// Specialising BlockRunLengthDecode for a 1D block of 128 threadsconstexpr int BLOCK_DIM_X = 128;// Specialising BlockRunLengthDecode to have each thread contribute 2 run-length encoded runsconstexpr int RUNS_PER_THREAD = 2;// Specialising BlockRunLengthDecode to have each thread hold 4 run-length decoded itemsconstexpr int DECODED_ITEMS_PER_THREAD = 4;// Specialize BlockRadixSort for a 1D block of 128 threads owning 4 integer items eachusing BlockRunLengthDecodeT =// Allocate shared memory for BlockRunLengthDecode__shared__ typename BlockRunLengthDecodeT::TempStorage temp_storage;// The run-length encoded items and how often they shall be repeated in the run-length decoded outputRunItemT run_values[RUNS_PER_THREAD];RunLengthT run_lengths[RUNS_PER_THREAD];...// Initialize the BlockRunLengthDecode with the runs that we want to run-length decodeuint32_t total_decoded_size = 0;BlockRunLengthDecodeT block_rld(temp_storage, run_values, run_lengths, total_decoded_size);// Run-length decode ("decompress") the runs into a window buffer of limited size. This is repeated until all runs// have been decoded.uint32_t decoded_window_offset = 0U;while (decoded_window_offset < total_decoded_size){RunLengthT relative_offsets[DECODED_ITEMS_PER_THREAD];RunItemT decoded_items[DECODED_ITEMS_PER_THREAD];// The number of decoded items that are valid within this window (aka pass) of run-length decodinguint32_t num_valid_items = total_decoded_size - decoded_window_offset;block_rld.RunLengthDecode(decoded_items, relative_offsets, decoded_window_offset);decoded_window_offset += BLOCK_DIM_X * DECODED_ITEMS_PER_THREAD;...}}The BlockRunLengthDecode class supports decoding a run-length encoded array of items....Definition: block_run_length_decode.hpp:131
- Suppose the set of input
run_values
across the block of threads is{ [0, 1], [2, 3], [4, 5], [6, 7], ..., [254, 255] }
andrun_lengths
is{ [1, 2], [3, 4], [5, 1], [2, 3], ..., [5, 1] }
. The corresponding outputdecoded_items
in those threads will be{ [0, 1, 1, 2], [2, 2, 3, 3], [3, 3, 4, 4], [4, 4, 4, 5], ..., [169, 169, 170, 171] }
andrelative_offsets
will be{ [0, 0, 1, 0], [1, 2, 0, 1], [2, 3, 0, 1], [2, 3, 4, 0], ..., [3, 4, 0, 0] }
during the first iteration of the while loop.
- Template Parameters
-
ItemT The data type of the items being run-length decoded BLOCK_DIM_X The thread block length in threads along the X dimension RUNS_PER_THREAD The number of consecutive runs that each thread contributes DECODED_ITEMS_PER_THREAD The maximum number of decoded items that each thread holds DecodedOffsetT Type used to index into the block's decoded items (large enough to hold the sum over all the runs' lengths) BLOCK_DIM_Y The thread block length in threads along the Y dimension BLOCK_DIM_Z The thread block length in threads along the Z dimension
Member Function Documentation
◆ RunLengthDecode() [1/2]
|
inline |
Run-length decodes the runs previously passed via a call to Init(...) and returns the run-length decoded items in a blocked arrangement to decoded_items
. If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset
can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required. item_offsets
can be used to retrieve each run-length decoded item's relative index within its run. E.g., the run-length encoded array of 3, 1, 4
with the respective run lengths of 2, 1, 3
would yield the run-length decoded array of 3, 3, 1, 4, 4, 4
with the relative offsets of 0, 1, 0, 0, 1, 2
. \smemreuse.
- Parameters
-
[out] decoded_items The run-length decoded items to be returned in a blocked arrangement [out] item_offsets The run-length decoded items' relative offset within the run they belong to [in] from_decoded_offset If invoked with from_decoded_offset that is larger than total_decoded_size results in undefined behavior.
◆ RunLengthDecode() [2/2]
|
inline |
Run-length decodes the runs previously passed via a call to Init(...) and returns the run-length decoded items in a blocked arrangement to decoded_items
. If the number of run-length decoded items exceeds the run-length decode buffer (i.e., DECODED_ITEMS_PER_THREAD * BLOCK_THREADS), only the items that fit within the buffer are returned. Subsequent calls to RunLengthDecode adjusting from_decoded_offset
can be used to retrieve the remaining run-length decoded items. Calling __syncthreads() between any two calls to RunLengthDecode is not required.
- Parameters
-
[out] decoded_items The run-length decoded items to be returned in a blocked arrangement [in] from_decoded_offset If invoked with from_decoded_offset that is larger than total_decoded_size results in undefined behavior.
The documentation for this class was generated from the following file:
- /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hipcub/checkouts/docs-5.4.3/hipcub/include/hipcub/backend/rocprim/block/block_run_length_decode.hpp