Histogram#
Class#

template<class T, unsigned int BlockSizeX, unsigned int ItemsPerThread, unsigned int Bins, block_histogram_algorithm Algorithm = block_histogram_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_histogram# The block_histogram class is a block level parallel primitive which provides methods for constructing blockwide histograms from items partitioned across threads in a block.
 Overview
block_histogram has two alternative implementations:
block_histogram_algorithm::using_atomic
andblock_histogram_algorithm::using_sort
.
 Examples
In the examples histogram operation is performed on block of 192 threads, each provides one
int
value, result is returned using the same variable as for input.__global__ void example_kernel(...) { // specialize block_histogram for int, logical block of 192 threads, // 2 items per thread and a bin size of 192. using block_histogram_int = rocprim::block_histogram<int, 192, 2, 192>; // allocate storage in shared memory __shared__ block_histogram_int::storage_type storage; __shared__ int hist[192]; int value[2]; ... // execute histogram block_histogram_int().histogram( value, // input hist, // output storage ); ... }
 Template Parameters:
T –  the input/output type.
BlockSize –  the number of threads in a block.
ItemsPerThread –  the number of items to be processed by each thread.
Bins –  the number of bins within the histogram.
Algorithm –  selected histogram algorithm, block_histogram_algorithm::default_algorithm by default.
Public Types

using storage_type = typename base_type::storage_type#
Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.
Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords
. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.
Public Functions

template<class Counter>
__device__ inline void init_histogram(Counter hist[Bins])# Initialize histogram counters to zero.
 Template Parameters:
Counter –  [inferred] counter type of histogram.
 Parameters:
hist – [out]  histogram bin count.

template<class Counter>
__device__ inline void composite(T (&input)[ItemsPerThread], Counter hist[Bins], storage_type &storage)# Update an existing blockwide histogram. Each thread composites an array of input elements.
 Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
. Examples
In the examples histogram operation is performed on block of 192 threads, each provides one
int
value, result is returned using the same variable as for input.__global__ void example_kernel(...) { // specialize block_histogram for int, logical block of 192 threads, // 2 items per thread and a bin size of 192. using block_histogram_int = rocprim::block_histogram<int, 192, 2, 192>; // allocate storage in shared memory __shared__ block_histogram_int::storage_type storage; __shared__ int hist[192]; int value[2]; ... // initialize histogram block_histogram_int().init_histogram( hist // output ); rocprim::syncthreads(); // update histogram block_histogram_int().composite( value, // input hist, // output storage ); ... }
 Template Parameters:
Counter –  [inferred] counter type of histogram.
 Parameters:
input – [in]  reference to an array containing thread input values. The function expects each value to satisfy 0 <= input[i] < BINS.
hist – [out]  histogram bin count.
storage – [in]  reference to a temporary storage object of type storage_type.

template<class Counter>
__device__ inline void composite(T (&input)[ItemsPerThread], Counter hist[Bins])# Update an existing blockwide histogram. Each thread composites an array of input elements.
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload does not accept storage argument. Required shared memory is allocated by the method itself.
 Template Parameters:
Counter –  [inferred] counter type of histogram.
 Parameters:
input – [in]  reference to an array containing thread input values. The function expects each value to satisfy 0 <= input[i] < BINS.
hist – [out]  histogram bin count.

template<class Counter>
__device__ inline void histogram(T (&input)[ItemsPerThread], Counter hist[Bins], storage_type &storage)# Construct a new blockwide histogram. Each thread contributes an array of input elements.
 Storage reusage
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
orrocprim::syncthreads()
. Examples
In the examples histogram operation is performed on block of 192 threads, each provides one
int
value, result is returned using the same variable as for input.__global__ void example_kernel(...) { // specialize block_histogram for int, logical block of 192 threads, // 2 items per thread and a bin size of 192. using block_histogram_int = rocprim::block_histogram<int, 192, 2, 192>; // allocate storage in shared memory __shared__ block_histogram_int::storage_type storage; __shared__ int hist[192]; int value[2]; ... // execute histogram block_histogram_int().histogram( value, // input hist, // output storage ); ... }
 Template Parameters:
Counter –  [inferred] counter type of histogram.
 Parameters:
input – [in]  reference to an array containing thread input values. The function expects each value to satisfy 0 <= input[i] < BINS.
hist – [out]  histogram bin count.
storage – [in]  reference to a temporary storage object of type storage_type.

template<class Counter>
__device__ inline void histogram(T (&input)[ItemsPerThread], Counter hist[Bins])# Construct a new blockwide histogram. Each thread contributes an array of input elements.
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload does not accept storage argument. Required shared memory is allocated by the method itself.
 Template Parameters:
Counter –  [inferred] counter type of histogram.
 Parameters:
input – [in]  reference to an array containing thread input values. The function expects each value to satisfy 0 <= input[i] < BINS.
hist – [out]  histogram bin count.
Algorithms#

enum class rocprim::block_histogram_algorithm#
Available algorithms for block_histogram primitive.
Values:

enumerator using_atomic#
Atomic addition is used to update bin count directly.
 Performance Notes:
Performance is dependent on hardware implementation of atomic addition.
Performance may decrease for nonuniform random input distributions where many concurrent updates may be made to the same bin counter.

enumerator using_sort#
A twophase operation is used:
Data is sorted using radixsort.
”Runs” of samevalued keys are detected using discontinuity; runlengths are bin counts.
 Performance Notes:
Performance is consistent regardless of sample bin distribution.

enumerator default_algorithm#
Default block_histogram algorithm.

enumerator using_atomic#