Sort#

generic#

template<class Key, unsigned int BlockSizeX, unsigned int ItemsPerThread = 1, class Value = empty_type, block_sort_algorithm Algorithm = block_sort_algorithm::default_algorithm, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_sort#

The block_sort class is a block level parallel primitive which provides methods sorting items (keys or key-value pairs) partitioned across threads in a block using comparison-based sort algorithm.

Overview

  • Accepts custom compare_functions for sorting across a block.

  • Performance depends on BlockSize.

    • It is better if BlockSize is a power of two.

    • If BlockSize is not a power of two, or when function with size overload is used odd-even sort is used instead of bitonic sort, leading to decreased performance.

Examples

In the examples sort is performed on a block of 256 threads, each thread provides one int value, results are returned using the same variable as for input.

__global__ void example_kernel(...)
{
    // specialize block_sort for int, block of 256 threads,
    // key-only sort
    using block_sort_int = rocprim::block_sort<int, 256>;
    // allocate storage in shared memory
    __shared__ block_sort_int::storage_type storage;

    int input = ...;
    // execute block sort (ascending)
    block_sort_int().sort(
        input,
        storage
    );
    ...
}

Template Parameters:
  • Key – - the key type.

  • BlockSize – - the number of threads in a block.

  • ItemsPerThread – - number of items processed by each thread. The total range will be BlockSize * ItemsPerThread long

  • Value – - the value type. Default type empty_type indicates a keys-only sort.

  • Algorithm – - selected sort algorithm, block_sort_algorithm::default_algorithm by default.

Public Types

using storage_type = typename base_type::storage_type#

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords . It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.

Public Functions

template<class BinaryFunction = ::rocprim::less<Key>>
__device__ inline void sort(Key &thread_key, BinaryFunction compare_function = BinaryFunction())#

Block sort for any data type.

Template Parameters:

BinaryFunction – - type of binary function used for sort. Default type is rocprim::less<T>.

Parameters:
  • thread_key[inout] - reference to a key provided by a thread.

  • compare_function[in] - comparison function object which returns true if the first argument is is ordered before the second. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::less<Key>>
__device__ inline void sort(Key &thread_key, storage_type &storage, BinaryFunction compare_function = BinaryFunction())#

Block sort for any data type.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

In the examples sort is performed on a block of 256 threads, each thread provides one int value, results are returned using the same variable as for input.

__global__ void example_kernel(...)
{
    // specialize block_sort for int, block of 256 threads,
    // key-only sort
    using block_sort_int = rocprim::block_sort<int, 256>;
    // allocate storage in shared memory
    __shared__ block_sort_int::storage_type storage;

    int input = ...;
    // execute block sort (ascending)
    block_sort_int().sort(
        input,
        storage
    );
    ...
}

Template Parameters:

BinaryFunction – - type of binary function used for sort. Default type is rocprim::less<T>.

Parameters:
  • thread_key[inout] - reference to a key provided by a thread.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • compare_function[in] - comparison function object which returns true if the first argument is is ordered before the second. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::less<Key>>
__device__ inline void sort(Key &thread_key, Value &thread_value, BinaryFunction compare_function = BinaryFunction())#

Block sort by key for any data type.

Template Parameters:

BinaryFunction – - type of binary function used for sort. Default type is rocprim::less<T>.

Parameters:
  • thread_key[inout] - reference to a key provided by a thread.

  • thread_value[inout] - reference to a value provided by a thread.

  • compare_function[in] - comparison function object which returns true if the first argument is is ordered before the second. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::less<Key>>
__device__ inline void sort(Key &thread_key, Value &thread_value, storage_type &storage, BinaryFunction compare_function = BinaryFunction())#

Block sort by key for any data type.

In the examples sort is performed on a block of 256 threads, each thread provides one int key and one int value, results are returned using the same variable as for input.

__global__ void example_kernel(...)
{
    // specialize block_sort for int, block of 256 threads,
    using block_sort_int = rocprim::block_sort<int, 256, int>;
    // allocate storage in shared memory
    __shared__ block_sort_int::storage_type storage;

    int key = ...;
    int value = ...;
    // execute block sort (ascending)
    block_sort_int().sort(
        key,
        value,
        storage
    );
    ...
}
Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Template Parameters:

BinaryFunction – - type of binary function used for sort. Default type is rocprim::less<T>.

Parameters:
  • thread_key[inout] - reference to a key provided by a thread.

  • thread_value[inout] - reference to a value provided by a thread.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • compare_function[in] - comparison function object which returns true if the first argument is is ordered before the second. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::less<Key>>
__device__ inline void sort(Key &thread_key, storage_type &storage, const unsigned int size, BinaryFunction compare_function = BinaryFunction())#

Block sort by key for any data type. If size is greater than BlockSize, this function does nothing.

Remark

Not implemented for block_sort_algorithm::merge_sort

Template Parameters:

BinaryFunction – - type of binary function used for sort. Default type is rocprim::less<T>.

Parameters:
  • thread_key[inout] - reference to a key provided by a thread.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • size[in] - custom size of block to be sorted.

  • compare_function[in] - comparison function object which returns true if the first argument is is ordered before the second. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::less<Key>>
__device__ inline void sort(Key (&thread_keys)[ItemsPerThread], storage_type &storage, const unsigned int size, BinaryFunction compare_function = BinaryFunction())#

Block sort by key for any data type. This function sorts up to size elements blocked across threads.

Remark

Not implemented for block_sort_algorithm::merge_sort and block_sort_algorithm::bitonic_sort

Template Parameters:

BinaryFunction – - type of binary function used for sort. Default type is rocprim::less<T>.

Parameters:
  • thread_keys[inout] - reference to keys provided by a thread.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • size[in] - custom size of block to be sorted.

  • compare_function[in] - comparison function object which returns true if the first argument is is ordered before the second. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

template<class BinaryFunction = ::rocprim::less<Key>>
__device__ inline void sort(Key (&thread_keys)[ItemsPerThread], Value (&thread_values)[ItemsPerThread], storage_type &storage, const unsigned int size, BinaryFunction compare_function = BinaryFunction())#

Block sort by key for any data type. This function sorts up to size elements blocked across threads.

Remark

Not implemented for block_sort_algorithm::merge_sort and block_sort_algorithm::bitonic_sort

Template Parameters:

BinaryFunction – - type of binary function used for sort. Default type is rocprim::less<T>.

Parameters:
  • thread_keys[inout] - reference to keys provided by a thread.

  • thread_values[inout] - reference to values provided by a thread.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • size[in] - custom size of block to be sorted.

  • compare_function[in] - comparison function object which returns true if the first argument is is ordered before the second. The signature of the function should be equivalent to the following: bool f(const T &a, const T &b);. The signature does not need to have const &, but function object must not modify the objects passed to it.

enum class rocprim::block_sort_algorithm#

Available algorithms for block_sort primitive.

Values:

enumerator bitonic_sort#

A bitonic sort based algorithm.

enumerator merge_sort#

A merge sort based algorithm.

enumerator stable_merge_sort#

A merged sort based algorithm which sorts stably.

enumerator default_algorithm#

Default block_sort algorithm.

radix sort#

template<class Key, unsigned int BlockSizeX, unsigned int ItemsPerThread, class Value = empty_type, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_radix_sort#

The block_radix_sort class is a block level parallel primitive which provides methods for sorting of items (keys or key-value pairs) partitioned across threads in a block using radix sort algorithm.

Overview

  • Key type must be an arithmetic type (that is, an integral type or a floating-point type).

  • Performance depends on BlockSize and ItemsPerThread.

    • It is usually better for BlockSize to be a multiple of the size of the hardware warp.

    • It is usually increased when ItemsPerThread is greater than one. However, when there are too many items per thread, each thread may need so much registers and/or shared memory that occupancy will fall too low, decreasing the performance.

    • If Key is an integer type and the range of keys is known in advance, the performance can be improved by setting begin_bit and end_bit, for example if all keys are in range [100, 10000], begin_bit = 0 and end_bit = 14 will cover the whole range.

Examples

In the examples radix sort is performed on a block of 256 threads, each thread provides eight int value, results are returned using the same array as for input.

__global__ void example_kernel(...)
{
    // specialize block_radix_sort for int, block of 256 threads,
    // and eight items per thread; key-only sort
    using block_rsort_int = rocprim::block_radix_sort<int, 256, 8>;
    // allocate storage in shared memory
    __shared__ block_rsort_int::storage_type storage;

    int input[8] = ...;
    // execute block radix sort (ascending)
    block_rsort_int().sort(
        input,
        storage
    );
    ...
}

Template Parameters:
  • Key – - the key type.

  • BlockSize – - the number of threads in a block.

  • ItemsPerThread – - the number of items contributed by each thread.

  • Value – - the value type. Default type empty_type indicates a keys-only sort.

Public Types

using storage_type = storage_type_#

Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.

Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords . It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.

Public Functions

__device__ inline void sort(Key (&keys)[ItemsPerThread], storage_type &storage, unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs ascending radix sort over keys partitioned across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

In the examples radix sort is performed on a block of 128 threads, each thread provides two float value, results are returned using the same array as for input.

__global__ void example_kernel(...)
{
    // specialize block_radix_sort for float, block of 128 threads,
    // and two items per thread; key-only sort
    using block_rsort_float = rocprim::block_radix_sort<float, 128, 2>;
    // allocate storage in shared memory
    __shared__ block_rsort_float::storage_type storage;

    float input[2] = ...;
    // execute block radix sort (ascending)
    block_rsort_float().sort(
        input,
        storage
    );
    ...
}

If the input values across threads in a block are {[256, 255], ..., [4, 3], [2, 1]}}, then then after sort they will be equal {[1, 2], [3, 4] ..., [255, 256]}.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

__device__ inline void sort(Key (&keys)[ItemsPerThread], unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs ascending radix sort over keys partitioned across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

__device__ inline void sort_desc(Key (&keys)[ItemsPerThread], storage_type &storage, unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs descending radix sort over keys partitioned across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

In the examples radix sort is performed on a block of 128 threads, each thread provides two float value, results are returned using the same array as for input.

__global__ void example_kernel(...)
{
    // specialize block_radix_sort for float, block of 128 threads,
    // and two items per thread; key-only sort
    using block_rsort_float = rocprim::block_radix_sort<float, 128, 2>;
    // allocate storage in shared memory
    __shared__ block_rsort_float::storage_type storage;

    float input[2] = ...;
    // execute block radix sort (descending)
    block_rsort_float().sort_desc(
        input,
        storage
    );
    ...
}

If the input values across threads in a block are {[1, 2], [3, 4] ..., [255, 256]}, then after sort they will be equal {[256, 255], ..., [4, 3], [2, 1]}.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

__device__ inline void sort_desc(Key (&keys)[ItemsPerThread], unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs descending radix sort over keys partitioned across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

template<bool WithValues = with_values>
__device__ inline void sort(Key (&keys)[ItemsPerThread], typename std::enable_if<WithValues, Value>::type (&values)[ItemsPerThread], storage_type &storage, unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs ascending radix sort over key-value pairs partitioned across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

In the examples radix sort is performed on a block of 128 threads, each thread provides two key-value int-float pairs, results are returned using the same arrays as for input.

__global__ void example_kernel(...)
{
    // specialize block_radix_sort for int-float pairs, block of 128
    // threads, and two items per thread
    using block_rsort_ii = rocprim::block_radix_sort<int, 128, 2, int>;
    // allocate storage in shared memory
    __shared__ block_rsort_ii::storage_type storage;

    int keys[2] = ...;
    float values[2] = ...;
    // execute block radix sort-by-key (ascending)
    block_rsort_ii().sort(
        keys, values,
        storage
    );
    ...
}

If the keys across threads in a block are {[256, 255], ..., [4, 3], [2, 1]} and the values are {[1, 1], [2, 2] ..., [128, 128]}, then after sort the keys will be equal {[1, 2], [3, 4] ..., [255, 256]} and the values will be equal {[128, 128], [127, 127] ..., [2, 2], [1, 1]}.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • values[inout] - reference to an array of values provided by a thread.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

Pre:

Method is enabled only if Value type is different than empty_type.

template<bool WithValues = with_values>
__device__ inline void sort(Key (&keys)[ItemsPerThread], typename std::enable_if<WithValues, Value>::type (&values)[ItemsPerThread], unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs ascending radix sort over key-value pairs partitioned across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • values[inout] - reference to an array of values provided by a thread.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

Pre:

Method is enabled only if Value type is different than empty_type.

template<bool WithValues = with_values>
__device__ inline void sort_desc(Key (&keys)[ItemsPerThread], typename std::enable_if<WithValues, Value>::type (&values)[ItemsPerThread], storage_type &storage, unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs descending radix sort over key-value pairs partitioned across threads in a block.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

In the examples radix sort is performed on a block of 128 threads, each thread provides two key-value int-float pairs, results are returned using the same arrays as for input.

__global__ void example_kernel(...)
{
    // specialize block_radix_sort for int-float pairs, block of 128
    // threads, and two items per thread
    using block_rsort_ii = rocprim::block_radix_sort<int, 128, 2, int>;
    // allocate storage in shared memory
    __shared__ block_rsort_ii::storage_type storage;

    int keys[2] = ...;
    float values[2] = ...;
    // execute block radix sort-by-key (descending)
    block_rsort_ii().sort_desc(
        keys, values,
        storage
    );
    ...
}

If the keys across threads in a block are {[1, 2], [3, 4] ..., [255, 256]} and the values are {[128, 128], [127, 127] ..., [2, 2], [1, 1]}, then after sort the keys will be equal {[256, 255], ..., [4, 3], [2, 1]} and the values will be equal {[1, 1], [2, 2] ..., [128, 128]}.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • values[inout] - reference to an array of values provided by a thread.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

Pre:

Method is enabled only if Value type is different than empty_type.

template<bool WithValues = with_values>
__device__ inline void sort_desc(Key (&keys)[ItemsPerThread], typename std::enable_if<WithValues, Value>::type (&values)[ItemsPerThread], unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs descending radix sort over key-value pairs partitioned across threads in a block.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • values[inout] - reference to an array of values provided by a thread.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

Pre:

Method is enabled only if Value type is different than empty_type.

__device__ inline void sort_to_striped(Key (&keys)[ItemsPerThread], storage_type &storage, unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs ascending radix sort over keys partitioned across threads in a block, results are saved in a striped arrangement.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

In the examples radix sort is performed on a block of 128 threads, each thread provides two float value, results are returned using the same array as for input.

__global__ void example_kernel(...)
{
    // specialize block_radix_sort for float, block of 128 threads,
    // and two items per thread; key-only sort
    using block_rsort_float = rocprim::block_radix_sort<float, 128, 2>;
    // allocate storage in shared memory
    __shared__ block_rsort_float::storage_type storage;

    float keys[2] = ...;
    // execute block radix sort (ascending)
    block_rsort_float().sort_to_striped(
        keys,
        storage
    );
    ...
}

If the input values across threads in a block are {[256, 255], ..., [4, 3], [2, 1]}}, then then after sort they will be equal {[1, 129], [2, 130] ..., [128, 256]}.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

__device__ inline void sort_to_striped(Key (&keys)[ItemsPerThread], unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs ascending radix sort over keys partitioned across threads in a block, results are saved in a striped arrangement.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

__device__ inline void sort_desc_to_striped(Key (&keys)[ItemsPerThread], storage_type &storage, unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs descending radix sort over keys partitioned across threads in a block, results are saved in a striped arrangement.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

In the examples radix sort is performed on a block of 128 threads, each thread provides two float value, results are returned using the same array as for input.

__global__ void example_kernel(...)
{
    // specialize block_radix_sort for float, block of 128 threads,
    // and two items per thread; key-only sort
    using block_rsort_float = rocprim::block_radix_sort<float, 128, 2>;
    // allocate storage in shared memory
    __shared__ block_rsort_float::storage_type storage;

    float input[2] = ...;
    // execute block radix sort (descending)
    block_rsort_float().sort_desc_to_striped(
        input,
        storage
    );
    ...
}

If the input values across threads in a block are {[1, 2], [3, 4] ..., [255, 256]}, then after sort they will be equal {[256, 128], ..., [130, 2], [129, 1]}.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

__device__ inline void sort_desc_to_striped(Key (&keys)[ItemsPerThread], unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs descending radix sort over keys partitioned across threads in a block, results are saved in a striped arrangement.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

template<bool WithValues = with_values>
__device__ inline void sort_to_striped(Key (&keys)[ItemsPerThread], typename std::enable_if<WithValues, Value>::type (&values)[ItemsPerThread], storage_type &storage, unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs ascending radix sort over key-value pairs partitioned across threads in a block, results are saved in a striped arrangement.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

In the examples radix sort is performed on a block of 4 threads, each thread provides two key-value int-float pairs, results are returned using the same arrays as for input.

__global__ void example_kernel(...)
{
    // specialize block_radix_sort for int-float pairs, block of 4
    // threads, and two items per thread
    using block_rsort_ii = rocprim::block_radix_sort<int, 4, 2, int>;
    // allocate storage in shared memory
    __shared__ block_rsort_ii::storage_type storage;

    int keys[2] = ...;
    float values[2] = ...;
    // execute block radix sort-by-key (ascending)
    block_rsort_ii().sort_to_striped(
        keys, values,
        storage
    );
    ...
}

If the keys across threads in a block are {[8, 7], [6, 5], [4, 3], [2, 1]} and the values are {[-1, -2], [-3, -4], [-5, -6], [-7, -8]}, then after sort the keys will be equal {[1, 5], [2, 6], [3, 7], [4, 8]} and the values will be equal {[-8, -4], [-7, -3], [-6, -2], [-5, -1]}.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • values[inout] - reference to an array of values provided by a thread.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

Pre:

Method is enabled only if Value type is different than empty_type.

template<bool WithValues = with_values>
__device__ inline void sort_to_striped(Key (&keys)[ItemsPerThread], typename std::enable_if<WithValues, Value>::type (&values)[ItemsPerThread], unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs ascending radix sort over key-value pairs partitioned across threads in a block, results are saved in a striped arrangement.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • values[inout] - reference to an array of values provided by a thread.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

template<bool WithValues = with_values>
__device__ inline void sort_desc_to_striped(Key (&keys)[ItemsPerThread], typename std::enable_if<WithValues, Value>::type (&values)[ItemsPerThread], storage_type &storage, unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs descending radix sort over key-value pairs partitioned across threads in a block, results are saved in a striped arrangement.

Storage reusage

Synchronization barrier should be placed before storage is reused or repurposed: __syncthreads() or rocprim::syncthreads().

Examples

In the examples radix sort is performed on a block of 4 threads, each thread provides two key-value int-float pairs, results are returned using the same arrays as for input.

__global__ void example_kernel(...)
{
    // specialize block_radix_sort for int-float pairs, block of 4
    // threads, and two items per thread
    using block_rsort_ii = rocprim::block_radix_sort<int, 4, 2, int>;
    // allocate storage in shared memory
    __shared__ block_rsort_ii::storage_type storage;

    int keys[2] = ...;
    float values[2] = ...;
    // execute block radix sort-by-key (descending)
    block_rsort_ii().sort_desc_to_striped(
        keys, values,
        storage
    );
    ...
}

If the keys across threads in a block are {[1, 2], [3, 4], [5, 6], [7, 8]} and the values are {[80, 70], [60, 50], [40, 30], [20, 10]}, then after sort the keys will be equal {[8, 4], [7, 3], [6, 2], [5, 1]} and the values will be equal {[10, 50], [20, 60], [30, 70], [40, 80]}.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • values[inout] - reference to an array of values provided by a thread.

  • storage[in] - reference to a temporary storage object of type storage_type.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).

Pre:

Method is enabled only if Value type is different than empty_type.

template<bool WithValues = with_values>
__device__ inline void sort_desc_to_striped(Key (&keys)[ItemsPerThread], typename std::enable_if<WithValues, Value>::type (&values)[ItemsPerThread], unsigned int begin_bit = 0, unsigned int end_bit = 8 * sizeof(Key))#

Performs descending radix sort over key-value pairs partitioned across threads in a block, results are saved in a striped arrangement.

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

  • This overload does not accept storage argument. Required shared memory is allocated by the method itself.

Parameters:
  • keys[inout] - reference to an array of keys provided by a thread.

  • values[inout] - reference to an array of values provided by a thread.

  • begin_bit[in] - [optional] index of the first (least significant) bit used in key comparison. Must be in range [0; 8 * sizeof(Key)). Default value: 0.

  • end_bit[in] - [optional] past-the-end index (most significant) bit used in key comparison. Must be in range (begin_bit; 8 * sizeof(Key)]. Default value: * sizeof(Key).