Intrinsics#

Bitwise#

__device__ inline int rocprim::get_bit(int x, int i)#

Returns a single bit at ‘i’ from ‘x’.

__device__ inline unsigned int rocprim::bit_count(unsigned int x)#

Bit count.

Returns the number of bit of x set.

__device__ inline unsigned int rocprim::bit_count(unsigned long long x)#

Bit count.

Returns the number of bit of x set.

Warp size#

__host__ __device__ inline constexpr unsigned int rocprim::warp_size()#

[DEPRECATED] Returns a number of threads in a hardware warp.

It is constant for a device. This function is not supported for the gfx1030 architecture and will be removed in a future release. Please use the new host_warp_size() and device_warp_size() functions.

__host__ inline unsigned int rocprim::host_warp_size()#

Returns a number of threads in a hardware warp for the actual device. At host side this constant is available at runtime time only.

It is constant for a device.

__device__ inline constexpr unsigned int rocprim::device_warp_size()#

Returns a number of threads in a hardware warp for the actual target. At device side this constant is available at compile time.

It is constant for a device.

Lane and Warp ID#

__device__ inline unsigned int warp_id()#

Returns warp id in a block (tile).

Returns warp id in a block (tile). Use template parameters to optimize 1D or 2D kernels.

Flat ID#

__device__ inline unsigned int flat_block_thread_id()#

Returns flat (linear, 1D) thread identifier in a multidimensional block (tile).

Returns flat (linear, 1D) thread identifier in a multidimensional block (tile). Use template parameters to optimize 1D or 2D kernels.

__device__ inline unsigned int flat_block_id()#

Returns flat (linear, 1D) block identifier in a multidimensional grid.

Flat Size#

__device__ inline unsigned int rocprim::flat_block_size()#

Returns flat size of a multidimensional block (tile).

__device__ inline unsigned int rocprim::flat_tile_size()#

Returns flat size of a multidimensional tile (block).

Synchronization#

__device__ inline void rocprim::syncthreads()#

Synchronize all threads in a block (tile)

__device__ inline void rocprim::wave_barrier()#

Synchronize all threads in the wavefront.

Wait for all threads in the wavefront before continuing execution. Memory ordering is guaranteed by this function between threads in the same wavefront. Threads can communicate by storing to global / shared memory, executing wave_barrier() then reading values stored by other threads in the same wavefront.

wave_barrier() should be executed by all threads in the wavefront in convergence, this means that if the function is executed in a conditional statement all threads in the wave must enter the conditional statement.

Note

On SIMT architectures all lanes come to a convergence point simultaneously, thus no special instruction is needed in the ISA.

Active threads#

__device__ inline lane_mask_type rocprim::ballot(int predicate)#

Evaluate predicate for all active work-items in the warp and return an integer whose i-th bit is set if and only if predicate is true for the i-th thread of the warp and the i-th thread is active.

Parameters:

predicate – - input to be evaluated for all active lanes

__device__ inline unsigned int rocprim::masked_bit_count(lane_mask_type x, unsigned int add = 0)#

Masked bit count.

For each thread, this function returns the number of active threads which have i-th bit of x set and come before the current thread.