Memcpy#

Configuring the kernel#

template<unsigned int NonBlevBlockSize = 256, unsigned int NonBlevBuffersPerThreaed = 2, unsigned int TlevBytesPerThread = 8, unsigned int BlevBlockSize = 128, unsigned int BlevBytesPerThread = 32, unsigned int WlevSizeThreshold = 128, unsigned int BlevSizeThreshold = 1024>
struct batch_memcpy_config#
Template Parameters
  • NonBlevBlockSize – - number of threads per block for thread- and warp-level copy.

  • NonBlevBuffersPerThreaed – - number of buffers processed per thread.

  • TlevBytesPerThread – - number of bytes per thread for thread-level copy.

  • BlevBlockSize – - number of thread per block for block-level copy.

  • BlevBytesPerThread – - number of bytes per thread for block-level copy.

  • WlevSizeThreshold – - minimum size to use warp-level copy instead of thread-level.

  • BlevSizeThreshold – - minimum size to use block-level copy instead of warp-level.

batch_memcpy#

template<class Config_ = default_config, class InputBufferItType, class OutputBufferItType, class BufferSizeItType>
static inline hipError_t rocprim::batch_memcpy(void *temporary_storage, size_t &storage_size, InputBufferItType sources, OutputBufferItType destinations, BufferSizeItType sizes, uint32_t num_copies, hipStream_t stream = hipStreamDefault, bool debug_synchronous = false)#

Copy sizes[i] bytes from sources[i] to destinations[i] for all i in the range [0, num_copies].

Performs multiple device to device memory copies as a single batched operation. Roughly equivalent to

for (auto i = 0; i < num_copies; ++i) {
    char* src = sources[i];
    char* dst = destinations[i];
    auto size = sizes[i];
    hipMemcpyAsync(dst, src, size, hipMemcpyDeviceToDevice, stream);
}
except executed on the device in parallel. Note that sources and destinations do not have to be part of the same array. I.e. you can copy from both array A and B to array C and D with a single call to this function. Source ranges are allowed to overlap, however, destinations overlapping with either other destinations or with sources is not allowed, and will result in undefined behaviour.

Example

In this example multiple sections of data are copied from a to b .

#include <rocprim/rocprim.hpp

// Device allocated data:
int* a;             // e.g, [9, 1, 2, 3, 4, 5, 6, 7, 8]
int* b;             // e.g, [0, 0, 0, 0, 0, 0, 0, 0, 0]

// Batch memcpy parameters:
int   num_copies;   // Number of buffers to copy.
                     // e.g, 4.
int** sources;       // Pointer to source pointers.
                     // e.g, [&a[0], &a[4] &a[7]]
int** destinations;  // Pointer to destination pointers.
                     // e.g, [&b[5], &b[2] &b[0]]
int*  sizes;         // Size of buffers to copy.
                     // e.g., [3 * sizeof(int), 2 * sizeof(int), 2 * sizeof(int)]

// Calculate the required temporary storage.
size_t temporary_storage_size_bytes;
void* temporary_storage_ptr = nullptr;
rocprim::batch_memcpy(
    temporary_storage_ptr,
    temporary_storage_size_bytes,
    sources,
    destinations,
    sizes,
    num_buffers);

// Allocate temporary storage.
hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes);

// Copy buffers.
rocprim::batch_memcpy(
    temporary_storage_ptr,
    temporary_storage_size_bytes,
    sources,
    destinations,
    sizes,
    num_copies);

// b is now: [7, 8, 4, 5, 0, 9, 1, 2, 0]
//   3rd copy ^--^  ^--^     ^--^--^ 1st copy
//                2nd copy

Template Parameters
  • Config – [optional] configuration of the primitive. It has to be batch_memcpy_config .

  • InputBufferItType – type of iterator to source pointers.

  • OutputBufferItType – type of iterator to desetination pointers.

  • BufferSizeItType – type of iterator to sizes.

Parameters
  • temporary_storage[in] pointer to device-accessible temporary storage. When a null pointer is passed, the required allocation size in bytes is written to storage_size and the function returns without performing the copy.

  • storage_size[inout] reference to the size in bytes of temporary_storage.

  • sources[in] iterator of source pointers.

  • destinations[in] iterator of destination pointers.

  • sizes[in] iterator of range sizes to copy.

  • num_copies[in] number of ranges to copy

  • stream[in] [optional] HIP stream object to enqueue the copy on. Default is hipStreamDefault.

  • debug_synchronous[in] - [optional] If true, synchronization after every kernel launch is forced in order to check for errors. The default value is false.