Memcpy#

Configuring the kernel#

template<unsigned int NonBlevBlockSize, unsigned int NonBlevItemsPerThread, unsigned int TlevItemsPerThread, unsigned int BlevBlockSize, unsigned int BlevItemsPerThread, unsigned int WlevSizeThreshold, unsigned int BlevSizeThreshold, unsigned int SizeLimit = std::numeric_limits<unsigned int>::max()>
struct batch_memcpy_config : public rocprim::detail::batch_memcpy_config_params#

Configuration of device-level batch memcopy primitives.

Template Parameters:
  • BlockSize – - number of threads in a block.

  • ItemsPerThread – - number of items processed by each thread.

  • BlockLoadMethod – - method for loading input values.

  • BlockStoreMethod – - method for storing values.

  • SizeLimit – - limit on the number of items for a single adjacent difference kernel launch.

Subclassed by rocprim::batch_copy_config< 256, 2, 8, 128, 32, 128, 1024 >, rocprim::detail::default_batch_copy_config< arch, value_type, enable >, rocprim::detail::default_batch_memcpy_config< arch, value_type, enable >

batch_memcpy#

template<class Config_ = default_config, class InputBufferItType, class OutputBufferItType, class BufferSizeItType>
static inline hipError_t rocprim::batch_memcpy(void *temporary_storage, size_t &storage_size, InputBufferItType sources, OutputBufferItType destinations, BufferSizeItType sizes, uint32_t num_copies, hipStream_t stream = hipStreamDefault, bool debug_synchronous = false)#

Copy sizes[i] bytes from sources[i] to destinations[i] for all i in the range [0, num_copies].

Performs multiple device to device memory copies as a single batched operation. Roughly equivalent to

for (auto i = 0; i < num_copies; ++i) {
    char* src = sources[i];
    char* dst = destinations[i];
    auto size = sizes[i];
    hipMemcpyAsync(dst, src, size, hipMemcpyDeviceToDevice, stream);
}
except executed on the device in parallel. Note that sources and destinations do not have to be part of the same array. I.e. you can copy from both array A and B to array C and D with a single call to this function. Source ranges are allowed to overlap, however, destinations overlapping with either other destinations or with sources is not allowed, and will result in undefined behaviour.

Example

In this example multiple sections of data are copied from a to b .

#include <rocprim/rocprim.hpp

// Device allocated data:
int* a;             // e.g, [9, 1, 2, 3, 4, 5, 6, 7, 8]
int* b;             // e.g, [0, 0, 0, 0, 0, 0, 0, 0, 0]

// Batch memcpy parameters:
int   num_copies;   // Number of buffers to copy.
                     // e.g, 4.
int** sources;       // Pointer to source pointers.
                     // e.g, [&a[0], &a[4] &a[7]]
int** destinations;  // Pointer to destination pointers.
                     // e.g, [&b[5], &b[2] &b[0]]
int*  sizes;         // Size of buffers to copy.
                     // e.g., [3 * sizeof(int), 2 * sizeof(int), 2 * sizeof(int)]

// Calculate the required temporary storage.
size_t temporary_storage_size_bytes;
void* temporary_storage_ptr = nullptr;
rocprim::batch_memcpy(
    temporary_storage_ptr,
    temporary_storage_size_bytes,
    sources,
    destinations,
    sizes,
    num_buffers);

// Allocate temporary storage.
hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes);

// Copy buffers.
rocprim::batch_memcpy(
    temporary_storage_ptr,
    temporary_storage_size_bytes,
    sources,
    destinations,
    sizes,
    num_copies);

// b is now: [7, 8, 4, 5, 0, 9, 1, 2, 0]
//   3rd copy ^--^  ^--^     ^--^--^ 1st copy
//                2nd copy

Template Parameters:
  • Config – [optional] Configuration of the primitive, must be default_config or batch_memcpy_config.

  • InputBufferItType – type of iterator to source pointers.

  • OutputBufferItType – type of iterator to desetination pointers.

  • BufferSizeItType – type of iterator to sizes.

Parameters:
  • temporary_storage[in] pointer to device-accessible temporary storage. When a null pointer is passed, the required allocation size in bytes is written to storage_size and the function returns without performing the copy.

  • storage_size[inout] reference to the size in bytes of temporary_storage.

  • sources[in] iterator of source pointers.

  • destinations[in] iterator of destination pointers.

  • sizes[in] iterator of range sizes to copy.

  • num_copies[in] number of ranges to copy

  • stream[in] [optional] HIP stream object to enqueue the copy on. Default is hipStreamDefault.

  • debug_synchronous[in] [optional] If true, synchronization after every kernel launch is forced in order to check for errors. The default value is false.