Memcpy#
Configuring the kernel#
-
template<unsigned int NonBlevBlockSize = 256, unsigned int NonBlevBuffersPerThreaed = 2, unsigned int TlevBytesPerThread = 8, unsigned int BlevBlockSize = 128, unsigned int BlevBytesPerThread = 32, unsigned int WlevSizeThreshold = 128, unsigned int BlevSizeThreshold = 1024>
struct batch_memcpy_config# - Template Parameters:
NonBlevBlockSize – - number of threads per block for thread- and warp-level copy.
NonBlevBuffersPerThreaed – - number of buffers processed per thread.
TlevBytesPerThread – - number of bytes per thread for thread-level copy.
BlevBlockSize – - number of thread per block for block-level copy.
BlevBytesPerThread – - number of bytes per thread for block-level copy.
WlevSizeThreshold – - minimum size to use warp-level copy instead of thread-level.
BlevSizeThreshold – - minimum size to use block-level copy instead of warp-level.
batch_memcpy#
-
template<class Config_ = default_config, class InputBufferItType, class OutputBufferItType, class BufferSizeItType>
static inline hipError_t rocprim::batch_memcpy(void *temporary_storage, size_t &storage_size, InputBufferItType sources, OutputBufferItType destinations, BufferSizeItType sizes, uint32_t num_copies, hipStream_t stream = hipStreamDefault, bool debug_synchronous = false)# Copy
sizes[i]
bytes fromsources[i]
todestinations[i]
for alli
in the range [0,num_copies
].Performs multiple device to device memory copies as a single batched operation. Roughly equivalent to
except executed on the device in parallel. Note that sources and destinations do not have to be part of the same array. I.e. you can copy from both array A and B to array C and D with a single call to this function. Source ranges are allowed to overlap, however, destinations overlapping with either other destinations or with sources is not allowed, and will result in undefined behaviour.for (auto i = 0; i < num_copies; ++i) { char* src = sources[i]; char* dst = destinations[i]; auto size = sizes[i]; hipMemcpyAsync(dst, src, size, hipMemcpyDeviceToDevice, stream); }
- Example
In this example multiple sections of data are copied from
a
tob
.#include <rocprim/rocprim.hpp // Device allocated data: int* a; // e.g, [9, 1, 2, 3, 4, 5, 6, 7, 8] int* b; // e.g, [0, 0, 0, 0, 0, 0, 0, 0, 0] // Batch memcpy parameters: int num_copies; // Number of buffers to copy. // e.g, 4. int** sources; // Pointer to source pointers. // e.g, [&a[0], &a[4] &a[7]] int** destinations; // Pointer to destination pointers. // e.g, [&b[5], &b[2] &b[0]] int* sizes; // Size of buffers to copy. // e.g., [3 * sizeof(int), 2 * sizeof(int), 2 * sizeof(int)] // Calculate the required temporary storage. size_t temporary_storage_size_bytes; void* temporary_storage_ptr = nullptr; rocprim::batch_memcpy( temporary_storage_ptr, temporary_storage_size_bytes, sources, destinations, sizes, num_buffers); // Allocate temporary storage. hipMalloc(&temporary_storage_ptr, temporary_storage_size_bytes); // Copy buffers. rocprim::batch_memcpy( temporary_storage_ptr, temporary_storage_size_bytes, sources, destinations, sizes, num_copies); // b is now: [7, 8, 4, 5, 0, 9, 1, 2, 0] // 3rd copy ^--^ ^--^ ^--^--^ 1st copy // 2nd copy
- Template Parameters:
Config – [optional] configuration of the primitive. It has to be
batch_memcpy_config
.InputBufferItType – type of iterator to source pointers.
OutputBufferItType – type of iterator to desetination pointers.
BufferSizeItType – type of iterator to sizes.
- Parameters:
temporary_storage – [in] pointer to device-accessible temporary storage. When a null pointer is passed, the required allocation size in bytes is written to
storage_size
and the function returns without performing the copy.storage_size – [inout] reference to the size in bytes of
temporary_storage
.sources – [in] iterator of source pointers.
destinations – [in] iterator of destination pointers.
sizes – [in] iterator of range sizes to copy.
num_copies – [in] number of ranges to copy
stream – [in] [optional] HIP stream object to enqueue the copy on. Default is
hipStreamDefault
.debug_synchronous – [in] - [optional] If true, synchronization after every kernel launch is forced in order to check for errors. The default value is
false
.