Device memory allocation in rocBLAS#
For temporary device memory, rocBLAS uses a per-handle memory allocation with out-of-band management. For more information, see the device memory allocation section of the :ref:programmers-guide.
The following computational functions use temporary device memory.
Function |
Use of temporary device memory |
|---|---|
L1 reduction functions
|
Reduction array |
L2 functions
|
Result array before overwriting input Column reductions of skinny transposed matrices
applicable for |
L3 GEMM-based functions
|
Block of matrix |
Environment variable for preallocating memory#
The environment variable ROCBLAS_DEVICE_MEMORY_SIZE is used to set how much memory to preallocate:
If it is greater than 0, it sets the default handle device memory size to the specified size (in bytes).
If it is equal to 0 or unset, it lets rocBLAS manage the device memory. It uses a default size, like 32MiB or 128MiB, and expands it when necessary.
Memory allocation functions#
rocBLAS includes functions for manually setting the memory size and determining the memory requirements.
Functions for manually setting the memory size#
rocblas_set_device_memory_sizerocblas_get_device_memory_sizerocblas_is_user_managing_device_memory
Function for setting a user-owned workspace#
rocblas_set_workspace
Functions for determining memory requirements#
rocblas_start_device_memory_size_queryrocblas_stop_device_memory_size_queryrocblas_is_managing_device_memory
See the API section for information about these functions.
rocBLAS function return values for insufficient device memory#
If the user preallocates or manually allocates, that size is used as the limit and no resizing or synchronizing ever occurs. The following two function return values indicate insufficient memory:
rocblas_status == rocblas_status_memory_error: indicates there is insufficient device memory for a rocBLAS function.rocblas_status == rocblas_status_perf_degraded: indicates that a slower algorithm was used because of insufficient device memory for the optimal algorithm.
Stream-ordered memory allocation#
Stream-ordered device memory allocation is added to rocBLAS. The asynchronous allocators
hipMallocAsync() and hipFreeAsync() are used to allow allocation and deallocation to happen in stream order.
This is a non-default beta option that can be enabled by setting the environment variable ROCBLAS_STREAM_ORDER_ALLOC.
To check whether the device supports stream-order allocation, call hipDeviceGetAttribute() with the
device attribute hipDeviceAttributeMemoryPoolsSupported.
Enabling stream-ordered memory allocation#
On supported platforms, the environment variable ROCBLAS_STREAM_ORDER_ALLOC is used to enable stream-ordered memory allocation.
If it is greater than 0 (
> 0), it sets the allocation to be stream-ordered and useshipMallocAsync/hipFreeAsyncto manage device memory.If it is equal to zero (
= 0) or unset, it useshipMallocandhipFreeto manage device memory.
Switching streams without synchronization#
Stream-order memory allocation lets the application switch streams without having to call hipStreamSynchronize().