Device memory allocation in rocBLAS#
rocBLAS uses per-handle device memory allocation to manage temporary memory efficiently. Each handle maintains its own memory and executes kernels sequentially in a single stream, allowing memory reuse across kernels.
There are two memory allocation schemes:
rocBLAS_managed(default): By default, rocBLAS internally manages memory, allocating more if needed. Allocated memory persists with the handle for reuse.
user_owned: Users allocate memory and provide it to rocBLAS via
rocblas_set_workspace
.
rocBLAS_managed
is the default scheme. This scheme uses hipMallocAsync
and hipFreeAsync
(stream-order allocation) to allocate and free memory in stream order, avoiding global synchronization. This enables seamless stream switching without needing hipStreamSynchronize()
.
The following computational functions use temporary device memory.
Function |
Use of temporary device memory |
---|---|
L1 reduction functions
|
Reduction array |
L2 functions
|
Result array before overwriting input Column reductions of skinny transposed matrices
applicable for |
L3 GEMM-based functions
|
Block of matrix |
Memory allocation functions#
rocBLAS includes functions for manually setting the memory size and determining the memory requirements.
Function for setting a user-owned workspace#
rocblas_set_workspace
Functions for determining memory requirements#
rocblas_start_device_memory_size_query
rocblas_stop_device_memory_size_query
rocblas_is_managing_device_memory
See the API section for information about these functions.
rocBLAS function return values for insufficient device memory#
If the user manually allocates (user-owned scheme) using rocblas_set_workspace(rocblas_handle handle, void* addr, size_t size)
, that size is used as the limit and no resizing or synchronizing ever occurs.
The following two function return values indicate insufficient memory:
rocblas_status == rocblas_status_memory_error
: indicates there is insufficient device memory for a rocBLAS function.rocblas_status == rocblas_status_perf_degraded
: indicates that a slower algorithm was used because of insufficient device memory for the optimal algorithm.
Switching streams without synchronization#
Stream-order memory allocation lets the application switch streams without having to call hipStreamSynchronize()
.