Device memory allocation in rocBLAS#
rocBLAS uses per-handle device memory allocation to manage temporary memory efficiently. Each handle maintains its own memory and executes kernels sequentially in a single stream, allowing memory reuse across kernels.
There are two memory allocation schemes:
rocBLAS_managed(default): By default, rocBLAS internally manages memory, allocating more if needed. Allocated memory persists with the handle for reuse.
user_owned: Users allocate memory and provide it to rocBLAS via
rocblas_set_workspace.
rocBLAS_managed is the default scheme. This scheme uses hipMallocAsync and hipFreeAsync (stream-order allocation) to allocate and free memory in stream order, avoiding global synchronization. This enables seamless stream switching without needing hipStreamSynchronize().
The following computational functions use temporary device memory.
Function |
Use of temporary device memory |
|---|---|
L1 reduction functions
|
Reduction array |
L2 functions
|
Result array before overwriting input Column reductions of skinny transposed matrices
applicable for |
L3 GEMM-based functions
|
Block of matrix |
Memory allocation functions#
rocBLAS includes functions for manually setting the memory size and determining the memory requirements.
Function for setting a user-owned workspace#
rocblas_set_workspace
Functions for determining memory requirements#
rocblas_start_device_memory_size_queryrocblas_stop_device_memory_size_queryrocblas_is_managing_device_memory
See the API section for information about these functions.
rocBLAS function return values for insufficient device memory#
If the user manually allocates (user-owned scheme) using rocblas_set_workspace(rocblas_handle handle, void* addr, size_t size), that size is used as the limit and no resizing or synchronizing ever occurs.
The following two function return values indicate insufficient memory:
rocblas_status == rocblas_status_memory_error: indicates there is insufficient device memory for a rocBLAS function.rocblas_status == rocblas_status_perf_degraded: indicates that a slower algorithm was used because of insufficient device memory for the optimal algorithm.
Switching streams without synchronization#
Stream-order memory allocation lets the application switch streams without having to call hipStreamSynchronize().