Device memory allocation in rocBLAS

Device memory allocation in rocBLAS#

For temporary device memory, rocBLAS uses a per-handle memory allocation with out-of-band management. For more information, see the device memory allocation section of the :ref:programmers-guide.

The following computational functions use temporary device memory.

Function

Use of temporary device memory

L1 reduction functions

rocblas_Xasum

rocblas_Xasum_batched

rocblas_Xasum_strided_batched

rocblas_Xdot

rocblas_Xdot_batched

rocblas_Xdot_strided_batched

rocblas_Xmax

rocblas_Xmax_batched

rocblas_Xmax_strided_batched

rocblas_Xmin

rocblas_Xmin_batched

rocblas_Xmin_strided_batched

rocblas_Xnrm2

rocblas_Xnrm2_batched

rocblas_Xnrm2_strided_batched

rocblas_dot_ex

rocblas_dot_batched_ex

rocblas_dot_strided_batched_ex

rocblas_nrm2_ex

rocblas_nrm2_batched_ex

rocblas_nrm2_strided_batched_ex

Reduction array

L2 functions

rocblas_Xgemv (optional)

rocblas_Xgemv_batched

rocblas_Xgemv_strided_batched

rocblas_Xtbmv

rocblas_Xtbmv_batched

rocblas_Xtbmv_strided_batched

rocblas_Xtpmv

rocblas_Xtpmv_batched

rocblas_Xtpmv_strided_batched

rocblas_Xtrmv

rocblas_Xtrmv_batched

rocblas_Xtrmv_strided_batched

rocblas_Xtrsv

rocblas_Xtrsv_batched

rocblas_Xtrsv_strided_batched

rocblas_Xhemv

rocblas_Xhemv_batched

rocblas_Xhemv_strided_batched

rocblas_Xsymv

rocblas_Xsymv_batched

rocblas_Xsymv_strided_batched

rocblas_Xtrsv_ex

rocblas_Xtrsv_batched_ex

rocblas_Xtrsv_strided_batched_ex

Result array before overwriting input

Column reductions of skinny transposed matrices applicable for gemv functions

L3 GEMM-based functions

rocblas_Xtrsm

rocblas_Xtrsm_batched

rocblas_Xtrsm_strided_batched

rocblas_Xsymm

rocblas_Xsymm_batched

rocblas_Xsymm_strided_batched

rocblas_Xsyrk

rocblas_Xsyrk_batched

rocblas_Xsyrk_strided_batched

rocblas_Xsyr2k

rocblas_Xsyr2k_batched

rocblas_Xsyr2k_strided_batched

rocblas_Xsyrkx

rocblas_Xsyrkx_batched

rocblas_Xsyrkx_strided_batched

rocblas_Xtrmm

rocblas_Xtrmm_batched

rocblas_Xtrmm_strided_batched

rocblas_Xhemm

rocblas_Xhemm_batched

rocblas_Xhemm_strided_batched

rocblas_Xherk

rocblas_Xherk_batched

rocblas_Xherk_strided_batched

rocblas_Xher2k

rocblas_Xher2k_batched

rocblas_Xher2k_strided_batched

rocblas_Xherkx

rocblas_Xherkx_batched

rocblas_Xherkx_strided_batched

rocblas_Xgemm

rocblas_Xgemm_batched

rocblas_Xgemm_strided_batched

rocblas_gemm_ex

rocblas_gemm_ex_batched

rocblas_gemm_ex_strided_batched

rocblas_Xtrtri

rocblas_Xtrtri_batched

rocblas_Xtrtri_strided_batched

Block of matrix

Environment variable for preallocating memory#

The environment variable ROCBLAS_DEVICE_MEMORY_SIZE is used to set how much memory to preallocate:

If it is greater than 0, it sets the default handle device memory size to the specified size (in bytes).
If it is equal to 0 or unset, it lets rocBLAS manage the device memory. It uses a default size, like 32MiB or 128MiB, and expands it when necessary.

Memory allocation functions#

rocBLAS includes functions for manually setting the memory size and determining the memory requirements.

Functions for manually setting the memory size#

rocblas_set_device_memory_size
rocblas_get_device_memory_size
rocblas_is_user_managing_device_memory

Function for setting a user-owned workspace#

rocblas_set_workspace

Functions for determining memory requirements#

rocblas_start_device_memory_size_query
rocblas_stop_device_memory_size_query
rocblas_is_managing_device_memory

See the API section for information about these functions.

rocBLAS function return values for insufficient device memory#

If the user preallocates or manually allocates, that size is used as the limit and no resizing or synchronizing ever occurs. The following two function return values indicate insufficient memory:

rocblas_status == rocblas_status_memory_error : indicates there is insufficient device memory for a rocBLAS function.
rocblas_status == rocblas_status_perf_degraded : indicates that a slower algorithm was used because of insufficient device memory for the optimal algorithm.

Stream-ordered memory allocation#

Stream-ordered device memory allocation is added to rocBLAS. The asynchronous allocators hipMallocAsync() and hipFreeAsync() are used to allow allocation and deallocation to happen in stream order.

This is a non-default beta option that can be enabled by setting the environment variable ROCBLAS_STREAM_ORDER_ALLOC.

To check whether the device supports stream-order allocation, call hipDeviceGetAttribute() with the device attribute hipDeviceAttributeMemoryPoolsSupported.

Enabling stream-ordered memory allocation#

On supported platforms, the environment variable ROCBLAS_STREAM_ORDER_ALLOC is used to enable stream-ordered memory allocation.

If it is greater than 0 (> 0), it sets the allocation to be stream-ordered and uses hipMallocAsync/hipFreeAsync to manage device memory.
If it is equal to zero (= 0) or unset, it uses hipMalloc and hipFree to manage device memory.

Switching streams without synchronization#

Stream-order memory allocation lets the application switch streams without having to call hipStreamSynchronize().