rocBLAS Beta Features#
To allow for future growth and changes, the features in this section are not subject to the same level of backwards compatibility and support as the normal rocBLAS API. These features are subject to change and/or removal in future release of rocBLAS.
To use the following beta API features ROCBLAS_BETA_FEATURES_API
must be defined before including rocblas.h
.
rocblas_gemm_ex_get_solutions + batched, strided_batched#

rocblas_status rocblas_gemm_ex_get_solutions(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_datatype compute_type, rocblas_gemm_algo algo, uint32_t flags, rocblas_int *list_array, rocblas_int *list_size)
BLAS BETA API
gemm_ex_get_solutions gets the indices for all the solutions that can solve a corresponding call to gemm_ex. Which solution is used by gemm_ex is controlled by the solution_index parameter.
All parameters correspond to gemm_ex except for list_array and list_size, which are used as input and output for getting the solution indices. If list_array is NULL, list_size is an output and will be filled with the number of solutions that can solve the GEMM. If list_array is not NULL, then it must be pointing to an array with at least list_size elements and will be filled with the solution indices that can solve the GEMM: the number of elements filled is min(list_size, # of solutions).
 Parameters:
handle – [in] [rocblas_handle] handle to the rocblas library context queue.
transA – [in] [rocblas_operation] specifies the form of op( A ).
transB – [in] [rocblas_operation] specifies the form of op( B ).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
k – [in] [rocblas_int] matrix dimension k.
alpha – [in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.
a – [in] [void *] device pointer storing matrix A.
a_type – [in] [rocblas_datatype] specifies the datatype of matrix A.
lda – [in] [rocblas_int] specifies the leading dimension of A.
b – [in] [void *] device pointer storing matrix B.
b_type – [in] [rocblas_datatype] specifies the datatype of matrix B.
ldb – [in] [rocblas_int] specifies the leading dimension of B.
beta – [in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.
c – [in] [void *] device pointer storing matrix C.
c_type – [in] [rocblas_datatype] specifies the datatype of matrix C.
ldc – [in] [rocblas_int] specifies the leading dimension of C.
d – [out] [void *] device pointer storing matrix D. If d and c pointers are to the same matrix then d_type must equal c_type and ldd must equal ldc or the respective invalid status will be returned.
d_type – [in] [rocblas_datatype] specifies the datatype of matrix D.
ldd – [in] [rocblas_int] specifies the leading dimension of D.
compute_type – [in] [rocblas_datatype] specifies the datatype of computation.
algo – [in] [rocblas_gemm_algo] enumerant specifying the algorithm type.
flags – [in] [uint32_t] optional gemm flags.
list_array – [out] [rocblas_int *] output array for solution indices or NULL if getting number of solutions
list_size – [inout] [rocblas_int *] size of list_array if getting solution indices or output with number of solutions if list_array is NULL

rocblas_status rocblas_gemm_ex_get_solutions_by_type(rocblas_handle handle, rocblas_datatype input_type, rocblas_datatype output_type, rocblas_datatype compute_type, uint32_t flags, rocblas_int *list_array, rocblas_int *list_size)
BLAS BETA API
rocblas_gemm_ex_get_solutions_by_type gets the indices for all the solutions that match the given types for gemm_ex. Which solution is used by gemm_ex is controlled by the solution_index parameter.
If list_array is NULL, list_size is an output and will be filled with the number of solutions that can solve the GEMM. If list_array is not NULL, then it must be pointing to an array with at least list_size elements and will be filled with the solution indices that can solve the GEMM: the number of elements filled is min(list_size, # of solutions).
 Parameters:
handle – [in] [rocblas_handle] handle to the rocblas library context queue.
input_type – [in] [rocblas_datatype] specifies the datatype of matrix A.
output_type – [in] [rocblas_datatype] specifies the datatype of matrix D.
compute_type – [in] [rocblas_datatype] specifies the datatype of computation.
flags – [in] [uint32_t] optional gemm flags.
list_array – [out] [rocblas_int *] output array for solution indices or NULL if getting number of solutions
list_size – [inout] [rocblas_int *] size of list_array if getting solution indices or output with number of solutions if list_array is NULL

rocblas_status rocblas_gemm_batched_ex_get_solutions(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, uint32_t flags, rocblas_int *list_array, rocblas_int *list_size)
BLAS BETA API
rocblas_gemm_batched_ex_get_solutions gets the indices for all the solutions that can solve a corresponding call to gemm_batched_ex. Which solution is used by gemm_batched_ex is controlled by the solution_index parameter.
All parameters correspond to gemm_batched_ex except for list_array and list_size, which are used as input and output for getting the solution indices. If list_array is NULL, list_size is an output and will be filled with the number of solutions that can solve the GEMM. If list_array is not NULL, then it must be pointing to an array with at least list_size elements and will be filled with the solution indices that can solve the GEMM: the number of elements filled is min(list_size, # of solutions).
 Parameters:
handle – [in] [rocblas_handle] handle to the rocblas library context queue.
transA – [in] [rocblas_operation] specifies the form of op( A ).
transB – [in] [rocblas_operation] specifies the form of op( B ).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
k – [in] [rocblas_int] matrix dimension k.
alpha – [in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.
a – [in] [void *] device pointer storing array of pointers to each matrix A_i.
a_type – [in] [rocblas_datatype] specifies the datatype of each matrix A_i.
lda – [in] [rocblas_int] specifies the leading dimension of each A_i.
b – [in] [void *] device pointer storing array of pointers to each matrix B_i.
b_type – [in] [rocblas_datatype] specifies the datatype of each matrix B_i.
ldb – [in] [rocblas_int] specifies the leading dimension of each B_i.
beta – [in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.
c – [in] [void *] device array of device pointers to each matrix C_i.
c_type – [in] [rocblas_datatype] specifies the datatype of each matrix C_i.
ldc – [in] [rocblas_int] specifies the leading dimension of each C_i.
d – [out] [void *] device array of device pointers to each matrix D_i. If d and c are the same array of matrix pointers then d_type must equal c_type and ldd must equal ldc or the respective invalid status will be returned.
d_type – [in] [rocblas_datatype] specifies the datatype of each matrix D_i.
ldd – [in] [rocblas_int] specifies the leading dimension of each D_i.
batch_count – [in] [rocblas_int] number of gemm operations in the batch.
compute_type – [in] [rocblas_datatype] specifies the datatype of computation.
algo – [in] [rocblas_gemm_algo] enumerant specifying the algorithm type.
flags – [in] [uint32_t] optional gemm flags.
list_array – [out] [rocblas_int *] output array for solution indices or NULL if getting number of solutions
list_size – [inout] [rocblas_int *] size of list_array if getting solution indices or output with number of solutions if list_array is NULL

rocblas_status rocblas_gemm_batched_ex_get_solutions_by_type(rocblas_handle handle, rocblas_datatype input_type, rocblas_datatype output_type, rocblas_datatype compute_type, uint32_t flags, rocblas_int *list_array, rocblas_int *list_size)
BLAS BETA API
rocblas_gemm_batched_ex_get_solutions_by_type gets the indices for all the solutions that match the given types for gemm_batched_ex. Which solution is used by gemm_ex is controlled by the solution_index parameter.
If list_array is NULL, list_size is an output and will be filled with the number of solutions that can solve the GEMM. If list_array is not NULL, then it must be pointing to an array with at least list_size elements and will be filled with the solution indices that can solve the GEMM: the number of elements filled is min(list_size, # of solutions).
 Parameters:
handle – [in] [rocblas_handle] handle to the rocblas library context queue.
input_type – [in] [rocblas_datatype] specifies the datatype of matrix A.
output_type – [in] [rocblas_datatype] specifies the datatype of matrix D.
compute_type – [in] [rocblas_datatype] specifies the datatype of computation.
flags – [in] [uint32_t] optional gemm flags.
list_array – [out] [rocblas_int *] output array for solution indices or NULL if getting number of solutions
list_size – [inout] [rocblas_int *] size of list_array if getting solution indices or output with number of solutions if list_array is NULL

rocblas_status rocblas_gemm_strided_batched_ex_get_solutions(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, rocblas_stride stride_a, const void *b, rocblas_datatype b_type, rocblas_int ldb, rocblas_stride stride_b, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, rocblas_stride stride_c, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_stride stride_d, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, uint32_t flags, rocblas_int *list_array, rocblas_int *list_size)
BLAS BETA API
gemm_strided_batched_ex_get_solutions gets the indices for all the solutions that can solve a corresponding call to gemm_strided_batched_ex. Which solution is used by gemm_strided_batched_ex is controlled by the solution_index parameter.
All parameters correspond to gemm_strided_batched_ex except for list_array and list_size, which are used as input and output for getting the solution indices. If list_array is NULL, list_size is an output and will be filled with the number of solutions that can solve the GEMM. If list_array is not NULL, then it must be pointing to an array with at least list_size elements and will be filled with the solution indices that can solve the GEMM: the number of elements filled is min(list_size, # of solutions).
 Parameters:
handle – [in] [rocblas_handle] handle to the rocblas library context queue.
transA – [in] [rocblas_operation] specifies the form of op( A ).
transB – [in] [rocblas_operation] specifies the form of op( B ).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
k – [in] [rocblas_int] matrix dimension k.
alpha – [in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.
a – [in] [void *] device pointer pointing to first matrix A_1.
a_type – [in] [rocblas_datatype] specifies the datatype of each matrix A_i.
lda – [in] [rocblas_int] specifies the leading dimension of each A_i.
stride_a – [in] [rocblas_stride] specifies stride from start of one A_i matrix to the next A_(i + 1).
b – [in] [void *] device pointer pointing to first matrix B_1.
b_type – [in] [rocblas_datatype] specifies the datatype of each matrix B_i.
ldb – [in] [rocblas_int] specifies the leading dimension of each B_i.
stride_b – [in] [rocblas_stride] specifies stride from start of one B_i matrix to the next B_(i + 1).
beta – [in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.
c – [in] [void *] device pointer pointing to first matrix C_1.
c_type – [in] [rocblas_datatype] specifies the datatype of each matrix C_i.
ldc – [in] [rocblas_int] specifies the leading dimension of each C_i.
stride_c – [in] [rocblas_stride] specifies stride from start of one C_i matrix to the next C_(i + 1).
d – [out] [void *] device pointer storing each matrix D_i. If d and c pointers are to the same matrix then d_type must equal c_type and ldd must equal ldc and stride_d must equal stride_c or the respective invalid status will be returned.
d_type – [in] [rocblas_datatype] specifies the datatype of each matrix D_i.
ldd – [in] [rocblas_int] specifies the leading dimension of each D_i.
stride_d – [in] [rocblas_stride] specifies stride from start of one D_i matrix to the next D_(i + 1).
batch_count – [in] [rocblas_int] number of gemm operations in the batch.
compute_type – [in] [rocblas_datatype] specifies the datatype of computation.
algo – [in] [rocblas_gemm_algo] enumerant specifying the algorithm type.
flags – [in] [uint32_t] optional gemm flags.
list_array – [out] [rocblas_int *] output array for solution indices or NULL if getting number of solutions
list_size – [inout] [rocblas_int *] size of list_array if getting solution indices or output with number of solutions if list_array is NULL
rocblas_gemm_ex3 + batched, strided_batched#

rocblas_status rocblas_gemm_ex3(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_computetype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)
BLAS BETA API
gemm_ex3 performs one of the matrixmatrix operations
where op( X ) is one ofD = alpha*op( A )*op( B ) + beta*C,
alpha and beta are scalars, and A, B, C, and D are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C and D are m by n matrices.op( X ) = X or op( X ) = X**T or op( X ) = X**H,
gemm_ex3 is a temporary API to support float8 computation. Trying to run this API on unsupported hardware will return rocblas_status_arch_mismatch.
Supported computetypes are as follows:
Supported types are as follows: alpha/beta always floatrocblas_compute_type_f32 = 300 or rocblas_compute_type_f8_f8_f32 = 301 or (internalAType_internalBType_AccumulatorType) rocblas_compute_type_f8_bf8_f32 = 302 or rocblas_compute_type_bf8_f8_f32 = 303 or rocblas_compute_type_bf8_bf8_f32 = 304
A type
B type
C type
D type
Compute type
fp8 or bf8
fp8 or bf8
fp32
fp32
f32
fp8
fp8
fp8
fp8
f32
fp8 or bf8
fp8 or bf8
bf8
bf8
f32
fp8 or bf8
fp8 or bf8
fp16
fp16
f32
fp16
fp16
fp16
fp16
f8_f8_f32 or f8_bf8_f32 or bf8_f8_f32 or bf8_bf8_f32
fp8 or fp32
fp8 or fp32
fp16
fp16
f8_f8_f32
fp32
bfp8
bfp8 or fp32
bfp8 or fp32
f8_bf8_f32
bfp8
fp32
fp32
fp32
bf8_f8_f32
Note: When using rocBLAS numerical checking with the flag ROCBLAS_CHECK_NUMERICS, gemm_ex3 will check the input data after the quantization stage. Consequently, when rocblas_compute_type_f32 is not used, only the post processed input will be checked. Numerical checking will always take place on f8/bf8 data.
 Parameters:
handle – [in] [rocblas_handle] handle to the rocblas library context queue.
transA – [in] [rocblas_operation] specifies the form of op( A ).
transB – [in] [rocblas_operation] specifies the form of op( B ).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
k – [in] [rocblas_int] matrix dimension k.
alpha – [in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.
a – [in] [void *] device pointer storing matrix A.
a_type – [in] [rocblas_datatype] specifies the datatype of matrix A.
lda – [in] [rocblas_int] specifies the leading dimension of A.
b – [in] [void *] device pointer storing matrix B.
b_type – [in] [rocblas_datatype] specifies the datatype of matrix B.
ldb – [in] [rocblas_int] specifies the leading dimension of B.
beta – [in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.
c – [in] [void *] device pointer storing matrix C.
c_type – [in] [rocblas_datatype] specifies the datatype of matrix C.
ldc – [in] [rocblas_int] specifies the leading dimension of C.
d – [out] [void *] device pointer storing matrix D.
d_type – [in] [rocblas_datatype] specifies the datatype of matrix D.
ldd – [in] [rocblas_int] specifies the leading dimension of D.
compute_type – [in] [rocblas_computetype] specifies the datatype of computation.
algo – [in] [rocblas_gemm_algo] enumerant specifying the algorithm type.
solution_index – [in] [int32_t] reserved for future use.
flags – [in] [uint32_t] optional gemm flags.

rocblas_status rocblas_gemm_batched_ex3(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_int batch_count, rocblas_computetype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)
BLAS BETA API
gemm_batched_ex3 performs one of the batched matrixmatrix operations:
where op( X ) is one ofD_i = alpha*op(A_i)*op(B_i) + beta*C_i, for i = 1, ..., batch_count.
alpha and beta are scalars, and A, B, C, and D are batched pointers to matrices, with op( A ) an m by k by batch_count batched matrix, op( B ) a k by n by batch_count batched matrix and C and D are m by n by batch_count batched matrices. The batched matrices are an array of pointers to matrices. The number of pointers to matrices is batch_count. C and D may point to the same matrices if their parameters are identical.op( X ) = X or op( X ) = X**T or op( X ) = X**H,
gemm_ex3 is a temporary API to support float8 computation. Trying to run this API on unsupported hardware will return rocblas_status_arch_mismatch.
Supported computetypes are as follows:
Supported types are as follows: alpha/beta always floatrocblas_compute_type_f32 = 300 or rocblas_compute_type_f8_f8_f32 = 301 or (internalAType_internalBType_AccumulatorType) rocblas_compute_type_f8_bf8_f32 = 302 or rocblas_compute_type_bf8_f8_f32 = 303 or rocblas_compute_type_bf8_bf8_f32 = 304
A type
B type
C type
D type
Compute type
fp8 or bf8
fp8 or bf8
fp32
fp32
f32
fp8
fp8
fp8
fp8
f32
fp8 or bf8
fp8 or bf8
bf8
bf8
f32
fp8 or bf8
fp8 or bf8
fp16
fp16
f32
fp16
fp16
fp16
fp16
f8_f8_f32 or f8_bf8_f32 or bf8_f8_f32 or bf8_bf8_f32
fp8 or fp32
fp8 or fp32
fp16
fp16
f8_f8_f32
fp32
bfp8
bfp8 or fp32
bfp8 or fp32
f8_bf8_f32
bfp8
fp32
fp32
fp32
bf8_f8_f32
Note: When using rocBLAS numerical checking with the flag ROCBLAS_CHECK_NUMERICS, gemm_batched_ex3 will check the input data after the quantization stage. Consequently, when rocblas_compute_type_f32 is not used, only the post processed input will be checked. Numerical checking will always take place on f8/bf8 data.
 Parameters:
handle – [in] [rocblas_handle] handle to the rocblas library context queue.
transA – [in] [rocblas_operation] specifies the form of op( A ).
transB – [in] [rocblas_operation] specifies the form of op( B ).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
k – [in] [rocblas_int] matrix dimension k.
alpha – [in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.
a – [in] [void *] device pointer storing array of pointers to each matrix A_i.
a_type – [in] [rocblas_datatype] specifies the datatype of each matrix A_i.
lda – [in] [rocblas_int] specifies the leading dimension of each A_i.
b – [in] [void *] device pointer storing array of pointers to each matrix B_i.
b_type – [in] [rocblas_datatype] specifies the datatype of each matrix B_i.
ldb – [in] [rocblas_int] specifies the leading dimension of each B_i.
beta – [in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.
c – [in] [void *] device array of device pointers to each matrix C_i.
c_type – [in] [rocblas_datatype] specifies the datatype of each matrix C_i.
ldc – [in] [rocblas_int] specifies the leading dimension of each C_i.
d – [out] [void *] device array of device pointers to each matrix D_i. If d and c are the same array of matrix pointers then d_type must equal c_type and ldd must equal ldc or the respective invalid status will be returned.
d_type – [in] [rocblas_datatype] specifies the datatype of each matrix D_i.
ldd – [in] [rocblas_int] specifies the leading dimension of each D_i.
batch_count – [in] [rocblas_int] number of gemm operations in the batch.
compute_type – [in] [rocblas_computetype] specifies the datatype of computation.
algo – [in] [rocblas_gemm_algo] enumerant specifying the algorithm type.
solution_index – [in] [int32_t] if algo is rocblas_gemm_algo_solution_index, this controls which solution is used. When algo is not rocblas_gemm_algo_solution_index, or if solution_index <= 0, the default solution is used. This parameter was unused in previous releases and instead always used the default solution
flags – [in] [uint32_t] optional gemm flags.

rocblas_status rocblas_gemm_strided_batched_ex3(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, rocblas_stride stride_a, const void *b, rocblas_datatype b_type, rocblas_int ldb, rocblas_stride stride_b, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, rocblas_stride stride_c, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_stride stride_d, rocblas_int batch_count, rocblas_computetype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)
BLAS EX API
gemm_strided_batched_ex3 performs one of the strided_batched matrixmatrix operations:
where op( X ) is one ofD_i = alpha*op(A_i)*op(B_i) + beta*C_i, for i = 1, ..., batch_count
alpha and beta are scalars, and A, B, C, and D are strided_batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) a k by n by batch_count strided_batched matrix and C and D are m by n by batch_count strided_batched matrices. C and D may point to the same matrices if their parameters are identical.op( X ) = X or op( X ) = X**T or op( X ) = X**H,
The strided_batched matrices are multiple matrices separated by a constant stride. The number of matrices is batch_count.
gemm_ex3 is a temporary API to support float8 computation. Trying to run this API on unsupported hardware will return rocblas_status_arch_mismatch.
Supported computetypes are as follows:
Supported types are as follows: alpha/beta always floatrocblas_compute_type_f32 = 300 or rocblas_compute_type_f8_f8_f32 = 301 or (internalAType_internalBType_AccumulatorType) rocblas_compute_type_f8_bf8_f32 = 302 or rocblas_compute_type_bf8_f8_f32 = 303 or rocblas_compute_type_bf8_bf8_f32 = 304
A type
B type
C type
D type
Compute type
fp8 or bf8
fp8 or bf8
fp32
fp32
f32
fp8
fp8
fp8
fp8
f32
fp8 or bf8
fp8 or bf8
bf8
bf8
f32
fp8 or bf8
fp8 or bf8
fp16
fp16
f32
fp16
fp16
fp16
fp16
f8_f8_f32 or f8_bf8_f32 or bf8_f8_f32 or bf8_bf8_f32
fp8 or fp32
fp8 or fp32
fp16
fp16
f8_f8_f32
fp32
bfp8
bfp8 or fp32
bfp8 or fp32
f8_bf8_f32
bfp8
fp32
fp32
fp32
bf8_f8_f32
Note: When using rocBLAS numerical checking with the flag ROCBLAS_CHECK_NUMERICS, gemm_strided_batched_ex3 will check the input data after the quantization stage. Consequently, when rocblas_compute_type_f32 is not used, only the post processed input will be checked. Numerical checking will always take place on f8/bf8 data.
 Parameters:
handle – [in] [rocblas_handle] handle to the rocblas library context queue.
transA – [in] [rocblas_operation] specifies the form of op( A ).
transB – [in] [rocblas_operation] specifies the form of op( B ).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
k – [in] [rocblas_int] matrix dimension k.
alpha – [in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.
a – [in] [void *] device pointer pointing to first matrix A_1.
a_type – [in] [rocblas_datatype] specifies the datatype of each matrix A_i.
lda – [in] [rocblas_int] specifies the leading dimension of each A_i.
stride_a – [in] [rocblas_stride] specifies stride from start of one A_i matrix to the next A_(i + 1).
b – [in] [void *] device pointer pointing to first matrix B_1.
b_type – [in] [rocblas_datatype] specifies the datatype of each matrix B_i.
ldb – [in] [rocblas_int] specifies the leading dimension of each B_i.
stride_b – [in] [rocblas_stride] specifies stride from start of one B_i matrix to the next B_(i + 1).
beta – [in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.
c – [in] [void *] device pointer pointing to first matrix C_1.
c_type – [in] [rocblas_datatype] specifies the datatype of each matrix C_i.
ldc – [in] [rocblas_int] specifies the leading dimension of each C_i.
stride_c – [in] [rocblas_stride] specifies stride from start of one C_i matrix to the next C_(i + 1).
d – [out] [void *] device pointer storing each matrix D_i. If d and c pointers are to the same matrix then d_type must equal c_type and ldd must equal ldc and stride_d must equal stride_c or the respective invalid status will be returned.
d_type – [in] [rocblas_datatype] specifies the datatype of each matrix D_i.
ldd – [in] [rocblas_int] specifies the leading dimension of each D_i.
stride_d – [in] [rocblas_stride] specifies stride from start of one D_i matrix to the next D_(i + 1).
batch_count – [in] [rocblas_int] number of gemm operations in the batch.
compute_type – [in] [rocblas_computetype] specifies the datatype of computation.
algo – [in] [rocblas_gemm_algo] enumerant specifying the algorithm type.
solution_index – [in] [int32_t] if algo is rocblas_gemm_algo_solution_index, this controls which solution is used. When algo is not rocblas_gemm_algo_solution_index, or if solution_index <= 0, the default solution is used. This parameter was unused in previous releases and instead always used the default solution
flags – [in] [uint32_t] optional gemm flags.
Graph Support for rocBLAS#
Most of the rocBLAS functions can be captured into a graph node via Graph Management HIP APIs, except those listed in Functions Unsupported with Graph Capture. For a list of graph related HIP APIs, refer to Graph Management HIP API.
CHECK_HIP_ERROR((hipStreamBeginCapture(stream, hipStreamCaptureModeGlobal));
rocblas_<function>(<arguments>);
CHECK_HIP_ERROR(hipStreamEndCapture(stream, &graph));
The above code will create a graph with rocblas_function()
as graph node. The captured graph can be launched as shown below:
CHECK_HIP_ERROR(hipGraphInstantiate(&instance, graph, NULL, NULL, 0));
CHECK_HIP_ERROR(hipGraphLaunch(instance, stream));
Graph support requires Asynchronous HIP APIs, hence, users must enable streamorder memory allocation. For more details refer to section StreamOrdered Memory Allocation.
During stream capture, rocBLAS stores the allocated host and device memory in the handle and the allocated memory will be freed when the handle is destroyed.
Functions Unsupported with Graph Capture#
The following Level1 functions place results into host buffers (in pointer mode host) which enforces synchronization.
dot
asum
nrm2
imax
imin
BLAS Level3 and BLASEX functions in pointer mode device do not support HIP Graph. Support will be added in future releases.
HIP Graph Known Issues in rocBLAS#
On Windows platform, batched functions (Level1, Level2 and Level3) produce incorrect results.