rocBLAS Beta Features#

To allow for future growth and changes, the features in this section are not subject to the same level of backwards compatibility and support as the normal rocBLAS API. These features are subject to change and/or removal in future release of rocBLAS.

To use the following beta API features ROCBLAS_BETA_FEATURES_API must be defined before including rocblas.h.

rocblas_gemm_ex_get_solutions + batched, strided_batched#

rocblas_status rocblas_gemm_ex_get_solutions(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_datatype compute_type, rocblas_gemm_algo algo, uint32_t flags, rocblas_int *list_array, rocblas_int *list_size)#

BLAS BETA API

gemm_ex_get_solutions gets the indices for all the solutions that can solve a corresponding call to gemm_ex. Which solution is used by gemm_ex is controlled by the solution_index parameter.

All parameters correspond to gemm_ex except for list_array and list_size, which are used as input and output for getting the solution indices. If list_array is NULL, list_size is an output and will be filled with the number of solutions that can solve the GEMM. If list_array is not NULL, then it must be pointing to an array with at least list_size elements and will be filled with the solution indices that can solve the GEMM: the number of elements filled is min(list_size, # of solutions).

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • transA[in] [rocblas_operation] specifies the form of op( A ).

  • transB[in] [rocblas_operation] specifies the form of op( B ).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • k[in] [rocblas_int] matrix dimension k.

  • alpha[in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.

  • a[in] [void *] device pointer storing matrix A.

  • a_type[in] [rocblas_datatype] specifies the datatype of matrix A.

  • lda[in] [rocblas_int] specifies the leading dimension of A.

  • b[in] [void *] device pointer storing matrix B.

  • b_type[in] [rocblas_datatype] specifies the datatype of matrix B.

  • ldb[in] [rocblas_int] specifies the leading dimension of B.

  • beta[in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.

  • c[in] [void *] device pointer storing matrix C.

  • c_type[in] [rocblas_datatype] specifies the datatype of matrix C.

  • ldc[in] [rocblas_int] specifies the leading dimension of C.

  • d[out] [void *] device pointer storing matrix D. If d and c pointers are to the same matrix then d_type must equal c_type and ldd must equal ldc or the respective invalid status will be returned.

  • d_type[in] [rocblas_datatype] specifies the datatype of matrix D.

  • ldd[in] [rocblas_int] specifies the leading dimension of D.

  • compute_type[in] [rocblas_datatype] specifies the datatype of computation.

  • algo[in] [rocblas_gemm_algo] enumerant specifying the algorithm type.

  • flags[in] [uint32_t] optional gemm flags.

  • list_array[out] [rocblas_int *] output array for solution indices or NULL if getting number of solutions

  • list_size[inout] [rocblas_int *] size of list_array if getting solution indices or output with number of solutions if list_array is NULL

rocblas_status rocblas_gemm_ex_get_solutions_by_type(rocblas_handle handle, rocblas_datatype input_type, rocblas_datatype output_type, rocblas_datatype compute_type, uint32_t flags, rocblas_int *list_array, rocblas_int *list_size)#

BLAS BETA API

rocblas_gemm_ex_get_solutions_by_type gets the indices for all the solutions that match the given types for gemm_ex. Which solution is used by gemm_ex is controlled by the solution_index parameter.

If list_array is NULL, list_size is an output and will be filled with the number of solutions that can solve the GEMM. If list_array is not NULL, then it must be pointing to an array with at least list_size elements and will be filled with the solution indices that can solve the GEMM: the number of elements filled is min(list_size, # of solutions).

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • input_type[in] [rocblas_datatype] specifies the datatype of matrix A.

  • output_type[in] [rocblas_datatype] specifies the datatype of matrix D.

  • compute_type[in] [rocblas_datatype] specifies the datatype of computation.

  • flags[in] [uint32_t] optional gemm flags.

  • list_array[out] [rocblas_int *] output array for solution indices or NULL if getting number of solutions

  • list_size[inout] [rocblas_int *] size of list_array if getting solution indices or output with number of solutions if list_array is NULL

rocblas_status rocblas_gemm_batched_ex_get_solutions(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, uint32_t flags, rocblas_int *list_array, rocblas_int *list_size)#

BLAS BETA API

rocblas_gemm_batched_ex_get_solutions gets the indices for all the solutions that can solve a corresponding call to gemm_batched_ex. Which solution is used by gemm_batched_ex is controlled by the solution_index parameter.

All parameters correspond to gemm_batched_ex except for list_array and list_size, which are used as input and output for getting the solution indices. If list_array is NULL, list_size is an output and will be filled with the number of solutions that can solve the GEMM. If list_array is not NULL, then it must be pointing to an array with at least list_size elements and will be filled with the solution indices that can solve the GEMM: the number of elements filled is min(list_size, # of solutions).

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • transA[in] [rocblas_operation] specifies the form of op( A ).

  • transB[in] [rocblas_operation] specifies the form of op( B ).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • k[in] [rocblas_int] matrix dimension k.

  • alpha[in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.

  • a[in] [void *] device pointer storing array of pointers to each matrix A_i.

  • a_type[in] [rocblas_datatype] specifies the datatype of each matrix A_i.

  • lda[in] [rocblas_int] specifies the leading dimension of each A_i.

  • b[in] [void *] device pointer storing array of pointers to each matrix B_i.

  • b_type[in] [rocblas_datatype] specifies the datatype of each matrix B_i.

  • ldb[in] [rocblas_int] specifies the leading dimension of each B_i.

  • beta[in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.

  • c[in] [void *] device array of device pointers to each matrix C_i.

  • c_type[in] [rocblas_datatype] specifies the datatype of each matrix C_i.

  • ldc[in] [rocblas_int] specifies the leading dimension of each C_i.

  • d[out] [void *] device array of device pointers to each matrix D_i. If d and c are the same array of matrix pointers then d_type must equal c_type and ldd must equal ldc or the respective invalid status will be returned.

  • d_type[in] [rocblas_datatype] specifies the datatype of each matrix D_i.

  • ldd[in] [rocblas_int] specifies the leading dimension of each D_i.

  • batch_count[in] [rocblas_int] number of gemm operations in the batch.

  • compute_type[in] [rocblas_datatype] specifies the datatype of computation.

  • algo[in] [rocblas_gemm_algo] enumerant specifying the algorithm type.

  • flags[in] [uint32_t] optional gemm flags.

  • list_array[out] [rocblas_int *] output array for solution indices or NULL if getting number of solutions

  • list_size[inout] [rocblas_int *] size of list_array if getting solution indices or output with number of solutions if list_array is NULL

rocblas_status rocblas_gemm_batched_ex_get_solutions_by_type(rocblas_handle handle, rocblas_datatype input_type, rocblas_datatype output_type, rocblas_datatype compute_type, uint32_t flags, rocblas_int *list_array, rocblas_int *list_size)#

BLAS BETA API

rocblas_gemm_batched_ex_get_solutions_by_type gets the indices for all the solutions that match the given types for gemm_batched_ex. Which solution is used by gemm_ex is controlled by the solution_index parameter.

If list_array is NULL, list_size is an output and will be filled with the number of solutions that can solve the GEMM. If list_array is not NULL, then it must be pointing to an array with at least list_size elements and will be filled with the solution indices that can solve the GEMM: the number of elements filled is min(list_size, # of solutions).

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • input_type[in] [rocblas_datatype] specifies the datatype of matrix A.

  • output_type[in] [rocblas_datatype] specifies the datatype of matrix D.

  • compute_type[in] [rocblas_datatype] specifies the datatype of computation.

  • flags[in] [uint32_t] optional gemm flags.

  • list_array[out] [rocblas_int *] output array for solution indices or NULL if getting number of solutions

  • list_size[inout] [rocblas_int *] size of list_array if getting solution indices or output with number of solutions if list_array is NULL

rocblas_status rocblas_gemm_strided_batched_ex_get_solutions(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, rocblas_stride stride_a, const void *b, rocblas_datatype b_type, rocblas_int ldb, rocblas_stride stride_b, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, rocblas_stride stride_c, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_stride stride_d, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, uint32_t flags, rocblas_int *list_array, rocblas_int *list_size)#

BLAS BETA API

gemm_strided_batched_ex_get_solutions gets the indices for all the solutions that can solve a corresponding call to gemm_strided_batched_ex. Which solution is used by gemm_strided_batched_ex is controlled by the solution_index parameter.

All parameters correspond to gemm_strided_batched_ex except for list_array and list_size, which are used as input and output for getting the solution indices. If list_array is NULL, list_size is an output and will be filled with the number of solutions that can solve the GEMM. If list_array is not NULL, then it must be pointing to an array with at least list_size elements and will be filled with the solution indices that can solve the GEMM: the number of elements filled is min(list_size, # of solutions).

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • transA[in] [rocblas_operation] specifies the form of op( A ).

  • transB[in] [rocblas_operation] specifies the form of op( B ).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • k[in] [rocblas_int] matrix dimension k.

  • alpha[in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.

  • a[in] [void *] device pointer pointing to first matrix A_1.

  • a_type[in] [rocblas_datatype] specifies the datatype of each matrix A_i.

  • lda[in] [rocblas_int] specifies the leading dimension of each A_i.

  • stride_a[in] [rocblas_stride] specifies stride from start of one A_i matrix to the next A_(i + 1).

  • b[in] [void *] device pointer pointing to first matrix B_1.

  • b_type[in] [rocblas_datatype] specifies the datatype of each matrix B_i.

  • ldb[in] [rocblas_int] specifies the leading dimension of each B_i.

  • stride_b[in] [rocblas_stride] specifies stride from start of one B_i matrix to the next B_(i + 1).

  • beta[in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.

  • c[in] [void *] device pointer pointing to first matrix C_1.

  • c_type[in] [rocblas_datatype] specifies the datatype of each matrix C_i.

  • ldc[in] [rocblas_int] specifies the leading dimension of each C_i.

  • stride_c[in] [rocblas_stride] specifies stride from start of one C_i matrix to the next C_(i + 1).

  • d[out] [void *] device pointer storing each matrix D_i. If d and c pointers are to the same matrix then d_type must equal c_type and ldd must equal ldc and stride_d must equal stride_c or the respective invalid status will be returned.

  • d_type[in] [rocblas_datatype] specifies the datatype of each matrix D_i.

  • ldd[in] [rocblas_int] specifies the leading dimension of each D_i.

  • stride_d[in] [rocblas_stride] specifies stride from start of one D_i matrix to the next D_(i + 1).

  • batch_count[in] [rocblas_int] number of gemm operations in the batch.

  • compute_type[in] [rocblas_datatype] specifies the datatype of computation.

  • algo[in] [rocblas_gemm_algo] enumerant specifying the algorithm type.

  • flags[in] [uint32_t] optional gemm flags.

  • list_array[out] [rocblas_int *] output array for solution indices or NULL if getting number of solutions

  • list_size[inout] [rocblas_int *] size of list_array if getting solution indices or output with number of solutions if list_array is NULL

rocblas_gemm_ex3 + batched, strided_batched#

rocblas_status rocblas_gemm_ex3(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_computetype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)#

BLAS BETA API

gemm_ex3 performs one of the matrix-matrix operations

D = alpha*op( A )*op( B ) + beta*C,
where op( X ) is one of
op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,
alpha and beta are scalars, and A, B, C, and D are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C and D are m by n matrices.

gemm_ex3 is a temporary API to support float8 computation. Trying to run this API on unsupported hardware will return rocblas_status_arch_mismatch.

Supported compute-types are as follows:

rocblas_compute_type_f32         = 300 or
rocblas_compute_type_f8_f8_f32   = 301 or (internalAType_internalBType_AccumulatorType)
rocblas_compute_type_f8_bf8_f32  = 302 or
rocblas_compute_type_bf8_f8_f32  = 303 or
rocblas_compute_type_bf8_bf8_f32 = 304
Supported types are as follows: alpha/beta always float

A type

B type

C type

D type

Compute type

fp8 or bf8

fp8 or bf8

fp32

fp32

f32

fp8

fp8

fp8

fp8

f32

fp8 or bf8

fp8 or bf8

bf8

bf8

f32

fp8 or bf8

fp8 or bf8

fp16

fp16

f32

fp16

fp16

fp16

fp16

f8_f8_f32 or f8_bf8_f32 or bf8_f8_f32 or bf8_bf8_f32

fp8 or fp32

fp8 or fp32

fp16

fp16

f8_f8_f32

fp32

bfp8

bfp8 or fp32

bfp8 or fp32

f8_bf8_f32

bfp8

fp32

fp32

fp32

bf8_f8_f32

Note: When using rocBLAS numerical checking with the flag ROCBLAS_CHECK_NUMERICS, gemm_ex3 will check the input data after the quantization stage. Consequently, when rocblas_compute_type_f32 is not used, only the post processed input will be checked. Numerical checking will always take place on f8/bf8 data.

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • transA[in] [rocblas_operation] specifies the form of op( A ).

  • transB[in] [rocblas_operation] specifies the form of op( B ).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • k[in] [rocblas_int] matrix dimension k.

  • alpha[in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.

  • a[in] [void *] device pointer storing matrix A.

  • a_type[in] [rocblas_datatype] specifies the datatype of matrix A.

  • lda[in] [rocblas_int] specifies the leading dimension of A.

  • b[in] [void *] device pointer storing matrix B.

  • b_type[in] [rocblas_datatype] specifies the datatype of matrix B.

  • ldb[in] [rocblas_int] specifies the leading dimension of B.

  • beta[in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.

  • c[in] [void *] device pointer storing matrix C.

  • c_type[in] [rocblas_datatype] specifies the datatype of matrix C.

  • ldc[in] [rocblas_int] specifies the leading dimension of C.

  • d[out] [void *] device pointer storing matrix D.

  • d_type[in] [rocblas_datatype] specifies the datatype of matrix D.

  • ldd[in] [rocblas_int] specifies the leading dimension of D.

  • compute_type[in] [rocblas_computetype] specifies the datatype of computation.

  • algo[in] [rocblas_gemm_algo] enumerant specifying the algorithm type.

  • solution_index[in] [int32_t] reserved for future use.

  • flags[in] [uint32_t] optional gemm flags.

rocblas_status rocblas_gemm_batched_ex3(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_int batch_count, rocblas_computetype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)#

BLAS BETA API

gemm_batched_ex3 performs one of the batched matrix-matrix operations:

D_i = alpha*op(A_i)*op(B_i) + beta*C_i, for i = 1, ..., batch_count.
where op( X ) is one of
op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,
alpha and beta are scalars, and A, B, C, and D are batched pointers to matrices, with op( A ) an m by k by batch_count batched matrix, op( B ) a k by n by batch_count batched matrix and C and D are m by n by batch_count batched matrices. The batched matrices are an array of pointers to matrices. The number of pointers to matrices is batch_count. C and D may point to the same matrices if their parameters are identical.

gemm_ex3 is a temporary API to support float8 computation. Trying to run this API on unsupported hardware will return rocblas_status_arch_mismatch.

Supported compute-types are as follows:

rocblas_compute_type_f32         = 300 or
rocblas_compute_type_f8_f8_f32   = 301 or (internalAType_internalBType_AccumulatorType)
rocblas_compute_type_f8_bf8_f32  = 302 or
rocblas_compute_type_bf8_f8_f32  = 303 or
rocblas_compute_type_bf8_bf8_f32 = 304
Supported types are as follows: alpha/beta always float

A type

B type

C type

D type

Compute type

fp8 or bf8

fp8 or bf8

fp32

fp32

f32

fp8

fp8

fp8

fp8

f32

fp8 or bf8

fp8 or bf8

bf8

bf8

f32

fp8 or bf8

fp8 or bf8

fp16

fp16

f32

fp16

fp16

fp16

fp16

f8_f8_f32 or f8_bf8_f32 or bf8_f8_f32 or bf8_bf8_f32

fp8 or fp32

fp8 or fp32

fp16

fp16

f8_f8_f32

fp32

bfp8

bfp8 or fp32

bfp8 or fp32

f8_bf8_f32

bfp8

fp32

fp32

fp32

bf8_f8_f32

Note: When using rocBLAS numerical checking with the flag ROCBLAS_CHECK_NUMERICS, gemm_batched_ex3 will check the input data after the quantization stage. Consequently, when rocblas_compute_type_f32 is not used, only the post processed input will be checked. Numerical checking will always take place on f8/bf8 data.

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • transA[in] [rocblas_operation] specifies the form of op( A ).

  • transB[in] [rocblas_operation] specifies the form of op( B ).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • k[in] [rocblas_int] matrix dimension k.

  • alpha[in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.

  • a[in] [void *] device pointer storing array of pointers to each matrix A_i.

  • a_type[in] [rocblas_datatype] specifies the datatype of each matrix A_i.

  • lda[in] [rocblas_int] specifies the leading dimension of each A_i.

  • b[in] [void *] device pointer storing array of pointers to each matrix B_i.

  • b_type[in] [rocblas_datatype] specifies the datatype of each matrix B_i.

  • ldb[in] [rocblas_int] specifies the leading dimension of each B_i.

  • beta[in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.

  • c[in] [void *] device array of device pointers to each matrix C_i.

  • c_type[in] [rocblas_datatype] specifies the datatype of each matrix C_i.

  • ldc[in] [rocblas_int] specifies the leading dimension of each C_i.

  • d[out] [void *] device array of device pointers to each matrix D_i. If d and c are the same array of matrix pointers then d_type must equal c_type and ldd must equal ldc or the respective invalid status will be returned.

  • d_type[in] [rocblas_datatype] specifies the datatype of each matrix D_i.

  • ldd[in] [rocblas_int] specifies the leading dimension of each D_i.

  • batch_count[in] [rocblas_int] number of gemm operations in the batch.

  • compute_type[in] [rocblas_computetype] specifies the datatype of computation.

  • algo[in] [rocblas_gemm_algo] enumerant specifying the algorithm type.

  • solution_index[in] [int32_t] if algo is rocblas_gemm_algo_solution_index, this controls which solution is used. When algo is not rocblas_gemm_algo_solution_index, or if solution_index <= 0, the default solution is used. This parameter was unused in previous releases and instead always used the default solution

  • flags[in] [uint32_t] optional gemm flags.

rocblas_status rocblas_gemm_strided_batched_ex3(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, rocblas_stride stride_a, const void *b, rocblas_datatype b_type, rocblas_int ldb, rocblas_stride stride_b, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, rocblas_stride stride_c, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_stride stride_d, rocblas_int batch_count, rocblas_computetype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)#

BLAS EX API

gemm_strided_batched_ex3 performs one of the strided_batched matrix-matrix operations:

D_i = alpha*op(A_i)*op(B_i) + beta*C_i, for i = 1, ..., batch_count
where op( X ) is one of
op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,
alpha and beta are scalars, and A, B, C, and D are strided_batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) a k by n by batch_count strided_batched matrix and C and D are m by n by batch_count strided_batched matrices. C and D may point to the same matrices if their parameters are identical.

The strided_batched matrices are multiple matrices separated by a constant stride. The number of matrices is batch_count.

gemm_ex3 is a temporary API to support float8 computation. Trying to run this API on unsupported hardware will return rocblas_status_arch_mismatch.

Supported compute-types are as follows:

rocblas_compute_type_f32         = 300 or
rocblas_compute_type_f8_f8_f32   = 301 or (internalAType_internalBType_AccumulatorType)
rocblas_compute_type_f8_bf8_f32  = 302 or
rocblas_compute_type_bf8_f8_f32  = 303 or
rocblas_compute_type_bf8_bf8_f32 = 304
Supported types are as follows: alpha/beta always float

A type

B type

C type

D type

Compute type

fp8 or bf8

fp8 or bf8

fp32

fp32

f32

fp8

fp8

fp8

fp8

f32

fp8 or bf8

fp8 or bf8

bf8

bf8

f32

fp8 or bf8

fp8 or bf8

fp16

fp16

f32

fp16

fp16

fp16

fp16

f8_f8_f32 or f8_bf8_f32 or bf8_f8_f32 or bf8_bf8_f32

fp8 or fp32

fp8 or fp32

fp16

fp16

f8_f8_f32

fp32

bfp8

bfp8 or fp32

bfp8 or fp32

f8_bf8_f32

bfp8

fp32

fp32

fp32

bf8_f8_f32

Note: When using rocBLAS numerical checking with the flag ROCBLAS_CHECK_NUMERICS, gemm_strided_batched_ex3 will check the input data after the quantization stage. Consequently, when rocblas_compute_type_f32 is not used, only the post processed input will be checked. Numerical checking will always take place on f8/bf8 data.

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • transA[in] [rocblas_operation] specifies the form of op( A ).

  • transB[in] [rocblas_operation] specifies the form of op( B ).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • k[in] [rocblas_int] matrix dimension k.

  • alpha[in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.

  • a[in] [void *] device pointer pointing to first matrix A_1.

  • a_type[in] [rocblas_datatype] specifies the datatype of each matrix A_i.

  • lda[in] [rocblas_int] specifies the leading dimension of each A_i.

  • stride_a[in] [rocblas_stride] specifies stride from start of one A_i matrix to the next A_(i + 1).

  • b[in] [void *] device pointer pointing to first matrix B_1.

  • b_type[in] [rocblas_datatype] specifies the datatype of each matrix B_i.

  • ldb[in] [rocblas_int] specifies the leading dimension of each B_i.

  • stride_b[in] [rocblas_stride] specifies stride from start of one B_i matrix to the next B_(i + 1).

  • beta[in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.

  • c[in] [void *] device pointer pointing to first matrix C_1.

  • c_type[in] [rocblas_datatype] specifies the datatype of each matrix C_i.

  • ldc[in] [rocblas_int] specifies the leading dimension of each C_i.

  • stride_c[in] [rocblas_stride] specifies stride from start of one C_i matrix to the next C_(i + 1).

  • d[out] [void *] device pointer storing each matrix D_i. If d and c pointers are to the same matrix then d_type must equal c_type and ldd must equal ldc and stride_d must equal stride_c or the respective invalid status will be returned.

  • d_type[in] [rocblas_datatype] specifies the datatype of each matrix D_i.

  • ldd[in] [rocblas_int] specifies the leading dimension of each D_i.

  • stride_d[in] [rocblas_stride] specifies stride from start of one D_i matrix to the next D_(i + 1).

  • batch_count[in] [rocblas_int] number of gemm operations in the batch.

  • compute_type[in] [rocblas_computetype] specifies the datatype of computation.

  • algo[in] [rocblas_gemm_algo] enumerant specifying the algorithm type.

  • solution_index[in] [int32_t] if algo is rocblas_gemm_algo_solution_index, this controls which solution is used. When algo is not rocblas_gemm_algo_solution_index, or if solution_index <= 0, the default solution is used. This parameter was unused in previous releases and instead always used the default solution

  • flags[in] [uint32_t] optional gemm flags.

Graph Support for rocBLAS#

Most of the rocBLAS functions can be captured into a graph node via Graph Management HIP APIs, except those listed in Functions Unsupported with Graph Capture. For a list of graph related HIP APIs, refer to Graph Management HIP API.

CHECK_HIP_ERROR((hipStreamBeginCapture(stream, hipStreamCaptureModeGlobal));
rocblas_<function>(<arguments>);
CHECK_HIP_ERROR(hipStreamEndCapture(stream, &graph));

The above code will create a graph with rocblas_function() as graph node. The captured graph can be launched as shown below:

CHECK_HIP_ERROR(hipGraphInstantiate(&instance, graph, NULL, NULL, 0));
CHECK_HIP_ERROR(hipGraphLaunch(instance, stream));

Graph support requires Asynchronous HIP APIs, hence, users must enable stream-order memory allocation. For more details refer to section Stream-Ordered Memory Allocation.

During stream capture, rocBLAS stores the allocated host and device memory in the handle and the allocated memory will be freed when the handle is destroyed.

Functions Unsupported with Graph Capture#

  • The following Level-1 functions place results into host buffers (in pointer mode host) which enforces synchronization.

    • dot

    • asum

    • nrm2

    • imax

    • imin

  • BLAS Level-3 and BLAS-EX functions in pointer mode device do not support HIP Graph. Support will be added in future releases.

HIP Graph Known Issues in rocBLAS#

  • On Windows platform, batched functions (Level-1, Level-2 and Level-3) produce incorrect results.