rocBLAS Extension#
Level-1 Extension functions support the ILP64 API. For more information on these _64 functions, refer to section ILP64 Interface.
rocblas_axpy_ex + batched, strided_batched#
- 
rocblas_status rocblas_axpy_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, const void *x, rocblas_datatype x_type, rocblas_int incx, void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_datatype execution_type)#
- 
rocblas_status rocblas_axpy_batched_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, const void *x, rocblas_datatype x_type, rocblas_int incx, void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_int batch_count, rocblas_datatype execution_type)#
- 
rocblas_status rocblas_axpy_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stridex, void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_datatype execution_type)#
- BLAS EX API - axpy_strided_batched_ex computes constant alpha multiplied by vector x, plus vector y over a set of strided batched vectors. Currently supported datatypes are as follows:- y := alpha * x + y - alpha_type - x_type - y_type - execution_type - bf16_r - bf16_r - bf16_r - f32_r - f32_r - bf16_r - bf16_r - f32_r - f16_r - f16_r - f16_r - f16_r - f16_r - f16_r - f16_r - f32_r - f32_r - f16_r - f16_r - f32_r - f32_r - f32_r - f32_r - f32_r - f64_r - f64_r - f64_r - f64_r - f32_c - f32_c - f32_c - f32_c - f64_c - f64_c - f64_c - f64_c - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- n – [in] [rocblas_int] the number of elements in each x_i and y_i. 
- alpha – [in] device pointer or host pointer to specify the scalar alpha. 
- alpha_type – [in] [rocblas_datatype] specifies the datatype of alpha. 
- x – [in] device pointer to the first vector x_1. 
- x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i. 
- incx – [in] [rocblas_int] specifies the increment for the elements of each x_i. 
- stridex – [in] [rocblas_stride] stride from the start of one vector (x_i) to the next one (x_i+1). There are no restrictions placed on stridex. However, ensure that stridex is of appropriate size. For a typical case this means stridex >= n * incx. 
- y – [inout] device pointer to the first vector y_1. 
- y_type – [in] [rocblas_datatype] specifies the datatype of each vector y_i. 
- incy – [in] [rocblas_int] specifies the increment for the elements of each y_i. 
- stridey – [in] [rocblas_stride] stride from the start of one vector (y_i) to the next one (y_i+1). There are no restrictions placed on stridey. However, ensure that stridey is of appropriate size. For a typical case this means stridey >= n * incy. 
- batch_count – [in] [rocblas_int] number of instances in the batch. 
- execution_type – [in] [rocblas_datatype] specifies the datatype of computation. 
 
 
axpy_ex, axpy_batched_ex, and axpy_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.
rocblas_dot_ex + batched, strided_batched#
- 
rocblas_status rocblas_dot_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, const void *y, rocblas_datatype y_type, rocblas_int incy, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)#
- 
rocblas_status rocblas_dot_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, const void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_int batch_count, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)#
- 
rocblas_status rocblas_dot_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stride_x, const void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_stride stride_y, rocblas_int batch_count, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)#
- BLAS EX API - dot_strided_batched_ex performs a batch of dot products of vectors x and y. dotc_strided_batched_ex performs a batch of dot products of the conjugate of complex vector x and complex vector y- result_i = x_i * y_i; where (x_i, y_i) is the i-th instance of the batch. x_i and y_i are vectors, for i = 1, …, batch_count- result_i = conjugate (x_i) * y_i; - Currently supported datatypes are as follows: - x_type - y_type - result_type - execution_type - f16_r - f16_r - f16_r - f16_r - f16_r - f16_r - f16_r - f32_r - bf16_r - bf16_r - bf16_r - f32_r - f32_r - f32_r - f32_r - f32_r - f32_r - f32_r - f64_r - f64_r - f64_r - f64_r - f64_r - f64_r - f32_c - f32_c - f32_c - f32_c - f64_c - f64_c - f64_c - f64_c - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- n – [in] [rocblas_int] the number of elements in each x_i and y_i. 
- x – [in] device pointer to the first vector (x_1) in the batch. 
- x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i. 
- incx – [in] [rocblas_int] specifies the increment for the elements of each x_i. 
- stride_x – [in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1) 
- y – [in] device pointer to the first vector (y_1) in the batch. 
- y_type – [in] [rocblas_datatype] specifies the datatype of each vector y_i. 
- incy – [in] [rocblas_int] specifies the increment for the elements of each y_i. 
- stride_y – [in] [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1) 
- batch_count – [in] [rocblas_int] number of instances in the batch. 
- result – [inout] device array or host array of batch_count size to store the dot products of each batch. return 0.0 for each element if n <= 0. 
- result_type – [in] [rocblas_datatype] specifies the datatype of the result. 
- execution_type – [in] [rocblas_datatype] specifies the datatype of computation. 
 
 
dot_ex, dot_batched_ex, and dot_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.
rocblas_dotc_ex + batched, strided_batched#
- 
rocblas_status rocblas_dotc_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, const void *y, rocblas_datatype y_type, rocblas_int incy, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)#
- 
rocblas_status rocblas_dotc_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, const void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_int batch_count, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)#
- 
rocblas_status rocblas_dotc_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stride_x, const void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_stride stride_y, rocblas_int batch_count, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)#
- BLAS EX API - dot_strided_batched_ex performs a batch of dot products of vectors x and y. dotc_strided_batched_ex performs a batch of dot products of the conjugate of complex vector x and complex vector y- result_i = x_i * y_i; where (x_i, y_i) is the i-th instance of the batch. x_i and y_i are vectors, for i = 1, …, batch_count- result_i = conjugate (x_i) * y_i; - Currently supported datatypes are as follows: - x_type - y_type - result_type - execution_type - f16_r - f16_r - f16_r - f16_r - f16_r - f16_r - f16_r - f32_r - bf16_r - bf16_r - bf16_r - f32_r - f32_r - f32_r - f32_r - f32_r - f32_r - f32_r - f64_r - f64_r - f64_r - f64_r - f64_r - f64_r - f32_c - f32_c - f32_c - f32_c - f64_c - f64_c - f64_c - f64_c - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- n – [in] [rocblas_int] the number of elements in each x_i and y_i. 
- x – [in] device pointer to the first vector (x_1) in the batch. 
- x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i. 
- incx – [in] [rocblas_int] specifies the increment for the elements of each x_i. 
- stride_x – [in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1) 
- y – [in] device pointer to the first vector (y_1) in the batch. 
- y_type – [in] [rocblas_datatype] specifies the datatype of each vector y_i. 
- incy – [in] [rocblas_int] specifies the increment for the elements of each y_i. 
- stride_y – [in] [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1) 
- batch_count – [in] [rocblas_int] number of instances in the batch. 
- result – [inout] device array or host array of batch_count size to store the dot products of each batch. return 0.0 for each element if n <= 0. 
- result_type – [in] [rocblas_datatype] specifies the datatype of the result. 
- execution_type – [in] [rocblas_datatype] specifies the datatype of computation. 
 
 
dotc_ex, dotc_batched_ex, and dotc_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.
rocblas_nrm2_ex + batched, strided_batched#
- 
rocblas_status rocblas_nrm2_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, void *results, rocblas_datatype result_type, rocblas_datatype execution_type)#
- 
rocblas_status rocblas_nrm2_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_int batch_count, void *results, rocblas_datatype result_type, rocblas_datatype execution_type)#
- 
rocblas_status rocblas_nrm2_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count, void *results, rocblas_datatype result_type, rocblas_datatype execution_type)#
- BLAS_EX API. - nrm2_strided_batched_ex computes the euclidean norm over a batch of real or complex vectors. Currently supported datatypes are as follows:- result := sqrt( x_i'*x_i ) for real vectors x, for i = 1, ..., batch_count result := sqrt( x_i**H*x_i ) for complex vectors, for i = 1, ..., batch_count - x_type - result - execution_type - bf16_r - bf16_r - f32_r - f16_r - f16_r - f32_r - f32_r - f32_r - f32_r - f64_r - f64_r - f64_r - f32_c - f32_r - f32_r - f64_c - f64_r - f64_r - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- n – [in] [rocblas_int] number of elements in each x_i. 
- x – [in] device pointer to the first vector x_1. 
- x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i. 
- incx – [in] [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0. 
- stride_x – [in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x. However, ensure that stride_x is of appropriate size. For a typical case this means stride_x >= n * incx. 
- batch_count – [in] [rocblas_int] number of instances in the batch. 
- results – [out] device pointer or host pointer to array for storing contiguous batch_count results. return is 0.0 for each element if n <= 0, incx<=0. 
- result_type – [in] [rocblas_datatype] specifies the datatype of the result. 
- execution_type – [in] [rocblas_datatype] specifies the datatype of computation. 
 
 
nrm2_ex, nrm2_batched_ex, and nrm2_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.
rocblas_rot_ex + batched, strided_batched#
- 
rocblas_status rocblas_rot_ex(rocblas_handle handle, rocblas_int n, void *x, rocblas_datatype x_type, rocblas_int incx, void *y, rocblas_datatype y_type, rocblas_int incy, const void *c, const void *s, rocblas_datatype cs_type, rocblas_datatype execution_type)#
- 
rocblas_status rocblas_rot_batched_ex(rocblas_handle handle, rocblas_int n, void *x, rocblas_datatype x_type, rocblas_int incx, void *y, rocblas_datatype y_type, rocblas_int incy, const void *c, const void *s, rocblas_datatype cs_type, rocblas_int batch_count, rocblas_datatype execution_type)#
- 
rocblas_status rocblas_rot_strided_batched_ex(rocblas_handle handle, rocblas_int n, void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stride_x, void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_stride stride_y, const void *c, const void *s, rocblas_datatype cs_type, rocblas_int batch_count, rocblas_datatype execution_type)#
- BLAS Level 1 API - rot_strided_batched_ex applies the Givens rotation matrix defined by c=cos(alpha) and s=sin(alpha) to strided batched vectors x_i and y_i, for i = 1, …, batch_count. Scalars c and s may be stored in either host or device memory. Location is specified by calling rocblas_set_pointer_mode. - In the case where cs_type is real: In the case where cs_type is complex, the imaginary part of c is ignored:- x := c * x + s * y y := c * y - s * x Currently supported datatypes are as follows:- x := real(c) * x + s * y y := real(c) * y - conj(s) * x - x_type - y_type - cs_type - execution_type - bf16_r - bf16_r - bf16_r - f32_r - f16_r - f16_r - f16_r - f32_r - f32_r - f32_r - f32_r - f32_r - f64_r - f64_r - f64_r - f64_r - f32_c - f32_c - f32_c - f32_c - f32_c - f32_c - f32_r - f32_c - f64_c - f64_c - f64_c - f64_c - f64_c - f64_c - f64_r - f64_c - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- n – [in] [rocblas_int] number of elements in each x_i and y_i vectors. 
- x – [inout] device pointer to the first vector x_1. 
- x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i. 
- incx – [in] [rocblas_int] specifies the increment between elements of each x_i. 
- stride_x – [in] [rocblas_stride] specifies the increment from the beginning of x_i to the beginning of x_(i+1) 
- y – [inout] device pointer to the first vector y_1. 
- y_type – [in] [rocblas_datatype] specifies the datatype of each vector y_i. 
- incy – [in] [rocblas_int] specifies the increment between elements of each y_i. 
- stride_y – [in] [rocblas_stride] specifies the increment from the beginning of y_i to the beginning of y_(i+1) 
- c – [in] device pointer or host pointer to scalar cosine component of the rotation matrix. 
- s – [in] device pointer or host pointer to scalar sine component of the rotation matrix. 
- cs_type – [in] [rocblas_datatype] specifies the datatype of c and s. 
- batch_count – [in] [rocblas_int] the number of x and y arrays, the number of batches. 
- execution_type – [in] [rocblas_datatype] specifies the datatype of computation. 
 
 
rot_ex, rot_batched_ex, and rot_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.
rocblas_scal_ex + batched, strided_batched#
- 
rocblas_status rocblas_scal_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_datatype execution_type)#
- 
rocblas_status rocblas_scal_batched_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_int batch_count, rocblas_datatype execution_type)#
- 
rocblas_status rocblas_scal_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_datatype execution_type)#
- BLAS EX API - scal_strided_batched_ex scales each element of vector x with scalar alpha over a set of strided batched vectors. Currently supported datatypes are as follows:- x := alpha * x - alpha_type - x_type - execution_type - f32_r - bf16_r - f32_r - bf16_r - bf16_r - f32_r - f16_r - f16_r - f16_r - f16_r - f16_r - f32_r - f32_r - f16_r - f32_r - f32_r - f32_r - f32_r - f64_r - f64_r - f64_r - f32_c - f32_c - f32_c - f64_c - f64_c - f64_c - f32_r - f32_c - f32_c - f64_r - f64_c - f64_c - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- n – [in] [rocblas_int] the number of elements in x. 
- alpha – [in] device pointer or host pointer for the scalar alpha. 
- alpha_type – [in] [rocblas_datatype] specifies the datatype of alpha. 
- x – [inout] device pointer to the first vector x_1. 
- x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i. 
- incx – [in] [rocblas_int] specifies the increment for the elements of each x_i. 
- stridex – [in] [rocblas_stride] stride from the start of one vector (x_i) to the next one (x_i+1). There are no restrictions placed on stridex. However, ensure that stridex is of appropriate size. For a typical case this means stridex >= n * incx. 
- batch_count – [in] [rocblas_int] number of instances in the batch. 
- execution_type – [in] [rocblas_datatype] specifies the datatype of computation. 
 
 
scal_ex, scal_batched_ex, and scal_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.
rocblas_gemm_ex + batched, strided_batched#
- 
rocblas_status rocblas_gemm_ex(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)#
- BLAS EX API - gemm_ex performs one of the matrix-matrix operations: where op( X ) is one of- D = alpha*op( A )*op( B ) + beta*C, alpha and beta are scalars, and A, B, C, and D are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C and D are m by n matrices. C and D may point to the same matrix if their parameters are identical.- op( X ) = X or op( X ) = X**T or op( X ) = X**H, - Supported types are as follows: - rocblas_datatype_f64_r = a_type = b_type = c_type = d_type = compute_type 
- rocblas_datatype_f32_r = a_type = b_type = c_type = d_type = compute_type 
- rocblas_datatype_f16_r = a_type = b_type = c_type = d_type = compute_type 
- rocblas_datatype_f16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type 
- rocblas_datatype_f16_r = a_type = b_type; rocblas_datatype_f32_r = c_type = d_type = compute_type 
- rocblas_datatype_bf16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type 
- rocblas_datatype_bf16_r = a_type = b_type; rocblas_datatype_f32_r = c_type = d_type = compute_type 
- rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type 
- rocblas_datatype_f32_c = a_type = b_type = c_type = d_type = compute_type 
- rocblas_datatype_f64_c = a_type = b_type = c_type = d_type = compute_type 
 - Although not widespread, some gemm kernels used by gemm_ex may use atomic operations. See Atomic Operations in the API Reference Guide for more information. - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- transA – [in] [rocblas_operation] specifies the form of op( A ). 
- transB – [in] [rocblas_operation] specifies the form of op( B ). 
- m – [in] [rocblas_int] matrix dimension m. 
- n – [in] [rocblas_int] matrix dimension n. 
- k – [in] [rocblas_int] matrix dimension k. 
- alpha – [in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type. 
- a – [in] [void *] device pointer storing matrix A. 
- a_type – [in] [rocblas_datatype] specifies the datatype of matrix A. 
- lda – [in] [rocblas_int] specifies the leading dimension of A. 
- b – [in] [void *] device pointer storing matrix B. 
- b_type – [in] [rocblas_datatype] specifies the datatype of matrix B. 
- ldb – [in] [rocblas_int] specifies the leading dimension of B. 
- beta – [in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type. 
- c – [in] [void *] device pointer storing matrix C. 
- c_type – [in] [rocblas_datatype] specifies the datatype of matrix C. 
- ldc – [in] [rocblas_int] specifies the leading dimension of C. 
- d – [out] [void *] device pointer storing matrix D. If d and c pointers are to the same matrix then d_type must equal c_type and ldd must equal ldc or the respective invalid status will be returned. 
- d_type – [in] [rocblas_datatype] specifies the datatype of matrix D. 
- ldd – [in] [rocblas_int] specifies the leading dimension of D. 
- compute_type – [in] [rocblas_datatype] specifies the datatype of computation. 
- algo – [in] [rocblas_gemm_algo] enumerant specifying the algorithm type. 
- solution_index – [in] [int32_t] if algo is rocblas_gemm_algo_solution_index, this controls which solution is used. When algo is not rocblas_gemm_algo_solution_index, or if solution_index <= 0, the default solution is used. This parameter was unused in previous releases and instead always used the default solution 
- flags – [in] [uint32_t] optional gemm flags. 
 
 
gemm_ex functions support the _64 interface. However, no arguments larger than (int32_t max value * 16) are currently supported. Refer to section ILP64 Interface.
- 
rocblas_status rocblas_gemm_batched_ex(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)#
- BLAS EX API - gemm_batched_ex performs one of the batched matrix-matrix operations: D_i = alpha*op(A_i)*op(B_i) + beta*C_i, for i = 1, …, batch_count. where op( X ) is one of op( X ) = X or op( X ) = X**T or op( X ) = X**H, alpha and beta are scalars, and A, B, C, and D are batched pointers to matrices, with op( A ) an m by k by batch_count batched matrix, op( B ) a k by n by batch_count batched matrix and C and D are m by n by batch_count batched matrices. The batched matrices are an array of pointers to matrices. The number of pointers to matrices is batch_count. C and D may point to the same matrices if their parameters are identical. - Supported types are as follows: - rocblas_datatype_f64_r = a_type = b_type = c_type = d_type = compute_type 
- rocblas_datatype_f32_r = a_type = b_type = c_type = d_type = compute_type 
- rocblas_datatype_f16_r = a_type = b_type = c_type = d_type = compute_type 
- rocblas_datatype_f16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type 
- rocblas_datatype_bf16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type 
- rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type 
- rocblas_datatype_f32_c = a_type = b_type = c_type = d_type = compute_type 
- rocblas_datatype_f64_c = a_type = b_type = c_type = d_type = compute_type 
 - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- transA – [in] [rocblas_operation] specifies the form of op( A ). 
- transB – [in] [rocblas_operation] specifies the form of op( B ). 
- m – [in] [rocblas_int] matrix dimension m. 
- n – [in] [rocblas_int] matrix dimension n. 
- k – [in] [rocblas_int] matrix dimension k. 
- alpha – [in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type. 
- a – [in] [void *] device pointer storing array of pointers to each matrix A_i. 
- a_type – [in] [rocblas_datatype] specifies the datatype of each matrix A_i. 
- lda – [in] [rocblas_int] specifies the leading dimension of each A_i. 
- b – [in] [void *] device pointer storing array of pointers to each matrix B_i. 
- b_type – [in] [rocblas_datatype] specifies the datatype of each matrix B_i. 
- ldb – [in] [rocblas_int] specifies the leading dimension of each B_i. 
- beta – [in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type. 
- c – [in] [void *] device array of device pointers to each matrix C_i. 
- c_type – [in] [rocblas_datatype] specifies the datatype of each matrix C_i. 
- ldc – [in] [rocblas_int] specifies the leading dimension of each C_i. 
- d – [out] [void *] device array of device pointers to each matrix D_i. If d and c are the same array of matrix pointers then d_type must equal c_type and ldd must equal ldc or the respective invalid status will be returned. 
- d_type – [in] [rocblas_datatype] specifies the datatype of each matrix D_i. 
- ldd – [in] [rocblas_int] specifies the leading dimension of each D_i. 
- batch_count – [in] [rocblas_int] number of gemm operations in the batch. 
- compute_type – [in] [rocblas_datatype] specifies the datatype of computation. 
- algo – [in] [rocblas_gemm_algo] enumerant specifying the algorithm type. 
- solution_index – [in] [int32_t] if algo is rocblas_gemm_algo_solution_index, this controls which solution is used. When algo is not rocblas_gemm_algo_solution_index, or if solution_index <= 0, the default solution is used. This parameter was unused in previous releases and instead always used the default solution 
- flags – [in] [uint32_t] optional gemm flags. 
 
 
gemm_batched_ex functions support the _64 interface. Only the parameter batch_count larger than (int32_t max value * 16) is currently supported. Refer to section ILP64 Interface.
- 
rocblas_status rocblas_gemm_strided_batched_ex(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, rocblas_stride stride_a, const void *b, rocblas_datatype b_type, rocblas_int ldb, rocblas_stride stride_b, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, rocblas_stride stride_c, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_stride stride_d, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)#
- BLAS EX API - gemm_strided_batched_ex performs one of the strided_batched matrix-matrix operations: where op( X ) is one of- D_i = alpha*op(A_i)*op(B_i) + beta*C_i, for i = 1, ..., batch_count alpha and beta are scalars, and A, B, C, and D are strided_batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) a k by n by batch_count strided_batched matrix and C and D are m by n by batch_count strided_batched matrices. C and D may point to the same matrices if their parameters are identical.- op( X ) = X or op( X ) = X**T or op( X ) = X**H, - The strided_batched matrices are multiple matrices separated by a constant stride. The number of matrices is batch_count. - Supported types are as follows: - rocblas_datatype_f64_r = a_type = b_type = c_type = d_type = compute_type 
- rocblas_datatype_f32_r = a_type = b_type = c_type = d_type = compute_type 
- rocblas_datatype_f16_r = a_type = b_type = c_type = d_type = compute_type 
- rocblas_datatype_f16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type 
- rocblas_datatype_bf16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type 
- rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type 
- rocblas_datatype_f32_c = a_type = b_type = c_type = d_type = compute_type 
- rocblas_datatype_f64_c = a_type = b_type = c_type = d_type = compute_type 
 - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- transA – [in] [rocblas_operation] specifies the form of op( A ). 
- transB – [in] [rocblas_operation] specifies the form of op( B ). 
- m – [in] [rocblas_int] matrix dimension m. 
- n – [in] [rocblas_int] matrix dimension n. 
- k – [in] [rocblas_int] matrix dimension k. 
- alpha – [in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type. 
- a – [in] [void *] device pointer pointing to first matrix A_1. 
- a_type – [in] [rocblas_datatype] specifies the datatype of each matrix A_i. 
- lda – [in] [rocblas_int] specifies the leading dimension of each A_i. 
- stride_a – [in] [rocblas_stride] specifies stride from start of one A_i matrix to the next A_(i + 1). 
- b – [in] [void *] device pointer pointing to first matrix B_1. 
- b_type – [in] [rocblas_datatype] specifies the datatype of each matrix B_i. 
- ldb – [in] [rocblas_int] specifies the leading dimension of each B_i. 
- stride_b – [in] [rocblas_stride] specifies stride from start of one B_i matrix to the next B_(i + 1). 
- beta – [in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type. 
- c – [in] [void *] device pointer pointing to first matrix C_1. 
- c_type – [in] [rocblas_datatype] specifies the datatype of each matrix C_i. 
- ldc – [in] [rocblas_int] specifies the leading dimension of each C_i. 
- stride_c – [in] [rocblas_stride] specifies stride from start of one C_i matrix to the next C_(i + 1). 
- d – [out] [void *] device pointer storing each matrix D_i. If d and c pointers are to the same matrix then d_type must equal c_type and ldd must equal ldc and stride_d must equal stride_c or the respective invalid status will be returned. 
- d_type – [in] [rocblas_datatype] specifies the datatype of each matrix D_i. 
- ldd – [in] [rocblas_int] specifies the leading dimension of each D_i. 
- stride_d – [in] [rocblas_stride] specifies stride from start of one D_i matrix to the next D_(i + 1). 
- batch_count – [in] [rocblas_int] number of gemm operations in the batch. 
- compute_type – [in] [rocblas_datatype] specifies the datatype of computation. 
- algo – [in] [rocblas_gemm_algo] enumerant specifying the algorithm type. 
- solution_index – [in] [int32_t] if algo is rocblas_gemm_algo_solution_index, this controls which solution is used. When algo is not rocblas_gemm_algo_solution_index, or if solution_index <= 0, the default solution is used. This parameter was unused in previous releases and instead always used the default solution 
- flags – [in] [uint32_t] optional gemm flags. 
 
 
gemm_strided_batched_ex functions support the _64 interface. Only the parameter batch_count larger than (int32_t max value * 16) is currently supported. Refer to section ILP64 Interface.
rocblas_trsm_ex + batched, strided_batched#
- 
rocblas_status rocblas_trsm_ex(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const void *alpha, const void *A, rocblas_int lda, void *B, rocblas_int ldb, const void *invA, rocblas_int invA_size, rocblas_datatype compute_type)#
- 
rocblas_status rocblas_trsm_batched_ex(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const void *alpha, const void *A, rocblas_int lda, void *B, rocblas_int ldb, rocblas_int batch_count, const void *invA, rocblas_int invA_size, rocblas_datatype compute_type)#
- 
rocblas_status rocblas_trsm_strided_batched_ex(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const void *alpha, const void *A, rocblas_int lda, rocblas_stride stride_A, void *B, rocblas_int ldb, rocblas_stride stride_B, rocblas_int batch_count, const void *invA, rocblas_int invA_size, rocblas_stride stride_invA, rocblas_datatype compute_type)#
- BLAS EX API - trsm_strided_batched_ex solves: for i = 1, …, batch_count; and where alpha is a scalar, X and B are strided batched m by n matrices, A is a strided batched triangular matrix and op(A_i) is one of- op(A_i)*X_i = alpha*B_i or X_i*op(A_i) = alpha*B_i, Each matrix X_i is overwritten on B_i.- op( A_i ) = A_i or op( A_i ) = A_i^T or op( A_i ) = A_i^H. - This function gives the user the ability to reuse each invA_i matrix between runs. If invA == NULL, rocblas_trsm_batched_ex will automatically calculate each invA_i on every run. - Setting up invA: Each accepted invA_i matrix consists of the packed 128x128 inverses of the diagonal blocks of matrix A_i, followed by any smaller diagonal block that remains. To set up invA_i it is recommended that rocblas_trtri_batched be used with matrix A_i as the input. invA is a contiguous piece of memory holding each invA_i. - Device memory of size 128 x k should be allocated for each invA_i ahead of time, where k is m when rocblas_side_left and is n when rocblas_side_right. The actual number of elements in each invA_i should be passed as invA_size. - To begin, rocblas_trtri_batched must be called on the full 128x128-sized diagonal blocks of each matrix A_i. Below are the restricted parameters: - n = 128 
- ldinvA = 128 
- stride_invA = 128x128 
- batch_count = k / 128 
 - Then any remaining block may be added: - n = k % 128 
- invA = invA + stride_invA * previous_batch_count 
- ldinvA = 128 
- batch_count = 1 
 - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- side – [in] [rocblas_side] - rocblas_side_left: op(A)*X = alpha*B 
- rocblas_side_right: X*op(A) = alpha*B 
 
- uplo – [in] [rocblas_fill] - rocblas_fill_upper: each A_i is an upper triangular matrix. 
- rocblas_fill_lower: each A_i is a lower triangular matrix. 
 
- transA – [in] [rocblas_operation] - transB: op(A) = A. 
- rocblas_operation_transpose: op(A) = A^T 
- rocblas_operation_conjugate_transpose: op(A) = A^H 
 
- diag – [in] [rocblas_diagonal] - rocblas_diagonal_unit: each A_i is assumed to be unit triangular. 
- rocblas_diagonal_non_unit: each A_i is not assumed to be unit triangular. 
 
- m – [in] [rocblas_int] m specifies the number of rows of each B_i. m >= 0. 
- n – [in] [rocblas_int] n specifies the number of columns of each B_i. n >= 0. 
- alpha – [in] [void *] device pointer or host pointer specifying the scalar alpha. When alpha is &zero then A is not referenced, and B need not be set before entry. 
- A – [in] [void *] device pointer storing matrix A. of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed. 
- lda – [in] [rocblas_int] lda specifies the first dimension of A. - if side = rocblas_side_left, lda >= max( 1, m ), if side = rocblas_side_right, lda >= max( 1, n ). 
- stride_A – [in] [rocblas_stride] The stride between each A matrix. 
- B – [inout] [void *] device pointer pointing to first matrix B_i. each B_i is of dimension ( ldb, n ). Before entry, the leading m by n part of each array B_i must contain the right-hand side of matrix B_i, and on exit is overwritten by the solution matrix X_i. 
- ldb – [in] [rocblas_int] ldb specifies the first dimension of each B_i. ldb >= max( 1, m ). 
- stride_B – [in] [rocblas_stride] The stride between each B_i matrix. 
- batch_count – [in] [rocblas_int] specifies how many batches. 
- invA – [in] [void *] device pointer storing the inverse diagonal blocks of each A_i. invA points to the first invA_1. each invA_i is of dimension ( ld_invA, k ), where k is m when rocblas_side_left and is n when rocblas_side_right. ld_invA must be equal to 128. 
- invA_size – [in] [rocblas_int] invA_size specifies the number of elements of device memory in each invA_i. 
- stride_invA – [in] [rocblas_stride] The stride between each invA matrix. 
- compute_type – [in] [rocblas_datatype] specifies the datatype of computation. 
 
 
rocblas_Xgeam + batched, strided_batched#
- 
rocblas_status rocblas_sgeam(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, const float *beta, const float *B, rocblas_int ldb, float *C, rocblas_int ldc)#
- 
rocblas_status rocblas_dgeam(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, const double *beta, const double *B, rocblas_int ldb, double *C, rocblas_int ldc)#
- 
rocblas_status rocblas_cgeam(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, const rocblas_float_complex *beta, const rocblas_float_complex *B, rocblas_int ldb, rocblas_float_complex *C, rocblas_int ldc)#
- 
rocblas_status rocblas_zgeam(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, const rocblas_double_complex *beta, const rocblas_double_complex *B, rocblas_int ldb, rocblas_double_complex *C, rocblas_int ldc)#
- BLAS Level 3 API - geam performs one of the matrix-matrix operations: - C = alpha*op( A ) + beta*op( B ), where op( X ) is one of op( X ) = X or op( X ) = X**T or op( X ) = X**H, alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by n matrix, op( B ) an m by n matrix, and C an m by n matrix. - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- transA – [in] [rocblas_operation] specifies the form of op( A ). 
- transB – [in] [rocblas_operation] specifies the form of op( B ). 
- m – [in] [rocblas_int] matrix dimension m. 
- n – [in] [rocblas_int] matrix dimension n. 
- alpha – [in] device pointer or host pointer specifying the scalar alpha. 
- A – [in] device pointer storing matrix A. 
- lda – [in] [rocblas_int] specifies the leading dimension of A. 
- beta – [in] device pointer or host pointer specifying the scalar beta. 
- B – [in] device pointer storing matrix B. 
- ldb – [in] [rocblas_int] specifies the leading dimension of B. 
- C – [inout] device pointer storing matrix C. 
- ldc – [in] [rocblas_int] specifies the leading dimension of C. 
 
 
The geam functions support the _64 interface. Refer to section ILP64 Interface.
- 
rocblas_status rocblas_sgeam_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const float *alpha, const float *const A[], rocblas_int lda, const float *beta, const float *const B[], rocblas_int ldb, float *const C[], rocblas_int ldc, rocblas_int batch_count)#
- 
rocblas_status rocblas_dgeam_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const double *alpha, const double *const A[], rocblas_int lda, const double *beta, const double *const B[], rocblas_int ldb, double *const C[], rocblas_int ldc, rocblas_int batch_count)#
- 
rocblas_status rocblas_cgeam_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *const A[], rocblas_int lda, const rocblas_float_complex *beta, const rocblas_float_complex *const B[], rocblas_int ldb, rocblas_float_complex *const C[], rocblas_int ldc, rocblas_int batch_count)#
- 
rocblas_status rocblas_zgeam_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *const A[], rocblas_int lda, const rocblas_double_complex *beta, const rocblas_double_complex *const B[], rocblas_int ldb, rocblas_double_complex *const C[], rocblas_int ldc, rocblas_int batch_count)#
- BLAS Level 3 API - geam_batched performs one of the batched matrix-matrix operations: - C_i = alpha*op( A_i ) + beta*op( B_i ) for i = 0, 1, ... batch_count - 1, where alpha and beta are scalars, and op(A_i), op(B_i) and C_i are m by n matrices and op( X ) is one of op( X ) = X or op( X ) = X**T - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- transA – [in] [rocblas_operation] specifies the form of op( A ). 
- transB – [in] [rocblas_operation] specifies the form of op( B ). 
- m – [in] [rocblas_int] matrix dimension m. 
- n – [in] [rocblas_int] matrix dimension n. 
- alpha – [in] device pointer or host pointer specifying the scalar alpha. 
- A – [in] device array of device pointers storing each matrix A_i on the GPU. Each A_i is of dimension ( lda, k ), where k is m when transA == rocblas_operation_none and is n when transA == rocblas_operation_transpose. 
- lda – [in] [rocblas_int] specifies the leading dimension of A. 
- beta – [in] device pointer or host pointer specifying the scalar beta. 
- B – [in] device array of device pointers storing each matrix B_i on the GPU. Each B_i is of dimension ( ldb, k ), where k is m when transB == rocblas_operation_none and is n when transB == rocblas_operation_transpose. 
- ldb – [in] [rocblas_int] specifies the leading dimension of B. 
- C – [inout] device array of device pointers storing each matrix C_i on the GPU. Each C_i is of dimension ( ldc, n ). 
- ldc – [in] [rocblas_int] specifies the leading dimension of C. 
- batch_count – [in] [rocblas_int] number of instances i in the batch. 
 
 
The geam_batched functions support the _64 interface. Refer to section ILP64 Interface.
- 
rocblas_status rocblas_sgeam_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, rocblas_stride stride_A, const float *beta, const float *B, rocblas_int ldb, rocblas_stride stride_B, float *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#
- 
rocblas_status rocblas_dgeam_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, rocblas_stride stride_A, const double *beta, const double *B, rocblas_int ldb, rocblas_stride stride_B, double *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#
- 
rocblas_status rocblas_cgeam_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_float_complex *beta, const rocblas_float_complex *B, rocblas_int ldb, rocblas_stride stride_B, rocblas_float_complex *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#
- 
rocblas_status rocblas_zgeam_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_double_complex *beta, const rocblas_double_complex *B, rocblas_int ldb, rocblas_stride stride_B, rocblas_double_complex *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#
- BLAS Level 3 API - geam_strided_batched performs one of the batched matrix-matrix operations: - C_i = alpha*op( A_i ) + beta*op( B_i ) for i = 0, 1, ... batch_count - 1, where alpha and beta are scalars, and op(A_i), op(B_i) and C_i are m by n matrices and op( X ) is one of op( X ) = X or op( X ) = X**T - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- transA – [in] [rocblas_operation] specifies the form of op( A ). 
- transB – [in] [rocblas_operation] specifies the form of op( B ). 
- m – [in] [rocblas_int] matrix dimension m. 
- n – [in] [rocblas_int] matrix dimension n. 
- alpha – [in] device pointer or host pointer specifying the scalar alpha. 
- A – [in] device pointer to the first matrix A_0 on the GPU. Each A_i is of dimension ( lda, k ), where k is m when transA == rocblas_operation_none and is n when transA == rocblas_operation_transpose. 
- lda – [in] [rocblas_int] specifies the leading dimension of A. 
- stride_A – [in] [rocblas_stride] stride from the start of one matrix (A_i) and the next one (A_i+1). 
- beta – [in] device pointer or host pointer specifying the scalar beta. 
- B – [in] pointer to the first matrix B_0 on the GPU. Each B_i is of dimension ( ldb, k ), where k is m when transB == rocblas_operation_none and is n when transB == rocblas_operation_transpose. 
- ldb – [in] [rocblas_int] specifies the leading dimension of B. 
- stride_B – [in] [rocblas_stride] stride from the start of one matrix (B_i) and the next one (B_i+1) 
- C – [inout] pointer to the first matrix C_0 on the GPU. Each C_i is of dimension ( ldc, n ). 
- ldc – [in] [rocblas_int] specifies the leading dimension of C. 
- stride_C – [in] [rocblas_stride] stride from the start of one matrix (C_i) and the next one (C_i+1). 
- batch_count – [in] [rocblas_int] number of instances i in the batch. 
 
 
The geam_strided_batched functions support the _64 interface. Refer to section ILP64 Interface.
rocblas_geam_ex#
- 
rocblas_status rocblas_geam_ex(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *A, rocblas_datatype a_type, rocblas_int lda, const void *B, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *C, rocblas_datatype c_type, rocblas_int ldc, void *D, rocblas_datatype d_type, rocblas_int ldd, rocblas_datatype compute_type, rocblas_geam_ex_operation geam_ex_op)#
- BLAS EX API - geam_ex performs one of the matrix-matrix operations: alpha and beta are scalars, and A, B, C, and D are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C and D are m by n matrices. C and D may point to the same matrix if their type and leading dimensions are identical.- Dij = min(alpha * (Aik + Bkj), beta * Cij) Dij = min(alpha * Aik, alpha * Bkj) + beta * Cij - Aik refers to the element at the i-th row and k-th column of op( A ), Bkj refers to the element at the k-th row and j-th column of op( B ), and Cij/Dij refers to the element at the i-th row and j-th column of C/D. - Supported types are as follows: - rocblas_datatype_f64_r = a_type = b_type = c_type = d_type = compute_type 
- rocblas_datatype_f32_r = a_type = b_type = c_type = d_type = compute_type 
- rocblas_datatype_f16_r = a_type = b_type = c_type = d_type = compute_type 
 - if transA == N, must have lda >= max(1, m) otherwise, must have lda >= max(1, k) - if transB == N, must have ldb >= max(1, k) otherwise, must have ldb >= max(1, n) - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- transA – [in] [rocblas_operation] specifies the form of op( A ). 
- transB – [in] [rocblas_operation] specifies the form of op( B ). 
- m – [in] [rocblas_int] matrix dimension m. 
- n – [in] [rocblas_int] matrix dimension n. 
- k – [in] [rocblas_int] matrix dimension k. 
- alpha – [in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type. 
- A – [in] [void *] device pointer storing matrix A. 
- a_type – [in] [rocblas_datatype] specifies the datatype of matrix A. 
- lda – [in] [rocblas_int] specifies the leading dimension of A 
- B – [in] [void *] device pointer storing matrix B. 
- b_type – [in] [rocblas_datatype] specifies the datatype of matrix B. 
- ldb – [in] [rocblas_int] specifies the leading dimension of B 
- beta – [in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type. 
- C – [in] [void *] device pointer storing matrix C. 
- c_type – [in] [rocblas_datatype] specifies the datatype of matrix C. 
- ldc – [in] [rocblas_int] specifies the leading dimension of C, must have ldc >= max(1, m). 
- D – [out] [void *] device pointer storing matrix D. If D and C pointers are to the same matrix then d_type must equal c_type and ldd must equal ldc or the respective invalid status will be returned. 
- d_type – [in] [rocblas_datatype] specifies the datatype of matrix D. 
- ldd – [in] [rocblas_int] specifies the leading dimension of D, must have ldd >= max(1, m). 
- compute_type – [in] [rocblas_datatype] specifies the datatype of computation. 
- geam_ex_op – [in] [rocblas_geam_ex_operation] enumerant specifying the operation type, support for rocblas_geam_ex_operation_min_plus and rocblas_geam_ex_operation_plus_min. 
 
 
rocblas_Xdgmm + batched, strided_batched#
- 
rocblas_status rocblas_sdgmm(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const float *A, rocblas_int lda, const float *x, rocblas_int incx, float *C, rocblas_int ldc)#
- 
rocblas_status rocblas_ddgmm(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const double *A, rocblas_int lda, const double *x, rocblas_int incx, double *C, rocblas_int ldc)#
- 
rocblas_status rocblas_cdgmm(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_float_complex *A, rocblas_int lda, const rocblas_float_complex *x, rocblas_int incx, rocblas_float_complex *C, rocblas_int ldc)#
- 
rocblas_status rocblas_zdgmm(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_double_complex *A, rocblas_int lda, const rocblas_double_complex *x, rocblas_int incx, rocblas_double_complex *C, rocblas_int ldc)#
- BLAS Level 3 API - dgmm performs one of the matrix-matrix operations: - C = A * diag(x) if side == rocblas_side_right C = diag(x) * A if side == rocblas_side_left where C and A are m by n dimensional matrices. diag( x ) is a diagonal matrix and x is vector of dimension n if side == rocblas_side_right and dimension m if side == rocblas_side_left. - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- side – [in] [rocblas_side] specifies the side of diag(x). 
- m – [in] [rocblas_int] matrix dimension m. 
- n – [in] [rocblas_int] matrix dimension n. 
- A – [in] device pointer storing matrix A. 
- lda – [in] [rocblas_int] specifies the leading dimension of A. 
- x – [in] device pointer storing vector x. 
- incx – [in] [rocblas_int] specifies the increment between values of x 
- C – [inout] device pointer storing matrix C. 
- ldc – [in] [rocblas_int] specifies the leading dimension of C. 
 
 
The dgmm functions support the _64 interface. Refer to section ILP64 Interface.
- 
rocblas_status rocblas_sdgmm_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const float *const A[], rocblas_int lda, const float *const x[], rocblas_int incx, float *const C[], rocblas_int ldc, rocblas_int batch_count)#
- 
rocblas_status rocblas_ddgmm_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const double *const A[], rocblas_int lda, const double *const x[], rocblas_int incx, double *const C[], rocblas_int ldc, rocblas_int batch_count)#
- 
rocblas_status rocblas_cdgmm_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_float_complex *const A[], rocblas_int lda, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_float_complex *const C[], rocblas_int ldc, rocblas_int batch_count)#
- 
rocblas_status rocblas_zdgmm_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_double_complex *const A[], rocblas_int lda, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_double_complex *const C[], rocblas_int ldc, rocblas_int batch_count)#
- BLAS Level 3 API - dgmm_batched performs one of the batched matrix-matrix operations: - C_i = A_i * diag(x_i) for i = 0, 1, ... batch_count-1 if side == rocblas_side_right C_i = diag(x_i) * A_i for i = 0, 1, ... batch_count-1 if side == rocblas_side_left, where C_i and A_i are m by n dimensional matrices. diag(x_i) is a diagonal matrix and x_i is vector of dimension n if side == rocblas_side_right and dimension m if side == rocblas_side_left. - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- side – [in] [rocblas_side] specifies the side of diag(x). 
- m – [in] [rocblas_int] matrix dimension m. 
- n – [in] [rocblas_int] matrix dimension n. 
- A – [in] device array of device pointers storing each matrix A_i on the GPU. Each A_i is of dimension ( lda, n ). 
- lda – [in] [rocblas_int] specifies the leading dimension of A_i. 
- x – [in] device array of device pointers storing each vector x_i on the GPU. Each x_i is of dimension n if side == rocblas_side_right and dimension m if side == rocblas_side_left. 
- incx – [in] [rocblas_int] specifies the increment between values of x_i. 
- C – [inout] device array of device pointers storing each matrix C_i on the GPU. Each C_i is of dimension ( ldc, n ). 
- ldc – [in] [rocblas_int] specifies the leading dimension of C_i. 
- batch_count – [in] [rocblas_int] number of instances in the batch. 
 
 
The dgmm_batched functions support the _64 interface. Refer to section ILP64 Interface.
- 
rocblas_status rocblas_sdgmm_strided_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const float *A, rocblas_int lda, rocblas_stride stride_A, const float *x, rocblas_int incx, rocblas_stride stride_x, float *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#
- 
rocblas_status rocblas_ddgmm_strided_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const double *A, rocblas_int lda, rocblas_stride stride_A, const double *x, rocblas_int incx, rocblas_stride stride_x, double *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#
- 
rocblas_status rocblas_cdgmm_strided_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_float_complex *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#
- 
rocblas_status rocblas_zdgmm_strided_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_double_complex *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#
- BLAS Level 3 API - dgmm_strided_batched performs one of the batched matrix-matrix operations: - C_i = A_i * diag(x_i) if side == rocblas_side_right for i = 0, 1, ... batch_count-1 C_i = diag(x_i) * A_i if side == rocblas_side_left for i = 0, 1, ... batch_count-1, where C_i and A_i are m by n dimensional matrices. diag(x_i) is a diagonal matrix and x_i is vector of dimension n if side == rocblas_side_right and dimension m if side == rocblas_side_left. - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- side – [in] [rocblas_side] specifies the side of diag(x). 
- m – [in] [rocblas_int] matrix dimension m. 
- n – [in] [rocblas_int] matrix dimension n. 
- A – [in] device pointer to the first matrix A_0 on the GPU. Each A_i is of dimension ( lda, n ). 
- lda – [in] [rocblas_int] specifies the leading dimension of A. 
- stride_A – [in] [rocblas_stride] stride from the start of one matrix (A_i) and the next one (A_i+1). 
- x – [in] pointer to the first vector x_0 on the GPU. Each x_i is of dimension n if side == rocblas_side_right and dimension m if side == rocblas_side_left. 
- incx – [in] [rocblas_int] specifies the increment between values of x. 
- stride_x – [in] [rocblas_stride] stride from the start of one vector(x_i) and the next one (x_i+1). 
- C – [inout] device pointer to the first matrix C_0 on the GPU. Each C_i is of dimension ( ldc, n ). 
- ldc – [in] [rocblas_int] specifies the leading dimension of C. 
- stride_C – [in] [rocblas_stride] stride from the start of one matrix (C_i) and the next one (C_i+1). 
- batch_count – [in] [rocblas_int] number of instances i in the batch. 
 
 
The dgmm_strided_batched functions support the _64 interface. Refer to section ILP64 Interface.
rocblas_Xgemmt + batched, strided_batched#
- 
rocblas_status rocblas_sgemmt(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, const float *B, rocblas_int ldb, const float *beta, float *C, rocblas_int ldc)#
- 
rocblas_status rocblas_dgemmt(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, const double *B, rocblas_int ldb, const double *beta, double *C, rocblas_int ldc)#
- 
rocblas_status rocblas_cgemmt(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, const rocblas_float_complex *B, rocblas_int ldb, const rocblas_float_complex *beta, rocblas_float_complex *C, rocblas_int ldc)#
- 
rocblas_status rocblas_zgemmt(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, const rocblas_double_complex *B, rocblas_int ldb, const rocblas_double_complex *beta, rocblas_double_complex *C, rocblas_int ldc)#
- BLAS Level 3 API - gemmt performs matrix-matrix operations and updates the upper or lower triangular part of the result matrix: - C = alpha*op( A )*op( B ) + beta*C, where op( X ) is one of op( X ) = X or op( X ) = X**T or op( X ) = X**H, alpha and beta are scalars. A, B are general matrices and C is either an upper or lower triangular matrix, with op( A ) an n by k matrix, op( B ) a k by n matrix and C an n by n matrix. - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- uplo – [in] [rocblas_fill] - rocblas_fill_upper: C is an upper triangular matrix 
- rocblas_fill_lower: C is a lower triangular matrix 
 
- transA – [in] [rocblas_operation] - rocblas_operation_none: op(A) = A. 
- rocblas_operation_transpose: op(A) = A^T 
- rocblas_operation_conjugate_transpose: op(A) = A^H 
 
- transB – [in] [rocblas_operation] - rocblas_operation_none: op(B) = B. 
- rocblas_operation_transpose: op(B) = B^T 
- rocblas_operation_conjugate_transpose: op(B) = B^H 
 
- n – [in] [rocblas_int] number or rows of matrices op( A ), columns of op( B ), and (rows, columns) of C. 
- k – [in] [rocblas_int] number of rows of matrices op( B ) and columns of op( A ). 
- alpha – [in] device pointer or host pointer specifying the scalar alpha. 
- A – [in] device pointer storing matrix A. If transa = rocblas_operation_none, then, the leading n-by-k part of the array contains the matrix A, otherwise the leading k-by-n part of the array contains the matrix A. 
- lda – [in] [rocblas_int] specifies the leading dimension of A. If transA == rocblas_operation_none, must have lda >= max(1, n), otherwise, must have lda >= max(1, k). 
- B – [in] device pointer storing matrix B. If transB = rocblas_operation_none, then, the leading k-by-n part of the array contains the matrix B, otherwise the leading n-by-k part of the array contains the matrix B. 
- ldb – [in] [rocblas_int] specifies the leading dimension of B. If transB == rocblas_operation_none, must have ldb >= max(1, k), otherwise, must have ldb >= max(1, n) 
- beta – [in] device pointer or host pointer specifying the scalar beta. 
- C – [inout] device pointer storing matrix C on the GPU. If uplo == rocblas_fill_upper, the upper triangular part of the leading n-by-n array contains the matrix C, otherwise the lower triangular part of the leading n-by-n array contains the matrix C. 
- ldc – [in] [rocblas_int] specifies the leading dimension of C. Must have ldc >= max(1, n). 
 
 
The gemmt functions support the _64 interface. Refer to section ILP64 Interface.
- 
rocblas_status rocblas_sgemmt_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const float *alpha, const float *const A[], rocblas_int lda, const float *const B[], rocblas_int ldb, const float *beta, float *const C[], rocblas_int ldc, rocblas_int batch_count)#
- 
rocblas_status rocblas_dgemmt_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const double *alpha, const double *const A[], rocblas_int lda, const double *const B[], rocblas_int ldb, const double *beta, double *const C[], rocblas_int ldc, rocblas_int batch_count)#
- 
rocblas_status rocblas_cgemmt_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const rocblas_float_complex *alpha, const rocblas_float_complex *const A[], rocblas_int lda, const rocblas_float_complex *const B[], rocblas_int ldb, const rocblas_float_complex *beta, rocblas_float_complex *const C[], rocblas_int ldc, rocblas_int batch_count)#
- 
rocblas_status rocblas_zgemmt_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const rocblas_double_complex *alpha, const rocblas_double_complex *const A[], rocblas_int lda, const rocblas_double_complex *const B[], rocblas_int ldb, const rocblas_double_complex *beta, rocblas_double_complex *const C[], rocblas_int ldc, rocblas_int batch_count)#
- BLAS Level 3 API - gemmt_batched performs matrix-matrix operations and updates the upper or lower triangular part of the result matrix: - C_i = alpha*op( A_i )*op( B_i ) + beta*C_i, for i = 1, ..., batch_count, where op( X ) is one of op( X ) = X or op( X ) = X**T or op( X ) = X**H, alpha and beta are scalars. A, B are general matrices and C is either an upper or lower triangular matrix, with op( A ) an n by k by batch_count matrices, op( B ) an k by n by batch_count matrices and C an n by n by batch_count matrices. - Parameters:
- handle – [in] [rocblas_handle handle to the rocblas library context queue. 
- uplo – [in] [rocblas_fill] - rocblas_fill_upper: C is an upper triangular matrix 
- rocblas_fill_lower: C is a lower triangular matrix 
 
- transA – [in] [rocblas_operation] - rocblas_operation_none: op(A_i) = A_i. 
- rocblas_operation_transpose: op(A_i) = A_i^T 
- rocblas_operation_conjugate_transpose: op(A_i) = A_i^H 
 
- transB – [in] [rocblas_operation] - rocblas_operation_none: op(B_i) = B_i. 
- rocblas_operation_transpose: op(B_i) = B_i^T 
- rocblas_operation_conjugate_transpose: op(B_i) = B_i^H 
 
- n – [in] [rocblas_int] number or rows of matrices op( A_i ), columns of op( B_i ), and (rows, columns) of C_i. 
- k – [in] [rocblas_int] number of rows of matrices op( B_i ) and columns of op( A_i ). 
- alpha – [in] device pointer or host pointer specifying the scalar alpha. 
- A – [in] device array of device pointers storing each matrix A_i. If transa = rocblas_operation_none, then, the leading n-by-k part of the array contains each matrix A_i, otherwise the leading k-by-n part of the array contains each matrix A_i. 
- lda – [in] [rocblas_int] specifies the leading dimension of each A_i. If transA == rocblas_operation_none, must have lda >= max(1, n), otherwise, must have lda >= max(1, k). 
- B – [in] device array of device pointers storing each matrix B_i. If transB = rocblas_operation_none, then, the leading k-by-n part of the array contains each matrix B_i, otherwise the leading n-by-k part of the array contains each matrix B_i. 
- ldb – [in] [rocblas_int] specifies the leading dimension of each B_i. If transB == rocblas_operation_none, must have ldb >= max(1, k), otherwise, must have ldb >= max(1, n). 
- beta – [in] device pointer or host pointer specifying the scalar beta. 
- C – [inout] device array of device pointers storing each matrix C_i. If uplo == rocblas_fill_upper, the upper triangular part of the leading n-by-n array contains each matrix C_i, otherwise the lower triangular part of the leading n-by-n array contains each matrix C_i. 
- ldc – [in] [rocblas_int] specifies the leading dimension of each C_i. Must have ldc >= max(1, n). 
- batch_count – [in] [rocblas_int] number of gemm operations in the batch. 
 
 
The gemmt_batched functions support the _64 interface. Refer to section ILP64 Interface.
- 
rocblas_status rocblas_sgemmt_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, rocblas_stride stride_a, const float *B, rocblas_int ldb, rocblas_stride stride_b, const float *beta, float *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)#
- 
rocblas_status rocblas_dgemmt_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, rocblas_stride stride_a, const double *B, rocblas_int ldb, rocblas_stride stride_b, const double *beta, double *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)#
- 
rocblas_status rocblas_cgemmt_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride stride_a, const rocblas_float_complex *B, rocblas_int ldb, rocblas_stride stride_b, const rocblas_float_complex *beta, rocblas_float_complex *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)#
- 
rocblas_status rocblas_zgemmt_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride stride_a, const rocblas_double_complex *B, rocblas_int ldb, rocblas_stride stride_b, const rocblas_double_complex *beta, rocblas_double_complex *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)#
- BLAS Level 3 API - gemmt_strided_batched performs matrix-matrix operations and updates the upper or lower triangular part of the result matrix: - C_i = alpha*op( A_i )*op( B_i ) + beta*C_i, for i = 1, ..., batch_count, where op( X ) is one of op( X ) = X or op( X ) = X**T or op( X ) = X**H, alpha and beta are scalars. A, B are general matrices and C is either an upper or lower triangular matrix, with op( A ) an n by k by batch_count strided_batched matrix, op( B ) an k by n by batch_count strided_batched matrix and C an n by n by batch_count strided_batched matrix. - Parameters:
- handle – [in] [rocblas_handle] handle to the rocblas library context queue. 
- uplo – [in] [rocblas_fill] - rocblas_fill_upper: C is an upper triangular matrix 
- rocblas_fill_lower: C is a lower triangular matrix 
 
- transA – [in] [rocblas_operation] - rocblas_operation_none: op(A_i) = A_i. 
- rocblas_operation_transpose: op(A_i) = A_i^T 
- rocblas_operation_conjugate_transpose: op(A_i) = A_i^H 
 
- transB – [in] [rocblas_operation] - rocblas_operation_none: op(B_i) = B_i. 
- rocblas_operation_transpose: op(B_i) = B_i^T 
- rocblas_operation_conjugate_transpose: op(B_i) = B_i^H 
 
- n – [in] [rocblas_int] number or rows of matrices op( A_i ), columns of op( B_i ), and (rows, columns) of C_i. 
- k – [in] [rocblas_int] number of rows of matrices op( B_i ) and columns of op( A_i ). 
- alpha – [in] device pointer or host pointer specifying the scalar alpha. 
- A – [in] device array of device pointers storing each matrix A_i. If transa = rocblas_operation_none, then, the leading n-by-k part of the array contains each matrix A_i, otherwise the leading k-by-n part of the array contains each matrix A_i. 
- lda – [in] [rocblas_int] specifies the leading dimension of each A_i. If transA == rocblas_operation_none, must have lda >= max(1, n), otherwise, must have lda >= max(1, k). 
- stride_a – [in] [rocblas_stride] stride from the start of one A_i matrix to the next A_(i + 1). 
- B – [in] device array of device pointers storing each matrix B_i. If transB = rocblas_operation_none, then, the leading k-by-n part of the array contains each matrix B_i, otherwise the leading n-by-k part of the array contains each matrix B_i. 
- ldb – [in] [rocblas_int] specifies the leading dimension of each B_i. If transB == rocblas_operation_none, must have ldb >= max(1, k), otherwise, must have ldb >= max(1, n). 
- stride_b – [in] [rocblas_stride] stride from the start of one B_i matrix to the next B_(i + 1). 
- beta – [in] device pointer or host pointer specifying the scalar beta. 
- C – [inout] device array of device pointers storing each matrix C_i. If uplo == rocblas_fill_upper, the upper triangular part of the leading n-by-n array contains each matrix C_i, otherwise the lower triangular part of the leading n-by-n array contains each matrix C_i. 
- ldc – [in] [rocblas_int] specifies the leading dimension of each C_i. Must have ldc >= max(1, n). 
- stride_c – [in] [rocblas_stride] stride from the start of one C_i matrix to the next C_(i + 1). 
- batch_count – [in] [rocblas_int] number of gemm operatons in the batch. 
 
 
The gemmt_strided_batched functions support the _64 interface. Refer to section ILP64 Interface.