This page contains proposed changes for a future release of ROCm. Read the latest Linux release of ROCm documentation for your production environments.

rocBLAS Extension

Contents

rocBLAS Extension#

Level-1 Extension functions support the ILP64 API. For more information on these _64 functions, refer to section ILP64 Interface.

rocblas_axpy_ex + batched, strided_batched#

rocblas_status rocblas_axpy_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, const void *x, rocblas_datatype x_type, rocblas_int incx, void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_datatype execution_type)#
rocblas_status rocblas_axpy_batched_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, const void *x, rocblas_datatype x_type, rocblas_int incx, void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_int batch_count, rocblas_datatype execution_type)#
rocblas_status rocblas_axpy_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stridex, void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_datatype execution_type)#

BLAS EX API

axpy_strided_batched_ex computes constant alpha multiplied by vector x, plus vector y over a set of strided batched vectors.

y := alpha * x + y
Currently supported datatypes are as follows:

alpha_type

x_type

y_type

execution_type

bf16_r

bf16_r

bf16_r

f32_r

f32_r

bf16_r

bf16_r

f32_r

f16_r

f16_r

f16_r

f16_r

f16_r

f16_r

f16_r

f32_r

f32_r

f16_r

f16_r

f32_r

f32_r

f32_r

f32_r

f32_r

f64_r

f64_r

f64_r

f64_r

f32_c

f32_c

f32_c

f32_c

f64_c

f64_c

f64_c

f64_c

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in each x_i and y_i.

  • alpha[in] device pointer or host pointer to specify the scalar alpha.

  • alpha_type[in] [rocblas_datatype] specifies the datatype of alpha.

  • x[in] device pointer to the first vector x_1.

  • x_type[in] [rocblas_datatype] specifies the datatype of each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i.

  • stridex[in] [rocblas_stride] stride from the start of one vector (x_i) to the next one (x_i+1). There are no restrictions placed on stridex. However, ensure that stridex is of appropriate size. For a typical case this means stridex >= n * incx.

  • y[inout] device pointer to the first vector y_1.

  • y_type[in] [rocblas_datatype] specifies the datatype of each vector y_i.

  • incy[in] [rocblas_int] specifies the increment for the elements of each y_i.

  • stridey[in] [rocblas_stride] stride from the start of one vector (y_i) to the next one (y_i+1). There are no restrictions placed on stridey. However, ensure that stridey is of appropriate size. For a typical case this means stridey >= n * incy.

  • batch_count[in] [rocblas_int] number of instances in the batch.

  • execution_type[in] [rocblas_datatype] specifies the datatype of computation.

axpy_ex, axpy_batched_ex, and axpy_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_dot_ex + batched, strided_batched#

rocblas_status rocblas_dot_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, const void *y, rocblas_datatype y_type, rocblas_int incy, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)#
rocblas_status rocblas_dot_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, const void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_int batch_count, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)#
rocblas_status rocblas_dot_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stride_x, const void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_stride stride_y, rocblas_int batch_count, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)#

BLAS EX API

dot_strided_batched_ex performs a batch of dot products of vectors x and y.

result_i = x_i * y_i;
dotc_strided_batched_ex performs a batch of dot products of the conjugate of complex vector x and complex vector y
result_i = conjugate (x_i) * y_i;
where (x_i, y_i) is the i-th instance of the batch. x_i and y_i are vectors, for i = 1, …, batch_count

Currently supported datatypes are as follows:

x_type

y_type

result_type

execution_type

f16_r

f16_r

f16_r

f16_r

f16_r

f16_r

f16_r

f32_r

bf16_r

bf16_r

bf16_r

f32_r

f32_r

f32_r

f32_r

f32_r

f32_r

f32_r

f64_r

f64_r

f64_r

f64_r

f64_r

f64_r

f32_c

f32_c

f32_c

f32_c

f64_c

f64_c

f64_c

f64_c

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in each x_i and y_i.

  • x[in] device pointer to the first vector (x_1) in the batch.

  • x_type[in] [rocblas_datatype] specifies the datatype of each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i.

  • stride_x[in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1)

  • y[in] device pointer to the first vector (y_1) in the batch.

  • y_type[in] [rocblas_datatype] specifies the datatype of each vector y_i.

  • incy[in] [rocblas_int] specifies the increment for the elements of each y_i.

  • stride_y[in] [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1)

  • batch_count[in] [rocblas_int] number of instances in the batch.

  • result[inout] device array or host array of batch_count size to store the dot products of each batch. return 0.0 for each element if n <= 0.

  • result_type[in] [rocblas_datatype] specifies the datatype of the result.

  • execution_type[in] [rocblas_datatype] specifies the datatype of computation.

dot_ex, dot_batched_ex, and dot_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_dotc_ex + batched, strided_batched#

rocblas_status rocblas_dotc_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, const void *y, rocblas_datatype y_type, rocblas_int incy, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)#
rocblas_status rocblas_dotc_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, const void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_int batch_count, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)#
rocblas_status rocblas_dotc_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stride_x, const void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_stride stride_y, rocblas_int batch_count, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)#

BLAS EX API

dot_strided_batched_ex performs a batch of dot products of vectors x and y.

result_i = x_i * y_i;
dotc_strided_batched_ex performs a batch of dot products of the conjugate of complex vector x and complex vector y
result_i = conjugate (x_i) * y_i;
where (x_i, y_i) is the i-th instance of the batch. x_i and y_i are vectors, for i = 1, …, batch_count

Currently supported datatypes are as follows:

x_type

y_type

result_type

execution_type

f16_r

f16_r

f16_r

f16_r

f16_r

f16_r

f16_r

f32_r

bf16_r

bf16_r

bf16_r

f32_r

f32_r

f32_r

f32_r

f32_r

f32_r

f32_r

f64_r

f64_r

f64_r

f64_r

f64_r

f64_r

f32_c

f32_c

f32_c

f32_c

f64_c

f64_c

f64_c

f64_c

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in each x_i and y_i.

  • x[in] device pointer to the first vector (x_1) in the batch.

  • x_type[in] [rocblas_datatype] specifies the datatype of each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i.

  • stride_x[in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1)

  • y[in] device pointer to the first vector (y_1) in the batch.

  • y_type[in] [rocblas_datatype] specifies the datatype of each vector y_i.

  • incy[in] [rocblas_int] specifies the increment for the elements of each y_i.

  • stride_y[in] [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1)

  • batch_count[in] [rocblas_int] number of instances in the batch.

  • result[inout] device array or host array of batch_count size to store the dot products of each batch. return 0.0 for each element if n <= 0.

  • result_type[in] [rocblas_datatype] specifies the datatype of the result.

  • execution_type[in] [rocblas_datatype] specifies the datatype of computation.

dotc_ex, dotc_batched_ex, and dotc_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_nrm2_ex + batched, strided_batched#

rocblas_status rocblas_nrm2_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, void *results, rocblas_datatype result_type, rocblas_datatype execution_type)#
rocblas_status rocblas_nrm2_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_int batch_count, void *results, rocblas_datatype result_type, rocblas_datatype execution_type)#
rocblas_status rocblas_nrm2_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count, void *results, rocblas_datatype result_type, rocblas_datatype execution_type)#

BLAS_EX API.

nrm2_strided_batched_ex computes the euclidean norm over a batch of real or complex vectors.

result := sqrt( x_i'*x_i ) for real vectors x, for i = 1, ..., batch_count
result := sqrt( x_i**H*x_i ) for complex vectors, for i = 1, ..., batch_count
Currently supported datatypes are as follows:

x_type

result

execution_type

bf16_r

bf16_r

f32_r

f16_r

f16_r

f32_r

f32_r

f32_r

f32_r

f64_r

f64_r

f64_r

f32_c

f32_r

f32_r

f64_c

f64_r

f64_r

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in each x_i.

  • x[in] device pointer to the first vector x_1.

  • x_type[in] [rocblas_datatype] specifies the datatype of each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • stride_x[in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x. However, ensure that stride_x is of appropriate size. For a typical case this means stride_x >= n * incx.

  • batch_count[in] [rocblas_int] number of instances in the batch.

  • results[out] device pointer or host pointer to array for storing contiguous batch_count results. return is 0.0 for each element if n <= 0, incx<=0.

  • result_type[in] [rocblas_datatype] specifies the datatype of the result.

  • execution_type[in] [rocblas_datatype] specifies the datatype of computation.

nrm2_ex, nrm2_batched_ex, and nrm2_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_rot_ex + batched, strided_batched#

rocblas_status rocblas_rot_ex(rocblas_handle handle, rocblas_int n, void *x, rocblas_datatype x_type, rocblas_int incx, void *y, rocblas_datatype y_type, rocblas_int incy, const void *c, const void *s, rocblas_datatype cs_type, rocblas_datatype execution_type)#
rocblas_status rocblas_rot_batched_ex(rocblas_handle handle, rocblas_int n, void *x, rocblas_datatype x_type, rocblas_int incx, void *y, rocblas_datatype y_type, rocblas_int incy, const void *c, const void *s, rocblas_datatype cs_type, rocblas_int batch_count, rocblas_datatype execution_type)#
rocblas_status rocblas_rot_strided_batched_ex(rocblas_handle handle, rocblas_int n, void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stride_x, void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_stride stride_y, const void *c, const void *s, rocblas_datatype cs_type, rocblas_int batch_count, rocblas_datatype execution_type)#

BLAS Level 1 API

rot_strided_batched_ex applies the Givens rotation matrix defined by c=cos(alpha) and s=sin(alpha) to strided batched vectors x_i and y_i, for i = 1, …, batch_count. Scalars c and s may be stored in either host or device memory. Location is specified by calling rocblas_set_pointer_mode.

In the case where cs_type is real:

x := c * x + s * y
y := c * y - s * x
In the case where cs_type is complex, the imaginary part of c is ignored:
x := real(c) * x + s * y
y := real(c) * y - conj(s) * x
Currently supported datatypes are as follows:

x_type

y_type

cs_type

execution_type

bf16_r

bf16_r

bf16_r

f32_r

f16_r

f16_r

f16_r

f32_r

f32_r

f32_r

f32_r

f32_r

f64_r

f64_r

f64_r

f64_r

f32_c

f32_c

f32_c

f32_c

f32_c

f32_c

f32_r

f32_c

f64_c

f64_c

f64_c

f64_c

f64_c

f64_c

f64_r

f64_c

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in each x_i and y_i vectors.

  • x[inout] device pointer to the first vector x_1.

  • x_type[in] [rocblas_datatype] specifies the datatype of each vector x_i.

  • incx[in] [rocblas_int] specifies the increment between elements of each x_i.

  • stride_x[in] [rocblas_stride] specifies the increment from the beginning of x_i to the beginning of x_(i+1)

  • y[inout] device pointer to the first vector y_1.

  • y_type[in] [rocblas_datatype] specifies the datatype of each vector y_i.

  • incy[in] [rocblas_int] specifies the increment between elements of each y_i.

  • stride_y[in] [rocblas_stride] specifies the increment from the beginning of y_i to the beginning of y_(i+1)

  • c[in] device pointer or host pointer to scalar cosine component of the rotation matrix.

  • s[in] device pointer or host pointer to scalar sine component of the rotation matrix.

  • cs_type[in] [rocblas_datatype] specifies the datatype of c and s.

  • batch_count[in] [rocblas_int] the number of x and y arrays, the number of batches.

  • execution_type[in] [rocblas_datatype] specifies the datatype of computation.

rot_ex, rot_batched_ex, and rot_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_scal_ex + batched, strided_batched#

rocblas_status rocblas_scal_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_datatype execution_type)#
rocblas_status rocblas_scal_batched_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_int batch_count, rocblas_datatype execution_type)#
rocblas_status rocblas_scal_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_datatype execution_type)#

BLAS EX API

scal_strided_batched_ex scales each element of vector x with scalar alpha over a set of strided batched vectors.

x := alpha * x
Currently supported datatypes are as follows:

alpha_type

x_type

execution_type

f32_r

bf16_r

f32_r

bf16_r

bf16_r

f32_r

f16_r

f16_r

f16_r

f16_r

f16_r

f32_r

f32_r

f16_r

f32_r

f32_r

f32_r

f32_r

f64_r

f64_r

f64_r

f32_c

f32_c

f32_c

f64_c

f64_c

f64_c

f32_r

f32_c

f32_c

f64_r

f64_c

f64_c

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in x.

  • alpha[in] device pointer or host pointer for the scalar alpha.

  • alpha_type[in] [rocblas_datatype] specifies the datatype of alpha.

  • x[inout] device pointer to the first vector x_1.

  • x_type[in] [rocblas_datatype] specifies the datatype of each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i.

  • stridex[in] [rocblas_stride] stride from the start of one vector (x_i) to the next one (x_i+1). There are no restrictions placed on stridex. However, ensure that stridex is of appropriate size. For a typical case this means stridex >= n * incx.

  • batch_count[in] [rocblas_int] number of instances in the batch.

  • execution_type[in] [rocblas_datatype] specifies the datatype of computation.

scal_ex, scal_batched_ex, and scal_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_gemm_ex + batched, strided_batched#

rocblas_status rocblas_gemm_ex(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)#

BLAS EX API

gemm_ex performs one of the matrix-matrix operations:

D = alpha*op( A )*op( B ) + beta*C,
where op( X ) is one of
op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,
alpha and beta are scalars, and A, B, C, and D are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C and D are m by n matrices. C and D may point to the same matrix if their parameters are identical.

Supported types are as follows:

  • rocblas_datatype_f64_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f32_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f16_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type

  • rocblas_datatype_f16_r = a_type = b_type; rocblas_datatype_f32_r = c_type = d_type = compute_type

  • rocblas_datatype_bf16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type

  • rocblas_datatype_bf16_r = a_type = b_type; rocblas_datatype_f32_r = c_type = d_type = compute_type

  • rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type

  • rocblas_datatype_f32_c = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f64_c = a_type = b_type = c_type = d_type = compute_type

Although not widespread, some gemm kernels used by gemm_ex may use atomic operations. See Atomic Operations in the API Reference Guide for more information.

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • transA[in] [rocblas_operation] specifies the form of op( A ).

  • transB[in] [rocblas_operation] specifies the form of op( B ).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • k[in] [rocblas_int] matrix dimension k.

  • alpha[in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.

  • a[in] [void *] device pointer storing matrix A.

  • a_type[in] [rocblas_datatype] specifies the datatype of matrix A.

  • lda[in] [rocblas_int] specifies the leading dimension of A.

  • b[in] [void *] device pointer storing matrix B.

  • b_type[in] [rocblas_datatype] specifies the datatype of matrix B.

  • ldb[in] [rocblas_int] specifies the leading dimension of B.

  • beta[in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.

  • c[in] [void *] device pointer storing matrix C.

  • c_type[in] [rocblas_datatype] specifies the datatype of matrix C.

  • ldc[in] [rocblas_int] specifies the leading dimension of C.

  • d[out] [void *] device pointer storing matrix D. If d and c pointers are to the same matrix then d_type must equal c_type and ldd must equal ldc or the respective invalid status will be returned.

  • d_type[in] [rocblas_datatype] specifies the datatype of matrix D.

  • ldd[in] [rocblas_int] specifies the leading dimension of D.

  • compute_type[in] [rocblas_datatype] specifies the datatype of computation.

  • algo[in] [rocblas_gemm_algo] enumerant specifying the algorithm type.

  • solution_index[in] [int32_t] if algo is rocblas_gemm_algo_solution_index, this controls which solution is used. When algo is not rocblas_gemm_algo_solution_index, or if solution_index <= 0, the default solution is used. This parameter was unused in previous releases and instead always used the default solution

  • flags[in] [uint32_t] optional gemm flags.

gemm_ex functions support the _64 interface. However, no arguments larger than (int32_t max value * 16) are currently supported. Refer to section ILP64 Interface.

rocblas_status rocblas_gemm_batched_ex(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)#

BLAS EX API

gemm_batched_ex performs one of the batched matrix-matrix operations: D_i = alpha*op(A_i)*op(B_i) + beta*C_i, for i = 1, …, batch_count. where op( X ) is one of op( X ) = X or op( X ) = X**T or op( X ) = X**H, alpha and beta are scalars, and A, B, C, and D are batched pointers to matrices, with op( A ) an m by k by batch_count batched matrix, op( B ) a k by n by batch_count batched matrix and C and D are m by n by batch_count batched matrices. The batched matrices are an array of pointers to matrices. The number of pointers to matrices is batch_count. C and D may point to the same matrices if their parameters are identical.

Supported types are as follows:

  • rocblas_datatype_f64_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f32_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f16_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type

  • rocblas_datatype_bf16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type

  • rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type

  • rocblas_datatype_f32_c = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f64_c = a_type = b_type = c_type = d_type = compute_type

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • transA[in] [rocblas_operation] specifies the form of op( A ).

  • transB[in] [rocblas_operation] specifies the form of op( B ).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • k[in] [rocblas_int] matrix dimension k.

  • alpha[in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.

  • a[in] [void *] device pointer storing array of pointers to each matrix A_i.

  • a_type[in] [rocblas_datatype] specifies the datatype of each matrix A_i.

  • lda[in] [rocblas_int] specifies the leading dimension of each A_i.

  • b[in] [void *] device pointer storing array of pointers to each matrix B_i.

  • b_type[in] [rocblas_datatype] specifies the datatype of each matrix B_i.

  • ldb[in] [rocblas_int] specifies the leading dimension of each B_i.

  • beta[in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.

  • c[in] [void *] device array of device pointers to each matrix C_i.

  • c_type[in] [rocblas_datatype] specifies the datatype of each matrix C_i.

  • ldc[in] [rocblas_int] specifies the leading dimension of each C_i.

  • d[out] [void *] device array of device pointers to each matrix D_i. If d and c are the same array of matrix pointers then d_type must equal c_type and ldd must equal ldc or the respective invalid status will be returned.

  • d_type[in] [rocblas_datatype] specifies the datatype of each matrix D_i.

  • ldd[in] [rocblas_int] specifies the leading dimension of each D_i.

  • batch_count[in] [rocblas_int] number of gemm operations in the batch.

  • compute_type[in] [rocblas_datatype] specifies the datatype of computation.

  • algo[in] [rocblas_gemm_algo] enumerant specifying the algorithm type.

  • solution_index[in] [int32_t] if algo is rocblas_gemm_algo_solution_index, this controls which solution is used. When algo is not rocblas_gemm_algo_solution_index, or if solution_index <= 0, the default solution is used. This parameter was unused in previous releases and instead always used the default solution

  • flags[in] [uint32_t] optional gemm flags.

gemm_batched_ex functions support the _64 interface. Only the parameter batch_count larger than (int32_t max value * 16) is currently supported. Refer to section ILP64 Interface.

rocblas_status rocblas_gemm_strided_batched_ex(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, rocblas_stride stride_a, const void *b, rocblas_datatype b_type, rocblas_int ldb, rocblas_stride stride_b, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, rocblas_stride stride_c, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_stride stride_d, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)#

BLAS EX API

gemm_strided_batched_ex performs one of the strided_batched matrix-matrix operations:

D_i = alpha*op(A_i)*op(B_i) + beta*C_i, for i = 1, ..., batch_count
where op( X ) is one of
op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,
alpha and beta are scalars, and A, B, C, and D are strided_batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) a k by n by batch_count strided_batched matrix and C and D are m by n by batch_count strided_batched matrices. C and D may point to the same matrices if their parameters are identical.

The strided_batched matrices are multiple matrices separated by a constant stride. The number of matrices is batch_count.

Supported types are as follows:

  • rocblas_datatype_f64_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f32_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f16_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type

  • rocblas_datatype_bf16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type

  • rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type

  • rocblas_datatype_f32_c = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f64_c = a_type = b_type = c_type = d_type = compute_type

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • transA[in] [rocblas_operation] specifies the form of op( A ).

  • transB[in] [rocblas_operation] specifies the form of op( B ).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • k[in] [rocblas_int] matrix dimension k.

  • alpha[in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.

  • a[in] [void *] device pointer pointing to first matrix A_1.

  • a_type[in] [rocblas_datatype] specifies the datatype of each matrix A_i.

  • lda[in] [rocblas_int] specifies the leading dimension of each A_i.

  • stride_a[in] [rocblas_stride] specifies stride from start of one A_i matrix to the next A_(i + 1).

  • b[in] [void *] device pointer pointing to first matrix B_1.

  • b_type[in] [rocblas_datatype] specifies the datatype of each matrix B_i.

  • ldb[in] [rocblas_int] specifies the leading dimension of each B_i.

  • stride_b[in] [rocblas_stride] specifies stride from start of one B_i matrix to the next B_(i + 1).

  • beta[in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.

  • c[in] [void *] device pointer pointing to first matrix C_1.

  • c_type[in] [rocblas_datatype] specifies the datatype of each matrix C_i.

  • ldc[in] [rocblas_int] specifies the leading dimension of each C_i.

  • stride_c[in] [rocblas_stride] specifies stride from start of one C_i matrix to the next C_(i + 1).

  • d[out] [void *] device pointer storing each matrix D_i. If d and c pointers are to the same matrix then d_type must equal c_type and ldd must equal ldc and stride_d must equal stride_c or the respective invalid status will be returned.

  • d_type[in] [rocblas_datatype] specifies the datatype of each matrix D_i.

  • ldd[in] [rocblas_int] specifies the leading dimension of each D_i.

  • stride_d[in] [rocblas_stride] specifies stride from start of one D_i matrix to the next D_(i + 1).

  • batch_count[in] [rocblas_int] number of gemm operations in the batch.

  • compute_type[in] [rocblas_datatype] specifies the datatype of computation.

  • algo[in] [rocblas_gemm_algo] enumerant specifying the algorithm type.

  • solution_index[in] [int32_t] if algo is rocblas_gemm_algo_solution_index, this controls which solution is used. When algo is not rocblas_gemm_algo_solution_index, or if solution_index <= 0, the default solution is used. This parameter was unused in previous releases and instead always used the default solution

  • flags[in] [uint32_t] optional gemm flags.

gemm_strided_batched_ex functions support the _64 interface. Only the parameter batch_count larger than (int32_t max value * 16) is currently supported. Refer to section ILP64 Interface.

rocblas_trsm_ex + batched, strided_batched#

rocblas_status rocblas_trsm_ex(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const void *alpha, const void *A, rocblas_int lda, void *B, rocblas_int ldb, const void *invA, rocblas_int invA_size, rocblas_datatype compute_type)#
rocblas_status rocblas_trsm_batched_ex(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const void *alpha, const void *A, rocblas_int lda, void *B, rocblas_int ldb, rocblas_int batch_count, const void *invA, rocblas_int invA_size, rocblas_datatype compute_type)#
rocblas_status rocblas_trsm_strided_batched_ex(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const void *alpha, const void *A, rocblas_int lda, rocblas_stride stride_A, void *B, rocblas_int ldb, rocblas_stride stride_B, rocblas_int batch_count, const void *invA, rocblas_int invA_size, rocblas_stride stride_invA, rocblas_datatype compute_type)#

BLAS EX API

trsm_strided_batched_ex solves:

op(A_i)*X_i = alpha*B_i or X_i*op(A_i) = alpha*B_i,
for i = 1, …, batch_count; and where alpha is a scalar, X and B are strided batched m by n matrices, A is a strided batched triangular matrix and op(A_i) is one of
op( A_i ) = A_i   or   op( A_i ) = A_i^T   or   op( A_i ) = A_i^H.
Each matrix X_i is overwritten on B_i.

This function gives the user the ability to reuse each invA_i matrix between runs. If invA == NULL, rocblas_trsm_batched_ex will automatically calculate each invA_i on every run.

Setting up invA: Each accepted invA_i matrix consists of the packed 128x128 inverses of the diagonal blocks of matrix A_i, followed by any smaller diagonal block that remains. To set up invA_i it is recommended that rocblas_trtri_batched be used with matrix A_i as the input. invA is a contiguous piece of memory holding each invA_i.

Device memory of size 128 x k should be allocated for each invA_i ahead of time, where k is m when rocblas_side_left and is n when rocblas_side_right. The actual number of elements in each invA_i should be passed as invA_size.

To begin, rocblas_trtri_batched must be called on the full 128x128-sized diagonal blocks of each matrix A_i. Below are the restricted parameters:

  • n = 128

  • ldinvA = 128

  • stride_invA = 128x128

  • batch_count = k / 128

Then any remaining block may be added:

  • n = k % 128

  • invA = invA + stride_invA * previous_batch_count

  • ldinvA = 128

  • batch_count = 1

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • side[in] [rocblas_side]

    • rocblas_side_left: op(A)*X = alpha*B

    • rocblas_side_right: X*op(A) = alpha*B

  • uplo[in] [rocblas_fill]

    • rocblas_fill_upper: each A_i is an upper triangular matrix.

    • rocblas_fill_lower: each A_i is a lower triangular matrix.

  • transA[in] [rocblas_operation]

    • transB: op(A) = A.

    • rocblas_operation_transpose: op(A) = A^T

    • rocblas_operation_conjugate_transpose: op(A) = A^H

  • diag[in] [rocblas_diagonal]

    • rocblas_diagonal_unit: each A_i is assumed to be unit triangular.

    • rocblas_diagonal_non_unit: each A_i is not assumed to be unit triangular.

  • m[in] [rocblas_int] m specifies the number of rows of each B_i. m >= 0.

  • n[in] [rocblas_int] n specifies the number of columns of each B_i. n >= 0.

  • alpha[in] [void *] device pointer or host pointer specifying the scalar alpha. When alpha is &zero then A is not referenced, and B need not be set before entry.

  • A[in] [void *] device pointer storing matrix A. of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.

  • lda[in] [rocblas_int] lda specifies the first dimension of A.

    if side = rocblas_side_left,  lda >= max( 1, m ),
    if side = rocblas_side_right, lda >= max( 1, n ).
    

  • stride_A[in] [rocblas_stride] The stride between each A matrix.

  • B[inout] [void *] device pointer pointing to first matrix B_i. each B_i is of dimension ( ldb, n ). Before entry, the leading m by n part of each array B_i must contain the right-hand side of matrix B_i, and on exit is overwritten by the solution matrix X_i.

  • ldb[in] [rocblas_int] ldb specifies the first dimension of each B_i. ldb >= max( 1, m ).

  • stride_B[in] [rocblas_stride] The stride between each B_i matrix.

  • batch_count[in] [rocblas_int] specifies how many batches.

  • invA[in] [void *] device pointer storing the inverse diagonal blocks of each A_i. invA points to the first invA_1. each invA_i is of dimension ( ld_invA, k ), where k is m when rocblas_side_left and is n when rocblas_side_right. ld_invA must be equal to 128.

  • invA_size[in] [rocblas_int] invA_size specifies the number of elements of device memory in each invA_i.

  • stride_invA[in] [rocblas_stride] The stride between each invA matrix.

  • compute_type[in] [rocblas_datatype] specifies the datatype of computation.

rocblas_Xgeam + batched, strided_batched#

rocblas_status rocblas_sgeam(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, const float *beta, const float *B, rocblas_int ldb, float *C, rocblas_int ldc)#
rocblas_status rocblas_dgeam(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, const double *beta, const double *B, rocblas_int ldb, double *C, rocblas_int ldc)#
rocblas_status rocblas_cgeam(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, const rocblas_float_complex *beta, const rocblas_float_complex *B, rocblas_int ldb, rocblas_float_complex *C, rocblas_int ldc)#
rocblas_status rocblas_zgeam(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, const rocblas_double_complex *beta, const rocblas_double_complex *B, rocblas_int ldb, rocblas_double_complex *C, rocblas_int ldc)#

BLAS Level 3 API

geam performs one of the matrix-matrix operations:

C = alpha*op( A ) + beta*op( B ),

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B and C are matrices, with
op( A ) an m by n matrix, op( B ) an m by n matrix, and C an m by n matrix.

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • transA[in] [rocblas_operation] specifies the form of op( A ).

  • transB[in] [rocblas_operation] specifies the form of op( B ).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • alpha[in] device pointer or host pointer specifying the scalar alpha.

  • A[in] device pointer storing matrix A.

  • lda[in] [rocblas_int] specifies the leading dimension of A.

  • beta[in] device pointer or host pointer specifying the scalar beta.

  • B[in] device pointer storing matrix B.

  • ldb[in] [rocblas_int] specifies the leading dimension of B.

  • C[inout] device pointer storing matrix C.

  • ldc[in] [rocblas_int] specifies the leading dimension of C.

The geam functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_status rocblas_sgeam_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const float *alpha, const float *const A[], rocblas_int lda, const float *beta, const float *const B[], rocblas_int ldb, float *const C[], rocblas_int ldc, rocblas_int batch_count)#
rocblas_status rocblas_dgeam_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const double *alpha, const double *const A[], rocblas_int lda, const double *beta, const double *const B[], rocblas_int ldb, double *const C[], rocblas_int ldc, rocblas_int batch_count)#
rocblas_status rocblas_cgeam_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *const A[], rocblas_int lda, const rocblas_float_complex *beta, const rocblas_float_complex *const B[], rocblas_int ldb, rocblas_float_complex *const C[], rocblas_int ldc, rocblas_int batch_count)#
rocblas_status rocblas_zgeam_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *const A[], rocblas_int lda, const rocblas_double_complex *beta, const rocblas_double_complex *const B[], rocblas_int ldb, rocblas_double_complex *const C[], rocblas_int ldc, rocblas_int batch_count)#

BLAS Level 3 API

geam_batched performs one of the batched matrix-matrix operations:

C_i = alpha*op( A_i ) + beta*op( B_i )  for i = 0, 1, ... batch_count - 1,

where alpha and beta are scalars, and op(A_i), op(B_i) and C_i are m by n matrices
and op( X ) is one of

op( X ) = X      or
op( X ) = X**T

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • transA[in] [rocblas_operation] specifies the form of op( A ).

  • transB[in] [rocblas_operation] specifies the form of op( B ).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • alpha[in] device pointer or host pointer specifying the scalar alpha.

  • A[in] device array of device pointers storing each matrix A_i on the GPU. Each A_i is of dimension ( lda, k ), where k is m when transA == rocblas_operation_none and is n when transA == rocblas_operation_transpose.

  • lda[in] [rocblas_int] specifies the leading dimension of A.

  • beta[in] device pointer or host pointer specifying the scalar beta.

  • B[in] device array of device pointers storing each matrix B_i on the GPU. Each B_i is of dimension ( ldb, k ), where k is m when transB == rocblas_operation_none and is n when transB == rocblas_operation_transpose.

  • ldb[in] [rocblas_int] specifies the leading dimension of B.

  • C[inout] device array of device pointers storing each matrix C_i on the GPU. Each C_i is of dimension ( ldc, n ).

  • ldc[in] [rocblas_int] specifies the leading dimension of C.

  • batch_count[in] [rocblas_int] number of instances i in the batch.

The geam_batched functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_status rocblas_sgeam_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, rocblas_stride stride_A, const float *beta, const float *B, rocblas_int ldb, rocblas_stride stride_B, float *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#
rocblas_status rocblas_dgeam_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, rocblas_stride stride_A, const double *beta, const double *B, rocblas_int ldb, rocblas_stride stride_B, double *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#
rocblas_status rocblas_cgeam_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_float_complex *beta, const rocblas_float_complex *B, rocblas_int ldb, rocblas_stride stride_B, rocblas_float_complex *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#
rocblas_status rocblas_zgeam_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_double_complex *beta, const rocblas_double_complex *B, rocblas_int ldb, rocblas_stride stride_B, rocblas_double_complex *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#

BLAS Level 3 API

geam_strided_batched performs one of the batched matrix-matrix operations:

C_i = alpha*op( A_i ) + beta*op( B_i )  for i = 0, 1, ... batch_count - 1,

where alpha and beta are scalars, and op(A_i), op(B_i) and C_i are m by n matrices
and op( X ) is one of

op( X ) = X      or
op( X ) = X**T

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • transA[in] [rocblas_operation] specifies the form of op( A ).

  • transB[in] [rocblas_operation] specifies the form of op( B ).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • alpha[in] device pointer or host pointer specifying the scalar alpha.

  • A[in] device pointer to the first matrix A_0 on the GPU. Each A_i is of dimension ( lda, k ), where k is m when transA == rocblas_operation_none and is n when transA == rocblas_operation_transpose.

  • lda[in] [rocblas_int] specifies the leading dimension of A.

  • stride_A[in] [rocblas_stride] stride from the start of one matrix (A_i) and the next one (A_i+1).

  • beta[in] device pointer or host pointer specifying the scalar beta.

  • B[in] pointer to the first matrix B_0 on the GPU. Each B_i is of dimension ( ldb, k ), where k is m when transB == rocblas_operation_none and is n when transB == rocblas_operation_transpose.

  • ldb[in] [rocblas_int] specifies the leading dimension of B.

  • stride_B[in] [rocblas_stride] stride from the start of one matrix (B_i) and the next one (B_i+1)

  • C[inout] pointer to the first matrix C_0 on the GPU. Each C_i is of dimension ( ldc, n ).

  • ldc[in] [rocblas_int] specifies the leading dimension of C.

  • stride_C[in] [rocblas_stride] stride from the start of one matrix (C_i) and the next one (C_i+1).

  • batch_count[in] [rocblas_int] number of instances i in the batch.

The geam_strided_batched functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_geam_ex#

rocblas_status rocblas_geam_ex(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *A, rocblas_datatype a_type, rocblas_int lda, const void *B, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *C, rocblas_datatype c_type, rocblas_int ldc, void *D, rocblas_datatype d_type, rocblas_int ldd, rocblas_datatype compute_type, rocblas_geam_ex_operation geam_ex_op)#

BLAS EX API

geam_ex performs one of the matrix-matrix operations:

Dij = min(alpha * (Aik + Bkj), beta * Cij)
Dij = min(alpha * Aik, alpha * Bkj) + beta * Cij
alpha and beta are scalars, and A, B, C, and D are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C and D are m by n matrices. C and D may point to the same matrix if their type and leading dimensions are identical.

Aik refers to the element at the i-th row and k-th column of op( A ), Bkj refers to the element at the k-th row and j-th column of op( B ), and Cij/Dij refers to the element at the i-th row and j-th column of C/D.

Supported types are as follows:

  • rocblas_datatype_f64_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f32_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f16_r = a_type = b_type = c_type = d_type = compute_type

if transA == N, must have lda >= max(1, m) otherwise, must have lda >= max(1, k)

if transB == N, must have ldb >= max(1, k) otherwise, must have ldb >= max(1, n)

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • transA[in] [rocblas_operation] specifies the form of op( A ).

  • transB[in] [rocblas_operation] specifies the form of op( B ).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • k[in] [rocblas_int] matrix dimension k.

  • alpha[in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.

  • A[in] [void *] device pointer storing matrix A.

  • a_type[in] [rocblas_datatype] specifies the datatype of matrix A.

  • lda[in] [rocblas_int] specifies the leading dimension of A

  • B[in] [void *] device pointer storing matrix B.

  • b_type[in] [rocblas_datatype] specifies the datatype of matrix B.

  • ldb[in] [rocblas_int] specifies the leading dimension of B

  • beta[in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.

  • C[in] [void *] device pointer storing matrix C.

  • c_type[in] [rocblas_datatype] specifies the datatype of matrix C.

  • ldc[in] [rocblas_int] specifies the leading dimension of C, must have ldc >= max(1, m).

  • D[out] [void *] device pointer storing matrix D. If D and C pointers are to the same matrix then d_type must equal c_type and ldd must equal ldc or the respective invalid status will be returned.

  • d_type[in] [rocblas_datatype] specifies the datatype of matrix D.

  • ldd[in] [rocblas_int] specifies the leading dimension of D, must have ldd >= max(1, m).

  • compute_type[in] [rocblas_datatype] specifies the datatype of computation.

  • geam_ex_op[in] [rocblas_geam_ex_operation] enumerant specifying the operation type, support for rocblas_geam_ex_operation_min_plus and rocblas_geam_ex_operation_plus_min.

rocblas_Xdgmm + batched, strided_batched#

rocblas_status rocblas_sdgmm(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const float *A, rocblas_int lda, const float *x, rocblas_int incx, float *C, rocblas_int ldc)#
rocblas_status rocblas_ddgmm(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const double *A, rocblas_int lda, const double *x, rocblas_int incx, double *C, rocblas_int ldc)#
rocblas_status rocblas_cdgmm(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_float_complex *A, rocblas_int lda, const rocblas_float_complex *x, rocblas_int incx, rocblas_float_complex *C, rocblas_int ldc)#
rocblas_status rocblas_zdgmm(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_double_complex *A, rocblas_int lda, const rocblas_double_complex *x, rocblas_int incx, rocblas_double_complex *C, rocblas_int ldc)#

BLAS Level 3 API

dgmm performs one of the matrix-matrix operations:

C = A * diag(x) if side == rocblas_side_right
C = diag(x) * A if side == rocblas_side_left

where C and A are m by n dimensional matrices. diag( x ) is a diagonal matrix
and x is vector of dimension n if side == rocblas_side_right and dimension m
if side == rocblas_side_left.

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • side[in] [rocblas_side] specifies the side of diag(x).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • A[in] device pointer storing matrix A.

  • lda[in] [rocblas_int] specifies the leading dimension of A.

  • x[in] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment between values of x

  • C[inout] device pointer storing matrix C.

  • ldc[in] [rocblas_int] specifies the leading dimension of C.

The dgmm functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_status rocblas_sdgmm_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const float *const A[], rocblas_int lda, const float *const x[], rocblas_int incx, float *const C[], rocblas_int ldc, rocblas_int batch_count)#
rocblas_status rocblas_ddgmm_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const double *const A[], rocblas_int lda, const double *const x[], rocblas_int incx, double *const C[], rocblas_int ldc, rocblas_int batch_count)#
rocblas_status rocblas_cdgmm_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_float_complex *const A[], rocblas_int lda, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_float_complex *const C[], rocblas_int ldc, rocblas_int batch_count)#
rocblas_status rocblas_zdgmm_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_double_complex *const A[], rocblas_int lda, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_double_complex *const C[], rocblas_int ldc, rocblas_int batch_count)#

BLAS Level 3 API

dgmm_batched performs one of the batched matrix-matrix operations:

C_i = A_i * diag(x_i) for i = 0, 1, ... batch_count-1 if side == rocblas_side_right
C_i = diag(x_i) * A_i for i = 0, 1, ... batch_count-1 if side == rocblas_side_left,

where C_i and A_i are m by n dimensional matrices. diag(x_i) is a diagonal matrix
and x_i is vector of dimension n if side == rocblas_side_right and dimension m
if side == rocblas_side_left.

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • side[in] [rocblas_side] specifies the side of diag(x).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • A[in] device array of device pointers storing each matrix A_i on the GPU. Each A_i is of dimension ( lda, n ).

  • lda[in] [rocblas_int] specifies the leading dimension of A_i.

  • x[in] device array of device pointers storing each vector x_i on the GPU. Each x_i is of dimension n if side == rocblas_side_right and dimension m if side == rocblas_side_left.

  • incx[in] [rocblas_int] specifies the increment between values of x_i.

  • C[inout] device array of device pointers storing each matrix C_i on the GPU. Each C_i is of dimension ( ldc, n ).

  • ldc[in] [rocblas_int] specifies the leading dimension of C_i.

  • batch_count[in] [rocblas_int] number of instances in the batch.

The dgmm_batched functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_status rocblas_sdgmm_strided_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const float *A, rocblas_int lda, rocblas_stride stride_A, const float *x, rocblas_int incx, rocblas_stride stride_x, float *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#
rocblas_status rocblas_ddgmm_strided_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const double *A, rocblas_int lda, rocblas_stride stride_A, const double *x, rocblas_int incx, rocblas_stride stride_x, double *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#
rocblas_status rocblas_cdgmm_strided_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_float_complex *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#
rocblas_status rocblas_zdgmm_strided_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_double_complex *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)#

BLAS Level 3 API

dgmm_strided_batched performs one of the batched matrix-matrix operations:

C_i = A_i * diag(x_i)   if side == rocblas_side_right   for i = 0, 1, ... batch_count-1
C_i = diag(x_i) * A_i   if side == rocblas_side_left    for i = 0, 1, ... batch_count-1,

where C_i and A_i are m by n dimensional matrices. diag(x_i) is a diagonal matrix
and x_i is vector of dimension n if side == rocblas_side_right and dimension m
if side == rocblas_side_left.

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • side[in] [rocblas_side] specifies the side of diag(x).

  • m[in] [rocblas_int] matrix dimension m.

  • n[in] [rocblas_int] matrix dimension n.

  • A[in] device pointer to the first matrix A_0 on the GPU. Each A_i is of dimension ( lda, n ).

  • lda[in] [rocblas_int] specifies the leading dimension of A.

  • stride_A[in] [rocblas_stride] stride from the start of one matrix (A_i) and the next one (A_i+1).

  • x[in] pointer to the first vector x_0 on the GPU. Each x_i is of dimension n if side == rocblas_side_right and dimension m if side == rocblas_side_left.

  • incx[in] [rocblas_int] specifies the increment between values of x.

  • stride_x[in] [rocblas_stride] stride from the start of one vector(x_i) and the next one (x_i+1).

  • C[inout] device pointer to the first matrix C_0 on the GPU. Each C_i is of dimension ( ldc, n ).

  • ldc[in] [rocblas_int] specifies the leading dimension of C.

  • stride_C[in] [rocblas_stride] stride from the start of one matrix (C_i) and the next one (C_i+1).

  • batch_count[in] [rocblas_int] number of instances i in the batch.

The dgmm_strided_batched functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_Xgemmt + batched, strided_batched#

rocblas_status rocblas_sgemmt(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, const float *B, rocblas_int ldb, const float *beta, float *C, rocblas_int ldc)#
rocblas_status rocblas_dgemmt(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, const double *B, rocblas_int ldb, const double *beta, double *C, rocblas_int ldc)#
rocblas_status rocblas_cgemmt(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, const rocblas_float_complex *B, rocblas_int ldb, const rocblas_float_complex *beta, rocblas_float_complex *C, rocblas_int ldc)#
rocblas_status rocblas_zgemmt(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, const rocblas_double_complex *B, rocblas_int ldb, const rocblas_double_complex *beta, rocblas_double_complex *C, rocblas_int ldc)#

BLAS Level 3 API

gemmt performs matrix-matrix operations and updates the upper or lower triangular part of the result matrix:

C = alpha*op( A )*op( B ) + beta*C,

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars. A, B  are general matrices and C is either an upper or lower triangular matrix, with
op( A ) an n by k matrix, op( B ) a k by n matrix and C an n by n matrix.

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • uplo[in] [rocblas_fill]

    • rocblas_fill_upper: C is an upper triangular matrix

    • rocblas_fill_lower: C is a lower triangular matrix

  • transA[in] [rocblas_operation]

    • rocblas_operation_none: op(A) = A.

    • rocblas_operation_transpose: op(A) = A^T

    • rocblas_operation_conjugate_transpose: op(A) = A^H

  • transB[in] [rocblas_operation]

    • rocblas_operation_none: op(B) = B.

    • rocblas_operation_transpose: op(B) = B^T

    • rocblas_operation_conjugate_transpose: op(B) = B^H

  • n[in] [rocblas_int] number or rows of matrices op( A ), columns of op( B ), and (rows, columns) of C.

  • k[in] [rocblas_int] number of rows of matrices op( B ) and columns of op( A ).

  • alpha[in] device pointer or host pointer specifying the scalar alpha.

  • A[in] device pointer storing matrix A. If transa = rocblas_operation_none, then, the leading n-by-k part of the array contains the matrix A, otherwise the leading k-by-n part of the array contains the matrix A.

  • lda[in] [rocblas_int] specifies the leading dimension of A. If transA == rocblas_operation_none, must have lda >= max(1, n), otherwise, must have lda >= max(1, k).

  • B[in] device pointer storing matrix B. If transB = rocblas_operation_none, then, the leading k-by-n part of the array contains the matrix B, otherwise the leading n-by-k part of the array contains the matrix B.

  • ldb[in] [rocblas_int] specifies the leading dimension of B. If transB == rocblas_operation_none, must have ldb >= max(1, k), otherwise, must have ldb >= max(1, n)

  • beta[in] device pointer or host pointer specifying the scalar beta.

  • C[inout] device pointer storing matrix C on the GPU. If uplo == rocblas_fill_upper, the upper triangular part of the leading n-by-n array contains the matrix C, otherwise the lower triangular part of the leading n-by-n array contains the matrix C.

  • ldc[in] [rocblas_int] specifies the leading dimension of C. Must have ldc >= max(1, n).

The gemmt functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_status rocblas_sgemmt_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const float *alpha, const float *const A[], rocblas_int lda, const float *const B[], rocblas_int ldb, const float *beta, float *const C[], rocblas_int ldc, rocblas_int batch_count)#
rocblas_status rocblas_dgemmt_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const double *alpha, const double *const A[], rocblas_int lda, const double *const B[], rocblas_int ldb, const double *beta, double *const C[], rocblas_int ldc, rocblas_int batch_count)#
rocblas_status rocblas_cgemmt_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const rocblas_float_complex *alpha, const rocblas_float_complex *const A[], rocblas_int lda, const rocblas_float_complex *const B[], rocblas_int ldb, const rocblas_float_complex *beta, rocblas_float_complex *const C[], rocblas_int ldc, rocblas_int batch_count)#
rocblas_status rocblas_zgemmt_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const rocblas_double_complex *alpha, const rocblas_double_complex *const A[], rocblas_int lda, const rocblas_double_complex *const B[], rocblas_int ldb, const rocblas_double_complex *beta, rocblas_double_complex *const C[], rocblas_int ldc, rocblas_int batch_count)#

BLAS Level 3 API

gemmt_batched performs matrix-matrix operations and updates the upper or lower triangular part of the result matrix:

C_i = alpha*op( A_i )*op( B_i ) + beta*C_i, for i = 1, ..., batch_count,

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars. A, B  are general matrices and C is either an upper or lower triangular matrix, with

op( A ) an n by k by batch_count matrices,
op( B ) an k by n by batch_count matrices and
C an n by n by batch_count matrices.

Parameters:
  • handle[in] [rocblas_handle handle to the rocblas library context queue.

  • uplo[in] [rocblas_fill]

    • rocblas_fill_upper: C is an upper triangular matrix

    • rocblas_fill_lower: C is a lower triangular matrix

  • transA[in] [rocblas_operation]

    • rocblas_operation_none: op(A_i) = A_i.

    • rocblas_operation_transpose: op(A_i) = A_i^T

    • rocblas_operation_conjugate_transpose: op(A_i) = A_i^H

  • transB[in] [rocblas_operation]

    • rocblas_operation_none: op(B_i) = B_i.

    • rocblas_operation_transpose: op(B_i) = B_i^T

    • rocblas_operation_conjugate_transpose: op(B_i) = B_i^H

  • n[in] [rocblas_int] number or rows of matrices op( A_i ), columns of op( B_i ), and (rows, columns) of C_i.

  • k[in] [rocblas_int] number of rows of matrices op( B_i ) and columns of op( A_i ).

  • alpha[in] device pointer or host pointer specifying the scalar alpha.

  • A[in] device array of device pointers storing each matrix A_i. If transa = rocblas_operation_none, then, the leading n-by-k part of the array contains each matrix A_i, otherwise the leading k-by-n part of the array contains each matrix A_i.

  • lda[in] [rocblas_int] specifies the leading dimension of each A_i. If transA == rocblas_operation_none, must have lda >= max(1, n), otherwise, must have lda >= max(1, k).

  • B[in] device array of device pointers storing each matrix B_i. If transB = rocblas_operation_none, then, the leading k-by-n part of the array contains each matrix B_i, otherwise the leading n-by-k part of the array contains each matrix B_i.

  • ldb[in] [rocblas_int] specifies the leading dimension of each B_i. If transB == rocblas_operation_none, must have ldb >= max(1, k), otherwise, must have ldb >= max(1, n).

  • beta[in] device pointer or host pointer specifying the scalar beta.

  • C[inout] device array of device pointers storing each matrix C_i. If uplo == rocblas_fill_upper, the upper triangular part of the leading n-by-n array contains each matrix C_i, otherwise the lower triangular part of the leading n-by-n array contains each matrix C_i.

  • ldc[in] [rocblas_int] specifies the leading dimension of each C_i. Must have ldc >= max(1, n).

  • batch_count[in] [rocblas_int] number of gemm operations in the batch.

The gemmt_batched functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_status rocblas_sgemmt_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, rocblas_stride stride_a, const float *B, rocblas_int ldb, rocblas_stride stride_b, const float *beta, float *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)#
rocblas_status rocblas_dgemmt_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, rocblas_stride stride_a, const double *B, rocblas_int ldb, rocblas_stride stride_b, const double *beta, double *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)#
rocblas_status rocblas_cgemmt_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride stride_a, const rocblas_float_complex *B, rocblas_int ldb, rocblas_stride stride_b, const rocblas_float_complex *beta, rocblas_float_complex *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)#
rocblas_status rocblas_zgemmt_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_operation transB, rocblas_int n, rocblas_int k, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride stride_a, const rocblas_double_complex *B, rocblas_int ldb, rocblas_stride stride_b, const rocblas_double_complex *beta, rocblas_double_complex *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)#

BLAS Level 3 API

gemmt_strided_batched performs matrix-matrix operations and updates the upper or lower triangular part of the result matrix:

C_i = alpha*op( A_i )*op( B_i ) + beta*C_i, for i = 1, ..., batch_count,

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars. A, B  are general matrices and C is either an upper or lower triangular matrix, with
op( A ) an n by k by batch_count strided_batched matrix,
op( B ) an k by n by batch_count strided_batched matrix and
C an n by n by batch_count strided_batched matrix.

Parameters:
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • uplo[in] [rocblas_fill]

    • rocblas_fill_upper: C is an upper triangular matrix

    • rocblas_fill_lower: C is a lower triangular matrix

  • transA[in] [rocblas_operation]

    • rocblas_operation_none: op(A_i) = A_i.

    • rocblas_operation_transpose: op(A_i) = A_i^T

    • rocblas_operation_conjugate_transpose: op(A_i) = A_i^H

  • transB[in] [rocblas_operation]

    • rocblas_operation_none: op(B_i) = B_i.

    • rocblas_operation_transpose: op(B_i) = B_i^T

    • rocblas_operation_conjugate_transpose: op(B_i) = B_i^H

  • n[in] [rocblas_int] number or rows of matrices op( A_i ), columns of op( B_i ), and (rows, columns) of C_i.

  • k[in] [rocblas_int] number of rows of matrices op( B_i ) and columns of op( A_i ).

  • alpha[in] device pointer or host pointer specifying the scalar alpha.

  • A[in] device array of device pointers storing each matrix A_i. If transa = rocblas_operation_none, then, the leading n-by-k part of the array contains each matrix A_i, otherwise the leading k-by-n part of the array contains each matrix A_i.

  • lda[in] [rocblas_int] specifies the leading dimension of each A_i. If transA == rocblas_operation_none, must have lda >= max(1, n), otherwise, must have lda >= max(1, k).

  • stride_a[in] [rocblas_stride] stride from the start of one A_i matrix to the next A_(i + 1).

  • B[in] device array of device pointers storing each matrix B_i. If transB = rocblas_operation_none, then, the leading k-by-n part of the array contains each matrix B_i, otherwise the leading n-by-k part of the array contains each matrix B_i.

  • ldb[in] [rocblas_int] specifies the leading dimension of each B_i. If transB == rocblas_operation_none, must have ldb >= max(1, k), otherwise, must have ldb >= max(1, n).

  • stride_b[in] [rocblas_stride] stride from the start of one B_i matrix to the next B_(i + 1).

  • beta[in] device pointer or host pointer specifying the scalar beta.

  • C[inout] device array of device pointers storing each matrix C_i. If uplo == rocblas_fill_upper, the upper triangular part of the leading n-by-n array contains each matrix C_i, otherwise the lower triangular part of the leading n-by-n array contains each matrix C_i.

  • ldc[in] [rocblas_int] specifies the leading dimension of each C_i. Must have ldc >= max(1, n).

  • stride_c[in] [rocblas_stride] stride from the start of one C_i matrix to the next C_(i + 1).

  • batch_count[in] [rocblas_int] number of gemm operatons in the batch.

The gemmt_strided_batched functions support the _64 interface. Refer to section ILP64 Interface.