rocblas_gemm_strided_batched_ex Interface Reference

rocblas_gemm_strided_batched_ex Interface Reference#

HIPFORT API Reference: hipfort_rocblas::rocblas_gemm_strided_batched_ex Interface Reference
hipfort_rocblas::rocblas_gemm_strided_batched_ex Interface Reference

BLAS EX API. More...

Public Member Functions

integer(kind(rocblas_status_success)) function rocblas_gemm_strided_batched_ex_ (handle, transA, transB, m, n, k, alpha, a, a_type, lda, stride_a, b, b_type, ldb, stride_b, beta, c, c_type, ldc, stride_c, d, d_type, ldd, stride_d, batch_count, compute_type, algo, solution_index, flags)
 

Detailed Description

BLAS EX API.

gemm_strided_batched_ex performs one of the strided_batched matrix-matrix operations

D_i = alpha*op(A_i)*op(B_i) + beta*C_i, for i = 1, ..., batch_count

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B, C, and D are strided_batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) a k by n by batch_count strided_batched matrix and C and D are m by n by batch_count strided_batched matrices.

The strided_batched matrices are multiple matrices separated by a ant stride. The number of matrices is batch_count.

Supported types are as follows:

  • rocblas_datatype_f64_r = a_type = b_type = c_type = d_type = compute_type
  • rocblas_datatype_f32_r = a_type = b_type = c_type = d_type = compute_type
  • rocblas_datatype_f16_r = a_type = b_type = c_type = d_type = compute_type
  • rocblas_datatype_f16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type
  • rocblas_datatype_bf16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type
  • rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type
  • rocblas_datatype_f32_c = a_type = b_type = c_type = d_type = compute_type
  • rocblas_datatype_f64_c = a_type = b_type = c_type = d_type = compute_type

ROCm 4.2 supports two different versions of a = b = i8_r (in) and c = d = i32_r (out):

  • Both versions are rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type, in addition to a last flag param indicating packing input or not.
     - Without setting the last param 'flags' (default=none), this is supported for gfx908 or
    
    later GPUs only. Input a/b won't be packed into int8x4. So the following size restrictions and packing pseudo-code is not neccessary.
     - Set the last param 'flags' |= rocblas_gemm_flags_pack_int8x4. Input a/b would be packed
    
    into int8x4, and this will impose some size restrictions on A or B (See below.) For GPUs before gfx908, only packed-int8 version is supported so this flag and packing is required, while gfx908 GPUs support both versions.

Below are restrictions for rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type; flags |= rocblas_gemm_flags_pack_int8x4:

  • k must be a multiple of 4
  • lda must be a multiple of 4 if transA == rocblas_operation_transpose
  • ldb must be a multiple of 4 if transB == rocblas_operation_none
  • for transA == rocblas_operation_none or transB == rocblas_operation_transpose the matrices A and B must have each 4 consecutive values in the k dimension packed. This packing can be achieved with the following pseudo-code. The code assumes the original matrices are in A and B, and the packed matrices are A_packed and B_packed. The size of the A_packed matrix is the same as the size of the A matrix, and the size of the B_packed matrix is the same as the size of the B matrix.
if(transa == rocblas_operation_none)
{
int nb = 4;
for(int i_m = 0; i_m < m; i_m++)
{
for(int i_k = 0; i_k < k; i_k++)
{
a_packed[i_k % nb + (i_m + (i_k nb) * lda) * nb] = a[i_m + i_k * lda];
}
}
}
else
{
a_packed = a;
}
if(transb == rocblas_operation_transpose)
{
int nb = 4;
for(int i_n = 0; i_n < m; i_n++)
{
for(int i_k = 0; i_k < k; i_k++)
{
b_packed[i_k % nb + (i_n + (i_k nb) * ldb) * nb] = b[i_n + i_k * ldb];
}
}
}
else
{
b_packed = b;
}
Parameters
[in]handle[rocblas_handle] handle to the rocblas library context queue.
[in]transA[rocblas_operation] specifies the form of op( A ).
[in]transB[rocblas_operation] specifies the form of op( B ).
[in]m[rocblas_int] matrix dimension m.
[in]n[rocblas_int] matrix dimension n.
[in]k[rocblas_int] matrix dimension k.
[in]alpha[ void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.
[in]a[void *] device pointer pointing to first matrix A_1.
[in]a_type[rocblas_datatype] specifies the datatype of each matrix A_i.
[in]lda[rocblas_int] specifies the leading dimension of each A_i.
[in]stride_a[rocblas_stride] specifies stride from start of one A_i matrix to the next A_(i + 1).
[in]b[void *] device pointer pointing to first matrix B_1.
[in]b_type[rocblas_datatype] specifies the datatype of each matrix B_i.
[in]ldb[rocblas_int] specifies the leading dimension of each B_i.
[in]stride_b[rocblas_stride] specifies stride from start of one B_i matrix to the next B_(i + 1).
[in]beta[ void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.
[in]c[void *] device pointer pointing to first matrix C_1.
[in]c_type[rocblas_datatype] specifies the datatype of each matrix C_i.
[in]ldc[rocblas_int] specifies the leading dimension of each C_i.
[in]stride_c[rocblas_stride] specifies stride from start of one C_i matrix to the next C_(i + 1).
[out]d[void *] device pointer storing each matrix D_i.
[in]d_type[rocblas_datatype] specifies the datatype of each matrix D_i.
[in]ldd[rocblas_int] specifies the leading dimension of each D_i.
[in]stride_d[rocblas_stride] specifies stride from start of one D_i matrix to the next D_(i + 1).
[in]batch_count[rocblas_int] number of gemm operations in the batch.
[in]compute_type[rocblas_datatype] specifies the datatype of computation.
[in]algo[rocblas_gemm_algo] enumerant specifying the algorithm type.
[in]solution_index[int32_t] reserved for future use.
[in]flags[uint32_t] optional gemm flags.

Member Function/Subroutine Documentation

◆ rocblas_gemm_strided_batched_ex_()

integer(kind(rocblas_status_success)) function hipfort_rocblas::rocblas_gemm_strided_batched_ex::rocblas_gemm_strided_batched_ex_ ( type(c_ptr), value  handle,
integer(kind(rocblas_operation_none)), value  transA,
integer(kind(rocblas_operation_none)), value  transB,
integer(c_int), value  m,
integer(c_int), value  n,
integer(c_int), value  k,
type(c_ptr), value  alpha,
type(c_ptr), value  a,
integer(kind(rocblas_datatype_f16_r)), value  a_type,
integer(c_int), value  lda,
integer(c_int64_t), value  stride_a,
type(c_ptr), value  b,
integer(kind(rocblas_datatype_f16_r)), value  b_type,
integer(c_int), value  ldb,
integer(c_int64_t), value  stride_b,
type(c_ptr), value  beta,
type(c_ptr), value  c,
integer(kind(rocblas_datatype_f16_r)), value  c_type,
integer(c_int), value  ldc,
integer(c_int64_t), value  stride_c,
type(c_ptr), value  d,
integer(kind(rocblas_datatype_f16_r)), value  d_type,
integer(c_int), value  ldd,
integer(c_int64_t), value  stride_d,
integer(c_int), value  batch_count,
integer(kind(rocblas_datatype_f16_r)), value  compute_type,
integer(kind(rocblas_gemm_algo_standard)), value  algo,
integer(c_int32_t), value  solution_index,
integer(c_int), value  flags 
)

The documentation for this interface was generated from the following file: