API reference guide#

This document provides information about rocWMMA functions, data types, and other programming constructs.

Synchronous API#

rocWMMA API functions such as load_matrix_sync, store_matrix_sync, and mma_sync are synchronous when used with global memory. However, when you use these functions with shared memory, for example, LDS memory, explicit workgroup synchronization (synchronize_workgroup) might be required.

Supported GPU architectures#

Supported CDNA architectures (wave64):

  • gfx908

  • gfx90a

  • gfx942

  • gfx950

Note

gfx9 refers to gfx908, gfx90a, gfx942, and gfx950.

Supported RDNA architectures (wave32):

  • gfx1100

  • gfx1101

  • gfx1102

  • gfx1200

  • gfx1201

Note

gfx11 refers to gfx1100, gfx1101, and gfx1102. gfx12 refers to gfx1200 and gfx1201.

Supported data types#

rocWMMA mixed precision multiply-accumulate operations support the following data type combinations.

Data Types <Ti / To / Tc> = <Input type / Output Type / Compute Type>, where:

  • Input Type = Matrix A / B

  • Output Type = Matrix C / D

  • Compute Type = Math / Accumulation type

Supported data types:

  • i8: 8-bit precision integer

  • f8: 8-bit precision floating point

  • bf8: 8-bit precision brain floating point

  • f16: half-precision floating point

  • bf16: half-precision brain floating point

  • f32: single-precision floating point

  • i32: 32-bit precision integer

  • xf32: single-precision tensor floating point

  • f64: double-precision floating point

Note

f16 includes support for both _Float16 and __half types.

f8 NANOO (optimized) format is only supported on gfx942, otherwise f8 OCP is assumed on targets that support f8 datatypes.

Ti / To / Tc

BlockM

BlockN

BlockK Range* (Powers of 2)

CDNA Support

RDNA Support

bf8 / f32 / f32

16

16

32+

gfx940, gfx950

gfx12

32

32

16+

-

f8 / f32 / f32

16

16

32+

gfx940, gfx950

gfx12

32

32

16+

-

i8 / i32 / i32

16

16

16

gfx908, gfx90a

gfx11, gfx12

32

gfx940, gfx950

-

64+

gfx950

-

32

32

8

gfx908, gfx90a

-

16

gfx940, gfx950

-

32+

gfx950

-

i8 / i8 / i32

16

16

16

gfx908, gfx90a

gfx11, gfx12

32

gfx940, gfx950

-

64+

gfx950

-

32

32

8

gfx908, gfx90a

-

16

gfx940, gfx950

-

32+

gfx950

-

f16 / f32 / f32

16

16

16

gfx9

gfx11, gfx12

32+

gfx950

-

32

32

8

gfx9

-

16+

gfx950

-

f16 / f16 / f32

16

16

16

gfx9

gfx11, gfx12

32+

gfx950

-

32

32

8

gfx9

-

16+

gfx950

-

f16 / f16 / f16**

16

16

16

gfx9

gfx11, gfx12

32+

gfx950

-

32

32

8

gfx9

-

16+

gfx950

-

bf16 / f32 / f32

16

16

8

gfx908

-

16

gfx90a, gfx942, gfx950

gfx11, gfx12

32+

gfx950

-

32

32

4+

gfx908

-

8

gfx90a, gfx942, gfx950

-

16+

gfx950

-

bf16 / bf16 / f32

16

16

8

gfx908

-

16

gfx90a, gfx942, gfx950

gfx11, gfx12

32+

gfx950

-

32

32

4+

gfx908

-

8

gfx90a, gfx942, gfx950

-

16+

gfx950

-

bf16 / bf16 / bf16**

16

16

8

gfx908

-

16

gfx90a, gfx942, gfx950

gfx11, gfx12

32+

gfx950

-

32

32

4+

gfx908

-

8

gfx90a, gfx942, gfx950

-

16+

gfx950

-

f32 / f32 / f32

16

16

4+

gfx9

-

32

32

2+

gfx9

-

xf32 / xf32 / xf32

16

16

8+

gfx942

-

32

32

4+

f64 / f64 / f64

16

16

4+

gfx90a, gfx942, gfx950

-

Note

BlockM/N values are minimum recommended values. Below these values padding is used which may impact performance. Above this value powers of 2 are acceptable.

* BlockK range specifies the minimum recommended value. Below this value padding is used which may impact performance. Above this value powers of 2 are acceptable. In practice, BlockK values are typically 32 or less.

** On CDNA architectures, matrix unit accumulation is performed in natively 32-bit precision and then converted to the target data type.

Note

rocWMMA supports partial fragment sizes where FragMNK may be smaller than the BlockMNK sizes listed in the table above. These fragments are internally padded to nearest supported BlockMNK sizes.

Supported matrix layouts#

(N = col major, T = row major)

LayoutA

LayoutB

Layout C

LayoutD

N

N

N

N

N

N

T

T

N

T

N

N

N

T

T

T

T

N

N

N

T

N

T

T

T

T

N

N

T

T

T

T

Supported thread block sizes#

rocWMMA supports up to four wavefronts per thread block. The X dimension should be a multiple of the wave size and is scaled accordingly.

TBlock_X

TBlock_Y

WaveSize

1

WaveSize

2

WaveSize

4

WaveSize*2

1

WaveSize*2

2

WaveSize*4

1

Note

WaveSize (RDNA) = 32

WaveSize (CDNA) = 64

Using rocWMMA API#

This section describes how to use the rocWMMA library API.

rocWMMA datatypes#

matrix_a#

struct matrix_a#

Meta-tag indicating data context is input Matrix A.

matrix_b#

struct matrix_b#

Meta-tag indicating data context is input Matrix B.

accumulator#

struct accumulator#

Meta-tag indicating data context is Accumulator (also used as Matrix C / D).

row_major#

struct row_major#

Meta-tag indicating 2D in-memory data layout as row major.

col_major#

struct col_major#

Meta-tag indicating 2D in-memory data layout as column major.

default_schedule#

typedef IOScheduler::Default rocwmma::fragment_scheduler::default_schedule#

The default fragment scheduler; each wave operates independently.

coop_row_major_2d#

typedef IOScheduler::RowMajor2d<TBlockX, TBlockY> rocwmma::fragment_scheduler::coop_row_major_2d#

A cooperative scheduling strategy where each wave in the 2d threadblock will contribute to the fragment operation in row_major grid order. All waves are scheduled in row_major order. E.g. (TBlockX, TBlockY) => 2x2 waves w0 = (0, 0), w1 = (0, 1), w2 = (1, 0), w3 = (1, 1)

Template Parameters:
  • TBlockX – the size of the thread-block in the X dimension

  • TBlockY – the size of the thread-block in the Y dimension

coop_col_major_2d#

typedef IOScheduler::ColMajor2d<TBlockX, TBlockY> rocwmma::fragment_scheduler::coop_col_major_2d#

A cooperative scheduling strategy where each wave in the 2d threadblock will contribute to the fragment operation in col_major grid order. All waves are scheduled in row_major order. E.g. (TBlockX, TBlockY) => 2x2 waves w0 = (0, 0), w2 = (0, 1), w1 = (1, 0), w3 = (1, 1)

Template Parameters:
  • TBlockX – the size of the thread-block in the X dimension

  • TBlockY – the size of the thread-block in the Y dimension

coop_row_slice_2d#

typedef IOScheduler::RowSlice2d<TBlockX, TBlockY> rocwmma::fragment_scheduler::coop_row_slice_2d#

A cooperative scheduling strategy where each row of waves in the 2d threadblock will contribute to the fragment operation. Waves are partitioned into rows. Only waves in the same row participate together. E.g. (TBlockX, TBlockY) = 2x2 waves RowSlice0: w0 = (0, 0), w1 = (0, 1) RowSlice1: w0 = (1, 0), w1 = (1, 1)

Template Parameters:
  • TBlockX – the size of the thread-block in the X dimension

  • TBlockY – the size of the thread-block in the Y dimension

coop_col_slice_2d#

typedef IOScheduler::ColSlice2d<TBlockX, TBlockY> rocwmma::fragment_scheduler::coop_col_slice_2d#

A cooperative scheduling strategy where each col of waves in the 2d threadblock will contribute to the fragment operation. Waves are partitioned into cols. Only waves in the same col participate together. E.g. (TBlockX, TBlockY) = 2x2 waves ColSlice0: ColSlice1: w0 = (0, 0), w0 = (0, 1), w1 = (1, 0) w1 = (1, 1)

Template Parameters:
  • TBlockX – the size of the thread-block in the X dimension

  • TBlockY – the size of the thread-block in the Y dimension

single#

typedef IOScheduler::Single<TBlockX, TBlockY, WaveIdx> rocwmma::fragment_scheduler::single#

A cooperative scheduling strategy where only one wave in the thread block will participate.

Template Parameters:
  • TBlockX – the size of the thread-block in the X dimension

  • TBlockY – the size of the thread-block in the Y dimension

  • WaveIdx – the index of the wave which will participate

fragment#

template<typename MatrixT, uint32_t FragM, uint32_t FragN, uint32_t FragK, typename DataT, typename DataLayoutT = void, typename Scheduler = fragment_scheduler::default_schedule>
class fragment#

rocWMMA fragment class. This is the primary object used in block-wise decomposition of the matrix multiply-accumulate (mma) problem space. In general, fragment data is associated with a matrix context (matrix_a, matrix_b or accumulator), a block size (BlockM/N/K), a datatype (e.g. single-precision float, etc.) and an in-memory 2D layout (e.g. row_major or col_major). These fragment properties are used to define how data is handled and stored locally, and to drive API implementations for loading / storing, mma and transforms. Fragment abstractions are designed to promote a simple wavefront programming model, which can accelerate development time. Internal thread-level details are handled by rocWMMA which frees the user to focus on wavefront block-wise decomposition. Written purely in device code, the programmer can use this object in their own device kernels.

Note

Fragments are stored in packed registers, however vector elements have no guaranteed order or locality.

Template Parameters:
  • MatrixT – fragment context

  • FragM/N/K – fragment dimensions

  • DataT – datatype

  • DataLayoutT – in-memory layout as col_major or row_major

  • Scheduler – wave-wise scheduler

Public Types

using IOTraits = typename IOConfig<MatrixT, FragM, FragN, FragK, DataT, DataLayoutT, Scheduler>::IOTraits#

Input / output traits specific to AMDGCN architecture.

Public Functions

inline DataT &operator[](uint32_t index)#
Parameters:

index – Element index

Returns:

Mutable unpacked element accessor at given index

inline DataT const &operator[](uint32_t index) const#
Parameters:

index – Element index

Returns:

Immutable unpacked element accessor at given index

inline Traits::StorageT &operator*()#
Returns:

Mutable packed storage vector accessor

inline Traits::StorageT const &operator*() const#
Returns:

Immutable packed storage vector accessor

Public Members

union rocwmma::fragment::[anonymous] [anonymous]#

Internal data storage views. Compatibility with nvcuda::wmma.

Public Static Functions

static inline constexpr uint32_t height()#
Returns:

The geometric height of fragment

static inline constexpr uint32_t width()#
Returns:

The geometric width of fragment

static inline constexpr uint32_t blockDim()#
Returns:

The leading block dimension (non-K)

static inline constexpr uint32_t kDim()#
Returns:

The k dimension

static inline constexpr uint32_t size()#
Returns:

The size of the unpacked elements vector

struct Traits#

Public Types

using AccessT = VecT<UnpackedElementT, Size>#

Unpacked data access view.

using StorageT = VecT<PackedElementT, IOTraits::PackedSize / WaveCount>#

Packed data storage view.

Public Static Attributes

static constexpr uint32_t Size = IOTraits::UnpackedSize / WaveCount#

Assert the fragment occupies at least one packed register.

Assert the fragment is equally splittable among the wave count

rocWMMA enumeration#

layout_t#

enum rocwmma::layout_t#

Values:

enumerator mem_row_major#
enumerator mem_col_major#

rocWMMA API functions#

template<typename FragT, typename DataT>
void rocwmma::fill_fragment(FragT &frag, DataT value)#

Fills the entire fragment with the desired value.

Parameters:
  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • value – Fill value of type DataT

Template Parameters:
  • FragT – Opaque fragment type

  • DataT – Datatype

template<typename FragT, typename DataT>
void rocwmma::load_matrix_sync(FragT &frag, const DataT *data, uint32_t ldm)#

Loads the entire fragment from the data pointer according to its matrix and data layout contexts. Data pointer may point to either local or global memory.

Parameters:
  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • data – Data pointer to global or local memory

  • ldm – Leading dimension size

Template Parameters:
  • FragT – Opaque fragment type

  • DataT – Datatype

template<typename FragT, typename DataT>
void rocwmma::load_matrix_sync(FragT &frag, const DataT *data, uint32_t ldm, layout_t layout)#

Loads the entire fragment from the data pointer according to its matrix layout and data layout contexts. Data pointer may point to either local or global memory. This overload provides manual selection of data layout of the incoming memory pointer, which will be transformed to conform to the data layout of the fragment.

Parameters:
  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • data – Data pointer to global/local memory

  • ldm – Leading dimension size

  • layout – Data layout

Template Parameters:
  • FragT – Opaque fragment type

  • DataT – Datatype

template<typename FragT, typename DataT>
void rocwmma::store_matrix_sync(DataT *data, FragT const &frag, uint32_t ldm)#

Stores the entire fragment to the data pointer according to its matrix and data layouts. Data pointer may point to either local or global memory.

Parameters:
  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • data – Data pointer to global/local memory

  • ldm – Leading dimension size

Template Parameters:
  • FragT – Opaque fragment type

  • DataT – Datatype

template<typename FragT, typename DataT>
void rocwmma::store_matrix_sync(DataT *data, FragT const &frag, uint32_t ldm, layout_t layout)#

Stores the entire fragment to the data pointer according to its matrix layout and data layout contexts. Data pointer may point to either local or global memory. This overload provides manual selection of data layout of the outgoing memory pointer, which the data layout of the fragment will be transformed to.

Parameters:
  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • data – Data pointer to global/local memory

  • ldm – Leading dimension size

  • layout – Data layout

Template Parameters:
  • FragT – Opaque fragment type

  • DataT – Datatype

template<typename FragA, typename FragB, typename FragAccumIn, typename FragAccumOut>
void rocwmma::mma_sync(FragAccumOut &d, FragA const &a, FragB const &b, FragAccumIn &c)#

Performs the Multiply-Accumulate operation on the fragments A, B, C and D (D = A * B + C)

Note

Frag c = d is valid

Parameters:
  • d – Accumulator output D

  • a – Input fragment A

  • b – Input fragment B

  • c – Input accumulator fragment C

Template Parameters:
  • FragA – Opaque fragment type for matrix A data

  • FragB – Opaque fragment type for matrix A data

  • FragAccumIn – Opaque fragment type for input accumulation data

  • FragAccumOut – Opaque fragment type for output accumulation data

void rocwmma::synchronize_workgroup()#

Synchronization point for all wavefronts in a workgroup. Guarantees pending reads / writes to LDS are flushed.

rocWMMA transforms API functions#

template<typename FragT>
static inline T rocwmma::apply_transpose(FragT &&frag)#

Applies the transpose transform the input fragment. Transpose is defined as orthogonal matrix and data layout. E.g. T(fragment<matrix_a, BlockM, BlockN, BlockK, DataT, row_major>) = fragment<matrix_b, BlockN, BlockM, BlockK, DataT, col_major>

Parameters:

frag – Fragment of type MatrixT with its associated block sizes, data type and layout

Template Parameters:

FragT – The incoming fragment type

Returns:

Transposed (orthogonal) fragment

template<typename DataLayoutT, typename FragT>
static inline T rocwmma::apply_data_layout(FragT &&frag)#

Transforms the input fragment to have the desired data layout.

Parameters:

frag – Fragment of type MatrixT with its associated block sizes, data type and layout

Template Parameters:
  • DataLayoutT – The desired fragment data layout to apply

  • FragT – The incoming fragment type

Returns:

Fragment with transformed data layout

template<typename DstFragT, typename FragT>
static inline T rocwmma::apply_fragment(FragT &&frag)#

Transforms the input fragment to the target fragment type. This could include changing matrix context and/or changing data layout, as long as there is a path from the source register layout to the destination register layout.

Parameters:

frag – Source fragment of type MatrixT with its associated block sizes, data type and layout

Template Parameters:
  • DstFragT – The target fragment type to transform to

  • FragT – The source incoming fragment type

Returns:

Target fragment after transformation

template<typename FragT>
static inline T rocwmma::to_register_file(FragT &&frag)#

Transforms the input fragment to a “register file” fragment type. Register contents are directly mapped to a 2D matrix space represented by [RegCount x WaveSize]. This transform is a geometry reinterpretation.

Parameters:

frag – Source fragment of type MatrixT with its associated block sizes, data type and layout

Template Parameters:

FragT – The source incoming fragment type

Returns:

Target fragment after transformation

template<typename DstFragT, typename FragT>
static inline T rocwmma::from_register_file(FragT &&frag)#

Transforms the “register file” fragment type to a target fragment type. Register contents are directly mapped to a 2D matrix space represented by [RegCount x WaveSize]. This transform is a geometry reinterpretation.

Parameters:

frag – Source fragment of type MatrixT with its associated block sizes, data type and layout

Template Parameters:
  • DstFragT – The target frag to transform to

  • FragT – The source incoming fragment type as register file

Returns:

Fragment after transformation

Sample programs#

A sample demonstrating the use of rocWMMA functions load_matrix_sync, store_matrix_sync, fill_fragment, and mma_sync is available here. For more sample programs, refer to the samples directory.

Emulation tests#

The emulation test is a smaller test suite designed for emulators. It includes a subset of ROCWMMA test cases for faster execution on emulated platforms. It supports smoke, regression, and extended modes.

For example, to run a smoke test:

rtest.py --install_dir <build_dir> --emulation smoke