API reference guide#

This document provides information about rocWMMA functions, data types, and other programming constructs.

Synchronous API#

In general, rocWMMA API functions ( load_matrix_sync, store_matrix_sync, mma_sync ) are assumed to be synchronous when used in the context of global memory.

When using these functions in the context of shared memory (e.g. LDS memory), additional explicit workgroup synchronization may be required due to the nature of this memory usage.

Supported data types#

rocWMMA mixed precision multiply-accumulate operations support the following data type combinations.

Data Types <Ti / To / Tc> = <Input type / Output Type / Compute Type>

where,

Input Type = Matrix A/B

Output Type = Matrix C/D

Compute Type = Math / accumulation type

Ti / To / Tc

BlockM

BlockN

BlockK

i8 / i32 / i32

16

16

Min: 16, pow2

i8 / i32 / i32

32

32

Min: 8, pow2

i8 / i8 / i32

16

16

Min: 16, pow2

i8 / i32 / i32

32

32

Min: 8, pow2

f16 / f32 / f32

16

16

Min: 16, pow2

f16 / f32 / f32

32

32

Min: 8, pow2

f16 / f16 / f32

16

16

Min: 16, pow2

f16 / f16 / f32

32

32

Min: 8, pow2

f16 / f16 / f16*

16

16

Min: 16, pow2

f16 / f16 / f16*

32

32

Min: 8, pow2

__half / f32 / f32

16

16

Min: 16, pow2

__half / f32 / f32

32

32

Min: 8, pow2

__half / __half / f32

16

16

Min: 16, pow2

__half / __half / f32

32

32

Min: 8, pow2

__half / __half / __half*

16

16

Min: 16, pow2

__half / __half / __half*

32

32

Min: 8, pow2

bf16 / f32 / f32

16

16

Min: 8, pow2

bf16 / f32 / f32

32

32

Min: 4, pow2

bf16 / bf16 / f32

16

16

Min: 8, pow2

bf16 / bf16 / f32

32

32

Min: 4, pow2

bf16 / bf16 / bf16*

16

16

Min: 8, pow2

bf16 / bf16 / bf16*

32

32

Min: 4, pow2

f32 / f32 / f32

16

16

Min: 4, pow2

f32 / f32 / f32

32

32

Min: 2, pow2

f64** / f64** / f64**

16

16

Min: 4, pow2

*= Matrix unit accumulation is natively 32-bit precision and is converted to the desired type.

**= f64 datatype is only supported on MI-200 class AMDGPU and successors.

Supported matrix layouts#

(N = col major, T = row major)

LayoutA

LayoutB

Layout C

LayoutD

N

N

N

N

N

N

T

T

N

T

N

N

N

T

T

T

T

N

N

N

T

N

T

T

T

T

N

N

T

T

T

T

Using rocWMMA API#

This section describes how to use the rocWMMA library API.

rocWMMA datatypes#

struct matrix_a#

Input Matrix A.

struct matrix_b#

Input Matrix B.

struct accumulator#

Input/Output Matrix Accumulator.

struct row_major#

Data/In-memory Layout as Row Major.

struct col_major#

Data/In-memory Layout as Column Major.

class VecT#

HIP vector class.

Template Parameters:
  • DataT – vector data type

  • Rank – vector size

template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayoutT>
struct IOConfig#

Definition of fragment input / output configurations in specific matrix context.

Template Parameters:
  • Matrix – fragment context

  • BlockM/N/K – block dimensions

  • DataT – data type

  • DataLayoutT – in-memory layout as col_major or row_major

Param IOShape:

dimensional properties of the fragment

Param IOLayout:

1d and 2d layouts of the fragment

Param IOTraits:

meta-properties for input and output of the fragment

Param PackUtil:

utility for packing / unpacking fragment data

Param Broadcaster:

utility for assigning a single value to entire fragment

Param MappingUtil:

global mapping utility for current fragment

Param Loader:

Issues load instructions for raw fragment data

Param Storer:

Issues store instructions for raw fragment data

template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK>
struct IOShape#

Definition of fragment dimensions in specific matrix context.

Template Parameters:
  • MatrixT – fragment context

  • BlockM/N/K – block dimensions

rocWMMA enumeration#

Note

The enumeration constants numbering is consistent with the standard C++ libraries.

enum rocwmma::layout_t#

Values:

enumerator mem_row_major#
enumerator mem_col_major#

rocWMMA API functions#

template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::fill_fragment(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> &frag, DataT value)#

Fills the entire fragment with the desired value.

Parameters:
  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • value – Value of type DataT.

Template Parameters:
  • Matrix – fragment context

  • BlockM/N/K – block dimensions

  • DataT – data type

  • DataLayout – in-memory layout as col_major or row_major

template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::load_matrix_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> &frag, const DataT *data, uint32_t ldm)#

Loads the entire fragment from the data pointer according to its matrix and data layouts. Data pointer may point to either local or global memory.

Parameters:
  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • data – Data pointer to global/local memory

  • ldm – Leading dimension size

Template Parameters:
  • MatrixT – fragment context

  • BlockM/N/K – block dimensions

  • DataT – data type

  • DataLayout – in-memory layout as col_major or row_major

template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT>
void rocwmma::load_matrix_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT> &frag, const DataT *data, uint32_t ldm, layout_t layout)#

Loads the entire fragment from the data pointer according to its matrix layout.Data pointer may point to either local or global memory. This overload provides a run-time ability to choose the data layout of the target fragment.

Parameters:
  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • data – Data pointer to global/local memory

  • ldm – Leading dimension size

  • layout – Matrix layout

Template Parameters:
  • MatrixT – fragment context

  • BlockM/N/K – block dimensions

  • DataT – data type

  • DataLayout – in-memory layout as col_major or row_major

template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::store_matrix_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> const &frag, uint32_t ldm)#

Stores the entire fragment to the data pointer according to its matrix and data layouts. Data pointer may point to either local or global memory.

Parameters:
  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • data – Data pointer to global/local memory

  • ldm – Leading dimension size

Template Parameters:
  • MatrixT – fragment context

  • BlockM/N/K – block dimensions

  • DataT – data type

  • DataLayout – in-memory layout as col_major or row_major

template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT>
void rocwmma::store_matrix_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT> const &frag, uint32_t ldm, layout_t layout)#

Stores the entire fragment to the data pointer according to its matrix layout. Data pointer may point to either local or global memory. This overload provides a run-time ability to choose the data layout of the target fragment.

Parameters:
  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • data – Data pointer to global/local memory

  • ldm – Leading dimension size

  • layout – Data layout

Template Parameters:
  • MatrixT – fragment context

  • BlockM/N/K – block dimensions

  • DataT – data type

  • DataLayout – in-memory layout as col_major or row_major

template<uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename InputT, typename ComputeT, typename LayoutA, typename LayoutB, typename LayoutC, typename LayoutD>
void rocwmma::mma_sync(fragment<accumulator, BlockM, BlockN, BlockK, ComputeT, LayoutD> &d, fragment<matrix_a, BlockM, BlockN, BlockK, InputT, LayoutA> const &a, fragment<matrix_b, BlockM, BlockN, BlockK, InputT, LayoutB> const &b, fragment<accumulator, BlockM, BlockN, BlockK, ComputeT, LayoutC> const &c)#

Performs the Multiply-Accumulate operation on the fragments A, B, C and D(D = A * B + C)

Note

Frag c = d is valid

Parameters:
  • d – Accumulator output D

  • a – Input fragment A

  • b – Input fragment B

  • c – Input accumulator fragment C

Template Parameters:
  • BlockM/N/K – block dimensions

  • InputT – data type of input frags A and B

  • ComputeT – data type of accumulator fragment C / D

  • LayoutA – in-memory layout of frag A as col_major or row_major

  • LayoutB – in-memory layout of frag B as col_major or row_major

void rocwmma::synchronize_workgroup()#

Synchronization point for all wavefronts in a workgroup.

template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::load_matrix_coop_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> &frag, const DataT *data, uint32_t ldm, uint32_t waveIndex, uint32_t waveCount, uint32_t splitCount)#

Cooperative Load Matrix - Loads the entire fragment with data from memory address cooperatively across waves. Each cooperative wave is responsible in loading a portion of the final fragment. Note that the full fragment data is not cohesive for individual waves as they only load a piece of the data. This function may be paired with store_matrix_coop_sync to move a single fragment collaboratively between memory locations.

The full load is split into work items (splitCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed.

Parameters:
  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • data – Data pointer to global/local memory

  • ldm – Leading dimension size

  • waveIndex – Index assignment of current wave in collaboration

  • waveCount – Number of waves assigned for collaboration

  • splitCount – Number of work items to split the operation

Template Parameters:
  • MatrixT – fragment context

  • BlockM/N/K – block dimensions

  • DataT – data type

  • DataLayout – in-memory layout as col_major or row_major

template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
inline void rocwmma::load_matrix_coop_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> &frag, const DataT *data, uint32_t ldm, uint32_t waveIndex, uint32_t waveCount)#

Cooperative Load Matrix - Loads the entire fragment with data from memory address cooperatively across waves. Each cooperative wave is responsible in loading a portion of the final fragment. Note that the full fragment data is not cohesive for individual waves as they only load a piece of the data. This function may be paired with store_matrix_coop_sync to move a single fragment collaboratively between memory locations.

The full load is split into work items (default = waveCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed.

Parameters:
  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • data – Data pointer to global/local memory

  • ldm – Leading dimension size

  • waveIndex – Index assignment of current wave in collaboration

  • waveCount – Number of waves assigned for collaboration

Template Parameters:
  • MatrixT – fragment context

  • BlockM/N/K – block dimensions

  • DataT – data type

  • DataLayout – in-memory layout as col_major or row_major

template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::load_matrix_coop_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> &frag, const DataT *data, uint32_t ldm)#

Cooperative Load Matrix - Loads the entire fragment with data from memory address cooperatively across waves. Each cooperative wave is responsible in loading a portion of the final fragment. Note that the full fragment data is not cohesive for individual waves as they only load a piece of the data. This function may be paired with store_matrix_coop_sync to move a single fragment collaboratively between memory locations.

The full load is split into work items (current waveCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed.

Parameters:
  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • data – Data pointer to global/local memory

  • ldm – Leading dimension size

Template Parameters:
  • MatrixT – fragment context

  • BlockM/N/K – block dimensions

  • DataT – data type

  • DataLayout – in-memory layout as col_major or row_major

template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::store_matrix_coop_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> const &frag, uint32_t ldm, uint32_t waveIndex, uint32_t waveCount, uint32_t splitCount)#

Cooperative Store Matrix - Stores the entire fragment to data address cooperatively across waves. Each cooperative wave is responsible in storing a portion of the final fragment. Note that the full fragment data is not required to be cohesive for individual waves as they only store a piece of the data. This function may be paired with load_matrix_coop_sync to move a single fragment collaboratively between memory locations.

The full store is split into work items (splitCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed.

Parameters:
  • data – Data pointer to global/local memory

  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • ldm – Leading dimension size

  • waveIndex – Index assignment of current wave in collaboration

  • waveCount – Number of waves assigned for collaboration

  • splitCount – Number of work items to split the operation

Template Parameters:
  • MatrixT – fragment context

  • BlockM/N/K – block dimensions

  • DataT – data type

  • DataLayout – in-memory layout as col_major or row_major

template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::store_matrix_coop_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> const &frag, uint32_t ldm, uint32_t waveIndex, uint32_t waveCount)#

Cooperative Store Matrix - Stores the entire fragment to data address cooperatively across waves. Each cooperative wave is responsible in storing a portion of the final fragment. Note that the full fragment data is not required to be cohesive for individual waves as they only store a piece of the data. This function may be paired with load_matrix_coop_sync to move a single fragment collaboratively between memory locations.

The full store is split into work items (default = waveCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed.

Parameters:
  • data – Data pointer to global/local memory

  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • ldm – Leading dimension size

  • waveIndex – Index assignment of current wave in collaboration

  • waveCount – Number of waves assigned for collaboration

Template Parameters:
  • MatrixT – fragment context

  • BlockM/N/K – block dimensions

  • DataT – data type

  • DataLayout – in-memory layout as col_major or row_major

template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::store_matrix_coop_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> const &frag, uint32_t ldm)#

Cooperative Store Matrix - Stores the entire fragment to data address cooperatively across waves. Each cooperative wave is responsible in storing a portion of the final fragment. Note that the full fragment data is not required to be cohesive for individual waves as they only store a piece of the data. This function may be paired with load_matrix_coop_sync to move a single fragment collaboratively between memory locations.

The full store is split into work items (current waveCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed.

Parameters:
  • data – Data pointer to global/local memory

  • frag – Fragment of type MatrixT with its associated block sizes, data type and layout

  • ldm – Leading dimension size

Template Parameters:
  • MatrixT – fragment context

  • BlockM/N/K – block dimensions

  • DataT – data type

  • DataLayout – in-memory layout as col_major or row_major

Sample programs#

See a sample code for calling rocWMMA functions load_matrix_sync, store_matrix_sync, fill_fragment, and mma_sync here. For more such sample programs, refer to the Samples directory.