API reference guide#
This document provides information about rocWMMA functions, data types, and other programming constructs.
Synchronous API#
In general, rocWMMA API functions ( load_matrix_sync
, store_matrix_sync
, mma_sync
) are assumed to be synchronous when
used in the context of global memory.
When using these functions in the context of shared memory (e.g. LDS memory), additional explicit workgroup synchronization may be required due to the nature of this memory usage.
Supported data types#
rocWMMA mixed precision multiply-accumulate operations support the following data type combinations.
Data Types <Ti / To / Tc> = <Input type / Output Type / Compute Type>
where,
Input Type = Matrix A/B
Output Type = Matrix C/D
Compute Type = Math / accumulation type
Ti / To / Tc |
BlockM |
BlockN |
BlockK |
---|---|---|---|
i8 / i32 / i32 |
16 |
16 |
Min: 16, pow2 |
i8 / i32 / i32 |
32 |
32 |
Min: 8, pow2 |
i8 / i8 / i32 |
16 |
16 |
Min: 16, pow2 |
i8 / i32 / i32 |
32 |
32 |
Min: 8, pow2 |
f16 / f32 / f32 |
16 |
16 |
Min: 16, pow2 |
f16 / f32 / f32 |
32 |
32 |
Min: 8, pow2 |
f16 / f16 / f32 |
16 |
16 |
Min: 16, pow2 |
f16 / f16 / f32 |
32 |
32 |
Min: 8, pow2 |
f16 / f16 / f16* |
16 |
16 |
Min: 16, pow2 |
f16 / f16 / f16* |
32 |
32 |
Min: 8, pow2 |
__half / f32 / f32 |
16 |
16 |
Min: 16, pow2 |
__half / f32 / f32 |
32 |
32 |
Min: 8, pow2 |
__half / __half / f32 |
16 |
16 |
Min: 16, pow2 |
__half / __half / f32 |
32 |
32 |
Min: 8, pow2 |
__half / __half / __half* |
16 |
16 |
Min: 16, pow2 |
__half / __half / __half* |
32 |
32 |
Min: 8, pow2 |
bf16 / f32 / f32 |
16 |
16 |
Min: 8, pow2 |
bf16 / f32 / f32 |
32 |
32 |
Min: 4, pow2 |
bf16 / bf16 / f32 |
16 |
16 |
Min: 8, pow2 |
bf16 / bf16 / f32 |
32 |
32 |
Min: 4, pow2 |
bf16 / bf16 / bf16* |
16 |
16 |
Min: 8, pow2 |
bf16 / bf16 / bf16* |
32 |
32 |
Min: 4, pow2 |
f32 / f32 / f32 |
16 |
16 |
Min: 4, pow2 |
f32 / f32 / f32 |
32 |
32 |
Min: 2, pow2 |
f64** / f64** / f64** |
16 |
16 |
Min: 4, pow2 |
*= Matrix unit accumulation is natively 32-bit precision and is converted to the desired type.
**= f64 datatype is only supported on MI-200 class AMDGPU and successors.
Supported matrix layouts#
(N = col major, T = row major)
LayoutA |
LayoutB |
Layout C |
LayoutD |
---|---|---|---|
N |
N |
N |
N |
N |
N |
T |
T |
N |
T |
N |
N |
N |
T |
T |
T |
T |
N |
N |
N |
T |
N |
T |
T |
T |
T |
N |
N |
T |
T |
T |
T |
Using rocWMMA API#
This section describes how to use the rocWMMA library API.
rocWMMA datatypes#
-
struct matrix_a#
Input Matrix A.
-
struct matrix_b#
Input Matrix B.
-
struct accumulator#
Input/Output Matrix Accumulator.
-
struct row_major#
Data/In-memory Layout as Row Major.
-
struct col_major#
Data/In-memory Layout as Column Major.
-
class VecT#
HIP vector class.
- Template Parameters:
DataT – vector data type
Rank – vector size
-
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayoutT>
struct IOConfig# Definition of fragment input / output configurations in specific matrix context.
- Template Parameters:
- Param IOShape:
dimensional properties of the fragment
- Param IOLayout:
1d and 2d layouts of the fragment
- Param IOTraits:
meta-properties for input and output of the fragment
- Param PackUtil:
utility for packing / unpacking fragment data
- Param Broadcaster:
utility for assigning a single value to entire fragment
- Param MappingUtil:
global mapping utility for current fragment
- Param Loader:
Issues load instructions for raw fragment data
- Param Storer:
Issues store instructions for raw fragment data
-
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK>
struct IOShape# Definition of fragment dimensions in specific matrix context.
- Template Parameters:
MatrixT – fragment context
BlockM/N/K – block dimensions
rocWMMA enumeration#
Note
The enumeration constants numbering is consistent with the standard C++ libraries.
rocWMMA API functions#
-
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::fill_fragment(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> &frag, DataT value)# Fills the entire fragment with the desired value.
-
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::load_matrix_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> &frag, const DataT *data, uint32_t ldm)# Loads the entire fragment from the data pointer according to its matrix and data layouts. Data pointer may point to either local or global memory.
- Parameters:
frag – Fragment of type MatrixT with its associated block sizes, data type and layout
data – Data pointer to global/local memory
ldm – Leading dimension size
- Template Parameters:
-
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT>
void rocwmma::load_matrix_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT> &frag, const DataT *data, uint32_t ldm, layout_t layout)# Loads the entire fragment from the data pointer according to its matrix layout.Data pointer may point to either local or global memory. This overload provides a run-time ability to choose the data layout of the target fragment.
- Parameters:
frag – Fragment of type MatrixT with its associated block sizes, data type and layout
data – Data pointer to global/local memory
ldm – Leading dimension size
layout – Matrix layout
- Template Parameters:
-
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::store_matrix_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> const &frag, uint32_t ldm)# Stores the entire fragment to the data pointer according to its matrix and data layouts. Data pointer may point to either local or global memory.
- Parameters:
frag – Fragment of type MatrixT with its associated block sizes, data type and layout
data – Data pointer to global/local memory
ldm – Leading dimension size
- Template Parameters:
-
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT>
void rocwmma::store_matrix_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT> const &frag, uint32_t ldm, layout_t layout)# Stores the entire fragment to the data pointer according to its matrix layout. Data pointer may point to either local or global memory. This overload provides a run-time ability to choose the data layout of the target fragment.
- Parameters:
frag – Fragment of type MatrixT with its associated block sizes, data type and layout
data – Data pointer to global/local memory
ldm – Leading dimension size
layout – Data layout
- Template Parameters:
-
template<uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename InputT, typename ComputeT, typename LayoutA, typename LayoutB, typename LayoutC, typename LayoutD>
void rocwmma::mma_sync(fragment<accumulator, BlockM, BlockN, BlockK, ComputeT, LayoutD> &d, fragment<matrix_a, BlockM, BlockN, BlockK, InputT, LayoutA> const &a, fragment<matrix_b, BlockM, BlockN, BlockK, InputT, LayoutB> const &b, fragment<accumulator, BlockM, BlockN, BlockK, ComputeT, LayoutC> const &c)# Performs the Multiply-Accumulate operation on the fragments A, B, C and D(D = A * B + C)
Note
Frag c = d is valid
- Parameters:
d – Accumulator output D
a – Input fragment A
b – Input fragment B
c – Input accumulator fragment C
- Template Parameters:
-
void rocwmma::synchronize_workgroup()#
Synchronization point for all wavefronts in a workgroup.
-
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::load_matrix_coop_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> &frag, const DataT *data, uint32_t ldm, uint32_t waveIndex, uint32_t waveCount, uint32_t splitCount)# Cooperative Load Matrix - Loads the entire fragment with data from memory address cooperatively across waves. Each cooperative wave is responsible in loading a portion of the final fragment. Note that the full fragment data is not cohesive for individual waves as they only load a piece of the data. This function may be paired with store_matrix_coop_sync to move a single fragment collaboratively between memory locations.
The full load is split into work items (splitCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed.
- Parameters:
frag – Fragment of type MatrixT with its associated block sizes, data type and layout
data – Data pointer to global/local memory
ldm – Leading dimension size
waveIndex – Index assignment of current wave in collaboration
waveCount – Number of waves assigned for collaboration
splitCount – Number of work items to split the operation
- Template Parameters:
-
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
inline void rocwmma::load_matrix_coop_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> &frag, const DataT *data, uint32_t ldm, uint32_t waveIndex, uint32_t waveCount)# Cooperative Load Matrix - Loads the entire fragment with data from memory address cooperatively across waves. Each cooperative wave is responsible in loading a portion of the final fragment. Note that the full fragment data is not cohesive for individual waves as they only load a piece of the data. This function may be paired with store_matrix_coop_sync to move a single fragment collaboratively between memory locations.
The full load is split into work items (default = waveCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed.
- Parameters:
frag – Fragment of type MatrixT with its associated block sizes, data type and layout
data – Data pointer to global/local memory
ldm – Leading dimension size
waveIndex – Index assignment of current wave in collaboration
waveCount – Number of waves assigned for collaboration
- Template Parameters:
-
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::load_matrix_coop_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> &frag, const DataT *data, uint32_t ldm)# Cooperative Load Matrix - Loads the entire fragment with data from memory address cooperatively across waves. Each cooperative wave is responsible in loading a portion of the final fragment. Note that the full fragment data is not cohesive for individual waves as they only load a piece of the data. This function may be paired with store_matrix_coop_sync to move a single fragment collaboratively between memory locations.
The full load is split into work items (current waveCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed.
- Parameters:
frag – Fragment of type MatrixT with its associated block sizes, data type and layout
data – Data pointer to global/local memory
ldm – Leading dimension size
- Template Parameters:
-
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::store_matrix_coop_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> const &frag, uint32_t ldm, uint32_t waveIndex, uint32_t waveCount, uint32_t splitCount)# Cooperative Store Matrix - Stores the entire fragment to data address cooperatively across waves. Each cooperative wave is responsible in storing a portion of the final fragment. Note that the full fragment data is not required to be cohesive for individual waves as they only store a piece of the data. This function may be paired with load_matrix_coop_sync to move a single fragment collaboratively between memory locations.
The full store is split into work items (splitCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed.
- Parameters:
data – Data pointer to global/local memory
frag – Fragment of type MatrixT with its associated block sizes, data type and layout
ldm – Leading dimension size
waveIndex – Index assignment of current wave in collaboration
waveCount – Number of waves assigned for collaboration
splitCount – Number of work items to split the operation
- Template Parameters:
-
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::store_matrix_coop_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> const &frag, uint32_t ldm, uint32_t waveIndex, uint32_t waveCount)# Cooperative Store Matrix - Stores the entire fragment to data address cooperatively across waves. Each cooperative wave is responsible in storing a portion of the final fragment. Note that the full fragment data is not required to be cohesive for individual waves as they only store a piece of the data. This function may be paired with load_matrix_coop_sync to move a single fragment collaboratively between memory locations.
The full store is split into work items (default = waveCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed.
- Parameters:
data – Data pointer to global/local memory
frag – Fragment of type MatrixT with its associated block sizes, data type and layout
ldm – Leading dimension size
waveIndex – Index assignment of current wave in collaboration
waveCount – Number of waves assigned for collaboration
- Template Parameters:
-
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayout>
void rocwmma::store_matrix_coop_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayout> const &frag, uint32_t ldm)# Cooperative Store Matrix - Stores the entire fragment to data address cooperatively across waves. Each cooperative wave is responsible in storing a portion of the final fragment. Note that the full fragment data is not required to be cohesive for individual waves as they only store a piece of the data. This function may be paired with load_matrix_coop_sync to move a single fragment collaboratively between memory locations.
The full store is split into work items (current waveCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed.
- Parameters:
data – Data pointer to global/local memory
frag – Fragment of type MatrixT with its associated block sizes, data type and layout
ldm – Leading dimension size
- Template Parameters:
Sample programs#
See a sample code for calling rocWMMA functions load_matrix_sync
, store_matrix_sync
, fill_fragment
, and mma_sync
here.
For more such sample programs, refer to the Samples directory.