API reference guide#
This document provides information about rocWMMA functions, data types, and other programming constructs.
Synchronous API#
In general, rocWMMA API functions ( load_matrix_sync, store_matrix_sync, mma_sync ) are assumed to be synchronous when
used in the context of global memory.
When using these functions in the context of shared memory (e.g. LDS memory), additional explicit workgroup synchronization (synchronize_workgroup)
may be required due to the nature of this memory usage.
Supported GPU architectures#
List of supported CDNA architectures (wave64):
- gfx908 
- gfx90a 
- gfx940 
- gfx941 
- gfx942 
Note
gfx9 = gfx908, gfx90a, gfx940, gfx941, gfx942
gfx940+ = gfx940, gfx941, gfx942
List of supported RDNA architectures (wave32):
- gfx1100 
- gfx1101 
- gfx1102 
Note
gfx11 = gfx1100, gfx1101, gfx1102
Supported data types#
rocWMMA mixed precision multiply-accumulate operations support the following data type combinations.
Data Types <Ti / To / Tc> = <Input type / Output Type / Compute Type>, where:
- Input Type = Matrix A / B 
- Output Type = Matrix C / D 
- Compute Type = Math / accumulation type 
- i8 = 8-bit precision integer 
- f8 = 8-bit precision floating point 
- bf8 = 8-bit precision brain floating point 
- f16 = half-precision floating point 
- bf16 = half-precision brain floating point 
- f32 = single-precision floating point 
- i32 = 32-bit precision integer 
- xf32 = single-precision tensor floating point 
- f64 = double-precision floating point 
Note
f16 represents equivalent support for both _Float16 and __half types.
Current f8 support is NANOO (optimized) format.
| Ti / To / Tc | BlockM | BlockN | BlockK Range* (Powers of 2) | CDNA Support | RDNA Support | 
|---|---|---|---|---|---|
| bf8 / f32 / f32 | 16 | 16 | 32+ | gfx940+ | - | 
| 32 | 32 | 16+ | |||
| f8 / f32 / f32 | 16 | 16 | 32+ | gfx940+ | - | 
| 32 | 32 | 16+ | |||
| i8 / i32 / i32 | 16 | 16 | 16+ | gfx908, gfx90a | gfx11 | 
| 32+ | gfx940+ | - | |||
| 32 | 32 | 8+ | gfx908, gfx90a | - | |
| 16+ | gfx940+ | - | |||
| i8 / i8 / i32 | 16 | 16 | 16+ | gfx908, gfx90a | gfx11 | 
| 32+ | gfx940+ | - | |||
| 32 | 32 | 8+ | gfx908, gfx90a | - | |
| 16+ | gfx940+ | - | |||
| f16 / f32 / f32 | 16 | 16 | 16+ | gfx9 | gfx11 | 
| 32 | 32 | 8+ | gfx9 | - | |
| f16 / f16 / f32 | 16 | 16 | 16+ | gfx9 | gfx11 | 
| 32 | 32 | 8+ | gfx9 | - | |
| f16 / f16 / f16** | 16 | 16 | 16+ | gfx9 | gfx11 | 
| 32 | 32 | 8+ | gfx9 | - | |
| bf16 / f32 / f32 | 16 | 16 | 8+ | gfx908 | - | 
| 16+ | gfx90a, gfx940+ | gfx11 | |||
| 32 | 32 | 4+ | gfx908 | - | |
| 8+ | gfx90a, gfx940+ | - | |||
| bf16 / bf16 / f32 | 16 | 16 | 8+ | gfx908 | - | 
| 16+ | gfx90a, gfx940+ | gfx11 | |||
| 32 | 32 | 4+ | gfx908 | - | |
| 8+ | gfx90a, gfx940+ | - | |||
| bf16 / bf16 / bf16** | 16 | 16 | 8+ | gfx908 | - | 
| 16+ | gfx90a, gfx940+ | gfx11 | |||
| 32 | 32 | 4+ | gfx908 | - | |
| 8+ | gfx90a, gfx940+ | - | |||
| f32 / f32 / f32 | 16 | 16 | 4+ | gfx9 | - | 
| 32 | 32 | 2+ | gfx9 | - | |
| xf32 / xf32 / xf32 | 16 | 16 | 8+ | gfx940+ | - | 
| 32 | 32 | 4+ | |||
| f64 / f64 / f64 | 16 | 16 | 4+ | gfx90a, gfx940+ | - | 
Note
* = BlockK range lists the minimum possible value. Other values in the range are powers of 2 larger than the minimum. Practical BlockK values are usually 32 and smaller.
** = CDNA architectures matrix unit accumulation is natively 32-bit precision and is converted to the desired type.
Supported matrix layouts#
(N = col major, T = row major)
| LayoutA | LayoutB | Layout C | LayoutD | 
|---|---|---|---|
| N | N | N | N | 
| N | N | T | T | 
| N | T | N | N | 
| N | T | T | T | 
| T | N | N | N | 
| T | N | T | T | 
| T | T | N | N | 
| T | T | T | T | 
Supported thread block sizes#
rocWMMA generally supports and tests up to 4 wavefronts per threadblock. The X dimension is expected to be a multiple of the wave size and will be scaled as such.
| TBlock_X | TBlock_Y | 
|---|---|
| WaveSize | 1 | 
| WaveSize | 2 | 
| WaveSize | 4 | 
| WaveSize*2 | 1 | 
| WaveSize*2 | 2 | 
| WaveSize*4 | 1 | 
Note
WaveSize (RDNA) = 32
WaveSize (CDNA) = 64
Using rocWMMA API#
This section describes how to use the rocWMMA library API.
rocWMMA datatypes#
matrix_a#
- 
struct matrix_a#
- Meta-tag indicating data context is input Matrix A. 
matrix_b#
- 
struct matrix_b#
- Meta-tag indicating data context is input Matrix B. 
accumulator#
- 
struct accumulator#
- Meta-tag indicating data context is Accumulator (also used as Matrix C / D). 
row_major#
- 
struct row_major#
- Meta-tag indicating 2D in-memory data layout as row major. 
col_major#
- 
struct col_major#
- Meta-tag indicating 2D in-memory data layout as column major. 
fragment#
- 
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayoutT = void>
 class fragment#
- rocWMMA fragment class. This is the primary object used in block-wise decomposition of the matrix multiply-accumulate (mma) problem space. In general, fragment data is associated with a matrix context (matrix_a, matrix_b or accumulator), a block size (BlockM/N/K), a datatype (e.g. single-precision float, etc.) and an in-memory 2D layout (e.g. row_major or col_major). These fragment properties are used to define how data is handled and stored locally, and to drive API implementations for loading / storing, mma and transforms. Fragment abstractions are designed to promote a simple wavefront programming model, which can accelerate development time. Internal thread-level details are handled by rocWMMA which frees the user to focus on wavefront block-wise decomposition. Written purely in device code, the programmer can use this object in their own device kernels. - Note - Fragments are stored in packed registers, however vector elements have no guaranteed order or locality. - Template Parameters:
 - Public Types - Public Functions - 
inline DataT &operator[](uint32_t index)#
- Parameters:
- index – Element index 
- Returns:
- Mutable unpacked element accessor at given index 
 
 - Public Members - Public Static Functions - 
static inline constexpr uint32_t height()#
- Returns:
- The geometric height of fragment 
 
 - 
static inline constexpr uint32_t width()#
- Returns:
- The geometric width of fragment 
 
 - 
static inline constexpr uint32_t blockDim()#
- Returns:
- The leading block dimension (non-K) 
 
 - 
static inline constexpr uint32_t kDim()#
- Returns:
- The k dimension 
 
 - 
static inline constexpr uint32_t size()#
- Returns:
- The size of the unpacked elements vector 
 
 - 
struct Traits#
 
rocWMMA enumeration#
layout_t#
rocWMMA API functions#
- 
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayoutT>
 void rocwmma::fill_fragment(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayoutT> &frag, DataT value)#
- Fills the entire fragment with the desired value. 
- 
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayoutT>
 void rocwmma::load_matrix_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayoutT> &frag, const DataT *data, uint32_t ldm)#
- Loads the entire fragment from the data pointer according to its matrix and data layout contexts. Data pointer may point to either local or global memory. - Parameters:
- frag – Fragment of type MatrixT with its associated block sizes, data type and layout 
- data – Data pointer to global or local memory 
- ldm – Leading dimension size 
 
- Template Parameters:
 
- 
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT>
 void rocwmma::load_matrix_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT> &frag, const DataT *data, uint32_t ldm, layout_t layout)#
- Loads the entire fragment from the data pointer according to its matrix layout and data layout contexts. Data pointer may point to either local or global memory. This overload provides a run-time ability to choose the data layout of the target fragment. - Parameters:
- frag – Fragment of type MatrixT with its associated block sizes, data type and layout 
- data – Data pointer to global/local memory 
- ldm – Leading dimension size 
- layout – Data layout 
 
- Template Parameters:
- MatrixT – Fragment context 
- BlockM/N/K – Block dimensions 
- DataT – Datatype 
 
 
- 
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayoutT>
 void rocwmma::store_matrix_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayoutT> const &frag, uint32_t ldm)#
- Stores the entire fragment to the data pointer according to its matrix and data layouts. Data pointer may point to either local or global memory. - Parameters:
- frag – Fragment of type MatrixT with its associated block sizes, data type and layout 
- data – Data pointer to global/local memory 
- ldm – Leading dimension size 
 
- Template Parameters:
 
- 
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT>
 void rocwmma::store_matrix_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT> const &frag, uint32_t ldm, layout_t layout)#
- Stores the entire fragment to the data pointer according to its matrix layout. Data pointer may point to either local or global memory. This overload provides a run-time ability to choose the data layout of the target fragment. - Parameters:
- frag – Fragment of type MatrixT with its associated block sizes, data type and layout 
- data – Data pointer to global/local memory 
- ldm – Leading dimension size 
- layout – Data layout 
 
- Template Parameters:
- MatrixT – Fragment context 
- BlockM/N/K – Block dimensions 
- DataT – Datatype 
 
 
- 
template<uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename InputT, typename ComputeT, typename LayoutA, typename LayoutB, typename LayoutC, typename LayoutD>
 void rocwmma::mma_sync(fragment<accumulator, BlockM, BlockN, BlockK, ComputeT, LayoutD> &d, fragment<matrix_a, BlockM, BlockN, BlockK, InputT, LayoutA> const &a, fragment<matrix_b, BlockM, BlockN, BlockK, InputT, LayoutB> const &b, fragment<accumulator, BlockM, BlockN, BlockK, ComputeT, LayoutC> const &c)#
- Performs the Multiply-Accumulate operation on the fragments A, B, C and D (D = A * B + C) - Note - Frag c = d is valid - Parameters:
- d – Accumulator output D 
- a – Input fragment A 
- b – Input fragment B 
- c – Input accumulator fragment C 
 
- Template Parameters:
 
- 
void rocwmma::synchronize_workgroup()#
- Synchronization point for all wavefronts in a workgroup. Guarantees pending reads / writes to LDS are flushed. 
rocWMMA cooperative API functions#
- 
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayoutT>
 inline void rocwmma::load_matrix_coop_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayoutT> &frag, const DataT *data, uint32_t ldm, uint32_t waveIndex, uint32_t waveCount)#
- Loads the fragment from memory address cooperatively across wavefronts. Each cooperating wavefront is responsible in loading a portion of the final fragment. This function may be paired with store_matrix_coop_sync to move a single fragment collaboratively between memory locations. - The full load is split into work items (default = waveCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed. - Note - Individual wavefronts only load a smaller portion of the full data that they are responsible for. - Parameters:
- frag – Fragment of type MatrixT with its associated block sizes, data type and layout 
- data – Data pointer to global/local memory 
- ldm – Leading dimension size 
- waveIndex – Index assignment of current wave in collaboration 
- waveCount – Number of waves assigned for collaboration 
 
- Template Parameters:
 
- 
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayoutT>
 void rocwmma::load_matrix_coop_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayoutT> &frag, const DataT *data, uint32_t ldm)#
- Loads the fragment from memory address cooperatively across wavefronts. Each cooperating wavefront is responsible in loading a portion of the final fragment. This function may be paired with store_matrix_coop_sync to move a single fragment collaboratively between memory locations. - The full load is split into work items (current waveCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed. - Note - Individual wavefronts only load a smaller portion of the full data that they are responsible for. - Parameters:
- frag – Fragment of type MatrixT with its associated block sizes, data type and layout 
- data – Data pointer to global/local memory 
- ldm – Leading dimension size 
 
- Template Parameters:
 
- 
template<uint32_t WaveCount, typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayoutT>
 void rocwmma::load_matrix_coop_sync(fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayoutT> &frag, const DataT *data, uint32_t ldm, uint32_t waveIndex)#
- Loads the fragment from memory address cooperatively across wavefronts. Each cooperating wavefront is responsible in loading a portion of the final fragment. This function may be paired with store_matrix_coop_sync to move a single fragment collaboratively between memory locations. - This flavor of cooperative load includes WaveCount as a template parameter that may be used to optimize during compile time, and is preferred over providing this value as runtime function argument. - The full load is split into work items (WaveCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed. - Note - Individual wavefronts only load a smaller portion of the full data that they are responsible for. - Parameters:
- frag – Fragment of type MatrixT with its associated block sizes, data type and layout 
- data – Data pointer to global/local memory 
- ldm – Leading dimension size 
- waveIndex – Index assignment of current wave in collaboration 
 
- Template Parameters:
 
- 
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayoutT>
 void rocwmma::store_matrix_coop_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayoutT> const &frag, uint32_t ldm, uint32_t waveIndex, uint32_t waveCount)#
- Cooperative Store Matrix - Stores the entire fragment to data address cooperatively across waves. Each cooperative wave is responsible in storing a portion of the final fragment. - The full store is split into work items (default = waveCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed. - Note - The full fragment data is not required to be cohesive for individual waves as they only store a piece of the data. This function may be paired with load_matrix_coop_sync to move a single fragment collaboratively between memory locations. - Parameters:
- data – Data pointer to global/local memory 
- frag – Fragment of type MatrixT with its associated block sizes, data type and layout 
- ldm – Leading dimension size 
- waveIndex – Index assignment of current wave in collaboration 
- waveCount – Number of waves assigned for collaboration 
 
- Template Parameters:
 
- 
template<typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayoutT>
 void rocwmma::store_matrix_coop_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayoutT> const &frag, uint32_t ldm)#
- Cooperative Store Matrix - Stores the entire fragment to data address cooperatively across waves. Each cooperative wave is responsible in storing a portion of the final fragment. - The full store is split into work items (current waveCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed. - Note - The full fragment data is not required to be cohesive for individual waves as they only store a piece of the data. This function may be paired with load_matrix_coop_sync to move a single fragment collaboratively between memory locations. - Parameters:
- data – Data pointer to global/local memory 
- frag – Fragment of type MatrixT with its associated block sizes, data type and layout 
- ldm – Leading dimension size 
 
- Template Parameters:
 
- 
template<uint32_t WaveCount, typename MatrixT, uint32_t BlockM, uint32_t BlockN, uint32_t BlockK, typename DataT, typename DataLayoutT>
 void rocwmma::store_matrix_coop_sync(DataT *data, fragment<MatrixT, BlockM, BlockN, BlockK, DataT, DataLayoutT> const &frag, uint32_t ldm, uint32_t waveIndex)#
- Cooperative Store Matrix - Stores the entire fragment to data address cooperatively across waves. Each cooperative wave is responsible in storing a portion of the final fragment. - This flavor of cooperative store includes WaveCount as a template parameter that may be used to optimize during compile time, and is preferred over providing this value as runtime function argument. - The full store is split into work items (WaveCount). Work items are assigned in round robin fashion to waves in the range of [0, waveCount). The current wave index determines the order of the current wave in the collaboration pool. Work items are consumed in order by waves [0, waveCount) until there are no more work items and the operation is completed. - Note - The full fragment data is not required to be cohesive for individual waves as they only store a piece of the data. This function may be paired with load_matrix_coop_sync to move a single fragment collaboratively between memory locations. - Parameters:
- data – Data pointer to global/local memory 
- frag – Fragment of type MatrixT with its associated block sizes, data type and layout 
- ldm – Leading dimension size 
- waveIndex – Index assignment of current wave in collaboration 
 
- Template Parameters:
 
rocWMMA transforms API functions#
- 
template<typename FragT>
 static inline T rocwmma::applyTranspose(FragT &&frag)#
- Applies the transpose transform the input fragment. Transpose is defined as orthogonal matrix and data layout. E.g. T(fragment<matrix_a, BlockM, BlockN, BlockK, DataT, row_major>) = fragment<matrix_b, BlockN, BlockM, BlockK, DataT, col_major> - Parameters:
- frag – Fragment of type MatrixT with its associated block sizes, data type and layout 
- Template Parameters:
- FragT – The incoming fragment type 
- Returns:
- Transposed (orthogonal) fragment 
 
- 
template<typename DataLayoutT, uint32_t WaveCount = 1, typename FragT>
 static inline T rocwmma::applyDataLayout(FragT &&frag)#
- Transforms the input fragment to have the desired data layout. - Parameters:
- frag – Fragment of type MatrixT with its associated block sizes, data type and layout 
- Template Parameters:
- DataLayoutT – The desired fragment data layout to apply 
- WaveCount – The number of cooperative waves for cooperative fragments (defaults to 1, or non-cooperative) 
- FragT – The incoming fragment type 
 
- Returns:
- Fragment with transformed data layout 
 
Sample programs#
See a sample code for calling rocWMMA functions load_matrix_sync, store_matrix_sync, fill_fragment, and mma_sync here.
For more such sample programs, refer to the Samples directory.
Emulation tests#
The emulation test is a smaller test suite specifically designed for emulators. It comprises a selection of test cases from the full ROCWMM test set, allowing for significantly faster execution on emulated platforms. Despite its concise nature, the emulation test supports smoke, regression, and extended modes.
For example, run a smoke test.
rtest.py --install_dir <build_dir> --emulation smoke