Programmer’s guide#
This document provides insight into the library design choices, source code organization and helpful information for new development.
Infrastructure#
Doxygen and Sphinx are used to generate the project’s documentation.
Jenkins is used to automate Continuous Integration (CI) testing (
.jenkins
folder has configurations).rocWMMA is hosted and maintained by AMD on Github.
The rocWMMA project is organized and configured via
CMake
and the collection ofCMakeLists.txt
in the base of each directory.clang-format
is used to format C++ code..githooks/install
ensures that a clang-format pass will run on each committed file.GTest
is used to implement test suite organization and execution.CTest
is used to consolidate and invoke multiple test targets. In the<rocWMMA_install_dir>/bin/rocWMMA/CTestTestfile.cmake
file, testing targets are listed that will be run whenctest
is invoked.The preferred compiler for rocWMMA is
CC=<path_to_rocm>/bin/amdclang and CXX=<path_to_rocm>/bin/amdclang++
.hipcc
is also supported, however may be deprecated in future ROCm releases.
hipRTC Support#
The HIP runtime compilation (hipRTC) environment allows on-the-fly runtime compilation, loading, and execution of device code on AMD GPUs. The rocWMMA library is compatible with hipRTC, so it can be leveraged for runtime-generated kernels. A simple GEMM sample is included to demonstrate compatibility.
For more information, refer to the HIP API Reference
General Design Concepts#
rocWMMA is developed with the C++17
language standard. The library takes advantage of several meta-programming techniques that help to statically
optimize code at compile time and generate more efficient GPU kernels. All of the code is contained within the rocwmma
namespace. Due to the templated
nature of the code, the rocWMMA API is a header-only library. The developer has the advantage of full visibility of the implementation, as well as the ability to
integrate rocWMMA API calls directly within their own kernels. An important feature is that the integrated code has higher visibility
to the compiler’s optimization passes where there is higher potential to generate more efficient device code. Moreover, no expensive host-device transfers or
external kernel invokations are imposed by the API.
The programming model that is the most useful with the rocWMMA API is wavefront-centric. In a more general sense, loading and storing data or mma calls are assumed to involve the entire wavefront (or, warp). In the collaborative API, we can assume that other wavefronts in the same threadblock may collaborate moving data from one location to another.
The rocWMMA API can eliminate a very large amount of boiler-plate code as it provides tools to decompose matrix multiply-accumulate based problems into
blocks, or fragments of data that may be efficiently processed by individual wavefronts. Wavefronts will abstract blocks of data into fragments
, which are designed to encapsulate meta-data properties about the blocks in different contexts, along with the data itself:
a general geometric shape (e.g. BlockM/N/K dimensions)
a context of the provenance of the data (e.g.
matrix_a
ormatrix_b
)a context of how the data was stored (e.g. row or column major)
a datatype (e.g. single or half-precision floating point)
These basic traits are then used to decide things like how much storage is needed, and how rocWMMA will organize individual threads in a layout to fetch or store the data.
Fragments
are powerful objects because they are versatile in configuring and representing data, and their meta-properties propagate easily via Argument
Dependent Lookup (ADL) and other deduction techniques. In practice, the user must simply configure the fragments correctly for their problem and rocWMMA conveniently takes care of the rest.
The implementation code is generally encapsulated into different layers of objects, fulfilling specific interface communications (from low level functions to API level):
Unit backend operations. These are often wrappers to device-specific functions such as intrinsics that usually prefixed with
amdgcn_*
. These functions translate inputs into raw vector forms and addresses required by the low-level intrinsics. This backend-area of the code usually handles architecture or device-specific behavior differences.Vector operations. This level of objects such as
OpaqueLoad
orOpaqueStore
handle variable-sized vector inputs and marshall them into unrolled unit backend operations. They encompass thread-wise details of vector operations. These classes provide a consistent functional interface for input vectors of all sizes, independent of device architecture, whose details are handled at a lower level.Fragment operations. This is the API level of rocWMMA where user data is stored and manipulated in
fragment
objects. Fragment objects can be visualized as geometric ‘blocks’ of data in the perspective of the current wavefront, which are to be stored as vectors. Each of the loading / storing and mma operations provided by rocWMMA are in the perspective of the wavefront, assuming that all threads in the wavefront are participating under the hood. This layer’s implementation translates wavefront fragment operations into vector operations and so on. The encapsulation of rocWMMA into these separate layers allows for an API experience that is seamless across different device architectures and platforms.
Nomenclature#
GEMM#
GEneralized Matrix-Matrix multiplication (or, GEMM) is one of the fundamental applications for rocWMMA. In general, GEMM solves the equation D = alpha * A x B + beta * C
, where A, B, C, D
are matrices and alpha
and beta
are scalars.
Matrices are generally sized by M x N x K
, such that A = M x K
, B = K x N
and C, D = M x N
.
rocWMMA implements many varieties of testing and sample kernels for this purpose and encompasses a wide variety of parameters. Testing kernels are grouped into executables that are named as a string of parameters that describe their implementations.
PGR# - Prefetch Global Read lookup stages. PGR0 = no global read prefetch. PGR1 = 1 stage global read prefetch.
LB# - LDS buffer count. LB0 = no LDS usage, LB2 = 2 LDS buffers used for swap.
MP# - MFMA instruction priority. MP0 = default MFMA instruction priority of 0. MP1 = raise MFMA instruction priority to 1.
MB - Multiple output blocks targeted per wave
SB - Single output block target per wave
NC - Non-Cooperative load / store
CP - Cooperative load / store
BLK - Cooperative load / store per block tile
WV - Cooperative load / store per wave tile
WG - Cooperative load / store per macro tile
gemm_PGR0_LB0_MP0_SB_NC
: The simplest blocked GEMM example, which targets one output block of matrix multiplication per wave. No prefetch, no LDS usage, default MFMA prioritization, single block output and non-collaborative.gemm_PGR0_LB0_MP0_MB_NC
: Implements a multi-block GEMM where each wave is responsible for a BlocksX x BlocksY grid of output blocks. No prefetch, no LDS usage, default MFMA prioritization, multiple blocks output, and non-collaborative.gemm_PGR1_LB2_MP0_MB_CP_BLK
: Implements a multi-block GEMM where each wave is responsible for a BlocksX x BlocksY grid of output blocks. This kernel leverages shared memory to implement a data prefetching pipeline and collaborates with other waves to improve performance. Implements single stage prefetch, double LDS buffer, default MFMA prioritization, multiple blocks output, and is block-tile collaborative in global read and local write.gemm_PGR1_LB2_MP0_MB_CP_WV
: Implements a multi-block GEMM where each wave is responsible for a BlocksX x BlocksY grid of output blocks. This kernel leverages shared memory to implement a data prefetching pipeline and collaborates with other waves to improve performance. Implements single stage prefetch, double LDS buffer, default MFMA prioritization, multiple blocks output, and is wave-tile collaborative in global read and local write.gemm_PGR1_LB2_MP0_MB_CP_WG
: Implements a multi-block GEMM where each wave is responsible for a BlocksX x BlocksY grid of output blocks. This kernel leverages shared memory to implement a data prefetching pipeline and collaborates with other waves to improve performance. Implements single stage prefetch, double LDS buffer, default MFMA prioritization, multiple blocks output and is macro-tile collaborative in global read and local write.Ad Hoc Test
: An executable that focuses on a specific set of kernel parameters. This is used as a quick mock-up of a situational investigation of a particular GEMM kernel.
Validation tests are postfixed with -validate
. Benchmark tests are postfixed with -bench
.
Sample kernels are constructed with as minimal infrastructure as possible. Their namings are much different to appeal to a broader audience.
simple_sgemm
: a simple GEMM kernel withs
denoting single-precision floating point datatype.simple_dgemm
: a simple GEMM kernel withd
denoting double-precision floating point datatype.simple_hgemm
: a simple GEMM kernel withh
denoting half-precision floating point datatype.perf_sgemm
: a performant GEMM kernel withs
denoting single-precision floating point datatype.perf_dgemm
: a performant GEMM kernel withd
denoting double-precision floating point datatype.perf_hgemm
: a performant GEMM kernel withh
denoting half-precision floating point datatype.
GEMV#
GEneralized Matrix-Vector multiplication (or, GEMV) is another application for rocWMMA. In general, GEMV solves the equation y = alpha * A * x + beta * y
, where A
is a matrix, x and y
are vectors and alpha and beta
are scalars.
Matrix A
is generally sized as M x K
, vector X
as K x 1
and vector Y
as M x 1
.
rocWMMA implements some samples of simple GEMV demonstrations as below:
simple_sgemv
: Simple GEMV kernel withs
denoting single-precision floating point datatype.simple_dgemv
: Simple GEMV kernel withd
denoting double-precision floating point datatype.
DLRM#
rocWMMA implements a simple component of Deep Learning Recommendation Model (DLRM) for machine learning. Both forward and backwards passes on half-precision inputs and outputs are demonstrated.
simple_dlrm
: Simple GEMV kernel withs
denoting single-precision floating point datatype.
Library source code organization#
The rocWMMA code is split into four major parts:
The
library
directory contains the header library API and implementation.The
samples
directory contains real-world sample use-cases of the rocWMMA API.The
test
directory contains testing infrastructure for rocWMMA.The
docs
directory contains documentation generation sources.
library
directory#
The library
directory contains the following structure:
library/include/rocwmma/
: C++ include files for the rocWMMA API. These files also contain Doxygen content that documents the API.
The API currently has three API contexts:
rocwmma.hpp
: The main API for rocWMMA, defining fragment data abstractions, wave-wise storing, loading, matrix multiply-accumulate (mma) and threadblock synchronization. This API’s function signatures are portable from nvcuda::wmma.
rocwmma_coop.hpp
: A complimentary API for rocWMMA, defining functionality that allows GPU wavefronts to collaborate in the loading / storing of fragment data. These are unique to rocWMMA.
rocwmma_transforms.hpp
: A complimentary API for rocWMMA, defining functionality to manipulate fragment data (e.g. transpose and data layout changes). These are unique to rocWMMA.
library/include/internal
: Internal include files define the main infrastructure driving the rocWMMA API:Configuration of platforms and architectures
Type support
Input and output configuration, shapes and traits
Loading and storing utilities
Layouts of memory and registers
Mapping utilities
Intrinsic wrappers
Vector class implementations
Vector conversion, permutation and transform utilities
Vector packing and unpacking
Matrix multiply-accumulate
Cooperative loading and storing
Threadblock synchronization and flow control
Utility code
Data layout transformation utilities
samples
directory#
The samples
directory contains the sample codes for the following use cases:
samples/hipRTC_gemm.cpp
: For calling simple General Matrix Multiply (GEMM) algorithm demonstration without LDS memory usage and no transpose, from within the hipRTC environment.samples/simple_sgemv.cpp
: For calling simple matrix multiply-accumulate with a vector demonstration, without LDS and no transpose for single-precision floating point types.samples/simple_dgemv.cpp
: For calling simple matrix multiply-accumulate with a vector demonstration, without LDS and no transpose for double-precision floating point types.samples/simple_sgemm.cpp
: For calling simple GEMM algorithm demonstration without LDS memory usage and no transpose for single-precision floating point types.samples/simple_dgemm.cpp
: For calling simple GEMM algorithm demonstration without LDS memory usage and no transpose for double-precision floating point types.samples/simple_hgemm.cpp
: For calling simple GEMM algorithm demonstration without LDS memory usage and no transpose for half-precision floating point types.samples/perf_sgemm.cpp
: For calling the high performing multi-block GEMM algorithm demonstration with LDS memory, macro tile collaboration, data reuse and optimized pipeline for single-precision floating point types.samples/perf_dgemm.cpp
: For calling the high performing multi-block GEMM algorithm demonstration with LDS memory, macro tile collaboration, data reuse and optimized pipeline for double-precision floating point types.samples/perf_hgemm.cpp
: For calling the high performant multi-block GEMM algorithm demonstration with LDS memory, macro tile collaboration, data reuse and optimized pipeline for half-precision floating point types.samples/simple_dlrm.cpp
: For calling simple Deep Learning Recommendation Model (DLRM) for machine learning.samples/common.hpp
: Common code used by all the above rocWMMA samples files.
test
directory#
The test
directory contains the test code support:
test/bin
: To generate benchmark plots from thegtest
output dumps of rocWMMA’s benchmark tests.test/device
: Device utility kernels to support test setup and validation on GPU.test/dlrm
: For various strategies of DLRM application. This test is used to validate DLRM functions using rocWMMA API.test/gemm
: For various strategies of GEMM application. This test is used to validate and benchmark GEMM functions using rocWMMA API.test/unit
: For testing the basic functional units of rocWMMA library.
docs
directory#
Sphinx and Doxygen are used to generate project documentation.
api-reference-guide.rst
pulls from Doxygen documentation to format API documentation.installation.rst
builds installation / build instructions for rocWMMA.license.rst
includes information pertaining to rocWMMA licensing.programmers-guide.rst
includes information about project organization and expectations.what-is-rocwmma.rst
includes a description of rocWMMA.
Contributing#
For those wishing to contribute to the project, please see Contributing to rocWMMA.