Kernel parameters#
Kernel parameters are configuration parameters used by Tensile to make decisions about what assembly code to generate. The kernel parameters affect many aspects of performance. Changing a parameter might help address one performance bottleneck but worsen another. Hence, searching through the parameter space is vital to discovering the fastest kernel for a given problem.
The following table lists the kernel parameters:
Kernel parameter |
Description |
|---|---|
|
Setting to True = Do-While loop and Setting to False = While or For loop. |
|
Additional loop with ``LoopUnroll``=1. |
|
Branch, |
|
[dim0, dim1, |
|
[dim0, dim1] |
|
Type of matrix instruction used for the calculation and wave tiling parameters [ |
|
Split up summation among work-groups to create more concurrency. This option launches a kernel to handle the beta scaling and then a second kernel with atomic writes to the global memory. |
|
Setting to True ensures that the outer loop prefetches global data one iteration ahead. |
|
Setting to True ensures that the inner loop prefetches local data one iteration ahead. |
|
The order in which the work-groups compute C. This affects cacheing. |
|
The number of iterations to unroll the inner loop. This helps in loading coalesced memory. |
|
Derived using |
|
Derived using |
|
The number of loads from A in coalesced dimension. |
|
Setting to True ensures that the adjacent threads map to adjacent global read elements. However, if transposing data then write to LDS is scattered. |
|
Setting to True ensures that the vector components map to adjacent global read elements. However, if transposing data then write to LDS is scattered. |
|
As the thread tile elements are contiguous for faster memory accesses, a ``VectorWidth``= 4 implies that a thread will read a float4 from memory instead of 4 non-contiguous floats. |
|
Decides if the kernels should be written in the source code (HIP) or assembly (gfx803, gfx900, …). |
For the exhaustive list of solution parameters and their defaults, see Common.py.
GPU kernel dimensions#
Tensile allows for 3-dimensional grid of work-groups. Each work-group can be a 3-dimensional grid of work-items. Tensile assigns D0 to the dimension-0 and D1 to the dimension-1 of the work-group and work-item grid. All other free or batch dimensions are flattened into the final dimension-2 of the work-group and work-item grids. Within the GPU kernel, dimension-2 is reconstituted back into whatever dimensions it represents.
Mapping between N-dimensional tensor contractions and finite-dimensional GPU kernels#
For a traditional GEMM, the 2-dimensional output, C[i,j], is mapped to launching a 2-dimensional grid of work-groups. Each work-group has a 2-dimensional grid of work-items with one dimension belonging to i and another to j. The 1-dimensional summation is represented by a single loop within the kernel body.
Special dimensions: D0, D1, and DU#
To handle arbitrary dimensionality, Tensile begins by determining three special dimensions: D0, D1, and DU.
D0 and D1 are the free indices of A and B respectively having the shortest strides. This allows the fastest reads for innermost loops from A and B via coalescing. In a traditional GEMM, every matrix has a dimension with a shortest stride of one, but Tensile doesn’t rely on this assumption. Of these two dimensions, D0 is the dimension with the shortest tensor C stride, which allows for fast writing.
DU represents the summation index with the shortest combined stride (stride in A + stride in B). DU is the innermost loop that gets “U”nrolled. This assignment is also meant to assure fast reading in the innermost summation loop. In case of multiple summation indices (embedded loops), DU iterates over the innermost loop.
Kernel names#
Kernel names contain abbreviations of relevant parameters along with their value. Here is what a typical kernel name looks like:
Cijk_Ailk_Bjlk_SB_MT64x256x16_<PARAMETERS>
The given kernel name example is a GEMM. The different parts of the kernel name are described here:
The first part (C***_A***_B***) indicates the type of operation the kernel performs.
The second part indicates the data type supported by the kernel. In the preceding example, “S” indicates single-precision floating-point numbers and “B” indicates that the kernel can use beta values.
For a list of supported data types and their corresponding code names, please refer to Precision support.
The third part “MT” stands for macro tile, which is 64x256 here. The third number listed with macro tile (16 in the example) is the unroll depth, specified by the
DepthUparameter.The last part “<PARAMETERS>” is an alphabetized list of abbreviations of relevant kernel parameters. The table below lists parameters, their kernel name abbreviations, and their default values to help interpret the meaning of a kernel name:
Table 3 kernel name parameters# Code
Parameter
Default
1LDSB1LDSBuffer0
APMAggressivePerfMode1
AAVAssertAlphaValueFalse
ABVAssertBetaValueFalse
ACEDAssertCEqualsDFalse
AF0EMAssertFree0ElementMultiple1
AF1EMAssertFree1ElementMultiple1
AMASAssertMinApproxSize-1
ASEAssertSizeEqual{}
ASGTAssertSizeGreaterThan{}
ASLTAssertSizeLessThan{}
ASMAssertSizeMultiple{}
ASAEAssertStrideAEqual{}
ASBEAssertStrideBEqual{}
ASCEAssertStrideCEqual{}
ASDEAssertStrideDEqual{}
ASEMAssertSummationElementMultiple1
AACAtomicAddCFalse
BLBufferLoadTrue
BSBufferStoreTrue
CDOCheckDimOverflow0
CTDACheckTensorDimAssertsFalse
CustomKernelName“”
DUDepthU-1
DULDDepthULdsDivisor1
DTLDirectToLdsFalse
DTVADirectToVgprAFalse
DTVBDirectToVgprBFalse
DAFDisableAtomicFail0
DKPDisableKernelPieces0
DVODisableVgprOverlappingFalse
ETEdgeTypeBranch
EPSExpandPointerSwapTrue
RFp16AltImplFalse
FLFractionalLoad0
GR2AGlobalRead2ATrue
GR2BGlobalRead2BTrue
GRCGAGlobalReadCoalesceGroupATrue
GRCGBGlobalReadCoalesceGroupBTrue
GRCVAGlobalReadCoalesceVectorATrue
GRCVBGlobalReadCoalesceVectorBTrue
GRPMGlobalReadPerMfma1
GRVWGlobalReadVectorWidth-1
GSUGlobalSplitU1
GSUAGlobalSplitUAlgorithmSingleBufferGSUSARRGlobalSplitUSummationAssignmentRoundRobinTrue
GSUWGMRRGlobalSplitUWorkGroupMappingRoundRobinFalse
GLSGroupLoadStoreFalse
ISAISAIUInnerUnroll1
IAInterleaveAlpha0
KLKernelLanguageSource
LELLdcEqualsLddTrue
LBSPPLdsBlockSizePerPad-1
LPALdsPadA0
LPBLdsPadB0
LDLLocalDotLayout1
LRVWLocalReadVectorWidth-1
LWPMLocalWritePerMfma-1
LR2ALocalRead2ATrue
LR2BLocalRead2BTrue
LW2ALocalWrite2ATrue
LW2BLocalWrite2BTrue
LDWLoopDoWhileFalse
LTLoopTailTrue
MADorFMAMACInstructionFMAMTMacroTileMTSMMacroTileShapeMax64
MTSMMacroTileShapeMin1
MDAMagicDivAlg2
MIMatrixInstruction[]
MOMaxOccupancy40
MVNMaxVgprNumber256
MIAVMIArchVgprFalse
MVNMinVgprNumber0
NTANonTemporalA0
NTBNonTemporalB0
NTCNonTemporalC0
NTDNonTemporalD0
NRNoRejectFalse
NEPBSNumElementsPerBatchStore0
NLCANumLoadsCoalescedA1
NLCBNumLoadsCoalescedB1
ONLLOptNoLoadLoop1
OPLVOptPreLoopVmcntTrue
PBDPackBatchDims0
PFDPackFreeDims1
PGPackGranularity2
PSDPackSummationDims0
PSLPerformanceSyncLocation-1
PWCPerformanceWaitCount-1
PWLPerformanceWaitLocation-1
PKPersistentKernel0
PKABPersistentKernelAlongBatchFalse
PAPPrefetchAcrossPersistent0
PAPMPrefetchAcrossPersistentMode0
PGRPrefetchGlobalReadTrue
PLRPrefetchLocalRead1
RKReplacementKernelFalse
SGRScheduleGlobalRead1
SIAScheduleIterAlg1
SLWScheduleLocalWrite1
SSSourceSwapFalse
SUStaggerU32
SUMStaggerUMapping0
SUSStaggerUStride256
SCIUStoreCInUnrollFalse
SCIUEStoreCInUnrollExactFalse
SCIUIStoreCInUnrollInterval1
SCIUPStoreCInUnrollPostLoopFalse
SPOStorePriorityOptFalse
SRVWStoreRemapVectorWidth0
SSOStoreSyncOpt0
SVWStoreVectorWidth-1
SNLLSuppressNoLoadLoopFalse
TSGRAThreadSeparateGlobalReadA0
TSGRBThreadSeparateGlobalReadB0
TTThreadTile[4, 4]
TLDSTransposeLDS0
UIIDUUnrollIncIsDepthU0
UMFUnrollMemFenceFalse
U64SLUse64bShadowLimit1
UIOFGROUseInstOffsetForGRO0
USFGROUseSgprForGRO-1
VAWVectorAtomicWidth-1
VSVectorStoreTrue
VWVectorWidth-1
WSGRAWaveSeparateGlobalReadA0
WSGRBWaveSeparateGlobalReadB0
WSWavefrontSize64
WGWorkGroup[16, 16, 1]
WGMWorkGroupMapping8
WGMTWorkGroupMappingTypeB