Index A | B | C | D | E | F | G | H | I | K | L | M | N | O | P | R | S | T | U | V | W A Add+Multiply alignment arithmetic logic unit B bank conflict batched GEMM block Size block tile C Col2Im compute unit coordinate transformation primitives D dense tensor descriptor device dilation E elementwise epilogue F fast changing dimension fused add multiply G GEMM GEMV general matrix multiply general matrix vector multiplication GGEMM global memory grid grouped GEMM H host host-device transfer I Im2Col inner dimension input K kernel L launch parameters LDS LDS banks load tile local data share M matrix core matrix fused multiply-add memory coalescing MFMA N naive GEMM O occupancy operation outer dimension P padding permute pinned memory pipeline policy problem problem shape R reference kernel register S scalar general purpose register SGPR SIMD SIMT single-instruction, multi-data single-instruction, multi-thread sparse tensor Split-K GEMM store tile stride T tile tile distribution tile partitioner tile programming API tile window transpose U user customized tile pipeline user customized tile pipeline optimization V vanilla GEMM vector vector general purpose register VGEMM VGPR W wave tile wavefront work group work-item