<no title>

Add+Multiply#: See fused add multiply.
alignment#: Alignment is a memory management strategy where data structures are stored at addresses that are multiples of a specific value.
arithmetic logic unit#: The arithmetic logic unit (ALU) is the GPU component responsible for arithmetic and logic operations.
bank conflict#: A bank conflict occurs when multiple work-items in a wavefront access different addresses that map to the same shared memory bank.
batched GEMM#: A kernel that calls VGEMMs with different batches of data. All the data batches have the same problem shape.
block Size#: The block size is the number of work-items in a compute unit.
block tile#: A block tile is a memory tile processed by a work group.
Col2Im#: Col2Im is a data transformation technique that converts column data to image format.
compute unit#: The compute unit (CU) is the parallel vector processor in an AMD GPU with multiple ALUs. Each compute unit will run all the wavefronts in a work group>. A compute unit is equivalent to NVIDIA’s streaming multiprocessor.
coordinate transformation primitives#: Coordinate transformation primitives are Composable Kernel utilities for converting between different coordinate systems.
dense tensor#: A dense tensor is a tensor where most of its elements are non-zero. Dense tensors are typically stored in a contiguous block of memory.
descriptor#: Metadata structure that defines tile properties, memory layouts, and coordinate transformations for Composable Kernel operations.
device#: Device refers to the GPU hardware that runs parallel kernels. The device contains the compute units, memory hierarchy, and specialized accelerators.
dilation#: Dilation is the spacing between kernel elements in convolution operations, allowing the receptive field to grow without increasing kernel size.
elementwise#: An elementwise operation is an operation applied to each tensor element independently.
epilogue#: The epilogue is the final stage of a kernel. Activation functions, bias, and other post-processing steps are applied in the epilogue.
fast changing dimension#: The fast changing dimension is the innermost dimension in memory layout.
fused add multiply#: A common fused operation in machine language and linear algebra, where an elementwise addition is immediately followed by a multiplication. Fused add multiply is often used for bias and scaling in neural network layers.
GEMM#: See general matrix multiply.
GEMV#: See general matrix vector multiplication
general matrix multiply#: A general matrix multiply (GEMM) is a Core matrix operation in linear algebra and deep learning. A GEMM is defined as \(C = {\alpha}AB + {\beta}C\), where \(A\), \(B\), and \(C\) are matrices, and \(\alpha\) and \(\beta\) are scalars.
general matrix vector multiplication#: General matrix vector multiplication (GEMV) is an operation where a matrix is multiplied by a vector, producing another vector.
GGEMM#: See grouped GEMM.
global memory#: The main device memory accessible by all threads, offering high capacity but higher latency than shared memory.
grid#: A grid is a collection of work groups that run a kernel. Each work group within the grid operates independently and can be scheduled on a different compute unit. A grid can be organized into one, two, or three dimensions. A grid is equivalent to an NVIDIA thread block.
grouped GEMM#: A kernel that calls multiple VGEMMs. Each call can have a different problem shape.
host#: Host refers to the CPU and the main memory system that manages GPU execution. The host is responsible for launching kernels, transferring data, and coordinating overall computation.
host-device transfer#: A host-device transfer is the process of moving data between host and device memory.
Im2Col#: Im2Col is a data transformation technique that converts image data to column format.
inner dimension#: The inner dimension is the faster-changing dimension in memory layout.
input#: See problem shape.
kernel#: A kernel is a function that runs an operation or a collection of operations. A kernel will run in parallel on several work-items across the GPU. In Composable Kernel, kernels require pipelines.
launch parameters#: Launch parameters are the configuration values, such as grid and block size, that determine how a kernel is mapped to hardware resources.
LDS#: See local data share.
LDS banks#: LDS banks are a type of memory organization where consecutive addresses are distributed across multiple memory banks for parallel access. LDS banks are used to prevent memory access conflicts and improve bandwidth when LDS is used.
load tile#: Load tile is an operation that transfers data from global memory or the load data share to vector general purpose registers.
local data share#: Local data share (LDS) is high-bandwidth, low-latency on-chip memory accessible to all the work-items in a work group. LDS is equivalent to NVIDIA’s shared memory.
matrix core#: A matrix core is a specialized GPU unit that accelerate matrix operations for AI and deep learning tasks. A GPU contains multiple matrix cores.
matrix fused multiply-add#: Matrix fused multiply-add (MFMA) is a matrix core instruction for GEMM operations.
memory coalescing#: Memory coalescing is an optimization strategy where consecutive work-items access consecutive memory addresses in such a way that a single memory transaction serves multiple work-items.
MFMA#: See matrix fused multiply-add.
naive GEMM#: The naive GEMM, sometimes referred to as a vanilla GEMM or VGEMM, is the simplest form of GEMM in Composable Kernel. The naive GEMM is defined as \(C = AB\), where \(A\), \(B\), and \(C\) are matrices. The naive GEMM is the baseline GEMM that all other GEMM operations build on.
occupancy#: The ratio of active wavefronts to the maximum possible number of wavefronts.
operation#: An operation is a computation on input data.
outer dimension#: The outer dimension is the slower-changing dimension in memory layout.
padding#: Padding is the addition of extra elements, often zeros, to tensor edges in order to control output size in convolution and pooling, or to align data for memory access.
permute#: Permute is an operation that rearranges the order of tensor axes, often for the purposes of matching kernel input formats or optimize memory access patterns.
pinned memory#: Pinned memory is host memory that is page-locked to accelerate transfers between the CPU and GPU.
pipeline#: A Composable Kernel pipeline schedules the sequence of operations for a kernel, such as the data loading, computation, and storage phases. A pipeline consists of a problem and a policy.
policy#: The policy is the part of the pipeline that defines memory access patterns and hardware-specific optimizations.
problem#: The problem is the part of the pipeline that defines input and output shapes, data types, and mathematical operations.
problem shape#: The problem shape defines the dimensions and data types of input tensors that define the problem.
reference kernel#: A reference kernel is a baseline kernel implementation used to verify correctness and performance. Composable Kernel makes two reference kernels, one for CPU and one for GPU, available.
register#: Registers are the fastest tier of memory. They’re used for storing temporary values during computations and are private to the work-items that use them.
scalar general purpose register#: A scalar general purpose register (SGPR) is a register shared by all the work items in a wave. SGPRs are used for constants, addresses, and control flow common across the entire wave.
SGPR#: See scalar general purpose register.
SIMD#: See single-instruction, multi-data
SIMT#: See single-instruction, multi-thread
single-instruction, multi-data#: Single-instruction, multi-data (SIMD) is a parallel computing model where the same instruction is run with different data simultaneously.
single-instruction, multi-thread#: Single-instruction, multi-thread (SIMT) is a parallel computing model where all the work-items within a wavefront run the same instruction on different data.
sparse tensor#: A sparse tensor is a tensor where most of its elements are zero. Typically only the non-zero elements of a sparse tensor and their indices are stored.
Split-K GEMM#: Split-K GEMM is a parallelization strategy that partitions the reduction dimension (K) of a GEMM across multiple compute units, increasing parallelism for large matrix multiplications.
store tile#: Store tile is an operation that transfers data from vector general purpose registers to global memory or the load data share.
stride#: A stride is the step size to move from one element to the next in a specific dimension of a tensor or matrix. In convolution and pooling, the stride determines how far the kernel moves at each step.
tile#: A tile is a sub-region of a tensor or matrix that is processed by a work group or work-item. Rectangular data blocks are the unit of computation and memory transfer in Composable Kernel, and are the basis for tiled algorithms.
tile distribution#: The tile distribution is the hierarchical data mapping from work-items to data in memory.
tile partitioner#: The tile partitioner defines the mapping between the problem dimensions and GPU hierarchy. It specifies workgroup-level tile sizes and determines grid dimensions by dividing the problem size by the tile sizes.
tile programming API#: The tile programming API is Composable Kernel’s high-level interface for defining tile-based computations with predefined hardware mappings for data loading and storing.
tile window#: Viewport into a larger tensor that defines the current tile’s position and boundaries for computation.
transpose#: Transpose is an operation that rearranges the order of tensor axes, often for the purposes of matching kernel input formats or optimize memory access patterns.
user customized tile pipeline#: A customized tile pipeline that combines custom problem and policy components for specialized computations.
user customized tile pipeline optimization#: The process of tuning the tile size, memory access pattern, and hardware utilization for specific workloads.
vanilla GEMM#: See naive GEMM.
vector#: The vector is the smallest data unit processed by an individual work-item. A vectors is typically four to sixteen elements, depending on data type and hardware.
vector general purpose register#: A vector general purpose register (VGPR) is a register that stores individual thread data. Each thread in a wave has its own set of VGPRs for private variables and calculations.
VGEMM#: See naive GEMM.
VGPR#: See vector general purpose register.
wave tile#: A wave tile is a sub-tile processed by a single wavefront within a work group. The wave tile is the base level granularity of a single-instruction, multi-thread (SIMD) model.
wavefront#: Also referred to as a wave, a wavefront is a group of work-items that run the same instruction. A wavefront is equivalent to an NVIDIA warp.
work group#: A work group is a collection of work-items that can synchronize and share memory. A work group is equivalent to NVIDIA’s thread block.
work-item#: A work-item is the smallest unit of parallel execution. A work-item runs a single independent instruction stream on a single data element. A work-item is equivalent to an NVIDIA thread.

Contents