CK Tile Hardware Documentation#
This section provides in-depth coverage of hardware-specific concepts and optimizations for CK Tile on AMD GPUs.
Overview#
Understanding the underlying hardware architecture is crucial for achieving optimal performance with CK Tile. This documentation covers:
AMD CDNA architecture fundamentals
Memory hierarchy and optimization techniques
Practical examples of high-performance kernels
Documentation Structure#
Hardware Topics
GPU Architecture Basics#
Intro to AMD CDNA Architecture provides an introduction to AMD CDNA architecture.
LDS and Bank Conflicts#
Understanding AMD GPU LDS and Bank Conflicts explains Local Data Share (LDS) optimization.
GEMM Optimization Case Study#
A Block GEMM on MI300 demonstrates a complete optimization journey.
Key Hardware Considerations#
Memory Hierarchy#
Global Memory: High latency, high bandwidth
Optimize with coalesced access patterns
Use tile windows for automatic optimization
L2/Infinity Cache: Intermediate storage
Benefits from spatial and temporal locality
CK Tile’s tiling naturally improves cache hit rates
LDS: Low latency, shared within CU
64KB per CU, organized in 32 banks
CK Tile handles bank conflict avoidance
Registers: Lowest latency, per-thread storage
512 VGPRs available per wavefront
CK Tile’s compile-time optimization minimizes usage
Compute Resources#
Wavefront Execution: 64 threads in lockstep
CK Tile ensures coalesced memory access
Automatic warp-level synchronization
Matrix Units: Specialized MFMA instructions
16x16x16 operations in 16 cycles
CK Tile can leverage these automatically
Occupancy: Balancing threads vs resources
Register pressure affects occupancy
CK Tile helps through efficient register use
Performance Guidelines#
To achieve optimal performance with CK Tile:
Choose appropriate tile sizes:
Match hardware capabilities (e.g., 256x256 for GEMM)
Consider LDS capacity and register pressure
Align problem dimensions:
Match CU count when possible (304 for MI300)
Use padding for non-aligned sizes
Enable pipelining:
Use double buffering for latency hiding
CK Tile supports async operations
Profile and verify:
Use rocprof to check for bottlenecks
Verify bank conflict avoidance
Monitor occupancy and register usage
Next Steps#
Review Intro to AMD CDNA Architecture for architecture fundamentals
Study Understanding AMD GPU LDS and Bank Conflicts for shared memory optimization
Explore A Block GEMM on MI300 for a complete optimization example
For practical implementation, refer back to the main CK Tile Conceptual Documentation documentation to see how these hardware concepts integrate with CK Tile’s abstractions.