Understanding AMD GPU LDS and Bank Conflicts

Understanding AMD GPU LDS and Bank Conflicts#

Introduction#

Local Data Share (LDS) is AMD’s shared memory within a compute unit (see Intro to AMD CDNA Architecture for architecture details). It is organized into 32 or 64 banks depending on the hardware architecture, each bank has a 4 bytes width. Understanding how memory addresses map to banks is key to avoiding bank conflicts.

Bank Mapping#

For AMD GCN architecture, the LDS bank mapping is typically:

\[\text{bank} = \left( \frac{\text{address in bytes}}{4} \right) \bmod 32\]

This means:

Addresses that differ by multiples of bank numbers * 4 bytes map to the same bank.
Conflicts occur when multiple threads in the same wave access the same bank in the same cycle.

Not all the lanes can produce bank conflicts. HW divides access to LDS from wavefront into phases. Which lanes would be considered in each phase depends on the width of the instruction. Let us consider ds_write_b128 as an example as it is the instruction that has the largest granularity write with the highest performance. Here access will be divided into 8 phases for 64 lane wavefront. If in 1 phase there will not be two thread access the same bank, there will bot be bank conflict:

lane0~lane7
lane8~lane15
lane16~lane23
lane24~lane31
lane32~lane39
lane40~lane47
lane48~lane55
lane56~lane63

If within each group of lanes there is no conflict it is an LDS bank conflict free write access.

Bank Access Patterns#

LDS bank access can be simulated for a given set of thread addresses. With a 32 bank LDS with 4 bytes per bank, each thread will be writing 8 2-byte elements (16 bytes total), consuming 4 banks in LDS. fp16 or bf16 are the common formats GPU kernels have to deal with. With the phase access pattern like above by default it is a bank conflict free LDS write access.

Write Access Pattern#

For LDS write instructions like ds_write_b128, the hardware provides conflict-free access when threads write to consecutive addresses. Each phase of 8 lanes writes to different banks, avoiding conflicts.

Read Access Pattern#

Similarly for LDS read instruction ds_read_b128, when there is no bank conflict in these 8 lane groups:

0:3+20:23
4:7+16:19
8:11+28:31
12:15+24:27
32:35+52:55
36:39+48:51
40:43+60:63
44:47+56:59

then it’s bank conflict-free for LDS reading.

The reason for accessing the data vertically is because in most LDS access the MFMA instruction in the next step and the MFMA are requirde to access the data vertically like above.

The LDS read access pattern illustrated below is typical for LDS usage in machine learning workloads. The read pattern can generate 4-way bank conflicts in every phase of access. You can experiment with row_padding (padding in a number of banks) to see if the problem can be solved this way, but also remember that in practice this will require additional LDS storage. The bigger the padding, the more additional storage is necessary.

XOR Preshuffle: An Alternative to Padding#

Another technique to reduce LDS bank conflicts is XOR preshuffling (see Load Data Share Index Swapping for detailed implementation). Instead of adding padding between rows, we can permute the column indices for each row using XOR. This method can help to avoid bank conflicts without allocating extra storage in LDS.

For a wavefront of 64 threads, if each thread writes a vector of 8 fp16 elements (16 bytes), and the row size is 64 elements, the column index for each element in a row is adjusted as follows:

KTypeSize = 2
KPerBlock = 64 // 64 elements per row
KPack = 8 // 8 elements per thread

The adjusted column position for element (x, y) is:

\[x' = \left( y \bmod \frac{\text{KPerBlock}}{\text{KPack}} \right) \oplus x\]

where \(\oplus\) is the bitwise XOR, and \(x, y\) are the original positions of a vector element with respect to the LDS banks.

C++ Implementation#

Here’s how CK implements XOR preshuffling:

// XOR-based column index adjustment
template <index_t KPerBlock, index_t KPack>
__device__ constexpr index_t xor_preshuffle(index_t row, index_t col)
{
    constexpr index_t num_cols = KPerBlock / KPack;
    return (row % num_cols) ^ col;
}

// LDS write with XOR preshuffle
template <typename DataType, index_t RowStride>
__device__ void lds_write_with_xor(DataType* lds_ptr,
                                   const DataType* src,
                                   index_t row,
                                   index_t col)
{
    // Apply XOR preshuffle to column index
    index_t col_xor = xor_preshuffle<64, 8>(row, col);

    // Write to LDS with adjusted column
    index_t offset = row * RowStride + col_xor * 8;

    // Vectorized write (assuming 128-bit write)
    *reinterpret_cast<float4*>(lds_ptr + offset) =
        *reinterpret_cast<const float4*>(src);
}

// LDS read with XOR preshuffle
template <typename DataType, index_t RowStride>
__device__ void lds_read_with_xor(DataType* dst,
                                  const DataType* lds_ptr,
                                  index_t row,
                                  index_t col)
{
    // Apply same XOR preshuffle for read
    index_t col_xor = xor_preshuffle<64, 8>(row, col);

    // Read from LDS with adjusted column
    index_t offset = row * RowStride + col_xor * 8;

    // Vectorized read
    *reinterpret_cast<float4*>(dst) =
        *reinterpret_cast<const float4*>(lds_ptr + offset);
}

Integration with CK Tile#

CK Tile handles LDS bank conflict avoidance through its abstractions:

TileWindow (Tile Window - Data Access Gateway): Automatically applies XOR preshuffling when loading/storing to LDS
StaticDistributedTensor (Static Distributed Tensor): Manages LDS allocation with proper alignment
LoadStoreTraits (LoadStoreTraits - Memory Access Optimization Engine): Selects optimal access patterns to minimize conflicts

Example usage in CK Tile:

// CK Tile automatically handles bank conflict avoidance
template <typename TileDistribution>
__device__ void gemm_kernel()
{
    // Create tile window with automatic XOR preshuffle
    auto a_window = make_tile_window(
        a_tensor_view,
        tile_size,
        origin,
        tile_distribution);

    // Load to LDS - XOR preshuffle applied automatically
    auto a_lds_tensor = make_static_distributed_tensor<
        element_type,
        decltype(tile_distribution)>();

    a_window.load(a_lds_tensor);

    // Subsequent reads from LDS are conflict-free
    sweep_tile(a_lds_tensor, [](auto idx, auto& val) {
        // Process data...
    });
}

Performance Impact#

Proper LDS bank conflict avoidance can have significant performance impact:

4-way conflicts: Can reduce effective LDS bandwidth by 75%
XOR preshuffle: Restores full bandwidth with zero storage overhead
Padding: Also effective but requires 12.5-25% more LDS storage

Best Practices#

Use CK Tile abstractions: They automatically handle bank conflict avoidance
Prefer XOR preshuffle: No storage overhead compared to padding
Verify with profiling: Use rocprof to check for LDS bank conflicts
Consider access patterns: Design algorithms with bank-friendly patterns

By understanding LDS bank conflicts and using CK Tile’s automatic conflict avoidance mechanisms, developers can achieve optimal shared memory performance without manual optimization.