CK Tile buffer view

CK Tile buffer view#

Buffer view is an abstraction that provides structured access to memory. The buffer_view class is exposed in include/ck_tile/core/tensor/buffer_view.hpp.

Buffer view serves as the foundation for Tensor Views - Multi-Dimensional Structure. BufferView handles memory addressing and type safety, while TensorView builds upon this to add multi-dimensional coordinates (shape and strides).

Buffer view provides the following advantages:

A unified interface across global, shared, and register memory
Address spaces encoded in types, taking advantage of compile-time type checking
Configurable handling of invalid values, out-of-bounds operations, and conditional access patterns
Atomic operations for parallel algorithms
AMD GPU-specific optimizations
Automatic application of appropriate memory ordering constraints and cache control directives based on the target address space and operation type

[TO DO: do we want to say more about these items? There wasn’t a lot of detail in the original text, so I put them in a list for now]

Address Space Usage Patterns#

[TO DO: explain in words what the diagram shows] ..

Original mermaid diagram (edit here, then run update_diagrams.py)

Basic Creation#

[TO DO: remove “modern C++ template metaprogramming” and “zero-overhead abstraction”]

[TO DO: might want to move the implementation details to a separate section under “reference”]

#include <ck_tile/core/tensor/buffer_view.hpp>
#include <ck_tile/core/numeric/integral_constant.hpp>

// Create buffer view in C++
__device__ void example_buffer_creation()
{
    // Static array in global memory
    float data[8] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
    constexpr index_t buffer_size = 8;

// Create buffer view for global memory
// Template parameters: <AddressSpace>
auto buffer_view = make_buffer_view<address_space_enum::global>(
    data,        // pointer to data
    buffer_size  // number of elements
);


// Implementation detail: The actual C++ template is:
// template <address_space_enum BufferAddressSpace,
//           typename T,
//           typename BufferSizeType,
//           bool InvalidElementUseNumericalZeroValue = true,
//           amd_buffer_coherence_enum Coherence = amd_buffer_coherence_enum::coherence_default>
// struct buffer_view

    // Alternative: Create with explicit type
    using buffer_t = buffer_view<float*, address_space_enum::global>;
    buffer_t explicit_buffer{data, number<buffer_size>{}};

    // Access properties at compile time
    constexpr auto size = buffer_view.get_buffer_size();
    constexpr auto space = buffer_view.get_address_space();

    // The buffer_view type encodes:
    // - Data type (float)
    // - Address space (global memory)
    // - Size (known at compile time for optimization)
    static_assert(size == 8, "Buffer size should be 8");
    static_assert(space == address_space_enum::global, "Should be global memory");
}

[TO DO: add details and remove unnecessary comments; the “implementation detail” comment can be moved out and either placed outside and explained further, or just removed, depending on what we want to do]

[TO DO: might want to put this implementation detail in the reference section]

Buffer view uses two modes, zero value mode and custom value mode, that can prevent serialization during bounds checking.

Zero value mode returns zero without branching when an access falls outside the valid buffer range. This is useful in convolutions where out-of-bounds accesses correspond to zero-padding.

Custom value mode returns a custom value without branching when an access falls outside the valid buffer range. Custom value mode accommodates algorithms that require specific values for boundary conditions.

[TO DO: there were two examples of custom value mode that I removed. I removed them because unlike for zero value mode where the example was convolution, the example was vague in custom value. Is there a more specific example of where custom value would be used?]

// Basic buffer view creation with automatic zero for invalid elements
void basic_creation_example() {
    // Create data array
    constexpr size_t buffer_size = 8;
    float data[buffer_size] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};

    // Create global memory buffer view
    auto buffer_view = make_buffer_view<address_space_enum::global>(data, buffer_size);
}

// Custom invalid value mode
void custom_invalid_value_example() {
    constexpr size_t buffer_size = 8;
    float data[buffer_size] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
    float custom_invalid = 13.0f;

    // Create buffer view with custom invalid value
    auto buffer_view = make_buffer_view<address_space_enum::global>(
        data, buffer_size, custom_invalid);
}

When InvalidElementUseNumericalZeroValue is set to true, the system uses zero value mode for out of bounds checking. When InvalidElementUseNumericalZeroValue is set to false, custom value mode is used. Zero value mode is used by default.

Note

Zero or custom invalid value is only returned for complete invalid values or out of bound access, for example when the first address of the vector is invalid. Partial out of bounds access during vector reads will not return useful results.

// Create data array
constexpr size_t buffer_size = 8;
float data[buffer_size] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
float custom_invalid = 13.0f;

// Create global memory buffer view with zero invalid value mode (default)
auto buffer_view = make_buffer_view<address_space_enum::global>(data, buffer_size, custom_invalid);

// Invalid element access with is_valid_element=false
// Returns custom_invalid due to custom invalid value mode
auto invalid_value = buffer_view.template get<float>(0, 0, false);
printf("Invalid element: %.1f\n", invalid_value.get(0));

// Out of bounds access - AMD buffer addressing handles bounds checking
// Will return custom_invalid when accessing beyond buffer_size
auto oob_value = buffer_view.template get<float>(0, 100, true);
printf("Out of bounds: %.1f\n", oob_value.get(0));

Get Operations#

[TO DO: might want to put this implementation detail in the reference section]

The signature for the buffer_view get() takes four parameters:

i: the primary offset into the buffer expressed in terms of elements of type T rather than raw bytes.

linear_offset: [TO DO: what is this?]

is_valid_element: [TO DO: what is this?]

[TO DO: the last param, that’s the out of bounds handling, yes? .. code:: cpp

get(index_t i,
index_t linear_offset, bool is_valid_element, bool_constant<oob_conditional_check> = {})

[TO DO: need some context around the code]

[TO DO: code chunks need to have detail and explanation so that the reader can see what they’re trying to demonstrate.]

 // Create buffer view
 float data[8] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
 auto buffer_view = make_buffer_view<address_space_enum::global>(data, 8);

 // Simple get - compile-time bounds checking when possible
 auto value_buf = buffer_view.template get<float>(0,1,true); //get the buffer from the buffer view
 float value = value_buf.get(0); //get the value from the buffer

    // Get with valid flag - branchless conditional access
    bool valid_flag = false;
    value_buf = buffer_view.template get<float>(0,1,valid_flag);
    value = value_buf.get(0);
    // Returns 0 valid_flag is false

    // vectorized get
    using float2 = ext_vector_t<float, 2>;
    auto vector_buf = buffer_view.template get<float2>(0, 0, true);
    // Loads 2 floats in a single instruction
    float val1 = vector_buf.get(0);
    float val2 = vector_buf.get(1);
}

ext_vector_t<float, N> enables compile-time selection of optimal load and store instructions that can transfer multiple data elements in a single memory transaction.

[TO DO: what is it actually doing? When does one use scalars vs vectors? Is it application specific or are there ]

Understanding BufferView Indexing#

[TO DO: an explanation of the diagram is needed]

Update Operations#

Update operations modify the buffer content. The set() method writes a value to a specific location.

void scalar_set_operations_example() {

    // Create data array
    constexpr size_t buffer_size = 8;
    float data[buffer_size] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};

    // Create global memory buffer view
    auto buffer_view = make_buffer_view<address_space_enum::global>(data, buffer_size);

    // Basic set: set<T>(i, linear_offset, is_valid_element, value)
    // Sets element at position i + linear_offset = 0 + 2 = 2
    buffer_view.template set<float>(0, 2, true, 99.0f);

    // Invalid write with is_valid_element=false (ignored)
    buffer_view.template set<float>(0, 3, false, 777.0f);

    // Out of bounds write - handled safely by AMD buffer addressing
    buffer_view.template set<float>(0, 100, true, 555.0f);

    // Vector set
    using float2 = ext_vector_t<float, 2>;
    float2 pair_values{100.0f, 200.0f};
    buffer_view.template set<float2>(0, 5, true, pair_values);
}

Atomic Operations#

[TO DO: this needs information]

Atomic vs Non-Atomic Operations#

C++ Atomic Operations#

__device__ void example_atomic_operations()
{
    // Shared memory for workgroup-level reductions
    __shared__ float shared_sum[256];
    auto shared_buffer_view = make_buffer_view<address_space_enum::lds>(
        shared_sum, 256
    );

    // Initialize shared memory
    if (threadIdx.x < 256) {
        shared_buffer_view.template set<float>(threadIdx.x, 0.0f, true);
    }
    __syncthreads();

    // Each thread atomically adds to shared memory
    auto my_value = static_cast<float>(threadIdx.x);
    shared_buffer_view.template update<memory_operation_enum::atomic_add, float>(0,0,true,my_value);

    // Atomic max for finding maximum value
    shared_buffer_view.template update<memory_operation_enum::atomic_max, float>(0,1,true,my_value);

    __syncthreads();
}