Buffer assignment design document for rocFFT#

Summary#

Buffer assignment in rocFFT is the work of coordinating the input and output buffers of each step in a rocFFT plan.

Observations#

Some observations can be made about the FFT planning and buffer assignment process:

  1. Buffer assignment begins after the plan structure is decided. This means all of the node schemes, as well as input lengths and input/output strides are known.

    Note that at this time, output lengths are not directly known.

  2. The first child of any non-leaf node must read from its parent’s input, and the last child must write to its parent’s output.

  3. The input of any other child node must be the same as the output of its preceding sibling.

    There is one exception: a CS_KERNEL_CHIRP node is only used to populate the Bluestein temporary buffer, and can essentially be ignored during buffer processing as it does not actually read input.

  4. The top-level node in the tree must read from the user-defined input buffer, and write to the user-defined output buffer. These buffers will be the same for in-place transforms.

  5. When deciding buffer assignments for a node, only the output buffer for nodes besides the last requires actual decision making. The input buffer follows either from the top-level input, a preceding sibling, or a parent node’s input. The last node’s output must be the user-defined output.

  6. During buffer assignment, some number of buffers are available. At minimum, we have the user input and output buffers (which may be the same for in-place transforms) whose sizes are defined by the user.

    Zero or more temporary buffers may be needed, whose sizes are dynamic and can be as big as necessary for the transform to succeed.

  7. Some choices of output buffers are clearly invalid. For example:

    • Transpose nodes must always be out-of-place (i.e. input buffer cannot equal output buffer).

    • Some internal kernels only support interleaved formats as their input or output. For example, the input of a copy-kernel (like COPY_CMPLX_TO_R or COPY_CMPLX_TO_HERM) must be interleaved.

    • Internal temp buffers are allocated contigously, so they can be used on both planar and interlevead formats. This is not always true for user-provided buffers. An obvious example of this planar data: users typically create these using two buffers.

    • A node cannot write data to a buffer if the buffer is too small for that data. This really only applies to user input/output buffers, as temp buffers are always made large enough.

Solution#

We implement a decision function that determines whether a buffer assignment is valid based on the observations above.

Buffer assignment should do an exhaustive search through the space of possible buffer assignments for the tree, calling the decision function for each potential choice. If we arrive at the end of the tree and all assignments are valid, then the buffer assignment operation is complete.

Returning the first valid buffer assignment found is a simple solution. However, not all valid buffer assignments are equal in terms of memory usage and/or performance: some buffer assignments allow more kernel fusions and/or use more inplace kernels. This implies that we should keep all valid assignment candidates in a list and subsequently return the “best” one.

The first pass of buffer assignment shall attempt to assign buffers starting with just the user input buffer (and output buffer, if distinct) and a Bluestein temp buffer (if the plan requires it), but without any other temp buffers.

If that first pass is not successful, we retry with one temp buffer added to the list of available buffers. If that is still not successful, we retry with a second temp buffer.

Implementation#

A Structure Storing A Try#

We store our current assignment try in a tree-like structure. We don’t assign to the tree-node directly since there could be many valid assignment paths for one plan. Once we determine the best assignment path from this tree, we fill the assignment back to the real tree-node.

struct PlacementTrace
{
    TreeNode* referedNode;
    Buffer inBuffer, outBuffer;
    ArrayType inArrayType, outArrayType;

    // each branch stands for a possible try on the next node
    vector<PlacementTrace*> branches;

    // a parent pointer to backtracing
    PlacementTrace* parent;

    // propagate these values from the root to leaves
    size_t numFusedNodes;
    size_t numInplace;
    size_t numTypwSwithching;
}

Decision Function and Output Lengths#

Much of the remaining complexity lies in the ValidOutBuffer() decision function mentioned above.

Output lengths are not always trivially knowable from input lengths in some cases:

  • Padding introduced in some plans such as L1D_TRTRT, 2D_RTRT, or REAL_2D_EVEN. This is partially alleviated by setting a “outputHasPadding” flag before the buffer assignment process.

  • R2C/C2R handling of Hermitian redundancy. Without explicit output length information, it’s not possible to make a decision on buffers. In order to handle this, we calculate the real buffer length of the root plan itself, depending on if it’s R2C or C2R, IP or OP, which gives us the explicit length of in/out buffer size.

Since the exhaustive search is a depth-first-search, so when we go back to the upper nodes and proceed to another branch, the node-vs-buffer-test in the deeper nodes could be repeated, so we can put the test results in a cache to make a slight optimization.

Fusions#

Kernel-fusion is essential for improving performance. Unfortunately fusion depends heavily on buffer assignment. Two (or more) kernels can be fused into one kernel only when the resulting buffer assignment remains valid.

To maximise kernel fusion, we also implement a FuseShim framework. A FuseShim class is a container/shell indicating that there is a potentially-fusable kernel-fusion. Each FuseShim defines its own requirements to fulfill the fusion, including the expected buffer assignment.

During the buffer assignment process, we can use the test function to get the final number of the achievable kernel fusions. This number plays the most important role when making the final decision: we always pick the one which can fuse the most kernels.

Future Work#

Strides#

Currently, rocFFT does not guarantee that strides on user buffers are respected if temporary data is written to those buffers.

With this implementation, it would be simpler to begin enforcing such a guarantee.

Enforcing Read-only Input#

rocFFT may currently overwrite user input buffers for out-of-place real-transforms (not C2C-transform). Although we’ve documented this behaviour and it is common practice in other libraries, it might still be unintuitive for some users.

If we ever wanted to start guaranteeing that user input is left unmodified, this buffer assignment implementation would make that work trivial - only the decision function needs to be made aware of this policy change, and buffer assignment will work fine.

However, we may need to introduce yet another temp buffer, since we’d be taking away a potential work space from existing plans.

Flexibility Between Minimizing Memory or Maximizing Fusions#

We can’t always expect there is a perfect assignment that maximises kernel fusions while also minimising temporary buffers. In some cases, these two goals are contradictory: if we choose an assignment using minimal buffers, we may loose the oppurtunity to fuse more kernels. On the other hand, if we are allowed to use more memory, we have more buffers available for out-of-place kernel-fusions.

With this implementation, it is possible to introduce an optimization strategy option to users.

For example, if the memory usage is the main concern of users, we can return the assignment with least buffer usage. Otherwise, we return the result which maximizes the kernel fusions regardless of the memory consumption.

Make C Buffer as Temp2 Buffer#

There is no reason to limit the “C” buffer to real-transforms only. We can make the C buffer as another generic temporary buffer throughout; this can also avoid any confusion about the purpose of C and T.