Runtime Compilation Design Document for rocFFT#

Summary#

This document describes runtime compilation (RTC) as it is used in rocFFT. Runtime compilation helps reduce binary size and build times for the library, and can allow for optimizations that are not practical versus ahead-of-time compiled kernels.

Problem#

Stockham FFT kernels make up the vast majority of the rocFFT library. Kernels handling specific problem sizes are chosen as part of the rocFFT build process, and are compiled for all of the variants that might be required at runtime.

The count of variants for each problem size has a number of stacking multipliers applied to it. Any given problem size needs variants for:

  • Each supported GPU architecture;

  • Six interleaved/planar and in-place/out-of-place variants;

  • Forward and inverse transforms;

  • At least two precisions (single/double);

  • Unit- and non-unit- strides;

  • With and without callback support.

Runtime compilation has advantages over pre-compiling all of the above variants for all problem sizes. These include:

  • Handling of new problem sizes does not require rebuilding the library.

  • Build times are faster.

  • The library binary is smaller. This in turn means: installation is faster; and difficulties arising from limited memory addressing in shared objects (the default memory model for shared objects built with -fPIC only allows for 2 GiB binaries, resulting in build breaks) are reduced.

Solution#

HIP provides runtime compilation facilities, in the hiprtc.h header.

The code generator is embedded into the library itself. During FFT planning, we can run the code generator to produce suitable FFT source code for the specific problem being solved. Then, we can runtime-compile that source into GPU kernels that we launch at FFT execution time.

Empirical testing of runtime compilation on our FFT kernels shows that compilation times for single kernels range between 0.5 and 2 seconds on modern hardware, with more complicated kernels taking more time. Kernel execution time is identical to the ahead-of-time compiled version.

Implementation#

Embedding and running the generator#

A generator implemented in C++ can be built into the library like any other C++ code that makes up the library.

During plan building, we can execute the generator code to produce a string of source code for the required problem sizes.

The generator needs sufficient input to have it produce exactly the variant that is required, e.g. length-336, inverse, out-of-place, interleaved-to-planar, double precision, etc.

Compilation#

Compilation should be done during plan building, and the generated kernels can be attached directly to the TreeNode for that step of the FFT.

If the kernels are available on the TreeNode, we will have less overhead at execution time, since we don’t need to do any work to find the right kernel to run.

Caching kernels at runtime#

If a process needs to create multiple plans that would compile the same FFT kernel variant, it’s nice to reuse kernels we’ve already compiled. Reusing an already-compiled kernel would save compilation time on subsequent runs.

Compiled kernels may be persisted to disk for maximal reuse. However, rocFFT may be used in distributed systems where the filesystem is shared among multiple compute nodes, and having multiple nodes all contend for the same shared file is problematic for performance.

By default, kernels are only cached in memory, to prevent this contention. The cache location may still be overridden at runtime using mechanisms described below.

Cache keys#

The cache keys need to be chosen carefully to ensure that an obsolete kernel is not reused when a new version really should be recompiled.

The kernel function name shall be the main key field in the cache. The function name shall encode all of the parameters by which kernel functions could differ, including:

  • scheme (e.g. whether this is a standard Stockham kernel, or a variant that does different read/write tiling)

  • length (typically 1D length, may be 2D or more for single kernels that handle higher dimension transforms)

  • placement (in-place or out-of-place)

  • direction (forward or backward)

  • input and output formats (planar, interleaved, real)

  • precision (single or double)

  • stride type (unit stride or non-unit stride)

  • large 1D twiddle type and base (for kernels that need to integrate a twiddle multiplication step)

  • callback type (run callbacks or don’t run callbacks)

Encoding all of these parameters into the kernel name is necessary anyways, so that logs and profilers will tell users and developers exactly which kernel is running, even if it’s been runtime-compiled.

Using just the kernel name as the main key is also helpful because the caching code needn’t be aware of all the possible parameters that kernels could differ by. New parameters can be added at anytime, and as long as the kernel names are updated accordingly, the cache will just work.

The cache will also need to store other key fields to ensure that a kernel is compiled if any of these changes:

  • GPU architecture

  • HIP runtime version

  • Kernel generator version

Practically, these key field choices will ensure that users are always running the latest kernels that rocFFT provides and which are appropriate for the hardware present.

User control of cache#

Distributed workflows will want additional control over the cache. For example, a workload that distributes FFT computation over a large number of MPI nodes will want to ensure that the kernels are built once centrally rather than by each node.

MPI nodes might also have no access to disk (either shared with other nodes or local to each node).

rocFFT needs to expose APIs to:

  • Serialize the current cache to a library-allocated buffer

  • Free the library-allocated serialization buffer

  • Deserialize a buffer into a cache (which might need to be in-memory for diskless nodes)

The example MPI computation described above would be able to build plans on the rank 0 node to populate the cache once. Then, it can use these new APIs along with MPI APIs to distribute the cache to each work node.

Backing store implementation#

The cache may be written to disk, and if so it must be robust in the face of concurrent access, crashes during library operation, and so on.

We really would like the cache to have ACID properties of database systems.

The easiest way to achieve this is to use SQLite to manage the storage. It’s easily embeddable in our library (or is readily available as its own library), and provides all of the properties we’d want for the storage backend.

It also provides APIs to serialize a database, as required for the distributed workflows described above.

Pre-built kernels#

Even if rocFFT is prepared to runtime-compile any FFT kernel, we can still pre-compile kernels by populating a cache at library build time and shipping the cache with the library.

Cache location#

The main challenge here is installing this pre-built cache in a place that the library will be able to find.

The easiest solution here, as employed by other math libraries is to look for this the cache file relative to the shared library itself.

Environment variables can override the locations of caches used by rocFFT. During normal operation, we would expect one read-only cache shipped with the library and one modifiable cache updated as the user runs transforms that use new kernels.

We support two environment variables for these two locations:

  • ROCFFT_RTC_SYS_CACHE_PATH - the pre-built read-only system-level cache.

  • ROCFFT_RTC_CACHE_PATH - the read-write user-level cache.

Note that if the library is linked statically, we will not be able to find any files relative to the library. The ROCFFT_RTC_SYS_CACHE_PATH environment variable will then be required for rocFFT to find the system-level cache, but rocFFT will still update the user-level cache and have correct behaviour without a system-level cache.

Populating the cache#

Populating this shipped cache is done via a helper executable that is built and run during the rocFFT build. A separate helper executable (which is not itself shipped with rocFFT) is necessary so that it can share rocFFT’s generator and RTC code, without requiring rocFFT to expose extra symbols just for this task.

This helper should work at the kernel level, e.g. build Stockham kernels for all desired combinations of:

  • supported architectures (gfx908, gfx90a, gfx1030, etc.)

  • precisions

  • problem sizes

  • array formats

  • etc.

The criteria for which kernels to pre-build can be arbitrary. Less common choices will be runtime-compiled, and runtime compilation is still a fallback in case a pre-built kernel is not available for whatever reason.

An inferior option would be for the helper to work at the plan level (i.e. use rocFFT to build a set of plans and save the resulting RTC kernels). However, creating plans involves doing a lot of other unnecessary work, like generating twiddle tables and deciding on buffer assignment.

Impact on tests#

Accuracy tests are maximally affected in terms of runtime by this change, since they run a huge number of problem sizes in the context of a single process. That means the costs of generating and compiling a large variety of kernel variants will be the most painful here, once more problem sizes are handled by the new generator.

An increase in test runtime is an unfortunate side effect of runtime compilation. This cost is made more acceptable because the compile time of the library has already been reduced prior to running the tests.

A possible solution here might be to do a parallel traversal of the test cases, building rocFFT plans for each of them (but not actually executing plans). This would runtime-compile the whole suite’s kernels in parallel, which would save a lot of time.

Interaction with callbacks#

Callback-enabled FFTs require a different kernel variant to be generated, but the decision of whether to actually run with a callback is made by the user after the plan is constructed.

To solve this, we generate both a callback and non-callback variant where necessary during plan creation.

Parallel compilation#

Because of the potential need for callback-enabled kernels, most plans will be generated faster if kernels can be compiled in parallel. Unfortunately, hipRTC has process-wide locks in it that prevent useful multithreading of compilation.

Instead, we can spawn a helper process for subsequent compilations if a compilation is already in-progress in the original process. This helper would need to be shipped with the library, in a location that’s knowable by the library. If we fail to find or spawn that helper, compilation must fall back to compiling in-process.

Code organization#

The whole of rocFFT runtime compilation can be broken down into separate subsystems:

  1. Generating source to be compiled, further subdivided into generators for each type of kernel (Stockham, transpose, Bluestein, etc). Input specifications of the desired kernel include problem size, precision, result placement, and so on.

    Files to implement this are named:

    • rtc_stockham_gen.cpp

    • rtc_transpose_gen.cpp

    • etc.

  2. Compiling source code into object code, which can be further subdivided:

    1. Compiling code in the current process

    2. Compiling code in a subprocess

    The files to implement these are named:

    rtc_compile.cpp rtc_subprocess.cpp

  3. Reading/writing the cache of compiled object code.

    The file to implement this is named:

    rtc_cache.cpp

  4. Compiling and launching the correct kernel for a TreeNode in an FFT plan. This subsystem would need to derive the correct input specifications for the generator, given the data in the TreeNode. It would also need to derive the correct launch arguments to pass to the kernel.

    Files to implement this are named:

    • rtc_stockham_kernel.cpp

    • rtc_transpose_kernel.cpp

    • etc.

    These files are named rtc_*_kernel.cpp because they implement subclasses of the generic RTCKernel type.

In this list, 1 and 2 are independent. 2b depends on 2a. 3 depends on 1 and 2. 4 depends on 3. 2a requires the hipRTC library, 3 requires the SQLite library, and 4 requires the full HIP runtime library (amdhip64).

Build-time processes that populate a cache to ship with the library depend on 3. The helper process to support parallel compilation depends on 2a.

It’s important to avoid using the full HIP runtime at build time - Windows build environments in particular may not have the sufficient libraries or infrastructure to successfully load the full runtime, but they are able to load hipRTC.

Future work#

Moving away from chosen problem sizes#

Once the infrastructure is in place, we could consider enabling runtime compilation for all FFT sizes, not just those that are chosen ahead of time. The generator is already able to auto-factorize arbitrary sizes, though we haven’t yet tested the limits of this ability.