Performance tuning#
Algorithms often perform better if their launch parameters (number of blocks, block size, items per thread, etc.) are tailored for the particular architecture they are run on. rocPRIM achieves this by passing structs called configs to algorithms as template parameters. A config struct encapsulates all the information that’s needed to run a particular algorithm in the most performant way for a specific device. Default configurations (non custom-defined by users) can be selected by device code at compile time, since the architecture is known, while device algorithms detect at runtime which configuration should be used.
What we call autotuning is a method of generating the above-mentioned architecture-optimized default configurations for algorithms. The process to run the autotune is described below, as well as all the templates and scripts used.
Configure the project for autotuning. Autotune is an extension on top of the regular benchmarking process and it is enabled with a CMake option
BENCHMARK_CONFIG_TUNING, which doubles as a C++ macro to determine whether autotuning is enabled.When the project is configured, a large amount of C++ benchmark files are generated with variation in parameters such as block size, items per thread, and method. The files are generated based on a template (
benchmark/benchmark_*.parallel.cpp.in) and arguments defined inConfigAutotuneSettings.cmake. CMake will automatically detect when the input template changes and will reconfigure the required files as necessary.Compile results in one executable based on all generated files for an algorithm.
Run the executable and gather the JSON output files. The generation of output files is triggered by the use of
--benchmark_out_format=jsonand--benchmark_out=<output_file_name>.jsonoptions when running the executable.Convert the benchmark results into a config with
scripts/autotune/create_optimization.py. This python script injects the optimal configurations into the templates inscripts/autotune/templates.
The option
--out_basedircan be used to place the output config(s) in a specific path, otherwise the config(s) will be placed in the current directory.
If
--out_basedir rocprim/include/device/detail/configwas not used in the previous step, place the generated config(s) from the output path torocprim/include/device/detail/config.
Device-level algorithm dependencies#
Due to the modularity of rocPRIM, some device-level algorithms depend on other device-level algorithms. The implication is that when an algorithm is changed, the performance of algorithms that depend on it must be checked as well. This also applies to configuration tuning. Below is a list of device-level algorithm dependencies and additional considerations for the tuning of these algorithms.
lower_bound,upper_bound, andbinary_searchdepend ontransform. However, all these algorithms are all tuned separately.merge_sorthas two stages that are tuned separately:merge_sort_block_sortandmerge_sort_block_merge. Since the latter algorithm depends on the sorted block size, the bestmerge_sortconfigurations are obtained by tuningmerge_sort_block_sortfirst, adding the configurations, and then tuningmerge_sort_block_merge.partition_two_way,partition,partition_three_way,select,unique, andunique_by_keyall use the same underlying implementation. However, all these algorithms have separate tuning.radix_sorthas three sub-algorithms:radix_sort_block_sort,radix_sort_onesweep, and a merge sort for small sizes.radix_sort_block_sortandradix_sort_onesweepare tuned. The threshold for radix sort that determines whether to perform merge sort or onesweep is manually set.segmented_radix_sortdepends onpartitionandpartition_three_waybut does not use the tuned configurations.segmented_reducedoes not depend onreducebut uses the same tuned configurations.segmented_scandoes not depend onscanbut uses the same tuned configurations.run_length_encodedepends onreduce_by_key, but does not use its tuned configurations.run_length_encode_non_trivial_runsdepends onreduce_by_keyandselect, but does not use the tuned configurations fromreduce_by_keyandselect.