Performance tuning#
Algorithms often perform better if their launch parameters (number of blocks, block size, items per thread, etc.) are tailored for the particular architecture they are run on. rocPRIM achieves this by passing structs called configs to algorithms as template parameters. A config struct encapsulates all the information that’s needed to run a particular algorithm in the most performant way for a specific device. Default configurations (non custom-defined by users) can be selected by device code at compile time, since the architecture is known, while device algorithms detect at runtime which configuration should be used.
What we call autotuning is a method of generating the above-mentioned architecture-optimized default configurations for algorithms. The process to run the autotune is described below, as well as all the templates and scripts used.
Configure the project for autotuning. Autotune is an extension on top of the regular benchmarking process and it is enabled with a CMake option
BENCHMARK_CONFIG_TUNING
, which doubles as a C++ macro to determine whether autotuning is enabled.When the project is configured, a large amount of C++ benchmark files are generated with variation in parameters such as block size, items per thread, and method. The files are generated based on a template (
benchmark/benchmark_*.parallel.cpp.in
) and arguments defined inConfigAutotuneSettings.cmake
. CMake will automatically detect when the input template changes and will reconfigure the required files as necessary.Compile results in one executable based on all generated files for an algorithm.
Run the executable and gather the JSON output files. The generation of output files is triggered by the use of
--benchmark_out_format=json
and--benchmark_out=<output_file_name>.json
options when running the executable.Convert the benchmark results into a config with
scripts/autotune/create_optimization.py
. This python script injects the optimal configurations into the templates inscripts/autotune/templates
.
The option
--out_basedir
can be used to place the output config(s) in a specific path, otherwise the config(s) will be placed in the current directory.
If
--out_basedir rocprim/include/device/detail/config
was not used in the previous step, place the generated config(s) from the output path torocprim/include/device/detail/config
.
Device-level algorithm dependencies#
Due to the modularity of rocPRIM, some device-level algorithms depend on other device-level algorithms. The implication is that when an algorithm is changed, the performance of algorithms that depend on it must be checked as well. This also applies to configuration tuning. Below is a list of device-level algorithm dependencies and additional considerations for the tuning of these algorithms.
lower_bound
,upper_bound
, andbinary_search
depend ontransform
. However, all these algorithms are all tuned separately.merge_sort
has two stages that are tuned separately:merge_sort_block_sort
andmerge_sort_block_merge
. Since the latter algorithm depends on the sorted block size, the bestmerge_sort
configurations are obtained by tuningmerge_sort_block_sort
first, adding the configurations, and then tuningmerge_sort_block_merge
.partition_two_way
,partition
,partition_three_way
,select
,unique
, andunique_by_key
all use the same underlying implementation. However, all these algorithms have separate tuning.radix_sort
has three sub-algorithms:radix_sort_block_sort
,radix_sort_onesweep
, and a merge sort for small sizes.radix_sort_block_sort
andradix_sort_onesweep
are tuned. The threshold for radix sort that determines whether to perform merge sort or onesweep is manually set.segmented_radix_sort
depends onpartition
andpartition_three_way
but does not use the tuned configurations.segmented_reduce
does not depend onreduce
but uses the same tuned configurations.segmented_scan
does not depend onscan
but uses the same tuned configurations.run_length_encode
depends onreduce_by_key
, but does not use its tuned configurations.run_length_encode_non_trivial_runs
depends onreduce_by_key
andselect
, but does not use the tuned configurations fromreduce_by_key
andselect
.