ROCm Communication Collectives Library (RCCL) test configuration files

ROCm Communication Collectives Library (RCCL) test configuration files#

2025-11-13

7 min read time

Applies to Linux

These RCCL tests are comprehensive benchmarks that validate distributed GPU communication performance across AMD GPU clusters. These tests ensure optimal performance for AI training, HPC workloads, and distributed computing.

`rccl_config.json`#

Here’s a code snippet of the rccl_config.json file for reference:

rccl_config.json

{
  "rccl":
  {
    "no_of_nodes": "2",
    "no_of_global_ranks": "16",
    "no_of_local_ranks": "8",
    "ranks_per_node": "8",
    "rccl_dir": "/root/cache/INSTALL/rccl-tests",
    "rccl_tests_dir": "/root/cache/INSTALL/rccl-tests/build",
    "mpi_dir": "/root/cache/INSTALL/ompi-4.1.6/install/bin",
    "mpi_path_var": "/root/cache/INSTALL/ompi-4.1.6/install",
    "rocm_path_var": "/opt/rocm-7.0.2/",
    "rccl_path_var": "/root/cache/INSTALL/rccl-tests",
    "cluster_snapshot_debug": "False",
    "env_source_script": "None",
    "rccl_collective": [ "all_reduce_perf", "all_gather_perf", "scatter_perf", "gather_perf", "reduce_scatter_perf", "sendrecv_perf", "alltoall_perf", "alltoallv_perf", "reduce_scatter_perf", "broadcast_perf" ],
    "rccl_algo": [ "ring" ],
    "rccl_protocol": [ "simple" ],
    "qp_scale": [ "1" ],
    "ib_hca_list": "bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7",
    "net_dev_list": "ens11np0,ens12np0,ens21np0,ens22np0,ens31np0,ens32np0,ens41np0,ens42np0",
    "oob_port": "ens51f0np0",
    "gid_index": "1",
    "qp_count": "1",
    "start_msg_size": "1024",
    "end_msg_size": "16g",
    "step_function": "2",
    "threads_per_gpu": "1",
    "warmup_iterations": "10",
    "no_of_iterations": "1",
    "check_iteration_count": "1",
    "nccl_ib_timeout": "30",
    "ib_rx_queue_len": "8192",
    "ucx_tls": "tcp",
    "hcoll_enable_mcast_all": "0",
    "nccl_cumem_enable": "0",
    "nccl_ib_sl": "0",
    "nccl_ib_tc": "0",
    "nccl_ib_split_data_on_qps": "0",
    "nccl_pxn_disable": "0",
    "nccl_net_plugin": "none",
    "verify_bus_bw": "False",
    "debug_level": "ERROR",
    "rccl_result_file": "/tmp/rccl_result_file.json",
    "_comments_results": "expected results below are for 2 node cluster, will vary based on cluster size",
    "results":
    {
        "all_reduce_perf":
        {
            "bus_bw":
            {
                "8589934592": "330.00",
                "17179869184": "350.00"
            }
        },
        "all_gather_perf":
        {
            "bus_bw":
            {
                "8589934592": "330.00",
                "17179869184": "350.00"
            }
        },
        "gather_perf":
        {
            "bus_bw":
            {
                "8589934592": "330.00",
                "17179869184": "350.00"
            }
        },
        "scatter_perf":
        {
            "bus_bw":
            {
                "8589934592": "330.0",
                "17179869184": "350.0"
            }
        },
        "reduce_perf":
        {
            "bus_bw":
            {
                "8589934592": "300.0",
                "17179869184": "310.00"
            }
        },
        "reduce_scatter_perf":
        {
            "bus_bw":
            {
                "8589934592": "340.0",
                "17179869184": "360.00"
            }
        },
        "alltoall_perf":
        {
            "bus_bw":
            {
                "8589934592": "45.00",
                "17179869184": "50.00"
            }
        },
        "alltoallv_perf":
        {
            "bus_bw":
            {
                "8589934592": "45.00",
                "17179869184": "50.00"
            }
        },
        "sendrecv_perf":
        {
            "bus_bw":
            {
                "8589934592": "47.00",
                "17179869184": "48.00"
            }
        },
        "broadcast_perf":
        {
            "bus_bw":
            {
                "8589934592": "310.00",
                "17179869184": "312.00"
            }
        }
      }
  }

}

Parameters#

Here’s an exhaustive list of the available parameters in the rccl_config.json RCCL configuration file:

Configuration parameters	Default values	Description
`no_of_nodes`	2	Number of nodes in a cluster
`no_of_global_ranks`	16	Total MPI ranks across the entire cluster
`no_of_local_ranks`	8	MPI ranks running on each individual node
`ranks_per_node`	8	Confirms GPUs per node
`rccl_dir`	`/opt/rccl-tests/`	Directory where RCCL is installed
`rccl_tests_dir`	`/opt/rccl-tests/build`	Directory where RCCL tests are installed
`mpi_dir`	`/usr/bin`	Directory where mpi bin is located
`mpi_path_var`	`/usr`	Directory where mpi is located
`rocm_path_var`	`/opt/rocm-7.0.2/`	Path to ROCm installation
`rccl_path_var`	`/opt/rccl-tests/`	Directory where RCCL tests are located
`cluster_snapshot_debug`	False	Enables/disables cluster debug snapshot
`env_source_script`	None	Path to environment setup script
`rccl_collective`	Values: `all_reduce_perf` `all_gather_perf` `scatter_perf` `gather_perf` `reduce_scatter_perf` `sendrecv_perf` `alltoall_perf` `alltoallv_perf` `reduce_scatter_perf` `broadcast_perf`	RCCL tests list
`rccl_algo`	Values `ring` `tree`	Communication algorithms
`rccl_protocol`	Simple	Communication protocol
`qp_scale`	1, 2	Queue pair scaling factors
`ib_hca_list`	Values: `bnxt_re0` `bnxt_re1` `bnxt_re2` `bnxt_re3` `bnxt_re4` `bnxt_re5` `bnxt_re6` `bnxt_re7`	List of IB hosts
`net_dev_list`	Values: `ens28np0` `ens27np0` `ens25np0` `ens26np0` `ens24np0` `ens23np0` `ens21np0` `ens22np0`	List of network hosts
`oob_port`	eth0	Out of band port
`gid_index`	1	Global ID index for InfiniBand
`start_msg_size`	1024	Start with 1KB messages
`end_msg_size`	16g	End with 16GB messages
`step_function`	2	Double message size each step
`threads_per_gpu`	1	One thread per GPU
`warmup_iterations`	10	Warmup runs
`no_of_iterations`	1	Number of iterations to run the RCCL tests
`check_iteration_count`	1	Verification iteration
`nccl_ib_timeout`	30	InfiniBand timeout
`ib_rx_queue_len`	8192	Receive queue length
`ucx_tls`	Tcp	UCX transport layer
`hcoll_enable_mcast_all`	0	Disable multicast
`nccl_cumem_enable`	0	CUDA memory optimizations
`nccl_ib_sl`	0	InfiniBand service level
`nccl_ib_tc`	0	InfiniBand traffic class
`nccl_ib_split_data_on_qps`	0	Don’t split data across queue pairs
`nccl_pxn_disable`	0, 1	PXN disable options
`nccl_net_plugin`	None	Network plugin
`verify_bus_bw`	False	Verify bus bandwidth
`verify_bw_dip`	True	Check for bandwidth drops
`verify_lat_dip`	True	Check for latency spikes
`debug_level`	ERROR	Set the debug level
`rccl_result_file`	`/tmp/rccl_result_file.json`	Path where RCCL results are captured
`_comments_results`	`Expected results are for the two-node cluster and vary based on cluster size`	Expected results are for the two-node cluster and vary based on cluster size
`all_reduce_perf`	`bus_bw`: “8589934592”: “330.00” “17179869184”: “350.00”	Global reduction: sum/max/min across all GPUs
`all_gather_perf`	`bus_bw`: “8589934592”: “330.00” “17179869184”: “350.00”	All GPUs receive the complete combined dataset from all ranks
`gather_perf`	`bus_bw`: “8589934592”: “330.00” “17179869184”: “350.00”	Collect data from all ranks to one root rank
`scatter_perf`	`bus_bw`: “8589934592”: “330.00” “17179869184”: “350.00”	Distribute data from one rank to all ranks
`reduce_perf`	`bus_bw`: “8589934592”: “300.0” “17179869184”: “310.00”	Reduce operation
`reduce_scatter_perf`	`bus_bw`: “8589934592”: “340.0” “17179869184”: “360.00”	Reduce operation followed by scatter
`alltoall_perf`	`bus_bw`: “8589934592”: “45.00” “17179869184”: “50.00”	Every rank sends unique data to every other rank
`alltoallv_perf`	`bus_bw`: “8589934592”: “45.00” “17179869184”: “50.00”	Variable-size all-to-all exchange
`sendrecv_perf`	`bus_bw`: “8589934592”: “47.00” “17179869184”: “48.00”	Point-to-point communication between rank pairs
`broadcast_perf`	`bus_bw`: “8589934592”: “310.00” “17179869184”: “312.00”	One-to-all communication

`single_node_mi355_rccl.json`#

Here’s a code snippet of the single_node_mi355_rccl.json file for reference:

Parameters#

Here’s an exhaustive list of the available parameters in the single_node_mi355_rccl.json RCCL configuration file:

Configuration parameters	Default values	Description
`no_of_local_ranks`	8	MPI ranks running on each individual node
`rccl_dir`	`/opt/rccl-tests/`	Directory where RCCL is installed
`rccl_tests_dir`	`/opt/rccl-tests/build`	Directory where RCCL tests are installed
`rocm_path_var`	`/opt/rocm-7.0.2/`	Path to ROCm installation
`rccl_path_var`	`/opt/rccl-tests/`	Directory where RCCL tests are located
`env_source_script`	`/root/env_source_file.sh`	Path to environment setup script
`rccl_collective`	Values: `all_reduce_perf` `all_gather_perf` `scatter_perf` `gather_perf` `reduce_scatter_perf` `sendrecv_perf` `alltoall_perf` `alltoallv_perf` `reduce_scatter_perf` `broadcast_perf`	RCCL tests list
`start_msg_size`	1024	Start with 1KB messages
`end_msg_size`	16g	End with 16GB messages
`step_function`	2	Double message size each step
`warmup_iterations`	10	Warmup runs
`no_of_iterations`	1	Number of iterations to run the RCCL tests
`check_iteration_count`	1	Verification iteration
`verify_bus_bw`	False	Verify bus bandwidth
`verify_bw_dip`	True	Check for bandwidth drops
`verify_lat_dip`	True	Check for latency spikes
`debug_level`	ERROR	Set the debug level
`rccl_result_file`	`/tmp/rccl_result_file.json`	Path where RCCL results are captured
`_comments_results`	`Expected results are for the two-node cluster and vary based on cluster size`	Expected results are for the two-node cluster and vary based on cluster size
`all_reduce_perf`	`bus_bw`: “8589934592”: “390.00” “17179869184”: “393.00”	Global reduction: sum/max/min across all GPUs
`all_gather_perf`	`bus_bw`: “8589934592”: “380.00” “17179869184”: “383.00”	All GPUs receive the complete combined dataset from all ranks
`gather_perf`	`bus_bw`: “8589934592”: “430.00” “17179869184”: “430.00”	Collect data from all ranks to one root rank
`reduce_scatter_perf`	`bus_bw`: “8589934592”: “400.0” “17179869184”: “400.00”	Reduce operation followed by scatter
`alltoall_perf`	`bus_bw`: “8589934592”: “340.00” “17179869184”: “345.00”	Every rank sends unique data to every other rank
`alltoallv_perf`	`bus_bw`: “8589934592”: “340.00” “17179869184”: “345.00”	Variable-size all-to-all exchange
`broadcast_perf`	`bus_bw`: “8589934592”: “370.00” “17179869184”: “370.00”	One-to-all communication

ROCm Communication Collectives Library (RCCL) test configuration files

Contents

ROCm Communication Collectives Library (RCCL) test configuration files#

rccl_config.json#

Parameters#

single_node_mi355_rccl.json#

Parameters#

`rccl_config.json`#

`single_node_mi355_rccl.json`#