Multi-node network configuration for AMD Instinct accelerators#
After single node configuration testing has been completed and verified, validate network connections in node pairs. All the tests described in this topic must be run between two nodes in a client-server relationship. Both nodes must be configured and verified according to Single-node network configuration for AMD Instinct accelerators before running any node-to-node performance tests.
Prerequisites#
Before following the steps in this guide, complete the following prerequisites.
Install all required software for MPI in the ROCm documentation.
Specifically, follow the installation instructions for Open MPI, OSU benchmarks, and collective operations.
Install Slurm Workload Manager (if applicable). Refer to the Slurm Workload Manager documentation.
Implement passwordless SSH.
Evaluate platform-specific BIOS tunings#
Check your BIOS settings to make sure they are optimized for AMD GPUs. See the AMD Instinct system optimization guides for more information.
Enable large bar addressing in the BIOS to support peer to peer GPU memory access.
Verify SR-IOV is enabled, if needed.
Disable ACS (ACS forces P2P transactions through the PCIe root complex).
Note
If using virtual devices, AER and ACS should be enabled.
Single tier switch configuration#
Take these actions on each single tier (leaf/edge) switch you plan to include in network testing.
Configure remote access to the switch management console.
Verify the switch sees all hosts and ports are active.
For an InfiniBand switch, configure Fabric Manager on the switch or start OpenSM on a host in the network if a subnet manager isn’t already in place.
For an ethernet switch, configure MTU size and priority flow control (PFC) and ECN support as needed.
Clear all port counters after the switch is ready to use.
OFED perftest installation and benchmarking#
Install and run the OFED performance tests
for host to host (H2H) testing. Loopback is implemented in the tests to remove
the switch from benchmark results. Remember to install OFED perftests on both
nodes you plan to use in this section. Commands may require sudo
depending
on user privileges.
From the CLI of your host, clone the perftest repository.
git clone https://github.com/linux-rdma/perftest.git
Navigate to the installation directory and build the tests.
cd perftest ./autogen.sh ./configure --prefix=$PWD/install --enable-rocm --with-rocm=/opt/rocm
Locate and open
Makefile
in your editor of choice, then append-D__HIP_PLATFORM_AMD__
toCFLAGS
andCXXFLAGS
. This is required to compile the code correctly for this guide.Run
make && make install
.Repeat these steps on a second node connected to the same switch.
Run host-based (CPU) performance tests#
Once installed, there are six main modules available with OFED perftests:
ib_write_bw
- Test bandwidth with RDMA write transactions.ib_write_lat
- Test latency with RDMA write transactions.ib_read_bw
- Test bandwidth with RDMA read transactions.ib_read_lat
- Test latency with RDMA read transactions.ib_send_bw
- Test bandwidth with send transactions.ib_send_lat
- Test latency with send transactions.
The examples in this section use the ib_send_bw
tool, but you can achieve
similar results with other benchmarking tools, depending on your requirements.
The primary objective of these tests is to verify high-speed Host-to-Host (H2H)
data transfer rates between nodes before introducing GPU traffic–as a result,
the use_rocm
flag is intentionally omitted from all commands.
Run H2H RDMA benchmark#
To run the OFED perftest, establish an SSH connection to both nodes you installed the OFED perftests on.
Initiate a server connection on the first node:
$ cd perftest #if not already in directory $ numactl -C 1 ./ib_send_bw -a -F -d <IB/RoCE interface> ************************************ * Waiting for client to connect... * ************************************
Initiate a client connection on the second node:
$ cd perftest #if not already in directory $ numactl -C 1 ./ib_send_bw <node1 IP> -a -F -d <IB/RoCE interface>
Test should run and complete in several moments.
Note
The use of
numactl
ortaskset
commands makes sure NUMA domains are not crossed when communicating, which can create overhead and latency. When running tests you must ensure you use cores local to the network device.
Consult this table for an explanation of flags used in the numactl
examples
and other optional flags that may be useful for you.
- -d <IB/RoCE interface>
Specifies a NIC to use. Ensure you use a NIC that is both adjacent to a GPU and not crossing NUMA domains or otherwise needing pass traffic between CPUs before egressing from the host. Tools like
rocm-smi --showtopo
andlstopo
can help define which NICs are adjacent to which GPUs.- -p <port #>
Assign a port number to the server/client. Each instance must run on a different port when executed simultaneously.
- --report_gbits
Reports in Gb/s instead of Mb/s.
- -m <mtu>
Set MTU size.
- -b
Bidirectional runs.
- -a
Runs messages in all sizes.
- -n <number>
Provides the number of iterations.
- -F
Do not show warning if cpufreq_ondemand is loaded.
- --use_rocm=<rocm_device_number>
This is for device testing, allows you to specify which GPU to use. Zero-based numbering.
- --perform_warm_up
Runs several iterations before benchmarking to warm up memory cache.
As servers typically have one NIC per GPU, you must change the device location frequently as you iterate through tests.
Run multithreaded H2H RDMA benchmark#
To perform a multithreaded RDMA benchmark using the OFED perftest, run it
concurrently on each NIC in the server. Use the taskset
command to assign a
CPU core within the same NUMA domain as the NICs. While testing the
XGMI/Infinity Fabric link between CPUs is not required at this stage, it can be
an optional test if desired.
Run extended multithreaded H2H RDMA benchmark#
Repeat the multithreaded RDMA benchmark, but loop the test and run it continuously for at least 8 hours. This extended test is designed to stress the I/O network fabric over a prolonged period to assess stability and performance under sustained load.
Run device-based (GPU) OFED performance tests#
After confirming Host-to-Host (H2H) performance, proceed to run Device-to-Device (D2D) OFED perftests, which include GPU traffic. This will evaluate RDMA performance between GPUs.
Run D2D RDMA benchmark#
To run a D2D RDMA benchmark, use the following example setup to test GPU pairs–for example, GPU0 to GPU1, GPU2 to GPU3.
Note
If you have Mellanox or NVIDIA NICs, be aware that the default OFED perftest installation doesn’t include ROCm support. Follow the installation instructions if you haven’t done so already.
In this example, localhost
is used by the client to call the server. You may
use a specific IP address to ensure the network is tested.
$ (ib_write_bw -b -a -d <RDMA-NIC-1> --report_gbits -F -use_rocm=0 >> /dev/null &); sleep 1; ib_write_bw -b -a -d <RDMA-NIC-2> --report_gbits -use_rocm=0 -F localhost
---------------------------------------------------------------------------------------
RDMA_Write Bidirectional BW Test
Dual-port : OFF Device : <RDMA-NIC-2>
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0901 PSN 0x5e30c8 RKey 0x2000201 VAddr 0x007fe663d20000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:45
remote address: LID 0000 QPN 0x0901 PSN 0xf40c3c RKey 0x2000201 VAddr 0x007f282a06e000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:35
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
2 5000 0.142947 0.012281 0.767588
4 5000 0.28 0.26 8.255475
8 5000 0.55 0.54 8.471791
16 5000 1.16 1.16 9.025968
32 5000 2.31 2.27 8.865877
64 5000 4.49 4.43 8.647051
128 5000 8.98 8.96 8.745890
256 5000 17.57 16.32 7.969287
512 5000 34.63 34.41 8.400441
1024 5000 67.22 66.92 8.168969
2048 5000 129.04 126.20 7.702863
4096 5000 188.76 188.56 5.754307
8192 5000 194.79 192.62 2.939080
16384 5000 195.32 195.21 1.489355
32768 5000 203.15 203.13 0.774887
65536 5000 204.12 203.85 0.388818
131072 5000 204.44 204.43 0.194964
262144 5000 204.51 204.51 0.097517
524288 5000 204.56 204.56 0.048770
1048576 5000 204.57 204.57 0.024387
2097152 5000 204.59 204.59 0.012194
4194304 5000 204.59 204.59 0.006097
8388608 5000 204.59 204.59 0.003049
---------------------------------------------------------------------------------------
Note
If you run the test with different values for --use_rocm=#
on the server
and the client, the output will show results from whichever GPU is local to
the node you’re looking at. The tool is unable to show server and client
simultaneously.
Run H2D/D2H RDMA benchmark#
This is similar to the D2D test, but also includes the CPU on either the server or client side of the test-case scenarios.
For a 2-CPU/8-GPU node you would have 32 test scenarios per pairs of server.
Client |
GPU 0 |
GPU 1 |
GPU 2 |
GPU 3 |
GPU 4 |
GPU 5 |
GPU 6 |
GPU 7 |
---|---|---|---|---|---|---|---|---|
Server |
CPU 0 |
CPU 1 |
Server |
GPU 0 |
GPU 1 |
GPU 2 |
GPU 3 |
GPU 4 |
GPU 5 |
GPU 6 |
GPU 7 |
---|---|---|---|---|---|---|---|---|
Client |
CPU 0 |
CPU 1 |
To run this test, use a command similar to the example in the D2D benchmark, but
only add the --use_rocm
flag on either the server or client side so that one
node communicates with the GPUs while the other does so with CPUs. Then, run the
test a second time with the use_rocm
flag on the other side. Continue to use
the most adjacent NIC to the GPU or CPU being tested so that communication
doesn’t run between intra-node CPUs (testing the internal CPU-CPU fabric
isn’t a goal now).
D2D RDMA multithread benchmark#
For this test you must run the previous D2D benchmark simultaneously on all GPUs. Scripting is required to accomplish this, but the command input should resemble something like the following image with regard to your RDMA device naming scheme.
Important OFED perftest flags for this effort include:
- -p <port#>
Lets you assign specific ports for server/client combinations. Each pair needs an independent port number so you don’t inadvertently use the wrong server.
- -n <# of iterations>
Default is 1000, you can increase this to have the test run longer.
For bandwidth tests only:
- -D <seconds>
Defines how long the test runs for.
- --run_infinitely
Requires user to break the runtime, otherwise runs indefinitely.
D2D RDMA multithread extended benchmark#
Perform the D2D RDMA multithread benchmark again but set the duration for a minimum of 8 hours.
Build collective tests#
This section guides you through setting up the remaining tools necessary to simulate an AI workload on your GPU nodes after they have been sufficiently traffic-tested. Per the prerequisites, UCX, UCC, MPI and the OSU benchmarks must already be installed.
Install RCCL#
RCCL is likely already installed as part of ROCm on your compute nodes. Sometimes newer features and fixes might be available in the latest version of RCCL, which you can build from source at ROCm/rccl.
Build RCCL collective tests#
To more easily build and run the RCCL collective tests, review and implement the script provided in the drop-down (the script also includes an option to install MPICH if needed). Otherwise, you can follow the steps to manually install at ROCm/rccl-tests.
build-and-run_rccl-tests_sweep_multinode.sh
1#!/bin/bash -x
2
3## change this if ROCm is installed in a non-standard path
4ROCM_PATH=/opt/rocm
5
6## to use pre-installed MPI, change `build_mpi` to 0 and ensure that libmpi.so exists at `MPI_INSTALL_DIR/lib`.
7build_mpi=1
8MPI_INSTALL_DIR=/opt/ompi
9
10## to use pre-installed RCCL, change `build_rccl` to 0 and ensure that librccl.so exists at `RCCL_INSTALL_DIR/lib`.
11build_rccl=1
12RCCL_INSTALL_DIR=${ROCM_PATH}
13
14
15WORKDIR=$PWD
16
17## building mpich
18if [ ${build_mpi} -eq 1 ]
19then
20 cd ${WORKDIR}
21 if [ ! -d mpich ]
22 then
23 wget https://www.mpich.org/static/downloads/4.1.2/mpich-4.1.2.tar.gz
24 mkdir -p mpich
25 tar -zxf mpich-4.1.2.tar.gz -C mpich --strip-components=1
26 cd mpich
27 mkdir build
28 cd build
29 ../configure --prefix=${WORKDIR}/mpich/install --disable-fortran --with-ucx=embedded
30 make -j 16
31 make install
32 fi
33 MPI_INSTALL_DIR=${WORKDIR}/mpich/install
34fi
35
36
37## building rccl (develop)
38if [ ${build_rccl} -eq 1 ]
39then
40 cd ${WORKDIR}
41 if [ ! -d rccl ]
42 then
43 git clone https://github.com/ROCm/rccl -b develop
44 cd rccl
45 ./install.sh -l
46 fi
47 RCCL_INSTALL_DIR=${WORKDIR}/rccl/build/release
48fi
49
50
51## building rccl-tests (develop)
52cd ${WORKDIR}
53if [ ! -d rccl-tests ]
54then
55 git clone https://github.com/ROCm/rccl-tests
56 cd rccl-tests
57 make MPI=1 MPI_HOME=${MPI_INSTALL_DIR} NCCL_HOME=${RCCL_INSTALL_DIR} -j
58fi
59
60
61## running multi-node rccl-tests all_reduce_perf for 1GB
62cd ${WORKDIR}
63
64## requires a hostfile named hostfile.txt for the multi-node setup in ${WORKDIR}/
65
66n=`wc --lines < hostfile.txt` # count the numbers of nodes in hostfile.txt
67echo "No. of nodes: ${n}" # print number of nodes
68m=8 # assuming 8 GPUs per node
69echo "No. of GPUs/node: ${m}" # print number of GPUs per node
70total=$((n * m)) # total number of MPI ranks (1 per GPU)
71echo "Total ranks: ${total}" # print number of GPUs per node
72
73### set these environment variables if using Infiniband interconnect
74## export NCCL_IB_HCA=^mlx5_8
75
76### set these environment variables if using RoCE interconnect
77## export NCCL_IB_GID_INDEX=3
78
79for coll in all_reduce all_gather alltoall alltoallv broadcast gather reduce reduce_scatter scatter sendrecv
80do
81 # using MPICH; comment next line if using OMPI
82 mpirun -np ${total} --bind-to numa -env NCCL_DEBUG=VERSION -env PATH=${MPI_INSTALL_DIR}/bin:${ROCM_PATH}/bin:$PATH -env LD_LIBRARY_PATH=${RCCL_INSTALL_DIR}/lib:${MPI_INSTALL_DIR}/lib:$LD_LIBRARY_PATH ${WORKDIR}/rccl-tests/build/${coll}_perf -b 1 -e 16G -f 2 -g 1 2>&1 | tee ${WORKDIR}/stdout_rccl-tests_${coll}_1-16G_nodes${n}_gpus${total}.txt
83
84 ## uncomment, if using OMPI
85 ## mpirun -np ${total} --bind-to numa -x NCCL_DEBUG=VERSION -x PATH=${MPI_INSTALL_DIR}/bin:${ROCM_PATH}/bin:$PATH -x LD_LIBRARY_PATH=${RCCL_INSTALL_DIR}/lib:${MPI_INSTALL_DIR}/lib:$LD_LIBRARY_PATH --mca pml ucx --mca btl ^openib ${WORKDIR}/rccl-tests/build/${coll}_perf -b 1 -e 16G -f 2 -g 1 2>&1 | tee ${WORKDIR}/stdout_rccl-tests_${coll}_1-16G_nodes${n}_gpus${total}.txt
86
87 sleep 10
88done
Run OSU Micro Benchmarks#
Running the OSU Micro Benchmarks (OMB) with MPI simulates conditions similar to an AI/HPC workload over your cluster network. Successful MPI runs require that passwordless SSH be configured between all server pairs where OMB is installed and that they also be finger-printed, otherwise the runs fail.
This section covers the the two types of OMB:
Point to point (pt2pt) benchmarks test communication between one discrete component on a server (host or device) to another.
Collectives benchmarks support the use of multiple devices in a single run.
In a typical use case, you start with a pair of nodes and run the pt2pt benchmarks then move on to collectives.
Point to point (pt2pt) OSU benchmarks#
Commands in the table below must run on two nodes with RoCE or InfiniBand interconnect from Host to Host (CPU to CPU). You can invoke the command from either node, but directories must mirror one another or the tests will hang.
Note
The paths for the MPI and OMB commands presume both are installed in the /opt
directory. Installation paths for your environment may be different and should be updated accordingly.
Command |
Usage |
---|---|
osu_bw |
|
osu_bibw |
|
osu_mbw_mr |
|
osu_latency |
|
osu_multi_lat |
|
You can change communications mode by appending D D
to the end of command for D2D, or D H
for D2H (and vice-versa).
Collective OSU benchmarks#
Command |
Usage |
---|---|
osu_allreduce |
|
osu_allreduce 2N 16Proc |
|
osu_alltoall |
|
osu_alltoall 2N 16Proc |
|
osu_allgather |
|
osu_allgather 2N 16Proc |
|
Run RCCL collective benchmark#
RCCL is a collective communication library optimized for collective operations by multi-GPU and multi-node communication primitives that are in turn optimized for AMD Instinct accelerators. The RCCL Test is typically launched using MPI, but you can use MPICH or Open MPI as well.
RCCL with MPI |
|
---|