Single-node network configuration for AMD Instinct accelerators

Single-node network configuration for AMD Instinct accelerators#

This section explains setting up a testing environment on a single accelerator node and running benchmarks to simulate an AI or HPC workload.

Prerequisites#

Before following the steps in the following sections, ensure you have completed these prerequisites.

Install GPU and network hardware. Refer to the hardware support matrix.
Install OS and required GPU and network software on each node:
- Install the ROCm software stack.
- Install network drivers for NICs. If using InfiniBand, also install OpenSM.
Ensure network settings are correctly configured for your hardware.
Configure system BIOS and OS settings according to System optimization for your architecture (MI300, MI200, and so on).
Disable NUMA balancing.
1. Run sudo sysctl kernel.numa_balancing=0.
2. To verify NUMA balancing is disabled, run cat /proc/sys/kernel/numa_balancing and confirm that 0 is returned.
3. See Disable NUMA auto-balancing for more information.
Disable PCI ACS (access control services). Run the disable ACS script on all PCIe devices supporting it. This must be done after each reboot.
Configure IOMMU settings.
1. Add iommu=pt to the GRUB_CMDLINE_LINUX_DEFAULT entry in /etc/default/grub.
2. Run sudo update-grub, then reboot.
3. See GRUB settings and Issue #5: Application hangs on Multi-GPU systems for more information.
Verify group permissions.
1. Ensure the user belongs to the render and video groups.
2. Refer to Setting permissions for groups for guidance.

Best practices for software consistency#

To ensure consistent software configurations across systems:

Use a shared NFS (network file system) mount. Install the necessary software on a common NFS mount accessible to all systems.
Create a system image with all the software installed. Re-image when software changes are made.

Validate PCIe performance#

Checking that your relevant PCIe devices (GPUs, NICs, and internal switches) are using the maximum available transfer speed and width in their respective bus keeps you from having to troubleshoot any related issues in subsequent testing where it may not be obvious.

Tip

Gather all the PCIe addresses for your GPUs, NICs, and switches in advance and take note of them so you have them on hand for next steps.

Check PCIe device speed and width#

From the command line of your host, run lspci to retrieve a list of PCIe devices and locate your GPU and network devices.
Run sudo lspci -s <PCI address> -vvv | grep Speed to review the speed and width of your device. This example shows the speed and width for a GPU at the address 02:00.0.
Shell output
$ sudo lspci -s 02:00.0 -vvv | grep Speed LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us LnkSta: Speed 32GT/s (ok), Width x16 (ok)
Commands
sudo lspci -s 02:00.0 -vvv | grep Speed
The maximum supported speed of the GPU is reported in LnkCap along with a width of x16. Current status is shown in LnkSta–both speed and width are aligned. Your values may differ depending on your hardware.
Query and validate all GPUs in your node with the previous steps.
Gather the PCI addresses for your NICs and validate them next. See this example from a NIC running at 05:00.0:
Shell output
$ sudo lspci -s 05:00.0 -vvv | grep Speed LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported LnkSta: Speed 16GT/s (ok), Width x16 (ok)
Commands
sudo lspci -s 05:00.0 -vvv | grep Speed
Here, the NIC is running at a speed of 16GT/s. However, because the NIC configuration only supports PCIe Gen4 speeds, this is an expected value.

Once you verify all GPUs and NICs are running at maximum supported speeds and widths, then proceed to the next section.

Note

If you’re running a cloud instance, hardware passthrough to your guest OS might not be accurate. Verify your lspci results with your cloud provider.

Check PCIe switch speed and width#

Now, check the PCIe switches to ensure they are operating at the maximum speed and width for the LnkSta (Link Status).

Run lspci -vv and lspci -tv to identify PCIe switch locations on the server.
Run lspci -vvv <PCI address> | grep Speed to verify speed and width as previously demonstrated.

Check max payload size and max read request#

The MaxPayload and MaxReadReq attributes define the maximum size of PCIe packets and the number of simultaneous read requests, respectively. For optimal bandwidth, ensure that all GPUs and NICs are configured to use the maximum values for both attributes.

Run sudo lspci -vvv -s <PCI address> | grep DevCtl: -C 2 to review max payload size and max read request. Here is an example using the same NIC as before.

Shell output

$ sudo lspci -vvv -s 05:00.0 | grep DevCtl: -C 2

DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
         ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 40.000W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
         RlxdOrd+ ExtTag+ PhantFunc- AuxPwr+ NoSnoop+ FLReset-
         MaxPayload 512 bytes, MaxReadReq 4096 bytes

Commands

sudo lspci -vvv -s 05:00.0 | grep DevCtl: -C 2

MaxReadRequest is unique because it can be changed during runtime with the setpci command. If your value here is lower than expected, you can correct it as follows:

Shell output

$ sudo lspci -vvvs a1:00.0 | grep axReadReq

MaxPayload 512 bytes, MaxReadReq 512 bytes

$ sudo setpci -s a1:00.0 68.w

295e

$ sudo setpci -s a1:00.0 68.w=595e

$ sudo lspci -vvvs a1:00.0 | grep axReadReq

MaxPayload 512 bytes, MaxReadReq 4096 bytes

Commands

sudo lspci -vvvs a1:00.0 | grep axReadReq

sudo setpci -s a1:00.0 68.w

sudo setpci -s a1:00.0 68.w=595e

sudo lspci -vvvs a1:00.0 | grep axReadReq

Note

Changes made with setpci are not persistent across reboots. This example uses a single NIC for simplicity, but in practice you must run the change for each NIC in the node.

Validate NIC configuration#

After you’ve verified optimal PCIe speeds for all devices, configure your NICs according to best practices in the manufacturer or vendor documentation. This might already include some of the pre-assessment steps outlined in this guide and provide more hardware-specific tuning optimizations.

Vendor-specific NIC tuning#

Your NICs may require tuning if it has not already been done. Some steps differ based on the type of NIC you’re deploying (InfiniBand or RoCE).

Ensure ACS is disabled.

For Mellanox NICs (InfiniBand or RoCE): Disable ATS, enable PCI Relaxed Ordering, increase max read requests, enable advanced PCI settings.

sudo mst start

sudo mst status

sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s ADVANCED_PCI_SETTINGS=1

sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s MAX_ACC_OUT_READ=44

sudo mlxconfig -d /dev/mst/mt4123_pciconf0 s PCI_WR_ORDERING=1

reboot

For Broadcom NICs, ensure RoCE is enabled and consider disabling any unused ports. See the Broadcom RoCE configuration scripts for more details.
Ensure Relaxed Ordering is enabled in the PCIe settings for your system BIOS as well.

Note

All instructions for RoCE networks in this guide and additional guides are based on the v2 protocol.

Check NIC link speed#

Verify the NICs in your servers are reporting the correct speeds. Several commands and utilities are available to measure speed based on your network type.

RoCE / Ethernet
- sudo ethtool <interface> | grep -i speed
- cat /sys/class/net/<interface>/speed
InfiniBand
- ibdiagnet provides an output of the entire fabric in the default log files. You can verify link speeds here.
- ibstat or ibstatus tells you if the link is up and the speed at which it is running for all HCAs in the server.

Verify Mellanox OFED and firmware installation#

Note

This step is only necessary for InfiniBand networks.

Download the latest version of Mellanox OFED (MLNX_OFED) from NVIDIA. Run the installer and flint tools to verify the latest version of MLNX_OFED and firmware is on the HCAs.

Configuration scripts#

Run these scripts where indicated to aid in the configuration and setup of your devices.

Single-node network configuration for AMD Instinct accelerators

Contents

Single-node network configuration for AMD Instinct accelerators#

Prerequisites#

Best practices for software consistency#

Validate PCIe performance#

Check PCIe device speed and width#

Check PCIe switch speed and width#

Check max payload size and max read request#

Validate NIC configuration#

Vendor-specific NIC tuning#

Check NIC link speed#

Verify Mellanox OFED and firmware installation#

Set up a GPU testing environment#

Run ROCm Validation Suite (RVS)#

Example of GPU stress tests with the GST module#

Example of PCIe bandwidth benchmarks with the PBQT module#

Run TransferBench#

Run ROCm Bandwidth Test (RBT)#

Configuration scripts#