mGPU setup and configuration#

Hardware and software considerations#

Refer to the following hardware and software considerations to ensure optimal performance.

Hardware considerations#

  • PCIe® slots
    AMD recommends a system with multiple x16 (Gen 4) slots, with optimal performance achieved by provision of a 1:1 ratio between the number of x16 slots and the number of GPUs used.

    Functionality is maintained in the instance where only one x16 slot is available, at the cost of some performance.

  • mGPU power setup
    MultiGPU configurations require adequate amounts of power for all the components required.
    Consult AMD Radeon™ RX or AMD Radeon™ PRO for GPU specifications and graphics card power requirements.

Software considerations#

There are no differences in software requirements between single-GPU and multi-GPU usage.

mGPU configuration by framework#

PyTorch, ONNX, and Tensorflow may have additional guidelines regarding mGPU configuration. Refer to official mGPU support documentation of the applicable framework for more information.

mGPU known issues and limitations#

AMD has identified common errors when running ROCm™ on Radeon™ multi-GPU configuration at this time, along with the applicable recommendations.

IOMMU limitations and guidance#

For IOMMU limitations and guidance, see Issue #5: Application hangs on Multi-GPU systems.

Windows Subsystem for Linux (WSL) support#

Microsoft does not currently support mGPU setup in WSL.

Simultaneous parallel compute workloads#

Radeon GPUs does not support large amounts of simultaneous, parallel workloads. It is not recommended to exceed 2 simultaneous compute workloads, with the assumption that workloads are running alongside a graphics environment (eg: Linux desktop).

GPU isolation techniques#

For more information, see GPU isolation techniques.

PCIe atomic operations#

Some consumer grade motherboards may only support the first PCIe slot. For unexpected issues, see How ROCm uses PCIe atomics.

Errors due to GPU and PCIe configuration#

When using two AMD Radeon 7900XTX GPUs, the following HIP error is observed when running PyTorch micro-benchmarking if any one of the two GPUs are connected to a non-CPU PCIe slot (PCIe on chipset):

RuntimeError: HIP error: the operation cannot be performed in the present state
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Potential GPU reset with some mixed graphics and compute workloads#

Working with certain mixed graphics and compute workloads may result in a GPU reset on Radeon GPUs.

Currently identified scenarios include:

  • Running multiple ML workloads simultaneously while using the desktop

  • Running ML workloads while simultaneously using Blender/HIP