This section provides information on software and configuration limitations.
Due to limited validation of ROCm™ on Radeon™ multi-GPU configuration at this time, we have identified common errors, and applicable recommendations.
Important! ROCm 6.0.2 release is limited to preview support for multi-GPU configuration.
At this time, only a limited amount of validation has been performed. AMD recommends only proceeding with advanced know-how and at user discretion.
PCIe® slots connected to the GPU must have identical PCIe lane width or bifurcation settings, and support PCIe 3.0 Atomics.
Refer to How ROCm uses PCIe Atomics for more information.
✓ - GPU0 PCIe x16 connection + GPU1 PCIe x16 connection
✓ - GPU0 PCIe x8 connection + GPU1 PCIe x8 connection
X - GPU0 PCIe x16 connection + GPU1 PCIe x8 connection
Only use PCIe slots connected by the CPU and to avoid PCIe slots connected via chipset. Refer to product-specific motherboard documentation for PCIe electrical configuration.
Ensure the PSU has sufficient wattage to support multiple GPUs.
Errors due to GPU and PCIe configuration
When using two AMD Radeon 7900XTX GPUs, the following HIP error is observed when running PyTorch micro-benchmarking if any one of the two GPUs are connected to a non-CPU PCIe slot (PCIe on chipset):
RuntimeError: HIP error: the operation cannot be performed in the present state
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
Potential GPU reset with some mixed graphics and compute workloads
Working with certain mixed graphics and compute workloads may result in a GPU reset on Radeon GPUs.
Currently identified scenarios include:
Running multiple ML workloads simultaneously while using the desktop
Running ML workloads while simultaneously using Blender/HIP