Limitations#
This section provides information on software and configuration limitations.
Multi-GPU configuration#
Due to limited validation of ROCm on Radeon multi-GPU configuration at this time, we have identified common errors, and applicable recommendations.
Important! ROCm 5.7 release is limited to preview support for multi-GPU configuration.
At this time, only a limited amount of validation has been performed. AMD recommends only proceeding with advanced know-how and at user discretion.
Visit the AI community to share feedback, and Report a bug if you find any issues.
Recommended system configuration for multi-GPU#
PCIe® slots connected to the GPU must have identical PCIe lane width or bifurcation settings, and support PCIe 3.0 Atomics.
Refer to How ROCm uses PCIe Atomics for more information.
Example:
✓ - GPU0 PCIe x16 connection + GPU1 PCIe x16 connection
✓ - GPU0 PCIe x8 connection + GPU1 PCIe x8 connection
X - GPU0 PCIe x16 connection + GPU1 PCIe x8 connection
Important!
Only use PCIe slots connected by the CPU and to avoid PCIe slots connected via chipset. Refer to product-specific motherboard documentation for PCIe electrical configuration.
Ensure the PSU has sufficient wattage to support multiple GPUs.
Errors due to GPU and PCIe configuration#
When using two AMD Radeon 7900XTX GPUs, the following HIP error is observed when running PyTorch micro-benchmarking if any one of the two GPUs are connected to a non-CPU PCIe slot (PCIe on chipset):
RuntimeError: HIP error: the operation cannot be performed in the present state
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.