Limitations#

This section provides information on software and configuration limitations.

Important!
Radeon™ PRO Series graphics cards are not designed nor recommended for datacenter usage. Use in a datacenter setting may adversely affect manageability, efficiency, reliability, and/or performance. GD-239.

Important!
ROCm is not officially supported on any mobile SKUs.

Multi-GPU configuration#

Due to limited validation of ROCm™ on Radeon™ multi-GPU configuration at this time, we have identified common errors, and applicable recommendations.

Important! ROCm 6.0.2 release is limited to preview support for multi-GPU configuration.

At this time, only a limited amount of validation has been performed. AMD recommends only proceeding with advanced know-how and at user discretion.

Visit the AI community to share feedback, and Report a bug if you find any issues.

PCIe® slots connected to the GPU must have identical PCIe lane width or bifurcation settings, and support PCIe 3.0 Atomics.

Refer to How ROCm uses PCIe Atomics for more information.

Example:

- GPU0 PCIe x16 connection + GPU1 PCIe x16 connection

- GPU0 PCIe x8 connection + GPU1 PCIe x8 connection

X  - GPU0 PCIe x16 connection + GPU1 PCIe x8 connection

Important!

  • Only use PCIe slots connected by the CPU and to avoid PCIe slots connected via chipset. Refer to product-specific motherboard documentation for PCIe electrical configuration.

  • Ensure the PSU has sufficient wattage to support multiple GPUs.

Errors due to GPU and PCIe configuration

When using two AMD Radeon 7900XTX GPUs, the following HIP error is observed when running PyTorch micro-benchmarking if any one of the two GPUs are connected to a non-CPU PCIe slot (PCIe on chipset):

RuntimeError: HIP error: the operation cannot be performed in the present state
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Errors due to GPU and PCIe configuration with AI workloads

When using two AMD Radeon 7900XTX GPUs, both cards must be connected to a CPU-controlled PCIe slot. Not doing this might result in errors during AI workflows.

Potential GPU reset with some mixed graphics and compute workloads

Working with certain mixed graphics and compute workloads may result in a GPU reset on Radeon GPUs.

Currently identified scenarios include:

  • Running multiple ML workloads simultaneously while using the desktop

  • Running ML workloads while simultaneously using Blender/HIP

6.0.2 release known issues#

  • Running PyTorch with iGPU enabled + Discrete GPU enabled may cause crashes.

  • GPU reset may occur when running multiple heavy Machine Learning workloads at same time over an extended period of time.

  • Intermittent gpureset errors may be seen with Automatic 1111 webUI with IOMMU enabled. Please see https://community.amd.com/t5/knowledge-base/tkb-p/amd-rocm-tkb for suggested resolutions.

  • RX 7900 GRE may exhibit a hang rather than Out Of Memory error on BERT FP32 training loads.

  • Soft hang observed when running multi-queue workloads.