Limitations and recommended settings#
This section provides information on software and configuration limitations.
Note
For ROCm on Instinct known issues, refer to AMD ROCm DocumentationFor OpenMPI limitations, see ROCm UCX OpenMPI on Github
6.3.2 release known issues#
Intermittent script failure may be observed while running Stable Diffusion training workloads using TensorFlow
Intermittent script failure may be observed while running Triton examples
Increased memory consumption may be observed while running TensorFlow Resnet50 training workloads.
Performance drop may be observed while running ONNX Runtime scripts with INT8 precision
Script hang may be observed while running RetinaNet training workloads with batch size 32 using TensorFlow
Performance drop may be observed while running Bert training workloads across multiGPU configurations
Black Image may be generated/observed while running Stable Diffusion 2.1 FP16 using Pytorch
WSL specific issues#
TDR may be triggered by some long running rocsparse kernels
TDR may be triggered by running rocsolver-test
Text to text generation causes a driver timeout when using torch-2.3.0
Command line scripts may hang when running models that exceed the memory capacity of the GPU
Failures may occur when running ONNX Runtime training scripts
Important!
Radeon™ PRO Series graphics cards are not designed nor recommended for datacenter usage. Use in a datacenter setting may adversely affect manageability, efficiency, reliability, and/or performance. GD-239.
Important!
ROCm is not officially supported on any mobile SKUs.
Multi-GPU configuration#
Windows Subsystem for Linux (WSL)#
WSL recommended settings and limitations.
WSL recommended settings#
Optimizing GPU utilization
WSL overhead is a noted bottleneck for GPU utilization. Increasing the batch size of operations will load the GPU more optimally, reducing time required for AI workloads. Optimal batch sizes vary by model, and macro-parameters.
ROCm support in WSL environments#
Due to WSL architectural limitations for native Linux User Kernel Interface (UKI), rocm-smi is not supported.
Issue |
Limitations |
---|---|
UKI does not currently support rocm-smi |
No current support for: |
Not currently supported.
Not currently supported.
Running PyTorch in virtual environments#
Running PyTorch in virtual environments requires a manual libhsa-runtime64.so update.
When using the WSL usecase and hsa-runtime-rocr4wsl-amdgpu package (installed with PyTorch wheels), users are required to update to a WSL compatible runtime lib.
Solution:
Enter the following commands:
location=`pip show torch | grep Location | awk -F ": " '{print $2}'`
cd ${location}/torch/lib/
rm libhsa-runtime64.so*
cp /opt/rocm/lib/libhsa-runtime64.so.1.2 libhsa-runtime64.so