Training a model with PyTorch on ROCm

Training a model with PyTorch on ROCm#

2025-10-23

41 min read time

Applies to Linux

Note

For a unified training solution on AMD GPUs with ROCm, the rocm/pytorch-training Docker Hub registry will be deprecated soon in favor of rocm/primus. The rocm/primus Docker containers will cover PyTorch training ecosystem frameworks, including torchtitan and Megatron-LM.

See Training a model with Primus and PyTorch for details.

PyTorch is an open-source machine learning framework that is widely used for model training with GPU-optimized components for transformer-based models. The PyTorch for ROCm training Docker image provides a prebuilt optimized environment for fine-tuning and pretraining a model on AMD Instinct MI325X and MI300X GPUs. It includes the following software components to accelerate training workloads:

MI355X and MI350X

Software component	Version
ROCm	7.0.0
Primus	aab4234
PyTorch	2.9.0.dev20250821+rocm7.0.0.lw.git125803b7
Python	3.10
Transformer Engine	2.2.0.dev0+54dd2bdc
Flash Attention	2.8.3
hipBLASLt	911283acd1
Triton	3.4.0+rocm7.0.0.git56765e8c
RCCL	2.26.6

MI325X and MI300X

Software component	Version
ROCm	7.0.0
Primus	aab4234
PyTorch	2.9.0.dev20250821+rocm7.0.0.lw.git125803b7
Python	3.10
Transformer Engine	2.2.0.dev0+54dd2bdc
Flash Attention	2.8.3
hipBLASLt	911283acd1
Triton	3.4.0+rocm7.0.0.git56765e8c
RCCL	2.26.6

Supported models#

The following models are pre-optimized for performance on the AMD Instinct MI355X, MI350X, MI325X, and MI300X GPUs. Some instructions, commands, and training recommendations in this documentation might vary by model – select one to get started.

Model

Meta Llama

OpenAI

Qwen

Stable Diffusion

Flux

NCF

Variant

Llama 4 Scout 17B-16E

Llama 3.3 70B

Llama 3.2 1B

Llama 3.2 3B

Llama 3.2 Vision 11B

Llama 3.2 Vision 90B

Llama 3.1 8B

Llama 3.1 70B

Llama 3.1 405B

Llama 3 8B

Llama 3 70B

Llama 2 7B

Llama 2 13B

Llama 2 70B

GPT OSS 20B

GPT OSS 120B

Qwen 3 8B

Qwen 3 32B

Qwen 2.5 32B

Qwen 2.5 72B

Qwen 2 1.5B

Qwen 2 7B

Stable Diffusion XL

FLUX.1-dev

NCF

The following table lists supported training modes per model.

Supported training modes

Model	Supported training modes
Llama 4 Scout 17B-16E	`finetune_fw`, `finetune_lora`
Llama 3.3 70B	`finetune_fw`, `finetune_lora`, `finetune_qlora`
Llama 3.2 1B	`finetune_fw`, `finetune_lora`
Llama 3.2 3B	`finetune_fw`, `finetune_lora`
Llama 3.2 Vision 11B	`finetune_fw`
Llama 3.2 Vision 90B	`finetune_fw`
Llama 3.1 8B	`pretrain`, `finetune_fw`, `finetune_lora`, `HF_pretrain`
Llama 3.1 70B	`pretrain`, `finetune_fw`, `finetune_lora`
Llama 3.1 405B	`finetune_qlora`
Llama 3 8B	`finetune_fw`, `finetune_lora`
Llama 3 70B	`finetune_fw`, `finetune_lora`
Llama 2 7B	`finetune_fw`, `finetune_lora`, `finetune_qlora`
Llama 2 13B	`finetune_fw`, `finetune_lora`
Llama 2 70B	`finetune_lora`, `finetune_qlora`
GPT OSS 20B	`HF_finetune_lora`
GPT OSS 120B	`HF_finetune_lora`
Qwen 3 8B	`finetune_fw`, `finetune_lora`
Qwen 3 32B	`finetune_lora`
Qwen 2.5 32B	`finetune_lora`
Qwen 2.5 72B	`finetune_lora`
Qwen 2 1.5B	`finetune_fw`, `finetune_lora`
Qwen 2 7B	`finetune_fw`, `finetune_lora`
Stable Diffusion XL	`posttrain-p`
FLUX.1-dev	`posttrain-p`

Note

Some model and fine-tuning combinations are not listed. This is because the upstream torchtune repository doesn’t provide default YAML configurations for them. For advanced usage, you can create a custom configuration to enable unlisted fine-tuning methods by using an existing file in the /workspace/torchtune/recipes/configs directory as a template.

Performance measurements#

To evaluate performance, the Performance results with AMD ROCm software page provides reference throughput and latency measurements for training popular AI models.

Note

The performance data presented in Performance results with AMD ROCm software should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.

System validation#

Before running AI workloads, it’s important to validate that your AMD hardware is configured correctly and performing optimally.

If you have already validated your system settings, including aspects like NUMA auto-balancing, you can skip this step. Otherwise, complete the procedures in the System validation and optimization guide to properly configure your system settings before starting training.

To test for optimal performance, consult the recommended System health benchmarks. This suite of tests will help you verify and fine-tune your system’s configuration.

This Docker image is optimized for specific model configurations outlined below. Performance can vary for other training workloads, as AMD doesn’t test configurations and run conditions outside those described.

Run training#

Once the setup is complete, choose between two options to start benchmarking training:

MAD-integrated benchmarking

The following run command is tailored to Llama 4 Scout 17B-16E. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 4 Scout 17B-16E model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_llama-4-scout-17b-16e \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_llama-4-scout-17b-16e. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 3.3 70B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 3.3 70B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_llama-3.3-70b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_llama-3.3-70b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 3.2 1B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 3.2 1B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_llama-3.2-1b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_llama-3.2-1b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 3.2 3B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 3.2 3B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_llama-3.2-3b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_llama-3.2-3b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 3.2 Vision 11B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 3.2 Vision 11B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_llama-3.2-vision-11b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_llama-3.2-vision-11b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 3.2 Vision 90B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 3.2 Vision 90B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_llama-3.2-vision-90b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_llama-3.2-vision-90b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 3.1 8B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 3.1 8B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_llama-3.1-8b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_llama-3.1-8b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 3.1 70B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 3.1 70B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_llama-3.1-70b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_llama-3.1-70b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 3.1 405B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 3.1 405B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_llama-3.1-405b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_llama-3.1-405b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 3 8B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 3 8B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_llama-3-8b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_llama-3-8b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 3 70B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 3 70B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_llama-3-70b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_llama-3-70b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 2 7B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 2 7B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_llama-2-7b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_llama-2-7b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 2 13B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 2 13B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_llama-2-13b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_llama-2-13b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 2 70B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 2 70B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_llama-2-70b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_llama-2-70b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to GPT OSS 20B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the GPT OSS 20B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_gpt_oss_20b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_gpt_oss_20b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to GPT OSS 120B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the GPT OSS 120B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_gpt_oss_120b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_gpt_oss_120b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Qwen 3 8B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Qwen 3 8B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_qwen3-8b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_qwen3-8b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Qwen 3 32B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Qwen 3 32B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_qwen3-32b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_qwen3-32b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Qwen 2.5 32B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Qwen 2.5 32B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_qwen2.5-32b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_qwen2.5-32b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Qwen 2.5 72B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Qwen 2.5 72B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_qwen2.5-72b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_qwen2.5-72b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Qwen 2 1.5B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Qwen 2 1.5B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_qwen2-1.5b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_qwen2-1.5b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Qwen 2 7B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Qwen 2 7B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_qwen2-7b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_qwen2-7b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Stable Diffusion XL. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Stable Diffusion XL model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_huggingface_stable_diffusion_xl_2k_lora_finetuning \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_huggingface_stable_diffusion_xl_2k_lora_finetuning. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to FLUX.1-dev. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the FLUX.1-dev model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_train_flux \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_train_flux. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to NCF. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the NCF model using one node with the FP32 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_ncf_training \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-pyt_ncf_training. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

Standalone benchmarking

The following commands are tailored to Llama 4 Scout 17B-16E. See Supported models to switch to another available model.

The following commands are tailored to Llama 3.3 70B. See Supported models to switch to another available model.

The following commands are tailored to Llama 3.2 1B. See Supported models to switch to another available model.

The following commands are tailored to Llama 3.2 3B. See Supported models to switch to another available model.

The following commands are tailored to Llama 3.2 Vision 11B. See Supported models to switch to another available model.

The following commands are tailored to Llama 3.2 Vision 90B. See Supported models to switch to another available model.

The following commands are tailored to Llama 3.1 8B. See Supported models to switch to another available model.

The following commands are tailored to Llama 3.1 70B. See Supported models to switch to another available model.

The following commands are tailored to Llama 3.1 405B. See Supported models to switch to another available model.

The following commands are tailored to Llama 3 8B. See Supported models to switch to another available model.

The following commands are tailored to Llama 3 70B. See Supported models to switch to another available model.

The following commands are tailored to Llama 2 7B. See Supported models to switch to another available model.

The following commands are tailored to Llama 2 13B. See Supported models to switch to another available model.

The following commands are tailored to Llama 2 70B. See Supported models to switch to another available model.

The following commands are tailored to GPT OSS 20B. See Supported models to switch to another available model.

The following commands are tailored to GPT OSS 120B. See Supported models to switch to another available model.

The following commands are tailored to Qwen 3 8B. See Supported models to switch to another available model.

The following commands are tailored to Qwen 3 32B. See Supported models to switch to another available model.

The following commands are tailored to Qwen 2.5 32B. See Supported models to switch to another available model.

The following commands are tailored to Qwen 2.5 72B. See Supported models to switch to another available model.

The following commands are tailored to Qwen 2 1.5B. See Supported models to switch to another available model.

The following commands are tailored to Qwen 2 7B. See Supported models to switch to another available model.

The following commands are tailored to Stable Diffusion XL. See Supported models to switch to another available model.

The following commands are tailored to FLUX.1-dev. See Supported models to switch to another available model.

The following commands are tailored to NCF. See Supported models to switch to another available model.

Download the Docker image and required packages

Use the following command to pull the Docker image from Docker Hub.
MI355X and MI350X
docker pull rocm/pytorch-training:v25.9_gfx950
MI325X and MI300X
docker pull rocm/pytorch-training:v25.9_gfx942

Launch the Docker container.

MI355X and MI350X

docker run -it \
    --device /dev/dri \
    --device /dev/kfd \
    --network host \
    --ipc host \
    --group-add video \
    --cap-add SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --privileged \
    -v $HOME:$HOME \
    -v $HOME/.ssh:/root/.ssh \
    --shm-size 64G \
    --name training_env \
    rocm/pytorch-training:v25.9_gfx950

MI325X and MI300X

docker run -it \
    --device /dev/dri \
    --device /dev/kfd \
    --network host \
    --ipc host \
    --group-add video \
    --cap-add SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --privileged \
    -v $HOME:$HOME \
    -v $HOME/.ssh:/root/.ssh \
    --shm-size 64G \
    --name training_env \
    rocm/pytorch-training:v25.9_gfx942

Use these commands if you exit the training_env container and need to return to it.

docker start training_env
docker exec -it training_env bash

In the Docker container, clone the ROCm/MAD repository and navigate to the benchmark scripts directory /workspace/MAD/scripts/pytorch_train.
```
git clone https://github.com/ROCm/MAD
cd MAD/scripts/pytorch_train
```

Prepare training datasets and dependencies

The following benchmarking examples require downloading models and datasets from Hugging Face. To ensure successful access to gated repos, set your HF_TOKEN.
```
export HF_TOKEN=$your_personal_hugging_face_access_token
```

Run the setup script to install libraries and datasets needed for benchmarking.

./pytorch_benchmark_setup.sh

pytorch_benchmark_setup.sh installs the following libraries for Llama 3.1 8B:

Library	Reference
`accelerate`	Hugging Face Accelerate
`datasets`	Hugging Face Datasets 3.2.0

pytorch_benchmark_setup.sh installs the following libraries for Llama 3.1 70B:

Library	Reference
`datasets`	Hugging Face Datasets 3.2.0
`torchdata`	TorchData
`tomli`	Tomli
`tiktoken`	tiktoken
`blobfile`	blobfile
`tabulate`	tabulate
`wandb`	Weights & Biases
`sentencepiece`	SentencePiece 0.2.0
`tensorboard`	TensorBoard 2.18.0

pytorch_benchmark_setup.sh installs the following libraries for FLUX:

Library	Reference
`accelerate`	Hugging Face Accelerate
`datasets`	Hugging Face Datasets 3.2.0
`sentencepiece`	SentencePiece 0.2.0
`tensorboard`	TensorBoard 2.18.0
`csvkit`	csvkit 2.0.1
`deepspeed`	DeepSpeed 0.16.2
`diffusers`	Hugging Face Diffusers 0.31.0
`GitPython`	GitPython 3.1.44
`opencv-python-headless`	opencv-python-headless 4.10.0.84
`peft`	PEFT 0.14.0
`protobuf`	Protocol Buffers 5.29.2
`pytest`	PyTest 8.3.4
`python-dotenv`	python-dotenv 1.0.1
`seaborn`	Seaborn 0.13.2
`transformers`	Transformers 4.47.0

pytorch_benchmark_setup.sh downloads the following datasets from Hugging Face:

frank-chieng/chinese_architecture_siheyuan

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-4-17B_16E \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
	`finetune_lora`	LoRA fine-tuning (BF16 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-3.3-70B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
	`finetune_lora`	LoRA fine-tuning (BF16 supported).
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-3.2-1B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
	`finetune_lora`	LoRA fine-tuning (BF16 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-3.2-3B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
	`finetune_lora`	LoRA fine-tuning (BF16 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-3.2-Vision-11B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-3.2-Vision-90B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

For LoRA and QLoRA support with vision models (Llama 3.2 11B and 90B), use the following torchtune commit for compatibility:

git checkout 48192e23188b1fc524dd6d127725ceb2348e7f0e

Pre-training

To start the pre-training benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-3.1-8B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`pretrain`	Benchmark pre-training.
	`HF_pretrain`	Llama 3.1 8B pre-training with FP8 precision.
`$datatype`	`BF16` or `FP8`	Only Llama 3.1 8B supports FP8 precision.
`$sequence_length`	Sequence length for the language model.	Between 2048 and 8192. 8192 by default.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-3.1-8B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
	`finetune_lora`	LoRA fine-tuning (BF16 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Pre-training

To start the pre-training benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t pretrain \
    -m Llama-3.1-70B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`pretrain`	Benchmark pre-training.
`$datatype`	`BF16`	Only Llama 3.1 8B supports FP8 precision.
`$sequence_length`	Sequence length for the language model.	Between 2048 and 8192. 8192 by default.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-3.1-70B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
	`finetune_lora`	LoRA fine-tuning (BF16 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-3.1-405B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_qlora`	QLoRA fine-tuning (BF16 supported).
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-3-8B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
	`finetune_lora`	LoRA fine-tuning (BF16 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-3-70B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
	`finetune_lora`	LoRA fine-tuning (BF16 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-2-7B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
	`finetune_lora`	LoRA fine-tuning (BF16 supported).
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

You might encounter the following error with Llama 2: ValueError: seq_len (16384) of input tensor should be smaller than max_seq_len (4096). This error indicates that an input sequence is longer than the model’s maximum context window.

Ensure your tokenized input does not exceed the model’s max_seq_len (4096 tokens in this case). You can resolve this by truncating the input or splitting it into smaller chunks before passing it to the model.

Note on reproducibility: The results in this guide are based on commit b4c98ac from the upstream pytorch/torchtune repository. For the latest updates, you can use the main branch.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-2-13B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
	`finetune_lora`	LoRA fine-tuning (BF16 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

You might encounter the following error with Llama 2: ValueError: seq_len (16384) of input tensor should be smaller than max_seq_len (4096). This error indicates that an input sequence is longer than the model’s maximum context window.

Ensure your tokenized input does not exceed the model’s max_seq_len (4096 tokens in this case). You can resolve this by truncating the input or splitting it into smaller chunks before passing it to the model.

Note on reproducibility: The results in this guide are based on commit b4c98ac from the upstream pytorch/torchtune repository. For the latest updates, you can use the main branch.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Llama-2-70B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_lora`	LoRA fine-tuning (BF16 supported).
	`finetune_qlora`	QLoRA fine-tuning (BF16 supported).
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Note

You might encounter the following error with Llama 2: ValueError: seq_len (16384) of input tensor should be smaller than max_seq_len (4096). This error indicates that an input sequence is longer than the model’s maximum context window.

Ensure your tokenized input does not exceed the model’s max_seq_len (4096 tokens in this case). You can resolve this by truncating the input or splitting it into smaller chunks before passing it to the model.

Note on reproducibility: The results in this guide are based on commit b4c98ac from the upstream pytorch/torchtune repository. For the latest updates, you can use the main branch.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m GPT-OSS-20B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT.
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m GPT-OSS-120B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`HF_finetune_lora`	LoRA fine-tuning with Hugging Face PEFT.
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Qwen3-8B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
	`finetune_lora`	LoRA fine-tuning (BF16 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Qwen3-32 \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_lora`	LoRA fine-tuning (BF16 supported).
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Qwen2.5-32B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_lora`	LoRA fine-tuning (BF16 supported).
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Qwen2.5-72B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_lora`	LoRA fine-tuning (BF16 supported).
`$datatype`	`BF16`	All models support BF16.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Qwen2-1.5B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
	`finetune_lora`	LoRA fine-tuning (BF16 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Fine-tuning

To start the fine-tuning benchmark, use the following command with the appropriate options. See the following list of options and their descriptions. See supported training modes.

./pytorch_benchmark_report.sh -t $training_mode \
    -m Qwen2-7B \
    -p $datatype \
    -s $sequence_length

Name	Options	Description
`$training_mode`	`finetune_fw`	Full weight fine-tuning (BF16 and FP8 supported).
	`finetune_lora`	LoRA fine-tuning (BF16 supported).
`$datatype`	`BF16` or `FP8`	All models support BF16. FP8 is only available for full weight fine-tuning.
`$sequence_length`	Between 2048 and 16384.	Sequence length for the language model.

Benchmarking examples

For examples of benchmarking commands, see ROCm/MAD.

Multi-node training#

Refer to Multi-node setup for AI workloads to configure your environment for multi-node training. See PyTorch training for example Slurm run commands.

Pre-training#

Multi-node training with torchtitan is supported. The provided SLURM script is pre-configured for Llama 3 70B.

To launch the training job on a SLURM cluster for Llama 3 70B, run the following commands from the MAD repository.

# In the MAD repository
cd scripts/pytorch_train
sbatch run_slurm_train.sh

Fine-tuning#

Multi-node training with torchtune is supported. The provided SLURM script is pre-configured for Llama 3.3 70B.

To launch the training job on a SLURM cluster for Llama 3.3 70B, run the following commands from the MAD repository.

huggingface-cli login # Get access to HF Llama model space
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct --local-dir ./models/Llama-3.3-70B-Instruct # Download the Llama 3.3 model locally
# In the MAD repository
cd scripts/pytorch_train
sbatch Torchtune_Multinode.sh

Note

Information regarding benchmark setup:

By default, Llama 3.3 70B is fine-tuned using alpaca_dataset.
You can adjust the torchtune YAML configuration file if you’re using a different model.
The number of nodes and other parameters can be tuned in the SLURM script Torchtune_Multinode.sh.
Set the mounting_paths inside the SLURM script.

Once the run is finished, you can find the log files in the result_torchtune/ directory.

Known issues#

PyTorch Profiler may produce inaccurate traces when CPU activity profiling is enabled.

Previous versions#

See PyTorch training performance testing version history to find documentation for previous releases of the ROCm/pytorch-training Docker image.

Training a model with PyTorch on ROCm

Contents

Training a model with PyTorch on ROCm#

Supported models#

Performance measurements#

System validation#

Run training#

Multi-node training#

Pre-training#

Fine-tuning#

Known issues#

Further reading#

Previous versions#