Training a model with Megatron-LM on ROCm

Training a model with Megatron-LM on ROCm#

2026-01-30

20 min read time

Applies to Linux

Caution

For a unified training solution on AMD GPUs with ROCm, the rocm/megatron-lm Docker Hub registry will be deprecated soon in favor of rocm/primus. The rocm/primus Docker containers will cover PyTorch training ecosystem frameworks, including Megatron-LM and torchtitan.

Primus with Megatron is designed to replace this ROCm Megatron-LM training workflow. To learn how to migrate workloads from Megatron-LM to Primus with Megatron, see Migrating workloads to Primus (Megatron backend) from Megatron-LM.

The Megatron-LM framework for ROCm is a specialized fork of the robust Megatron-LM, designed to enable efficient training of large-scale language models on AMD GPUs. By leveraging AMD Instinct™ GPUs, Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI workloads. It is purpose-built to support models like Llama, DeepSeek, and Mixtral, enabling developers to train next-generation AI models more efficiently.

AMD provides ready-to-use Docker images for MI355X, MI350X, MI325X, and MI300X GPUs containing essential components, including PyTorch, ROCm libraries, and Megatron-LM utilities. It contains the following software components to accelerate training workloads:

rocm/primus:v26.1

Software component	Version
ROCm	7.1.0
PyTorch	2.10.0.dev20251112+rocm7.1
Python	3.10
Transformer Engine	2.6.0.dev0+f141f34b
Flash Attention	2.8.3
hipBLASLt	34459f66ea
Triton	3.4.0
RCCL	2.27.7

Supported models#

The following models are supported for training performance benchmarking with Megatron-LM and ROCm on AMD Instinct MI300X Series GPUs. Some instructions, commands, and training recommendations in this documentation might vary by model – select one to get started.

Model

Meta Llama

DeepSeek

Mistral AI

Qwen

Variant

Llama 3.3 70B

Llama 3.1 8B

Llama 3.1 70B

Llama 2 7B

Llama 2 70B

DeepSeek-V3 (proxy)

DeepSeek-V2-Lite

Mixtral 8x7B

Mixtral 8x22B (proxy)

Qwen 2.5 7B

Qwen 2.5 72B

Note

Some models, such as Llama, require an external license agreement through a third party (for example, Meta).

Performance measurements#

To evaluate performance, the Performance results with AMD ROCm software page provides reference throughput and latency measurements for training popular AI models.

Important

The performance data presented in Performance results with AMD ROCm software only reflects the latest version of this training benchmarking environment. The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.

System validation#

Before running AI workloads, it’s important to validate that your AMD hardware is configured correctly and performing optimally.

If you have already validated your system settings, including aspects like NUMA auto-balancing, you can skip this step. Otherwise, complete the procedures in the System validation and optimization guide to properly configure your system settings before starting training.

To test for optimal performance, consult the recommended System health benchmarks. This suite of tests will help you verify and fine-tune your system’s configuration.

Environment setup#

Use the following instructions to set up the environment, configure the script to train models, and reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker image.

Download the Docker image#

Use the following command to pull the Docker image from Docker Hub.
```
docker pull rocm/primus:v26.1
```

Launch the Docker container.

docker run -it \
    --device /dev/dri \
    --device /dev/kfd \
    --device /dev/infiniband \
    --network host --ipc host \
    --group-add video \
    --cap-add SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --privileged \
    -v $HOME:$HOME \
    -v $HOME/.ssh:/root/.ssh \
    --shm-size 128G \
    --name megatron_training_env \
    rocm/primus:v26.1

Use these commands if you exit the megatron_training_env container and need to return to it.
```
docker start megatron_training_env
docker exec -it megatron_training_env bash
```
Megatron-LM backward compatibility setup – this Docker is primarily intended for use with Primus, but it maintains Megatron-LM compatibility with limited support. To roll back to using Megatron-LM, follow these steps:
```
cd /workspace/Megatron-LM/
pip uninstall megatron-core
pip install -e .
```

The Docker container hosts a verified commit of ROCm/Megatron-LM.

Configuration#

Update the train_llama3.sh configuration script in the examples/llama directory of ROCm/Megatron-LM to configure your training run. Options can also be passed as command line arguments as described in Run training.

Update the train_llama2.sh configuration script in the examples/llama directory of ROCm/Megatron-LM to configure your training run. Options can also be passed as command line arguments as described in Run training.

Update the train_deepseekv3.sh configuration script in the examples/deepseek_v3 directory of ROCm/Megatron-LM to configure your training run. Options can also be passed as command line arguments as described in Run training.

Update the train_deepseekv2.sh configuration script in the examples/deepseek_v2 directory of ROCm/Megatron-LM to configure your training run. Options can also be passed as command line arguments as described in Run training.

Update the train_mixtral_moe.sh configuration script in the examples/mixtral directory of ROCm/Megatron-LM to configure your training run. Options can also be passed as command line arguments as described in Run training.

Note

See Key options for more information on configuration options.

Multi-node configuration#

Refer to Multi-node setup for AI workloads to configure your environment for multi-node training. See Multi-node training examples for example run commands.

Tokenizer#

You can assign the path of an existing tokenizer to the TOKENIZER_MODEL as shown in the following examples. If the tokenizer is not found, it’ll be downloaded if publicly available.

If you do not have Llama 3.3 tokenizer locally, you need to use your personal Hugging Face access token HF_TOKEN to download the tokenizer. See Llama-3.3-70B-Instruct. After you are authorized, use your HF_TOKEN to download the tokenizer and set the variable TOKENIZER_MODEL to the tokenizer path.

export HF_TOKEN=<Your personal Hugging Face access token>