Stanford Megatron-LM compatibility#
2025-10-21
5 min read time
Stanford Megatron-LM is a large-scale language model training framework developed by NVIDIA at NVIDIA/Megatron-LM. It is designed to train massive transformer-based language models efficiently by model and data parallelism.
It provides efficient tensor, pipeline, and sequence-based model parallelism for pre-training transformer-based language models such as GPT (Decoder Only), BERT (Encoder Only), and T5 (Encoder-Decoder).
Support overview#
The ROCm-supported version of Stanford Megatron-LM is maintained in the official ROCm/Stanford-Megatron-LM repository, which differs from the stanford-futuredata/Megatron-LM upstream repository.
To get started and install Stanford Megatron-LM on ROCm, use the prebuilt Docker image, which includes ROCm, Stanford Megatron-LM, and all required dependencies.
See the ROCm Stanford Megatron-LM installation guide for installation and setup instructions.
You can also consult the upstream Installation guide for additional context.
Version support#
Stanford Megatron-LM is supported on ROCm 6.3.0.
Supported devices#
Officially Supported: AMD Instinct™ MI300X
Partially Supported (functionality or performance limitations): AMD Instinct™ MI250X, MI210
Supported models and features#
This section details models & features that are supported by the ROCm version on Stanford Megatron-LM.
Models:
BERT
GPT
T5
ICT
Features:
Distributed Pre-training
Activation Checkpointing and Recomputation
Distributed Optimizer
Mixture-of-Experts
Use cases and recommendations#
The following blog post mentions Megablocks, but you can run Stanford Megatron-LM with the same steps to pre-process datasets on AMD GPUs:
The Efficient MoE training on AMD ROCm: How-to use Megablocks on AMD GPUs blog post guides how to leverage the ROCm platform for pre-training using the Megablocks framework. It introduces a streamlined approach for training Mixture-of-Experts (MoE) models using the Megablocks library on AMD hardware. Focusing on GPT-2, it demonstrates how block-sparse computations can enhance scalability and efficiency in MoE training. The guide provides step-by-step instructions for setting up the environment, including cloning the repository, building the Docker image, and running the training container. Additionally, it offers insights into utilizing the
oscar-1GB.jsondataset for pre-training language models. By leveraging Megablocks and the ROCm platform, you can optimize your MoE training workflows for large-scale transformer models.
It features how to pre-process datasets and how to begin pre-training on AMD GPUs through:
Single-GPU pre-training
Multi-GPU pre-training
Docker image compatibility#
AMD validates and publishes Stanford Megatron-LM images with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated inventories represent the latest Stanford Megatron-LM version from the official Docker Hub. Click to view the image on Docker Hub.