Stanford Megatron-LM compatibility

Stanford Megatron-LM compatibility#

2026-01-08

4 min read time

Applies to Linux

Stanford Megatron-LM is a large-scale language model training framework developed by NVIDIA at NVIDIA/Megatron-LM. It is designed to train massive transformer-based language models efficiently by model and data parallelism.

It provides efficient tensor, pipeline, and sequence-based model parallelism for pre-training transformer-based language models such as GPT (Decoder Only), BERT (Encoder Only), and T5 (Encoder-Decoder).

Support overview#

The ROCm-supported version of Stanford Megatron-LM is maintained in the official ROCm/Stanford-Megatron-LM repository, which differs from the stanford-futuredata/Megatron-LM upstream repository.
To get started and install Stanford Megatron-LM on ROCm, use the prebuilt Docker image, which includes ROCm, Stanford Megatron-LM, and all required dependencies.
- See the ROCm Stanford Megatron-LM installation guide for installation and setup instructions.
- You can also consult the upstream Installation guide for additional context.

Compatibility matrix#

AMD validates and publishes Stanford Megatron-LM images with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated inventories represent the latest Stanford Megatron-LM version from the official Docker Hub. Click to view the image on Docker Hub.

Docker image	ROCm	Stanford Megatron-LM	PyTorch	Ubuntu	Python	GPU
rocm/stanford-megatron-lm	6.3.0	85f95ae	2.4.0	24.04	3.12.9	MI300X

Supported models and features with ROCm 6.3.0#

This section details models & features that are supported by the ROCm version on Stanford Megatron-LM.

Models:

BERT
GPT
T5
ICT

Features:

Distributed Pre-training
Activation Checkpointing and Recomputation
Distributed Optimizer
Mixture-of-Experts

Use cases and recommendations#

The following blog post mentions Megablocks, but you can run Stanford Megatron-LM with the same steps to pre-process datasets on AMD GPUs:

The Efficient MoE training on AMD ROCm: How-to use Megablocks on AMD GPUs blog post guides how to leverage the ROCm platform for pre-training using the Megablocks framework. It introduces a streamlined approach for training Mixture-of-Experts (MoE) models using the Megablocks library on AMD hardware. Focusing on GPT-2, it demonstrates how block-sparse computations can enhance scalability and efficiency in MoE training. The guide provides step-by-step instructions for setting up the environment, including cloning the repository, building the Docker image, and running the training container. Additionally, it offers insights into utilizing the oscar-1GB.json dataset for pre-training language models. By leveraging Megablocks and the ROCm platform, you can optimize your MoE training workflows for large-scale transformer models.

It features how to pre-process datasets and how to begin pre-training on AMD GPUs through:

Single-GPU pre-training
Multi-GPU pre-training