Training a model#

Applies to Linux

2024-09-10

8 min read time

The following is a brief overview of popular component paths per AI development use-case, such as training, LLMs, and inferencing.

Accelerating model training#

To train a large model like GPT2 or Llama 2 70B, a single accelerator or GPU cannot store all the model parameters required for training. What if you could convert the single-GPU training code to run on multiple accelerators or GPUs? PyTorch offers distributed training solutions to facilitate this.

PyTorch distributed#

As of PyTorch 1.6.0, features in torch.distributed are categorized into three main components:

In this guide, the focus is on the distributed data-parallelism strategy as it’s the most popular. To get started with DDP, let’s first understand how to coordinate the model and its training data across multiple accelerators or GPUs.

The DDP workflow on multiple accelerators or GPUs is as follows:

  1. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples.

  2. Copy the model to every device so each device can process its local batches independently.

  3. Run a forward pass, then a backward pass, and output the gradient of the weights with respect to the loss of the model for that local batch. This happens in parallel on multiple devices.

  4. Synchronize the local gradients computed by each device and combine them to update the model weights. The updated weights are then redistributed to each device.

In DDP training, each process or worker owns a replica of the model and processes a batch of data, then the reducer uses allreduce to sum up gradients over different workers.

See the following developer blogs for more in-depth explanations and examples.

PyTorch FSDP#

As noted in PyTorch distributed, in DDP model weights and optimizer states are evenly replicated across all workers. Fully Sharded Data Parallel (FSDP) is a type of data parallelism that shards model parameters, optimizer states, and gradients across DDP ranks.

When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes the training of some very large models feasible by allowing larger models or batch sizes to fit on-device. However, this comes with the cost of increased communication volume. The communication overhead is reduced by internal optimizations like overlapping communication and computation.

For a high-level overview of how FSDP works, review Getting started with Fully Sharded Data Parallel.

For detailed training steps, refer to the PyTorch FSDP examples.

DeepSpeed#

DeepSpeed offers system innovations that make large-scale deep learning training effective, efficient, and easy to use. Innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, and so on fall under the training pillar.

See Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs — ROCm Blogs for a detailed example of training with DeepSpeed on an AMD accelerator or GPU.

Automatic mixed precision (AMP)#

As models increase in size, the time and memory needed to train them; that is, their cost also increases. Any measure we can take to reduce training time and memory usage through automatic mixed precision (AMP) is highly beneficial for most use cases.

See Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs for more information about running AMP on an AMD accelerator.

Fine-tuning your model#

ROCm supports multiple techniques for optimizing fine-tuning, for example, LoRA, QLoRA, PEFT, and FSDP.

Learn more about challenges and solutions for model fine-tuning in Fine-tuning LLMs and inference optimization.

The following developer blogs showcase examples of how to fine-tune a model on an AMD accelerator or GPU.