Fine-tuning Llama-3.2 3B with LoRA

Fine-tuning Llama-3.2 3B with LoRA#

This tutorial demonstrates how to fine-tune the Llama-3.2 3B large language model (LLM) using Low-Rank Adaptation (LoRA) on AMD ROCm GPUs. Llama-3.2, developed by Meta, is a widely used open-source large language model. For more information, see Meta’s Llama page.

Fine-tuning large language models can be computationally intensive due to the need to optimize all parameters. This approach, known as full-parameter fine-tuning, requires updating every weight in the model. This leads to a significant demand on memory and compute resources, often up to four times the size of the model itself.

To address these challenges, the tutorial uses LoRA, a parameter-efficient fine-tuning (PEFT) technique. As described by Hu et al. in their 2021 paper, LoRA freezes the pre-trained model weights and introduces trainable rank-decomposition matrices into each layer of the Transformer architecture. This significantly reduces the number of trainable parameters for downstream tasks while maintaining performance, making it possible to fine-tune large models efficiently on resource-constrained hardware.

Reference: Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” 2021.

Prerequisites#

This tutorial was developed and tested using the following setup.

Operating system#

Ubuntu 22.04: Ensure your system is running Ubuntu version 22.04.

Hardware#

AMD Instinct™ GPUs: This tutorial was tested on an AMD Instinct MI300X GPU. Ensure you are using an AMD Instinct GPU or compatible hardware with ROCm support and that your system meets the official requirements.

Software#

ROCm 6.2: Install and verify ROCm by following the ROCm install guide. After installation, confirm your setup using:
```
rocm-smi
```
This command lists your AMD GPUs with relevant details, similar to the image below.
Docker: Ensure Docker is installed and configured correctly. Follow the Docker installation guide for your operating system.

Note: Ensure the Docker permissions are correctly configured. To configure permissions to allow non-root access, run the following commands:
```
sudo usermod -aG docker $USER
newgrp docker
```
Verify Docker is working correctly:
```
docker run hello-world
```

Hugging Face API access#

Obtain an API token from Hugging Face for downloading models.
Ensure the Hugging Face API token has the necessary permissions and approval to access Meta’s Llama checkpoints.

Data preparation#

This tutorial uses a sample dataset from Hugging Face, which is prepared during the setup steps.

Prepare the training environment#

1. Pull the Docker image#

Ensure your system meets the System Requirements.

Pull the Docker image required for this tutorial:

docker pull rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0

2. Launch the Docker container#

Launch the Docker container and map the necessary directories. Replace /path/to/notebooks with the full path to the directory on your host machine where these notebooks are stored.

docker run -it --rm \
  --network=host \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add=video \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --shm-size 8G \
  --hostname=ROCm-FT \
  -v $(pwd):/workspace \
  -w /workspace/notebooks \
  rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0

Note: This command mounts the current directory to the /workspace directory in the container. Ensure the notebook file is either copied to this directory before running the Docker command or uploaded into the Jupyter Notebook environment after it starts. Save the token or URL provided in the terminal output to access the notebook from your web browser. You can download this notebook from the AI Developer Hub GitHub repository.

3. Install and launch Jupyter#

Inside the Docker container, install Jupyter using the following command:

pip install jupyter

Start the Jupyter server:

jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root

Note: Ensure port 8888 is not already in use on your system before running the above command. If it is, you can specify a different port by replacing --port=8888 with another port number, for example, --port=8890.

4. Install the required libraries#

Install the libraries required for this tutorial. Run the following commands inside the Jupyter notebook running within the Docker container:

# Install necessary libraries for fine-tuning, including parameter-efficient fine-tuning (peft) and transformers
!pip install pandas peft==0.14.0 transformers==4.47.1 trl==0.13.0 accelerate==1.2.1 scipy tensorboardX

Verify the installation:

# Verify the installation and version of the required libraries
!pip list | grep peft
!pip list | grep transformer
!pip list | grep accelerate
!pip list | grep trl

Here is the expected output:

peft                      0.14.0
transformers              4.47.1
accelerate                1.2.1
trl                       0.13.0

5. Install bitsandbytes for ROCm 6.2#

Install bitsandbytes for ROCm 6.2 from source:

# Install bitsandbytes from source to enable quantization on ROCm GPUs
!git clone --recurse https://github.com/ROCm/bitsandbytes.git && cd bitsandbytes && git checkout rocm6.2_internal_testing && make hip && python setup.py install

Verify the bitsandbytes installation, which should be version 0.42.0:

# Verify the installation and version of bitsandbytes
try:
    import bitsandbytes as bnb
    print("bitsandbytes version:", bnb.__version__)
except ImportError as e:
    print("Error:", e)

⚠️ Important: ensure the correct kernel is selected

If the verification process fails, ensure the correct Jupyter kernel is selected for your notebook. To change the kernel, follow these steps:

Go to the Kernel menu.
Select Change Kernel
Select Python 3 (ipykernel) from the list.

Failure to select the correct kernel can lead to unexpected issues when running the notebook.

6. Provide your Hugging Face token#

You’ll require a Hugging Face API token to access Llama-3.2 3B. Generate your token at Hugging Face Tokens and request access for Llama-3.2 3B. Tokens typically start with “hf_”.

Run the following interactive block in your Jupyter notebook to set up the token:

Note: Uncheck the “Add token as Git credential” option.

from huggingface_hub import notebook_login, HfApi

# Prompt the user to log in
notebook_login()

Verify that your token was accepted correctly:

from huggingface_hub import HfApi

try:
    api = HfApi()
    user_info = api.whoami()
    print(f"Token validated successfully! Logged in as: {user_info['name']}")
except Exception as e:
    print(f"Token validation failed. Error: {e}")

Fine-tuning the model#

This section covers the process of setting up and running fine-tuning for the Llama-3.2 model using the LoRA technique. The following steps describe how to set up GPUs, import the required libraries, configure the model and training parameters, and run the fine-tuning process.

⚠️ Important: ensure the correct kernel is selected

Ensure the correct Jupyter kernel is selected for your notebook. To change the kernel, follow these steps:

Go to the Kernel menu.
Select Change Kernel
Select Python 3 (ipykernel) from the list.

Failure to select the correct kernel can lead to unexpected issues when running the notebook.

Set and verify the GPU availability#

Begin by specifying the GPUs available for fine-tuning. Verify that they are properly detected by PyTorch.

import os
import torch
# gpus = [0, 1, 2, 3] # Specify the GPUs to be used for training for MI300x
gpus= [0] # for w7900
os.environ.setdefault("CUDA_VISIBLE_DEVICES", ','.join(map(str, gpus)))
# Ensure PyTorch detects the GPUs correctly
print(f"PyTorch detected number of available devices: {torch.cuda.device_count()}") 

Import the required packages#

Next, import the libraries necessary for fine-tuning, including utilities for dataset loading, model configuration, training setup, and evaluation.

# Load datasets and transformers for handling the Llama-3.2 model
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    pipeline
)
# Import utilities for LoRA fine-tuning and training configurations
from peft import LoraConfig
from trl import SFTTrainer

print("Successfully imported required libraries for dataset handling, model configuration, and LoRA fine-tuning.")

Configure the model#

Load the base model and tokenizer and configure the quantization settings for efficient fine-tuning on ROCm-enabled GPUs.

base_model_name = "meta-llama/Llama-3.2-3B"  # Hugging Face model repository name
new_model_name = "Llama-3.2-3B-lora"  # Name for the fine-tuned model

# Load and configure the tokenizer for padding and tokenization
llama_tokenizer = AutoTokenizer.from_pretrained(
    base_model_name, 
    trust_remote_code=True, 
    use_fast=True
)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

# Load the pre-trained Llama-3.2 model with device mapping for GPU
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    trust_remote_code=True
)

# Disable caching to optimize for fine-tuning
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

Load and prepare the dataset#

Fine-tune the base model for a question-and-answer task using a small dataset called mlabonne/guanaco-llama2-1k. This dataset is a subset of 1,000 samples from the timdettmers/openassistant-guanaco dataset. This is a human-generated, human-annotated, assistant-style conversation corpus that contains 161,443 messages in 35 different languages, annotated with 461,292 quality ratings. It results in over 10,000 fully annotated conversation trees.

# Dataset
data_name = "mlabonne/guanaco-llama2-1k"
# Load the fine-tuning dataset from Hugging Face
training_data = load_dataset(data_name, split="train")

# Display dataset structure and a sample for verification
print(training_data.shape)
#11 is a QA sample in English
print(training_data[11])

Fine-tuning the configuration#

Define the hyperparameters and configurations for the fine-tuning process.

# Define training arguments, including output directory and optimization settings
# Specify number of epochs, batch size, learning rate, and logging steps
train_params = TrainingArguments(
    output_dir="./results_lora",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=50,
    learning_rate=4e-5,
    weight_decay=0.001,
    fp16=False,
    bf16=True,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

print("Training parameters configured!.")

Note：If you encounter out-of-memory (OOM) errors, reduce the per_device_train_batch_size or enable gradient checkpointing. Use rocm-smi to monitor VRAM usage during fine-tuning.

LoRA configuration#

Low-Rank Adaptation (LoRA) introduces lightweight rank-decomposition matrices into the base model. By focusing only on updating these additional matrices, LoRA significantly reduces the number of trainable parameters, enabling efficient fine-tuning of large models.

from peft import get_peft_model

# Configure LoRA parameters for low-rank adaptation
peft_parameters = LoraConfig(
    lora_alpha=8, # Alpha controls the scaling parameter
    lora_dropout=0.1,
    r=8, # r specifies the rank of the low-rank matrices
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, peft_parameters)
model.print_trainable_parameters()

This indicates that only a small portion of the total parameters are trainable during fine-tuning, ensuring resource efficiency.

Fine-tuning with LoRA#

LoRA’s lightweight approach allows fine-tuning while maintaining high efficiency in terms of computation and memory usage. The following section defines a training pipeline using the LoRA-integrated model.

# Initialize the trainer with the fine-tuning dataset and configurations
fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=training_data,
    peft_config=peft_parameters,
    args=train_params
)

# Execute the training process
fine_tuning.train()

During training, the model outputs metrics such as training loss, step progress, and runtime performance, which can be monitored for insights.

Save the fine-tuned model#

After training is complete, save the model, specifying a name.

# Save the fine-tuned model to the specified directory
fine_tuning.model.save_pretrained(new_model_name)
print("Successfully saved the model!")

Monitoring GPU memory#

To monitor GPU memory during training, run the following command in a terminal:

!rocm-smi

This command displays memory usage and other GPU metrics to ensure your hardware resources are being optimally used.

Comparison: fine-tuning with and without LoRA#

To understand the benefits of LoRA, compare the fine-tuning metrics (such as memory usage, training speed, and loss) between these scenarios:

Fine-tuning with LoRA (low-rank adaptation layers).
Full fine-tuning (updating all model parameters).

LoRA’s resource-efficient approach is especially beneficial for training on hardware with limited memory or computational power.

Testing the fine-tuned model#

Load the fine-tuned model and run inference to evaluate its performance.

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
from peft import LoraConfig, PeftModel
peft_model = PeftModel.from_pretrained(base_model, new_model_name)
peft_model = peft_model.merge_and_unload()

# Configure the tokenizer for text generation
llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"
pipeline = pipeline(
    "text-generation", 
    model=peft_model, 
    tokenizer=llama_tokenizer,
    max_length=1024,
    device_map="auto"
)

Now run a query and view the response generated by your fine-tuned model.

# Use the fine-tuned model to generate responses for a query
query = "What do you think is the most important part of building an AI chatbot?"
output = pipeline(f"<s>[INST] {query} [/INST]")
print(output[0]['generated_text'])