SuperAGI GPT-2

https://contlo.notion.site/contlo/Assignment-32610c8f37dd4435b1f97ecaff93bdaf

Table of Contents

Introduction

This repository provides a training loop for GPT-2 models, accommodating various training setups: single GPU, DDP, and FSDP. The script supports training on custom datasets and can be easily adapted for specific project requirements.

This repository contains a PyTorch-based training loop for GPT-2, supporting single GPU, Distributed Data Parallel (DDP), and Fully Sharded Data Parallel (FSDP) setups.

The model.ipynb interactive notebook contains a functional training loop for GPT-2, and it is equipped to handle single GPU, DDP, and FSDP training. Here's a brief overview of each function in the script:

  • create_model_optimizer(lr=5e-5): Function to create a GPT-2 model and an AdamW optimizer with a specified learning rate.
  • train_single_gpu(model, optimizer, criterion, dataloader, device): Function to train the model on a single GPU.
  • train_ddp(model, optimizer, criterion, dataloader, device): Function to train the model using Distributed Data Parallel (DDP) across multiple GPUs.
  • train_fsdp(model, optimizer, criterion, dataloader, device): Function to train the model using Fully Sharded Data Parallel (FSDP) for fully sharded parallelism.
  • Sample dataset and dataloader: A placeholder for the dataset and dataloader; replace it with your actual implementation.
  • SampleDataset: An example dataset class (replace with your custom dataset class).
  • criterion: CrossEntropyLoss used as the loss function.

Setup

  1. Clone the repository:

    git clone https://github.com/Mannxxx/SuperAGI_AI_Assignment_Submission.git
    cd SuperAGI_AI_Assignment_Submission
    
  2. Install the required dependencies:

pip install torch torchvision
  1. Replace the sample dataset and dataloader in the script (train.py) with your actual dataset and dataloader.

Usage

Single GPU Training

To train the model on a single GPU, use train_single_gpu function.

Distributed Data Parallel (DDP)

To train the model using DDP across multiple GPUs, use train_ddp function.

Fully Sharded Data Parallel (FSDP)

To train the model using FSDP for fully sharded parallelism, use train_fsdp function.

References