minChatGPT

This is a custom project from Stanford CS224N Winter 2023 class. The goal of this project is to answer this question

Will alignment from human feedback also help small language models such as GPT-2?

And the answer is YES! With RLHF, evaluation shows that ChatGPT prefers the aligned GPT-2 outputs for 96% of times over the vanilla GPT-2 outputs, and 88% times over the supervised fine-tuning baseline. Please see the technical report for more details.

Also, you can test minChatGPT in Colab Notebook

Disclaimer:

This model has not been tested or evalauted against its safety. It may generate harmful or toxic content.
The demo is only meant to show how to improve small models with RLHF. The performance is not comparable with any conversation systems that are backed by large language models.
This is not an error free codebase! In fact there may be bugs here and there. Please make an issue if you have any questions.

Poster

Get Started

First of all, you should know that ChatGPT (or InstructGPT, which is the last open publication on this topic from OpenAI) has three stages: Supervised Fine-tuning, Reward Model Training and RL with Human Feedback (with Proximal Policy Gradient). Here's a diagram from the InstructGPT paper:

Training

You need to have a GPU with at least 16GB VRAM, NVIDIA Driver 515+, CUDA 11.7+ and also Python 3.8 or higher (other versions of CUDA and Python might work but I didn't test, Python 2.0 is very demanding.).
Install PyTorch 2.0
Install dependencies with

pip install -r requirements.txt

The first step is to traing a SFT model, inside src directory, run this command. You can change batch size via -b. The bigger VRAM you have the larger batch size you can afford. Optionally, you can open the tensorboard.ipynb to check the training progress.

python train_sft.py --n experiment_name -b 2`

Once you finished SFT stage, you can start to train the reward model. You should have a directory started with sft_ in your runs directory. Find the final model weights and run this. This should start a reward model training for 1 epoch and generate a directory started with rm_ with weights in it.

python train_rm.py -b 2 -n experiment_name -p "./runs/path/to/your/weights"

Finally, you can start the RLHF with the reward model and SFT model you get from previous two steps. Run this command. Because the training is not stable sometimes, I stopped early around 12K steps with a batch size of 1. The final weights will be in a directory started with ppo_.

python train_rm.py -b 2 -n experiment_name -a "./runs/path/to/sft/weights" -c "./runs/path/to/reward_model/weights" -s naive

Evaluate

You need to have an OpenAI account with credential key
Put your key into a file called "openai.key" JSON file. It should be a dictionary with a key called "OPENAI_API_KEY" and the value is your key.
Inside src, run this:

python evaluate.py -s "/path/to/sft/model" -p "/path/to/ppo/model"

This should generate the "Human" preference over each model using ChatGPT as a proxy

Citation

just in case you need this:

@misc{minChatGPT,
  author = {Yanjia Li},
  title = {minChatGPT: A minimum example of aligning language models with RLHF similar to ChatGPT},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ethanyanjiali/minChatGPT}},
}

Acknowledgement

This project wouldn't been done without the help from:

Stanford CS224N, Professor Manning and the TAs
nanoGPT
ColossalAI
OpenAI Baselines
OpenAssistant
Anthropic HH RLHF
And my project mentor Jesse Mu!

ethanyanjiali/minChatGPT