rcalix1/RLHF

Jupyter Notebook

Reinforcement Learning Through Human Feedbacks (RLHF)

RLHF pipeline:

STEP1: Ziegler2020
STEP2: HF tlr
STEP3: tlrx

Problems

Problem1: Traing GPT2 with PPO and reward model
Problem2: MathGPT

AI Cloud

Lambda Labs (https://lambdalabs.com)