Curated list of resources for Reinforcement Learning from Human Feedback and Language Models.
Reinforcement learning from human feedback (RLHF) has gained popularity with ChatGPT with combines language models with RLHF.
The paper Transformer models: an introduction and catalog contains a very comprehensive of the existing language models.
- Anthropic
- OpenAI ChatGPT
- ChatGPT (https://openai.com/blog/chatgpt/)
- InstructGPT (https://openai.com/research/instruction-following)
- Google bard
- Reinforcement Learning from Human Feedback: From Zero to chatGPT
- CS224n: NLP w/ Deep Learning course at Stanford
- Stanford CS234: Reinforcement Learning | Winter 2019
- Hugging Face Deep Reinforcement Learning Course
2020
2022
- Fine-tuning language models to find agreement among humans with diverse preferences
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
2023
- It's chatgpt a general-purpose natural language processing task solver?
- The Capacity for Moral Self-Correction in Large Language Models
- Is reinforcement learning (not) for natural language processing: benchmarks, baselines, and building blocks for natural language policy optimization
- https://github.com/openai/lm-human-preferences - The first code released to perform RLHF on LMs from OpenAI
- https://github.com/allenai/RL4LMs - provide easily customizable building blocks for training language models including implementations of on-policy algorithms, reward functions, metrics, datasets and LM based actor-critic policies
- https://github.com/lvwerra/trl - train transformer language models with Proximal Policy Optimization (PPO). The library is built on top of the transformers library by hugs Hugging Face.
- https://github.com/lucidrains/PaLM-rlhf-pytorch - Implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture.
- https://github.com/CarperAI/trlx - A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
- https://github.com/voidful/TextRL - Implementation of ChatGPT RLHF (Reinforcement Learning with Human Feedback) on any generation model in huggingface's transformer
- https://huggingface.co/datasets/Anthropic/hh-rlhf - Human preference data about helpfulness and harmlessness
- https://huggingface.co/datasets/stanfordnlp/SHP - SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice.