raghavc/LLM-RLHF-Tuning-with-PPO-and-DPO
Comprehensive toolkit for Reinforcement Learning from Human Feedback (RLHF) training, featuring instruction fine-tuning, reward model training, and support for PPO and DPO algorithms with various configurations for the Alpaca, LLaMA, and LLaMA2 models.
Python
Stargazers
- 3rdMassive Dynamic
- atveitMicrosoft
- bgyssIlludic, Inc.
- blshao84
- brunchybitFocused Labs
- crankyadmin@portias-garage
- creatorrr@julep-ai
- evert0nResola Inc
- gaviObjectGraph LLC
- gitwhistle
- interactivetechNYC
- jeffmcjunkinroguevalleyinfosec.com
- johncoates-st
- lucasosouzaNumenta
- lxe@netflix
- manooshree
- mdimarco121Chicago, Illinois
- MonkeyLeeTThat AI company
- mrudeOC
- munisp
- nesabra
- NextWordDev
- nezRobotanica
- nightlyworker
- noghartt@firefliesai
- olivierloverdeCTO @Innovorder
- raghavc
- rosco5@perpetua1
- rupurtFremantle Industries
- sierzpak
- StellarSk8board
- stephenwithavCharleston, SC
- TheCulliganManMinneapolis, MN
- tokestermwCresta
- veeragoniAmazon
- zf0x00