verandah/llm_ppo_deepspeed

Customized llm PPO (reinforcement learning) pipeline with deepSpeed. For Amex external usage. Training reward model, actor-critic models with referenced supervised fine-tuned model

Python

llm_ppo_deepspeed

Customized llm PPO (reinforcement learning) pipeline with deepSpeed. For Amex external usage. Training reward model, actor-critic models with referenced supervised fine-tuned model