Issues
- 1
concerns about the length of the input
#14 opened by andyclsr - 2
Running PPO with a subset of RMs
#13 opened by vishwa27yvs - 1
计算advantages时lastgaelam是指什么?
#8 opened by Congcong-Song - 1
Issue when running train_sft.sh
#12 opened by yunsaijc - 3
运行了源码,没有修改过模型和参数,但是loss和reward的结果很震荡
#9 opened by Congcong-Song - 1
Weights of reward functions in RLHF
#10 opened by jsw7460 - 1
支持多卡并行吗?训练ppo的时候似乎所有的模型都加载在同一块卡上了
#7 opened by Congcong-Song - 2
训练好的modeling_output可以提供一下吗?例如偏好模型,奖励模型?
#6 opened by Congcong-Song - 1
sft训练时找不到transformers.generation
#5 opened by Congcong-Song - 0
Is there any plan to share the pre-trained rewards (R1, R2, R3 and R_pref) on HuggingFace?
#4 opened by ZHZisZZ - 2
- 2
Open-sourcing the reward models
#1 opened by Glavin001 - 1
rename requrements.txt
#2 opened by nishkalavallabhi