RLHFlow/Online-RLHF

A recipe for online RLHF and online iterative DPO.

Python

Issues

About the results of vanilla DPO
#28 opened a month ago by lucasliunju
1
Reward-KL Comparison
#27 opened 2 months ago by vincezh2000
1
SFT training objective
#24 opened 2 months ago by ljb121002
3
Negative reward when serving ArmoRM-Llama3-8B-v0.1
#23 opened 4 months ago by maoliyuan
4
Reference policy ablations
#18 opened 4 months ago by yesiam-png
9
Iterative pipeline question
#7 opened 4 months ago by matouk98
4
Question about CUDA/NVCC setups
#22 opened 4 months ago by rqzhangberkeley
1
Question about the iteration dataset (information leakage)?
#21 opened 4 months ago by hhhhzzzzz
8
Questions about Nectar Datasets
#20 opened 5 months ago by XinZhao0211
4
pip's dependency conflict: accelerate
#19 opened 5 months ago by liwd190019
2
Phi3 has a nearly constant DPO loss of 0.69xx
#17 opened 5 months ago by Arnav0400
6
update the figure in readme
#9 opened 5 months ago by WayXG
1
questions about dpo
#8 opened 5 months ago by hong-xl
5
numpy version and transformers version
#14 opened 5 months ago by WayXG
1
More RLHF algorithms in the implementation
#13 opened 5 months ago by WayXG
1
Model evaluation issue
#6 opened 5 months ago by matouk98
5
question about dpo dataset
#12 opened 5 months ago by LiuChen19960902
1
Questions about training data during iterative DPO
#5 opened 5 months ago by hong-xl
3
Cannot Reproduce the DPO Checkpoint
#3 opened 5 months ago by gesy17
1
How train sft on rtx4090?
#2 opened 5 months ago by utrobinmv
1
large max_steps?
#16 opened 6 months ago by hunterlang
1
One question about the loss function given a gold reward model
#15 opened 6 months ago by srzer
2
Fail to load weight from pair-preference-model-LLaMA3-8B
#4 opened 7 months ago by matouk98
2
Distributed training in stage 3.3 keeps hanging
#11 opened 6 months ago by srzer
2