Issues
- 1
About the results of vanilla DPO
#28 opened by lucasliunju - 1
Reward-KL Comparison
#27 opened by vincezh2000 - 3
SFT training objective
#24 opened by ljb121002 - 4
- 9
Reference policy ablations
#18 opened by yesiam-png - 4
Iterative pipeline question
#7 opened by matouk98 - 1
Question about CUDA/NVCC setups
#22 opened by rqzhangberkeley - 8
- 4
Questions about Nectar Datasets
#20 opened by XinZhao0211 - 2
pip's dependency conflict: accelerate
#19 opened by liwd190019 - 6
Phi3 has a nearly constant DPO loss of 0.69xx
#17 opened by Arnav0400 - 1
update the figure in readme
#9 opened by WayXG - 5
questions about dpo
#8 opened by hong-xl - 1
numpy version and transformers version
#14 opened by WayXG - 1
More RLHF algorithms in the implementation
#13 opened by WayXG - 5
Model evaluation issue
#6 opened by matouk98 - 1
question about dpo dataset
#12 opened by LiuChen19960902 - 3
- 1
Cannot Reproduce the DPO Checkpoint
#3 opened by gesy17 - 1
How train sft on rtx4090?
#2 opened by utrobinmv - 1
large max_steps?
#16 opened by hunterlang - 2
- 2
- 2
Distributed training in stage 3.3 keeps hanging
#11 opened by srzer