eric-mitchell/direct-preference-optimization

Reference implementation for DPO (Direct Preference Optimization)

PythonApache-2.0

Issues

When trying to reproduce the complete example, "NotImplementedError: offload_to_cpu=True and NO_SHARD is not supported yet" is thrown
#91 opened a month ago by ZSvedic
1
error when following the readme to train sft on multiple cards using FSDPTrainer
#51 opened a year ago by NekoMimiUnagi
2
ValueError when using peft on FSDPTrainer
#90 opened 2 months ago by AragornHorse
0
Qwen model issues & embedding and loss has nan
#52 opened a year ago by lylcst
5
In DPO training, I got this ‘train stats after 160768 examples: {'rewards_train/chosen': 'nan', 'rewards_train/rejected': 'nan', 'rewards_train/accuracies': '0', 'rewards_train/margins': 'nan', 'l -ogps_train/rejected': 'nan', 'logps_train/chosen': 'nan', 'loss/train': 'nan', 'examples_per_second': '5.4876', 'grad_norm': 'nan', 'counters/examples': 160768, 'counters/up -dates': 5024}’
#89 opened 3 months ago by Alan-D-Chen
2
llama7B issue
#42 opened a year ago by JiuhaiChen
15
Loss is 0 when policy and reference models are the same
#46 opened a year ago by luffycodes
3
Hi @eric-mitchell ,
#82 opened 4 months ago by Gryff1ndor
3
Training process got stuck when loss=dpo sample_during_eval=true trainer=FSDPTrainer
#85 opened 5 months ago by kygguo
1
GPT4 prompt when evaluating DPO
#88 opened 4 months ago by kygguo
0
How to gurantee the output.logits.shape[:-1] == labels.shape
#87 opened 4 months ago by foreverhell
0
Why does SFT sum the cross-entropy loss within each sequence?
#68 opened 10 months ago by YJWon99
1
Computing faster lopgs
#72 opened 10 months ago by alexvishnevskiy
3
Question about average_log_prob
#40 opened a year ago by Kyeongpil
5
How are evals done on trained models?
#83 opened 7 months ago by lesnikow
0
Weird logits and model starts degeneration while training DPO
#77 opened 9 months ago by DungNasSa10
2
Unable to Run SFT
#66 opened 10 months ago by Rui-Yuan91
3
Question about _get_batch_logps of trainers.py
#57 opened a year ago by wulaoshi
3
where is config document of ipo?
#81 opened 8 months ago by 3244we
1
Using Mistral 7B with transformers v4.38.1 on MATH dataset, and facing memory leaks
#80 opened 8 months ago by Jayant1234
0
Division by Zero error sporadically occurs
#78 opened 8 months ago by Jayant1234
1
Question about fine tuning steps(epoch)
#58 opened a year ago by gyuwon12
2
Implementation for Plackett-Luce rank model
#71 opened 10 months ago by rohan598
1
Reproducing Win Rate inference for TL;DR
#62 opened a year ago by jdchang1
1
Pythia2.8B model weights
#50 opened a year ago by alexv-cerebras
2
Was it your intention to recreate wandb tables in iterator?
#76 opened 9 months ago by huskydoge
0
Question bout IPO loss vs DPO loss
#64 opened a year ago by MoonBlvd
1
Using cross entropy loss to calculate DPO?
#67 opened 10 months ago by zachares
2
Can DPO work on BERT-style Model?
#75 opened 9 months ago by Leo-T-Zang
0
The number of training steps in the SHP dataset
#73 opened 9 months ago by bonin147
0
Question about average_log_prob
#48 opened a year ago by LSX-Sneakerprogrammer
9
What's the reference policy of Preferred-FT in Figure 2?
#70 opened 10 months ago by zetian1025
0
My Code to Reproduce IMDB
#69 opened 10 months ago by QiyaoWei
0
DPO did not achieve the expected experimental effect
#56 opened a year ago by Vance0124
2
Bug in loading Llama tokenizer?
#65 opened a year ago by ajyl
1
Appendix A.4 of the papper: the derived gradient is not consistent with the main text
#63 opened a year ago by yflyzhang
2
does reference free mode work?
#38 opened a year ago by JosephZZ
1
Training cost: RLHF vs DPO
#55 opened a year ago by kartheekmedathati
1
Unable to run the code for Step 2: Run SFT
#59 opened a year ago by ppsmk388
1
Possible Inconsistency(Possibly Typo) in Gradient Definition Between Eq. 7 and Appendix A.4 in DPO paper
#60 opened a year ago by rustic-snob
1
Question regarding the logits in the `_get_batch_logps` function
#61 opened a year ago by vgoklani
2
Questions about the IMDB Sentiment dataset
#45 opened a year ago by stevie1023
3
How to load trained model for inference?
#49 opened a year ago by VibhuAg
2
How to re-implement the result of IMDB sentiment generation.
#54 opened a year ago by junkangwu
0
Llama-2-13b-chat Valid reward accuracy remains ~50%
#53 opened a year ago by nxphi47
0
No such file or directory: json-train-00000-00000-of-NNNNN.arrow
#47 opened a year ago by qingerVT
1
Questions about the average_log_prob parameter
#44 opened a year ago by liumingzhu6060
0
Is fine tuning with e.g., LORA supported?
#43 opened a year ago by Emerald01
1
Strange loss pattern
#41 opened a year ago by puyuanOT
0
Why sometimes chosen_rewards become negaive?
#39 opened a year ago by DwarfWarriors
0