eric-mitchell/direct-preference-optimization

Why sometimes chosen_rewards become negaive?

DwarfWarriors opened this issue · 0 comments

Duiring DPO training for some datasets, chosen rewards recorded in logger(wandb, tensorboard etc) are always negative. Is it normal? Why did these circumstances happend?