Why sometimes chosen_rewards become negaive?
DwarfWarriors opened this issue · 0 comments
DwarfWarriors commented
Duiring DPO training for some datasets, chosen rewards recorded in logger(wandb, tensorboard etc) are always negative. Is it normal? Why did these circumstances happend?