Why sometimes chosen_rewards become negaive?

Question

Why sometimes chosen_rewards become negaive?

DwarfWarriors opened this issue a year ago · 0 comments

Duiring DPO training for some datasets, chosen rewards recorded in logger(wandb, tensorboard etc) are always negative. Is it normal? Why did these circumstances happend?