junkangwu/beta-DPO

Lines 93 and 94 in Trainer.py don’t seem to use beta? Am I missing something?

Closed this issue · 1 comments

Hi, maybe I didn’t understand your paper well enough to know why beta is only used for the loss, and not used for the reference model and policy model difference, could you explain that to me?

losses = -F.logsigmoid(beta_used.detach() * logits) # return chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps).detach() rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps).detach()

You are correct that beta should also be applied to the reference model and policy model difference. Since this does not affect the training part of the model, we did not make that change.

Thank you for your feedback; we will make improvements for alignment.