Lines 93 and 94 in Trainer.py don’t seem to use beta? Am I missing something?
Closed this issue · 1 comments
geighz commented
Hi, maybe I didn’t understand your paper well enough to know why beta is only used for the loss, and not used for the reference model and policy model difference, could you explain that to me?
losses = -F.logsigmoid(beta_used.detach() * logits) # return chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps).detach() rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps).detach()
junkangwu commented
You are correct that beta should also be applied to the reference model and policy model difference. Since this does not affect the training part of the model, we did not make that change.
Thank you for your feedback; we will make improvements for alignment.