A question about `normalize_after`
liutianlin0121 opened this issue · 3 comments
Hi Costa!
A quick question about normalize_after
for reward normalization:
The current implementation seems to normalize the gain and bias of the reward model using the reward model's backbone (logit-returning language model). Specifically, for both normalize_before
and normalize_after
, the accelerator.unwrap_model(reward_model).pretrained_model
is used to generate responses.
However, according to OAI's paper and implementation, it seems they normalize the reward model based on the responses generated from the pretrained model. For normalize_before
, the pretrained model is the same as reward model's backbone. But for normalize_after
, differences might arise because reward_model.pretrained_model
could be updated during reward learning.
Using the notation of the paper, the responses for normalization come from the fixed pre-trained language model ref_policy
(link) for both normalize_before
and normalize_after
, and it seems ref_policy
doesn't update during reward learning.
Thought this detail might interest you! Nevertheless, with a low learning rate and just 1 epoch in reward learning, the practical difference can be small, as the parameters of reward model's backbone may not deviate significantly from the initialization.
Oh nice! You're absolutely spot on. We should definitely match this implementation detail. I have just created a PR to fix this issue and will be running experiments to examine its effect.
Hey @liutianlin0121 lets keep the issue open until we have benchmark :)
No noticeable difference. We are good! Closing the issue.
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=openrlbenchmark&wpn=lm-human-preferences&xaxis=_step&ceik=task_id&cen=task.value.policy.initial_model&metrics=ppo/objective/score&metrics=ppo/objective/kl&metrics=ppo/objective/entropy&metrics=ppo/objective/score_total&metrics=ppo/objective/kl_coef&metrics=ppo/ppo/loss/total&metrics=ppo/ppo/loss/value&metrics=ppo/ppo/loss/policy&metrics=ppo/ppo/policy/clipfrac&metrics=ppo/ppo/policy/entropy&metrics=ppo/ppo/returns/mean&metrics=ppo/ppo/policy/approxkl&metrics=ppo/ppo/val/clipfrac&metrics=ppo/ppo/val/error&metrics=ppo/ppo/val/mean&metrics=ppo/ppo/returns/var&metrics=ppo/ppo/val/vpred' \
'124M' \
--filters '?we=openrlbenchmark&wpn=lm_human_preference_details&xaxis=_step&ceik=rewards.value.label_dataset&cen=exp_name&metrics=objective/scores&metrics=objective/kl&metrics=objective/entropy&metrics=objective/score_total&metrics=objective/kl_coef&metrics=ppo/loss/total&metrics=ppo/loss/value&metrics=ppo/loss/policy_avg&metrics=ppo/policy/clipfrac_avg&metrics=ppo/policy/entropy_avg&metrics=ppo/returns/mean&metrics=ppo/policy/approxkl_avg&metrics=ppo/val/clipfrac_avg&metrics=ppo/val/error&metrics=ppo/val/mean&metrics=ppo/returns/var&metrics=ppo/val/vpred' \
'train_policy_accelerate?tag=v0.1.0-68-g2f3aa38&tag=tf_adam&tag=gpt2&cl=tf_adam,gpt2' \
'train_policy_accelerate?tag=v0.1.0-58-g4f42012&tag=tf_adam&tag=gpt2&cl=tf_adam,gpt2 (before PR-10)' \
--env-ids sentiment descriptiveness \
--env-ids sentiment/offline_5k.json descriptiveness/offline_5k.json \
--no-check-empty-runs \
--pc.ncols 6 \
--pc.ncols-legend 1 \
--output-filename static/0compare \
--scan-history