vwxyzjn/lm-human-preference-details

A question about `normalize_after`

liutianlin0121 opened this issue · 3 comments

Hi Costa!

A quick question about normalize_after for reward normalization:

The current implementation seems to normalize the gain and bias of the reward model using the reward model's backbone (logit-returning language model). Specifically, for both normalize_before and normalize_after, the accelerator.unwrap_model(reward_model).pretrained_model is used to generate responses.

However, according to OAI's paper and implementation, it seems they normalize the reward model based on the responses generated from the pretrained model. For normalize_before, the pretrained model is the same as reward model's backbone. But for normalize_after, differences might arise because reward_model.pretrained_model could be updated during reward learning.

Using the notation of the paper, the responses for normalization come from the fixed pre-trained language model $\rho$; see the text after Equation (1). In their code, they use ref_policy (link) for both normalize_before and normalize_after, and it seems ref_policy doesn't update during reward learning.

Thought this detail might interest you! Nevertheless, with a low learning rate and just 1 epoch in reward learning, the practical difference can be small, as the parameters of reward model's backbone may not deviate significantly from the initialization.

Oh nice! You're absolutely spot on. We should definitely match this implementation detail. I have just created a PR to fix this issue and will be running experiments to examine its effect.

Hey @liutianlin0121 lets keep the issue open until we have benchmark :)

No noticeable difference. We are good! Closing the issue.

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=openrlbenchmark&wpn=lm-human-preferences&xaxis=_step&ceik=task_id&cen=task.value.policy.initial_model&metrics=ppo/objective/score&metrics=ppo/objective/kl&metrics=ppo/objective/entropy&metrics=ppo/objective/score_total&metrics=ppo/objective/kl_coef&metrics=ppo/ppo/loss/total&metrics=ppo/ppo/loss/value&metrics=ppo/ppo/loss/policy&metrics=ppo/ppo/policy/clipfrac&metrics=ppo/ppo/policy/entropy&metrics=ppo/ppo/returns/mean&metrics=ppo/ppo/policy/approxkl&metrics=ppo/ppo/val/clipfrac&metrics=ppo/ppo/val/error&metrics=ppo/ppo/val/mean&metrics=ppo/ppo/returns/var&metrics=ppo/ppo/val/vpred' \
        '124M' \
    --filters '?we=openrlbenchmark&wpn=lm_human_preference_details&xaxis=_step&ceik=rewards.value.label_dataset&cen=exp_name&metrics=objective/scores&metrics=objective/kl&metrics=objective/entropy&metrics=objective/score_total&metrics=objective/kl_coef&metrics=ppo/loss/total&metrics=ppo/loss/value&metrics=ppo/loss/policy_avg&metrics=ppo/policy/clipfrac_avg&metrics=ppo/policy/entropy_avg&metrics=ppo/returns/mean&metrics=ppo/policy/approxkl_avg&metrics=ppo/val/clipfrac_avg&metrics=ppo/val/error&metrics=ppo/val/mean&metrics=ppo/returns/var&metrics=ppo/val/vpred' \
        'train_policy_accelerate?tag=v0.1.0-68-g2f3aa38&tag=tf_adam&tag=gpt2&cl=tf_adam,gpt2' \
        'train_policy_accelerate?tag=v0.1.0-58-g4f42012&tag=tf_adam&tag=gpt2&cl=tf_adam,gpt2 (before PR-10)' \
    --env-ids sentiment descriptiveness \
    --env-ids sentiment/offline_5k.json  descriptiveness/offline_5k.json \
    --no-check-empty-runs \
    --pc.ncols 6 \
    --pc.ncols-legend 1 \
    --output-filename static/0compare \
    --scan-history

0compare