In this task we are required to sort/range comment in social network. We desided to use reward part from RLHF approach.
Can be found in explore.ipynb.
The pairwise training of reward model, as used in the InstructGPT paper by OpenAI, is an effective method for rating comments on a given post. This approach allows the model to learn from the relative differences between comments, rather than relying on an absolute rating scale. This is especially useful when the rating scale is inconsistent or not well-defined. By training the model on relative differences between pairs of comments, it becomes less reliant on a predefined rating scale and is better able to generalize to new data. Overall, pairwise training of reward model is a sound choice for training a model to rate comments on a given post, as it enables more nuanced and accurate ratings and is more robust to inconsistencies in the rating scale.
For each row in the dataset, a list of comemnts is given with the score from 0 to 4 (0 - best, 4 - worst). We use this data to train a reward model that maps a (post, comment) pair to a reward r. The reward model is trained to predict which comment a human will prefer, using the rewards as logits.
We used first 2 steps from RLHF pipeline:
- Supervised fine-tuning on the given dataset
- Reward model training based on comparisons
We used the following loss function to train our reward model:
loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()
We used NDCG metrics to compare runs.
We compared 2 main models:
- with no additional context
- with post context from Google Big Query dataset: link
https://wandb.ai/aleksey-korshuk/huggingface/reports/CUP-IT-Report--VmlldzozODMzODI0
deepspeed reward_model/train_sft.py \
--model_name_or_path gpt2 \
--dataset_name AlekseyKorshuk/up-it-ds-sft \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--do_train \
--do_eval \
--output_dir /tmp/test-clm \
--push_to_hub
deepspeed reward_model/train_reward_model.py \
--model_path AlekseyKorshuk/cup-it-ds-sft-pretrained \
--dataset_path AlekseyKorshuk/cup-it-ds-pairwise \
--output_dir no-context
Resulting model: https://huggingface.co/AlekseyKorshuk/cup-it-ds-reward-model-no-context
deepspeed reward_model/train_reward_model.py \
--model_path AlekseyKorshuk/cup-it-ds-sft-pretrained \
--dataset_path ummagumm-a/cup-it-ds-classification-pairwise-train-val \
--output_dir with-context
Resulting model: https://huggingface.co/AlekseyKorshuk/cup-it-ds-reward-model-with-context
To generate scores for test dataset:
wget https://huggingface.co/AlekseyKorshuk/cup-it-ds-reward-model-no-context/resolve/main/pytorch_model.bin -O ./rm_checkpoint/no-context/checkpoint-4956/pytorch_model.bin
python3 reward_model/inference.py