CUP IT | Hackathon

About

In this task we are required to sort/range comment in social network. We desided to use reward part from RLHF approach.

Exploratory Data Analysis

Approach

The pairwise training of reward model, as used in the InstructGPT paper by OpenAI, is an effective method for rating comments on a given post. This approach allows the model to learn from the relative differences between comments, rather than relying on an absolute rating scale. This is especially useful when the rating scale is inconsistent or not well-defined. By training the model on relative differences between pairs of comments, it becomes less reliant on a predefined rating scale and is better able to generalize to new data. Overall, pairwise training of reward model is a sound choice for training a model to rate comments on a given post, as it enables more nuanced and accurate ratings and is more robust to inconsistencies in the rating scale.

For each row in the dataset, a list of comemnts is given with the score from 0 to 4 (0 - best, 4 - worst). We use this data to train a reward model that maps a (post, comment) pair to a reward r. The reward model is trained to predict which comment a human will prefer, using the rewards as logits.

We used first 2 steps from RLHF pipeline:

Supervised fine-tuning on the given dataset
Reward model training based on comparisons

Reward model training deteils

We used the following loss function to train our reward model:

loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()

Results

We used NDCG metrics to compare runs.

We compared 2 main models:

with no additional context
with post context from Google Big Query dataset: link

Grouped NDCG with k=5

Paired NDCG with k=2

Weights & Biases Report

https://wandb.ai/aleksey-korshuk/huggingface/reports/CUP-IT-Report--VmlldzozODMzODI0

Train SFT model

deepspeed reward_model/train_sft.py \
  --model_name_or_path gpt2 \
  --dataset_name AlekseyKorshuk/up-it-ds-sft \
  --per_device_train_batch_size 8 \
  --per_device_eval_batch_size 8 \
  --do_train \
  --do_eval \
  --output_dir /tmp/test-clm \
  --push_to_hub

Train reward model

deepspeed reward_model/train_reward_model.py \
  --model_path AlekseyKorshuk/cup-it-ds-sft-pretrained \
  --dataset_path AlekseyKorshuk/cup-it-ds-pairwise \
  --output_dir no-context

Resulting model: https://huggingface.co/AlekseyKorshuk/cup-it-ds-reward-model-no-context

deepspeed reward_model/train_reward_model.py \
  --model_path AlekseyKorshuk/cup-it-ds-sft-pretrained \
  --dataset_path ummagumm-a/cup-it-ds-classification-pairwise-train-val \
  --output_dir with-context

Resulting model: https://huggingface.co/AlekseyKorshuk/cup-it-ds-reward-model-with-context

Inference

To generate scores for test dataset:

wget  https://huggingface.co/AlekseyKorshuk/cup-it-ds-reward-model-no-context/resolve/main/pytorch_model.bin -O ./rm_checkpoint/no-context/checkpoint-4956/pytorch_model.bin
python3 reward_model/inference.py

AlekseyKorshuk/cup-it-ds