Hyperparameters used for paper 125M experiments

Hi,

Could I check what hyperparameters you used for the 125M P-CoT experiments in Table 5 in the paper for both SFT and RL? Running the SFT scripts in https://github.com/lqtrung1998/mwp_ReFT/tree/main/exps/small_model_exps doesn't reproduce the results. I attach the SFT graphs in the picture below, I am running on a single GPU.

It looks like the batch size must be 8x larger than batch_size=6 on the scripts, considering the global step count on the SFT checkpoints labelled on the RL scripts. For example, for the gsm8k example, I have global_step_12260_epoch_10 but the RL script in https://github.com/lqtrung1998/mwp_ReFT/blob/main/exps/small_model_exps/rl_gsm8k.sh uses global_step_1540_epoch_10.

Would you consider releasing the best SFT checkpoints for the 125M experiments as well?

Thank you!

I have tried withbatch_size=48 and see similar results, is there some other detail about the evaluation that is not mentioned? For example, the results presented are not top@1 but instead from majority voting?

Hi @conglu1997 ,
The hyper-parameter is the default setting in the scripts and the results presented is top@1.

Your evaluation curve looks really off ... did you make any changes to the code that I can take a look?

To be safe, I would suggest using the same library version as in requirements.txt. Also, as the evaluate code uses multiprocessing and timeout ... in case your environment resource is really tight ... I suggest increasing the timeout limit.

Best Regards,
Trung

No changes to the code (except I am not running with accelerate and just call python train_sft_model.py) and I used exactly requirements.txt, I've checked for timeouts and there are not enough to make a difference.

Here are my curves for running ReFT on SVAMP with Galactica 125M, pink=SFT, purple/green=different SFT checkpoints (either using epoch=10 or the best checkpoint). It looks like it is significantly lower than the Table 5 results.

Can I see the command that you run to train? ... I also did rerun the experiment .... this is a few initial value accuracies print out ...

cat paper_final/_models_outputs_sft_small/gsm8k_python_sdp_galactica_125m/gsm8k_python_sdp_galactica_125m.log | grep "value_accuracy:"
[Eval Info] value_accuracy: 0.75815%
[Eval Info] value_accuracy: 3.9424%
[Eval Info] value_accuracy: 6.141%
[Eval Info] value_accuracy: 10.235%
[Eval Info] value_accuracy: 13.495%
[Eval Info] value_accuracy: 15.466%
[Eval Info] value_accuracy: 16.452%

Here is my command, running with batch_size=48 right now, previously I did batch_size=6. I am using https://huggingface.co/facebook/galactica-125m and A100 GPU.

#!/bin/bash

exp_name="gsm8k_python_sdp_galactica_125m_8x"
keep_num_ckpt='40'
batch_size="12"
eval_batch_size="64"
gradient_accumulation_steps="4"

train_file="data/gsm8k_python_sdp.json"
test_file="data/gsm8k_test_set.json"
engine="python" # 'python' or 'nl'
model_name_or_path="hf_models/galactica-125m"
tokenizer_name_or_path="hf_models/galactica-125m"
model_dir="checkpoints/models_outputs_sft_small/${exp_name}/"
wandb_run_name="${exp_name}"
wandb_log="True"
wandb_project="ReFT_small"
n_epochs="40"
num_workers="8"
learning_rate="2e-5"
weight_decay="0"
warmup_step="-100"
clip_grad_norm="1"
evaluating_epoch_freq="1"
logging_epoch_freq="1"
saving_epoch_freq="1"
logging_step_freq="10"
evaluating_step_freq="-100"
saving_step_freq="-100"
seed="42"
max_input_length="1024"

mkdir -p "${model_dir}"
python train_sft_model.py \
            --model_name_or_path "${model_name_or_path}" \
            --tokenizer_name_or_path "${tokenizer_name_or_path}" \
            --train_file "${train_file}" \
            --test_file "${test_file}" \
            --model_dir "${model_dir}" \
            --batch_size "${batch_size}" \
            --eval_batch_size "${eval_batch_size}" \
            --n_epochs "${n_epochs}" \
            --num_workers "${num_workers}" \
            --learning_rate "${learning_rate}" \
            --weight_decay "${weight_decay}" \
            --warmup_step "${warmup_step}" \
            --clip_grad_norm "${clip_grad_norm}" \
            --evaluating_epoch_freq "${evaluating_epoch_freq}" \
            --logging_epoch_freq "${logging_epoch_freq}" \
            --saving_epoch_freq "${saving_epoch_freq}" \
            --evaluating_step_freq "${evaluating_step_freq}" \
            --logging_step_freq "${logging_step_freq}" \
            --saving_step_freq "${saving_step_freq}" \
            --seed "${seed}" \
            --max_input_length "${max_input_length}" \
            --gradient_accumulation_steps "${gradient_accumulation_steps}" \
            --keep_num_ckpt "${keep_num_ckpt}" \
            --wandb_log "${wandb_log}" \
            --wandb_project "${wandb_project}" \
            --wandb_run_name "${wandb_run_name}" \
            --engine "${engine}" \
            1> >(tee "${model_dir}"/"${exp_name}".log) \
            2> >(tee "${model_dir}"/"${exp_name}".err >&2)

This is mine for that setting:

[Eval Info] value_accuracy: 0.22745%
[Eval Info] value_accuracy: 0.75815%
[Eval Info] value_accuracy: 0.9856%
[Eval Info] value_accuracy: 2.047%

I switched

mwp_ReFT/train_sft_model.py

Line 303 in b274bda

output_ = model.module.generate(

to:

output_ = model.generate(

because model had no attribute module with my command.

Hi @conglu1997 ,
Why don't you run with accelerate? .... You can still use the script to run with single node and your config by setting num_processes='1' and batch size=12 and gradient_accumulation_steps=4, and the result is still ok ..

cat ppo_paper_final_new/_models_outputs_sft_small/gsm8k_python_sdp_galactica_125m/gsm8k_python_sdp_galactica_125m.log  | grep "value_accuracy:"
[Eval Info] value_accuracy: 5.3071%
[Eval Info] value_accuracy: 10.538%
[Eval Info] value_accuracy: 13.874%
[Eval Info] value_accuracy: 15.997%

I've never tested this script without accelerate though ... so it could be incompatible ...

Best Regards,
Trung

OK changing to accelerate fixed everything, thanks again for all the help!