about reproduction of Llama2-13b-lora results on GSM dataset

Question

about reproduction of Llama2-13b-lora results on GSM dataset

Closed this issue 6 months ago · 8 comments

hyy-2000 commented 6 months ago

Hi Alisa,
When I was trying to reproduce the results of Llama-2-13B-lora model on GSM dataset, my experiment results did not reach the level of that reported in the appendix of your paper

Here are my questions:

Is the training data the same version as in https://github.com/openai/grade-school-math/blob/master/grade_school_math/data/train.jsonl, or is there preprocess that is not specifed?
By running https://github.com/alisawuffles/proxy-tuning/blob/main/scripts/finetune_lora_with_accelerate.sh, can I obtain the Llama2-13B-lora model trained on GSM dataset properly?
I tried training with finetune_lora_with_accelerate.sh->merge_lora.py->/scripts/eval/gsm.sh (I used the code under "#Evaluation with Llama2", with --model_name_or_path specified as my directory for saving merged lora model). Is this the correct way to use the code?

Thank you!

Answer 1 · 2024-04-12T13:21:02.000Z

I also have this question, can the author provide the finetuned model or the deatils for the training of LoRA?

Answer 2 · 2024-04-12T16:50:05.000Z

Hi both! Here's an updated data folder that contains the processed training data for GSM and TriviaQA. @hyy-2000, the second step you describe is correct. In finetune_lora_with_accelerate.sh, just set TRAIN_FILE=data/train/gsm/train.jsonl.

Answer 3 · 2024-04-13T07:29:39.000Z

Hi Alisa,
Thank you for your responseand update! When I was using GSM training data from original source, the code reported that column 'prompt' and 'completion' are missing, and I directly changed the keys 'question' and 'answer' in train.jsonl to 'prompt' and 'completion'.
After the training->merging->testing, I have a approximately 13% accuracy of Llama2-13B-lora on GSM dataset, which I believe is a successful reprodection of /main/results/gsm/llama2-gsm-lora-13B-lr2e-5/metrics.json. However, in your paper, the hyperparamerters for lora finetuning are specified in Table 13, which is partly different from those in finetune_lora_with_accelerate.sh. so I'm wondering:

Is the results from the picture above obtained under lr2e-5 (according to paper) or lr1e-4 (according to finetune_lora_with_accelerate.sh), or any other different learning rate?
Is the hyperparameters all the same for both GSM and TriviaQa?
Is the hyperparameters the same for lora-finetuning both 13B and 70B model?

Thank you for your time!

Answer 4 · 2024-04-13T07:34:16.000Z

Thanks for your reply!

I have retrained the experiments according to what the author suggested, but I also cannot reproduce the results.
Can the author provide the details for training the LoRA?

This would be beneficial to our research community.

Answer 5 · 2024-04-13T07:56:41.000Z

Hi both, there is a mistake in Table 13, and the learning rate in the script is correct, which is 1e-4. This learning rate is from Tulu 2's QLORA implementation (Appendix B). As you have also found, we also saw that LORA was sensitive to the learning rate; we reported 1e-4 because it worked better than 2e-5, but we did not search over more values.

Yes, the hyperparameters are the same for both GSM and TriviaQA, and both 13B and 70B model.

We will update the preprint with the correct learning rate; it should be live on ArXiv on Monday. I apologize for the mistake and thank you so much for catching this!

If you still have trouble reproducing the results, please let me know!

Answer 6 · 2024-04-13T08:02:44.000Z

@hyy-2000 As for the columns of the dataframe, the dataset that I uploaded has prompt and completion as the fields. Are you sure that you saw question and answer instead? The correct training file is data/train/gsm/train.jsonl.

Answer 7 · 2024-04-13T08:11:29.000Z

Thanks for your reply. Let me try it! Alisa Liu ***@***.***>于2024年4月13日周六下午3:57写道：

…

Hi both, there is a mistake in Table 13, and the learning rate in the script is correct, which is 1e-4. This learning rate is from Tulu 2 <https://arxiv.org/abs/2311.10702>'s QLORA implementation (Appendix B). As you have also found, we also saw that LORA was sensitive to the learning rate; we reported 1e-4 because it worked better than 2e-5, but we did not search over more values. Yes, the hyperparameters are the same for both GSM and TriviaQA, and both 13B and 70B model. We will update the preprint with the correct learning rate; it should be live on ArXiv on Monday. I apologize for the mistake and thank you so much for catching this! If you still have trouble reproducing the results, please let me know! — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AR75XNZXOUS7NQAMSFWOXITY5DQM5AVCNFSM6AAAAABGDYG2EWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJTGU3DGNZZGQ> . You are receiving this because you commented.Message ID: ***@***.***>

Answer 8 · 2024-04-13T10:01:10.000Z

Thank you Alisa!
I have reproduced the lora-finetuning results for Llama2 13B model on GSM dataset.
As for the columns issue, sorry for the confusion. I was trying to explain how I used the GSM train.jsonl from source https://github.com/openai/grade-school-math/blob/master/grade_school_math/data/train.jsonl, and how my preprocessing went wrong. Your update on training data is correct and works perfectly when I tried to reproduce. Thanks again!