WooooDyy/Self-Polish

The experiment accuracy seems to be too low for gpt-3.5-turbo

ShadyPi opened this issue · 4 comments

Good idea! Polishing the question instead of prompt is a new angel of increasing the performance of LLM. However, the performance of gpt-3.5-turbo in your experiment is quite confusing because the accuracy is too low. According to my own experiment, without any extra prompt, the gpt-3.5-turbo gain 68% accuracy in GSM8K, higher than all your experiment results in GSM8K. The standard prompt in your code seems lead to a counterproductive effect on gpt-3.5-turbo's performance.

Thanks for your comment!

Firstly, prompts for standard Few-shot (without CoT or other techniques) are collected from the paper [1], following [2]. We report the results without performing any prompt engineering. Secondly, all the answer-side prompts are orthogonal to our methods. Here we just want to demonstrate the effectiveness of our problem-side method.

[1] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

[2] Progressive-Hint Prompting Improves Reasoning in Large Language Models

Thanks for your reply!

I understand what you mean. You have proved that your method works on davinci-002 and 003. However, the prompt in paper [1] and paper [2] may be useful for GPT3 and Instruct-GPT (e.g., davinci-002 and 003), but for gpt-3.5-turbo, which is fine-tuned on chat, the prompt doesn't work.

According to my own experiment, without any prompt, and just extract the last number in gpt's response as the answer, the accuracy will achieve 68%. Therefore, while there is an increase in performance after applying your method, it's quite strange that your polish method combined with CoT or LtM only gains a performance of 65.5% on GSM8K.

Thank you very much for detailed comments!

Actually, in the Standard Few-shot setting, the prompt will not elicit the model to generate intermediate reasoning steps. Instead, the model generate the answer directly.
But if you feed 3.5-turbo with only question (without any prompt), it may tend to generate intermediate reasoning steps, which conflicts with our setting.

We test one example without prompts and find gpt-3.5-turbo output intermediate steps :

{"question": "A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?", "answer": "It takes 2/2=<<2/2=1>>1 bolt of white fiber\nSo the total amount of fabric is 2+1=<<2+1=3>>3 bolts of fabric\n#### 3"}

Btw, the results of 65.5% you mentioned are achieved with davinci-003 which is the main model of our experiments.

Thanks for your explanation! I have to say sorry that I got the backbone model in table3 wrong. In this case, I am looking forward to your further research on polishing questions to improve the performance of gpt-3.5-turbo in no-prompt situation!