yingweima2022/CodeLLM

[Question about paper]Reasoning ability not come from code?

Closed this issue · 2 comments

Wonderful work! But I got few questions when I am learning the experiment result.

  1. In table 2, NL(2.6B) have a very close benchmark score with CODE(2.6b) beside the MBPP, This seems indicate that training on pure natural language can still learn to reason and Plus code can only increase its score on code generation?
    Screenshot 2023-11-28 at 10 30 27

  2. In table 4, both NL and CODe can benifit alot from CoT, Is this a strong evidence that CoT is not come from CODE?
    Screenshot 2023-11-28 at 10 30 42

Again, wonderful work! Thanks!

Thanks for your insightful and constructive comments.

  1. We updated table 2 in the latest version and added a significance test, we conducted a t-test on the predicted scores in the following table. It demonstrates that all p-values ​​are less than 0.05, indicating that the results are Statistically significant. It illustrates the impact of code data on general reasoning ability in the pre-training stage.
    image

    In order to supplement the experimental results, we selected the high school mathematics and high school physics problem parts of the MMLU [1] test set to evaluate the model in the pre-training stage. MMLU is a currently widely used dataset to evaluate the comprehensive ability of LLMs [2, 3], among which mathematics and physics can better reflect the reasoning ability of the model. The results obtained and the conclusions are shown in the table below. We have added these results to the revised paper and highlighted them.

    [1] Hendrycks D, Burns C, Basart S, et al. Measuring massive multitask language understanding[J]. arXiv preprint arXiv:2009.03300, 2020.

    [2] https://openai.com/research/gpt-4

    [3] Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models[J]. arXiv preprint arXiv:2307.09288, 2023.
    image
    *Note: Source of LLaMA results: https://github.com/baichuan-inc/Baichuan-7B

  2. In table4, both NL(2.6B) and CODE(2.6B) effectively utilize CoT information, and CODE(2.6B) achieves the best results compared to NL(2.6B). Therefore, we infer that code data may help the model better utilize the Chain-of-Thought for reasoning. This article does not delve into whether the code data you mentioned is the cause of the Chain-of-Thought. But we realize that this is an interesting direction, because many works now try to use programming methods [4] to solve reasoning problems. In addition, for the Chain-of-Thought question, you can try to find the answer in this blog[5].

    [4] Gao L, Madaan A, Zhou S, et al. Pal: Program-aided language models[C]//International Conference on Machine Learning. PMLR, 2023: 10764-10799.
    [5] https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1

Finally, thank you for paying attention to our work. If you have any questions, please feel free to discuss them with us.

Thanks for your prompt and detailed answear!

  1. Surely the CODE2.6B have higher score than NL2.6B,
    But it seems didn't improve that much from the original table, the scienceQA even only get less than 0.2 point improvement. Meanwhile the benchmark related to code get much higher. Looks more like domain specifial improvement.
    Considering the noize introduced by the difference of dataset beisde code between CODE and NL. we can't tell much from that. But the new result from MMLU_Physics seems quite supportive.

  2. Thanks for the additional information! Regardeing COT, I only roughly check the original paper, in which COT is reported as emergent ability at large scale(>8b). I didn't expect that it happen on 2.6B as well. Will read that, thanks!
    Screenshot 2023-11-28 at 13 08 58