TencentARC/LLaMA-Pro

Comparison with PEFT

LaVieEnRose365 opened this issue · 1 comments

Hi there! It's really an interesting work, but I have following questions:

  1. I think the proposed block expansion is quite similar to the idea of Adapter Tuning, can you explain the main difference?
  2. The results demonstrate that more expansion blocks lead to better results, which totally add 1B additional parameters. And block expansion is claimed to be superior to Lora. However, the low rank property of Lora actually leads to few parameters. Did you compare the performance of block expansion and Lora under the number of additional parameters ?
    It really will be a pleasure if you can reply to me.

Thanks for your attention!

I think the main difference between our work and PEFT methods is that we scale the parameters. We have observed the power of scaling like GPT, Claude, and so on. We did the experiment that the PEFT method tunes as much as parameters we scale for LoRA, however, it can not generalize well in the specific domain. We hypothesized that the PEFT method has its limitations in having the capacity to learn more knowledge, which is important in the (continual) pretraining. It is useful when doing SFT, as recently one group mentioned that at the SFT stage, the model mainly learns the style or format URIAL. I think the PEFT method is more suitable for tasks like learning the style or format, while not for learning more knowledge, which requires dense parameters to hold in the pretraining.

Recently, another interesting work also mentions this property, yi-9b. It also uses depth expansion and then trains on math and code corpus. It mentions that if they do not scale the parameters, the continual training only marginally improves the performance.

So basically I think the main difference is that we try to increase the parameters based on the initial model to do the continual training, while PEFT is more suitable for the following SFT.

I hope this will be helpful!