Comparison with full ShareGPT, Alpaca and Tulu in Table 1 Figure 1
apoorvumang opened this issue ยท 3 comments
Hi
I noticed that the Figure 1 and Table 1 do not include numbers for full versions of sharegpt, alpaca and tulu. But figure 2 seems to indicate that you have numbers for these available.
Could you please share the numbers and comparisons if possible for these datasets, and how IM compares to IT in the full dataset setting?
Hi, many thanks for your question.
I noticed that the Figure 1 and Table 1 do not include numbers for full versions of sharegpt, alpaca and tulu.
We are using the dataset provided by the Tulu V2 paper. Please refer to https://github.com/allenai/open-instruct.
But figure 2 seems to indicate that you have numbers for these available. Could you please share the numbers and comparisons if possible for these datasets?
Please refer to Appendix A and C, where we provide more experimental details about these experiments.
how IM does compare to IT in the full dataset setting?
When the full dataset is used, employing our Instruction Modelling (IM) may not always be beneficial. In such cases, if you have a substantial number of training examples, you may utilize the Instruction Tuning (IT) approach directly. However, it's important to recognize that when dealing with limited instruction tuning data and short completions, using IM during training might be more advantageous. Our goal is to offer a flexible solution that can be tailored to meet the specific requirements of different projects.
Thanks, and I got great insights from your paper! I think it might be connected somewhat to this older work https://arxiv.org/abs/2209.14389 which finds that pretraining on ft data boosts performance (this was from BERT era). The connection is that being able to predict input tokens is a valuable signal.
The drawback might be that predicting input tokens can reduce generalizability to out of distribution prompts - looking forward to hear any thoughts you guys have on that
Thank you for sharing this work with us. It is particularly relevant to our earlier research at NeurIPS 2023 from the BERT era. In that study, we also explore continued pre-training, using the downstream task data, and incorporate a prompt template for continued pre-training.
For our work and the paper you shared, it appears reasonable to trade some generalizability for improved performance on specific downstream tasks, as we will fine-tune LMs on the target tasks after continued pretraining any way.
Regarding our work in IM, we believe that including the prompt, in scenarios we mentioned above, can actually help mitigate the issue of overfitting. This approach suggests that it does not necessarily reduce generalizability.