performance without pretraining

Question

performance without pretraining

liuzh2016 opened this issue 2 years ago · 2 comments

liuzh2016 commented 2 years ago

I did a ablation study whether pretraining could benefit downstream tasks by finetuning without loading state dict from checkpoint. All other settings were kept same. the finetuning process stopped at epoch 34. All reported metrics are comparable to the paper results. Metrics for some rare cell types and overall are even better.
How to prove the pretraining benefit?

Answer 1 · 2023-02-26T20:52:14.000Z

Did you ever figure this out?

Answer 2 · 2023-12-06T06:40:42.000Z

We would like to explain this issue for users.
The advantage of large models does not lie in achieving the best performance on all tasks, but rather in the applicability of the knowledge encoded in the parameters after pre-training to a variety of downstream tasks, including those with few examples.
The test on unseen data and the existing benchmarking on diverse downstream tasks across tissues and cell types, as presented in the original scBERT paper, have demonstrated the success of the BERT paradigm including pre-training and fine-tuning. Moreover, we observed that scBERT could generate more effective cell representation for unseen data after pre-training compared to the raw data, originally reported in Figure 5d of the scBERT paper.
Besides the fact that scBERT’s performance is better than the model without pretraining, we think simply skipping the model ckpt loading step is problematic for the ablation study of non-pretraining scBERT. We unfroze the last few layers of the pre-trained scBERT model architecture for parameter updates during the fine-tuning phase. However, in some readers' ablation experiment, they simply skipped the model checkpoint loading process, modifying from our fine-tuning script. As a result, they randomly initialized all parameters and only unfroze a small fraction of the last few layers for updates, rather than training the entire model on the training set. In this case, it might not be a correct and sufficient training process for a model without pre-training.
Moreover, such an incorrect ablation study implicitly demonstrates the rationality of scBERT model architecture design and its efficiency in embedding, transforming, and abstracting raw single-cell data for useful knowledge. A similar case is the replacement of AlexNet by ResNet for various visual tasks, which is due to the superiority of ResNet’s model architecture. Additionally, the overall effectiveness of this architecture is not affected by specific model parameters (it can use either pre-trained model weights on ImageNet or other initializations like He initialization/Xavier initialization). Large models inherently have redundant parameters that can be pruned and compressed without significant performance degradation. When we first proposed scBERT, it surpassed all the state-of-the-art algorithms, achieving the best results in the benchmarking and further proving the superiority of the model architecture design. Since then, our first proposed Transformer-based model architecture has become the mainstream model architecture in the single-cell foundation model area.