What is the appropriate accuracy of a pre-trained model?
Closed this issue · 2 comments
Hi, thank you for your working! When I pre-train model on my own data by the following input. The training data is the whole genome data of Chlorella, sliced into 1000 bases for one row of sequence; a total of 53,160 rows of data were generated, of which the first 80% of the 42,528 rows of sequence data was the training set, and the remaining 20% of the 10,632 rows of sequence data was the test set.
The run_mlm.py of hugging face was used from https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling.
But the accuracy of pre-trained model was only 0.1117. How do I to improve the accuracy of pre-trained models? What is the accuracy of this pre-trained model for subsequent fine-tuning?
Thank you so much!
Thanks for using our work! First of all, the MLM accuracy is highly related to the vocabulary size and masking rate. Are you using DNABERT-2's tokenizer and a masking rate of 15%? In our case, the MLM accuracy in pre-training is around 30%.
Thanks for your reply! The input parameter "tokenizer_name" of run_mlm.py is the downloaded model of zhihan1996/DNABERT-2-117M (/share/home/yuyadan/workspace/DNABERT_2/DNABERT-2-117M). The masking rate is mlm_probability which is default=0.15. Should I modify the vocabulary size and mask rate to improve the accuracy of the pre-trained models?