What is the appropriate accuracy of a pre-trained model?

Question

What is the appropriate accuracy of a pre-trained model?

Closed this issue 3 months ago · 2 comments

Hi, thank you for your working! When I pre-train model on my own data by the following input. The training data is the whole genome data of Chlorella, sliced into 1000 bases for one row of sequence; a total of 53,160 rows of data were generated, of which the first 80% of the 42,528 rows of sequence data was the training set, and the remaining 20% of the 10,632 rows of sequence data was the test set.
The run_mlm.py of hugging face was used from https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling.

But the accuracy of pre-trained model was only 0.1117. How do I to improve the accuracy of pre-trained models? What is the accuracy of this pre-trained model for subsequent fine-tuning?

Thank you so much!

Answer 1 · 2024-10-28T07:51:08.000Z

Thanks for using our work! First of all, the MLM accuracy is highly related to the vocabulary size and masking rate. Are you using DNABERT-2's tokenizer and a masking rate of 15%? In our case, the MLM accuracy in pre-training is around 30%.

Answer 2 · 2024-10-29T08:28:19.000Z

Thanks for your reply! The input parameter "tokenizer_name" of run_mlm.py is the downloaded model of zhihan1996/DNABERT-2-117M (/share/home/yuyadan/workspace/DNABERT_2/DNABERT-2-117M). The masking rate is mlm_probability which is default=0.15. Should I modify the vocabulary size and mask rate to improve the accuracy of the pre-trained models?