MAGICS-LAB/DNABERT_2

Unable to reproduce covid results

anihab opened this issue · 2 comments

Hello,

I am trying to reproduce your results on the GUE covid dataset using the same script with the same parameters you've provided on GitHub. However, after multiple attempts I am still unable to reproduce the results reported in the paper. All three models (DNABERT-2, DNABERT-1, NT) are reporting low mcc scores close to 0.
For example, I received the following results when finetuning DNABERT-2 on the covid task:

{"eval_loss": 2.181016683578491, "eval_accuracy": 0.13514397905759162, "eval_f1": 0.09417480427631085, "eval_matthews_correlation": 0.01867806816220235, "eval_precision": 0.10709376616065985, "eval_recall": 0.1250636815111456, "eval_runtime": 20.7597, "eval_samples_per_second": 441.625, "eval_steps_per_second": 13.825, "epoch": 5.0}

Since all of the models are producing such low results I was wondering if there may possibly be an error in the uploaded dataset or the provided scripts for the covid task. Thank you for any help in advanced!

Thanks for sharing your results here. For this dataset, sometimes the model converges to local minimums at the beginning of model training due to the large difference between DNABERT-2's pretraining data and the evaluation dataset. But you should be able to get a reasonable result by simply using another random seed.

I got the similar results. In particular, I notice the uploaded dataset of Covid sequence contains illegal characters such as Y and R, while the DNA sequences should be {A,T,G,C}. I suppose this could be the source of the issue.