Unable to reproduce covid results

Question

Unable to reproduce covid results

anihab opened this issue 6 months ago · 2 comments

Hello,

I am trying to reproduce your results on the GUE covid dataset using the same script with the same parameters you've provided on GitHub. However, after multiple attempts I am still unable to reproduce the results reported in the paper. All three models (DNABERT-2, DNABERT-1, NT) are reporting low mcc scores close to 0.
For example, I received the following results when finetuning DNABERT-2 on the covid task:

{"eval_loss": 2.181016683578491, "eval_accuracy": 0.13514397905759162, "eval_f1": 0.09417480427631085, "eval_matthews_correlation": 0.01867806816220235, "eval_precision": 0.10709376616065985, "eval_recall": 0.1250636815111456, "eval_runtime": 20.7597, "eval_samples_per_second": 441.625, "eval_steps_per_second": 13.825, "epoch": 5.0}

Since all of the models are producing such low results I was wondering if there may possibly be an error in the uploaded dataset or the provided scripts for the covid task. Thank you for any help in advanced!

Answer 1 · 2024-07-25T21:20:05.000Z

Thanks for sharing your results here. For this dataset, sometimes the model converges to local minimums at the beginning of model training due to the large difference between DNABERT-2's pretraining data and the evaluation dataset. But you should be able to get a reasonable result by simply using another random seed.

Answer 2 · 2024-11-27T08:34:57.000Z

I got the similar results. In particular, I notice the uploaded dataset of Covid sequence contains illegal characters such as Y and R, while the DNA sequences should be {A,T,G,C}. I suppose this could be the source of the issue.