MAGICS-LAB/DNABERT_2

Pretraining, Pretraining, Pretraining!!!

multydoffer opened this issue 9 months ago · 2 comments

multydoffer commented 9 months ago

PLZ PLZ PLZ. Release the code for pretraining, I am dying for it.

Zhihan1996 commented 9 months ago

Sorry for the delay in sharing the pre-training codes. We used and slightly modified the MosaicBERT implementation for DNABERT-2 https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert . You should be able to replicate the model training following the instructions.

Or you can use the run_mlm.py at https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling by importing the BertModelForMaskedLM from https://huggingface.co/zhihan1996/DNABERT-2-117M/blob/main/bert_layers.py. It should produce a very similar model.

The training data is available here. https://drive.google.com/file/d/1dSXJfwGpDSJ59ry9KAp8SugQLK35V83f/view?usp=sharing.

multydoffer commented 8 months ago

Thanks a lot!