/BarcodeBERT

A pre-trained representation from a transformers model for inference on insect DNA barcoding data.

Primary LanguagePythonMIT LicenseMIT

BarcodeBERT

A pre-trained transformer model for inference on insect DNA barcoding data.

drawing

Model weights

4-mers
5-mers
6-mers

Reproducing the results from the paper

  1. Clone this repository and install the required libraries by running
pip install -e .
  1. Download the data
wget https://vault.cs.uwaterloo.ca/s/x7gXQKnmRX3GAZm/download -O data.zip
unzip data.zip
mv new_data/* data/
rm -r new_data
rm data.zip
CNN model

Training:

cd scripts/CNN/
python 1D_CNN_supervised.py

Evaluation:

python 1D_CNN_genus.py
python 1D_CNN_Linear_probing.py
BarcodeBERT

Model Pretraining:

cd scripts/BarcodeBERT/
python MGPU_MLM_train.py --input_path=../../data/pre_training.tsv --k_mer=4 --stride=4
python MGPU_MLM_train.py --input_path=../../data/pre_training.tsv --k_mer=5 --stride=5
python MGPU_MLM_train.py --input_path=../../data/pre_training.tsv --k_mer=6 --stride=6

Evaluation:

python MLM_genus_test.py 4
python MLM_genus_test.py 5
python MLM_genus_test.py 6

python Linear_probing.py 4
python Linear_probing.py 5
python Linear_probing.py 6

Model Fine-tuning To fine-tune the model, you need a folder with three files: "train," "test," and "dev." Each file should have two columns, one called "sequence" and the other called "label." You also need to specify the path to the pre-trained model you want to use for fine-tuning, using "pretrained_checkpoint_path".

python Fine-tuning.py --input_path=path_to_the_input_folder --Pretrained_checkpoint_path path_to_the_pretrained_model --k_mer=4 --stride=4
python Fine_tuning.py --input_path=path_to_the_input_folder --Pretrained_checkpoint_path path_to_the_pretrained_model --k_mer=5 --stride=5
python Fine_tuning.py --input_path=path_to_the_input_folder --Pretrained_checkpoint_path path_to_the_pretrained_model --k_mer=6 --stride=6
DNABERT

To fine-tune the model on our data, you first need to follow the instructions in the DNABERT repository original repository to donwnload the model weights. Place them in the dnabert folder and then run the following:

cd scripts/DNABERT/
python supervised_learning.py --input_path=../../data -k 4 --model dnabert --checkpoint dnabert/4-new-12w-0
python supervised_learning.py --input_path=../../data -k 6 --model dnabert --checkpoint dnabert/6-new-12w-0
python supervised_learning.py --input_path=../../data -k 5 --model dnabert --checkpoint dnabert/5-new-12w-0
DNABERT-2

To fine-tune the model on our dataset, you need to follow the instructions in DNABERT2 repository for fine-tuning the model on new dataset. You can use the same input path that is used for fine-tuning BarcodeBERT as the input path to DNABERT2.

Citation

If you find BarcodeBERT useful in your research please consider citing:

@misc{arias2023barcodebert,
  title={{BarcodeBERT}: Transformers for Biodiversity Analysis},
  author={Pablo Millan Arias
    and Niousha Sadjadi
    and Monireh Safari
    and ZeMing Gong
    and Austin T. Wang
    and Scott C. Lowe
    and Joakim Bruslund Haurum
    and Iuliia Zarubiieva
    and Dirk Steinke
    and Lila Kari
    and Angel X. Chang
    and Graham W. Taylor
  },
  year={2023},
  eprint={2311.02401},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  doi={10.48550/arxiv.2311.02401},
}