CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale

This is the official implementation for "CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale". Links: website | paper

Overview

Taxonomically classifying organisms at scale is crucial for monitoring biodiversity, understanding ecosystems, and preserving sustainability. It is possible to taxonomically classify organisms based on their image or their DNA barcode. While DNA barcodes are precise at species identification, they are less readily available than images. Thus, we investigate whether we can use DNA barcodes to improve taxonomic classification using image.

We introduce CLIBD, a model uses contrastive learning to map biological images, DNA barcodes, and textual taxonomic labels to the same latent space. The model is initialized using pretrained encoders for images (vit-base-patch16-224), DNA barcodes (BarcodeBERT), and textual taxonomic labels (BERT-small), and the weights of the encoders are fine-tuned using LoRA. The aligned image-DNA embedding space improves taxonomic classification using images and allows us to do cross-modal retrieval from image to DNA. We train CLIBD on the BIOSCAN-1M and BIOSCAN-5M insect datasets. These datasets provides paired images of insects and their DNA barcodes, along with their taxonomic labels.

Setup environment

CLIBD was developed using Python 3.10 and PyTorch 2.0.1. We recommend the use of GPU and CUDA for efficient training and inference. Our models were developed with CUDA 11.7 and 12.4.
We also recommend the use of miniconda for managing your environments.

To setup the environment width necessary dependencies, type the following commands:

conda create -n CLIBD python=3.10 -y
conda activate CLIBD
conda install pytorch=2.0.1 torchvision=0.15.2 torchtext=0.15.2 pytorch-cuda=11.7 -c pytorch -c nvidia -y
pip install -r requirements.txt
pip install -e .
pip install git+https://github.com/Baijiong-Lin/LoRA-Torch

Depending on your GPU version, you may have to modify the torch version and other package versions in requirements.txt.

Pretrained embeddings and models

We provide pretrained embeddings and model weights. We evaluate our models by encoding the image or DNA barcode, and using the taxonomic labels from the closest matching embedding (either using image or DNA barcode). See Download dataset and Running Experiments for how to get the data, and to train and evaluate the models.

Training data	Aligned modalities	Embeddings	Model	Config
BIOSCAN-1M	None	Embedding	N/A	Link
BIOSCAN-1M	Image + DNA	Embedding	Link	Link
BIOSCAN-1M	Image + DNA + Tax	Embedding	Link	Link
BIOSCAN-5M	None	Embedding	N/A	Link
BIOSCAN-5M	Image + DNA	Embedding	Link	Link
BIOSCAN-5M	Image + DNA + Tax	Embedding	Link	Link

We also provide checkpoints trained with LoRA layers. You can download them from this Link

Quick start

Instead of conducting a full training, you can choose to download pre-trained models or pre-extracted embeddings for evaluation from the table. You may need to posistion the downloaded checkpoints and extracted features in to the proper position based on the config file.

Download dataset

For BIOSCAN 1M, we partition the dataset for our CLIBD experiments into a training set for contrastive learning, and validation and test partitions. The training set has records without any species labels as well as a set of seen species. The validation and test sets include seen and unseen species. These images are further split into subpartitions of queries and keys for evaluation.

For BIOSCAN 5M, we use the dataset partitioning established in the BIOSCAN-5M paper.

For training and reproducing our experiments, we provide HDF5 files with BIOSCAN-1M and BIOSCAN-5M images. See DATA.md for format details. We also provide scripts for generating the HDF5 files directly from the BIOSCAN-1M and BIOSCAN-5M data.

Download BIOSCAN-1M data (79.7 GB)

# From project folder
mkdir -p data/BIOSCAN_1M/split_data
cd data/BIOSCAN_1M/split_data
wget https://aspis.cmpt.sfu.ca/projects/bioscan/clip_project/data/version_0.2.1/BioScan_data_in_splits.hdf5

Download BIOSCAN-5M data (190.4 GB)

# From project folder
mkdir -p data/BIOSCAN_5M/split_data
cd data/BIOSCAN_5M/split_data
wget https://aspis.cmpt.sfu.ca/projects/bioscan/BIOSCAN_CLIP_for_downloading/BIOSCAN_5M.hdf5

For more information about the hdf5 files, please check DATA.md.

Download data for generating hdf5 files

You can check BIOSCAN-1M and BIOSCAN-5M to download tsv files. But they are actually not necessary.

Running experiments

We recommend the use of weights and biases to track and log experiments

Activate Wandb

Register/Login for a free wandb account

wandb login
# Paste your wandb's API key

Checkpoints

Download checkpoint for BarcodeBERT and bioscan_clip and place them under ckpt.

# From project folder
mkdir -p ckpt/BarcodeBERT/5_mer
cd ckpt/BarcodeBERT/5_mer
wget https://aspis.cmpt.sfu.ca/projects/bioscan/clip_project/ckpt/BarcodeBERT/model_41.pth
cd ../../..
mkdir -p ckpt/bioscan_clip/trained_with_bioscan_1m
cd ckpt/bioscan_clip/trained_with_bioscan_1m
wget https://aspis.cmpt.sfu.ca/projects/bioscan/BIOSCAN_CLIP_for_downloading/ckpt/bioscan_clip/trained_with_bioscan_1m/image_dna_text.pth
cd ../../..
mkdir -p ckpt/bioscan_clip/trained_with_bioscan_5m
cd ckpt/bioscan_clip/trained_with_bioscan_5m
wget https://aspis.cmpt.sfu.ca/projects/bioscan/BIOSCAN_CLIP_for_downloading/ckpt/bioscan_clip/trained_with_bioscan_5m/image_dna_text.pth

For downloading all CLIBD pre-trained models: Link

Train

Use train_cl.py with the appropriate model_config to train CLIBD.

# From project folder
python scripts/train_cl.py 'model_config={config_name}'

To train the full model (I+D+T) using BIOSCAN-1M:

# From project folder
python scripts/train_cl.py 'model_config=lora_vit_lora_barcode_bert_lora_bert_ssl'

For multi-GPU training, you may need to specify the transport communication between the GPU using NCCL_P2P_LEVEL:

NCCL_P2P_LEVEL=NVL python scripts/train_cl.py 'model_config=lora_vit_lora_barcode_bert_lora_bert_ssl'

For example, using the following command, you can load the pre-trained ViT-B, BarcodeBERT, and BERT-small and fine-tune them through contrastive learning. Note that this training will only update their LoRA layers, not all the parameters.

python scripts/train_cl.py 'model_config=lora_vit_lora_barcode_bert_lora_bert_5m'

Evaluation

During evaluation, we using the trained encoders to obtain embeddings for input image or DNA, and the find the closest matching image or DNA and use the corresponding taxonomical labels as the predicted labels. We report both the micro and class averaged accuracy for seen and unseen species.

TODO: specify how to run evaluation for different models, and different query and key combinations.

To run evaluation for BIOSCAN-1M:

# From project folder
python scripts/inference_and_eval.py 'model_config=lora_vit_lora_barcode_bert_lora_bert_ssl'

To run evaluation for BIOSCAN-5M:

python scripts/inference_and_eval.py 'model_config=lora_vit_lora_barcode_bert_lora_bert_5m'

For BZSL experiment with the INSECT dataset.

TODO add some acknowledgement about the INSECT dataset. Also, the description of the BZSL experiments should be added.

To download unprocessed INSECT dataset, you can reference BZSL:

mkdir -p data/INSECT
cd data/INSECT
# Download the images and metadata here.

# Note that we need to get the other three labels because the INSECT dataset only has the species label.
# For that, please edit get_all_species_taxo_labels_dict_and_save_to_json.py, change Entrez.email = None to your email 
python get_all_species_taxo_labels_dict_and_save_to_json.py

# Then, generate CSV and hdf5 file for the dataset.
python process_insect_dataset.py

The downloaded data should be organized in this way:

data
├── INSECT
│   ├── att_splits.mat
│   ├── res101.mat
│   ├── images
│   │   │   ├── Abax parallelepipedus
│   │   │   │   ├── BC_ZSM_COL_02878+1311934584.jpg
│   │   │   │   ├── BC_ZSM_COL_05487+1338577126.JPG
│   │   │   │   ├── ...
│   │   │   ├── Abax parallelus
│   │   │   ├── Acordulecera dorsalis
│   │   │   ├── ...

You can also download the processed file with:

wget https://aspis.cmpt.sfu.ca/projects/bioscan/BIOSCAN_CLIP_for_downloading/INSECT_data/processed_data.zip
unzip processed_data.zip

Train CLIBD with INSECT dataset

python scripts/train_cl.py 'model_config=lora_vit_lora_barcode_bert_lora_bert_ssl_on_insect.yaml'

Extract image and DNA features of INSECT dataset.

python scripts/extract_feature_for_insect_dataset.py 'model_config=lora_vit_lora_barcode_bert_lora_bert_ssl_on_insect.yaml'

Then, you may move the extracted features to the BZSL folder or download the pre-extracted feature.

mkdir -p Fine-Grained-ZSL-with-DNA/data/INSECT/embeddings_from_bioscan_clip
cp extracted_embedding/INSECT/dna_embedding_from_bioscan_clip.csv Fine-Grained-ZSL-with-DNA/data/INSECT/embeddings_from_bioscan_clip/dna_embedding_from_bioscan_clip.csv
cp extracted_embedding/INSECT/image_embedding_from_bioscan_clip.csvFine-Grained-ZSL-with-DNA/data/INSECT/embeddings_from_bioscan_clip/image_embedding_from_bioscan_clip.csv

Run BZSL for evaluation.

cd Fine-Grained-ZSL-with-DNA/BZSL-Python
python Demo.py --using_bioscan_clip_image_feature --datapath ../data --side_info dna_bioscan_clip --alignment --tuning

Flatten the `results.csv`.

python scripts/flattenCsv.pya -i PATH_TO_RESULTS_CSV -o PATH_TO_FLATTEN_CSV

Citing CLIBD

If you use CLIBD in your research, please cite:

@article{gong2024clibd,
  title={{CLIBD}: Bridging Vision and Genomics for Biodiversity Monitoring at Scale},
  author={Gong, ZeMing and Wang, Austin T. and Huo, Xiaoliang and Haurum, Joakim Bruslund and Lowe, Scott C. and Taylor, Graham W. and Chang, Angel X.},
  journal={arXiv preprint arXiv:2405.17537},
  year={2024},
  eprint={2405.17537},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  doi={10.48550/arxiv.2405.17537},
}

Acknowledgements

We would like to express our gratitude for the use of the INSECT dataset, which played a pivotal role in the completion of our experiments. Additionally, we acknowledge the use and modification of code from the Fine-Grained-ZSL-with-DNA repository, which facilitated part of our experimental work. The contributions of these resources have been invaluable to our project, and we appreciate the efforts of all developers and researchers involved.

This reseach was supported by the Government of Canada’s New Frontiers in Research Fund (NFRF) [NFRFT-2020-00073], Canada CIFAR AI Chair grants, and the Pioneer Centre for AI (DNRF grant number P1). This research was also enabled in part by support provided by the Digital Research Alliance of Canada (alliancecan.ca).

bioscan-ml/clibd