Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings

In this study, we used a non-standard pre-training approach through incorporating randomness at the data and model level to investigate a BERT model pre-trained on nucleotide sequences.

Data

Pre-training data

data/ptData: source code of generate random sequences.

Fine-tuning data

TATA: human and mouse TATA dataset are in the fold of data/ftData/TATA
TFBS: used motif_discovery(690) and motif_occupancy (422) curated dataset provided by Zeng et al. (https://academic.oup.com/bioinformatics/article/32/12/i121/2240609). Please use the paper provided URL for the download.

Source code

ft_tasks: source code of using different k-mer embeddings in downstream tasks of TATA prediciton and TBFS prediction.

Enviroments and required packages

python 3
Pytorch 1.11
DNABERT
dna2vec

Evaluated k-mer embeddings

k-mer embedding	Description	required files
dnabert	k-mer embedding from DNABERT pre-trained on hg38	pre-trained model provided by DNABERT
dnabert	k-mer embedding from DNABERT pre-trained on random data	DNABERT model pre-trained on random data
onehot	one-hot embedding	None
dna2vec	k-mer embedding from dna2vec	pretrained model

TATA prediction task

KMER=5
SPIECE= "human" (or "mouse")
MODEL="deepPromoterNet"
MODEL_SAVE_PATH="model/"
DATA_PATH="ftData/TATA/TATA_${SPIECE}/overall"
EMBEDDING="dnabert" (or "onehot", "dna2vec")
embed_file=FOLD_PATH_OF_THE_PRETRAINED_MODEL (or NONE)
KERNEL="5,5,5"
LR=1e-4
EPOCH=20
BS=64
DROPOUT=0.1

CODE="ft_tasks/TATA/tata_train.py"
python $CODE --kmer $KMER --cnn_kernel_size $KERNEL --model $MODEL --model_dir $MODEL_SAVE_PATH \
    --data_dir $DATA_PATH  --embedding $EMBEDDING --embedding_file $embed_file \
    --lr $LR --epoch $EPOCH --batch_size $BS --dropout $DROPOUT --device "cuda:0"

TFBS prediction task

KMER=5
MODEL="zeng_CNN"
KERNEL="24" 
MODEL_SAVE_PATH="model/"
DATA_PATH="TBFS/motif_discovery/" or "TBFS/motif_occupancy/"
EMBEDDING="dnabert" (or "onehot", "dna2vec")
embed_file=FOLD_PATH_OF_THE_PRETRAINED_MODEL (or NONE)
LR=0.001
EPOCH=10
BS=64
DROPOUT=0.1

CODE="ft_tasks/TFBS/TBFS_all_run.py"
python $CODE --kmer $KMER --cnn_kernel_size $KERNEL --model $MODEL --model_dir $MODEL_SAVE_PATH \
	--data_dir $DATA_PATH --embedding $EMBEDDING --embedding_file $embed_file \
	--lr $LR --epoch $EPOCH --batch_size $BS --dropout $DROPOUT --device "cuda:0"

Pre-trained models

pt_models: download link of the pre-trained model using random data.
4mer pre-trained on randomly generated sequences: https://drive.google.com/file/d/1YKKoX_8NRrPR13uGdEQAKWqxBcOvq2su/view?usp=share_link
5mer pre-trained on randomly generated sequences: https://drive.google.com/file/d/1a2OjubusbsXkC2xAp8W0BbVZHuqyCAhk/view?usp=share_link
6mer pre-trained on radomly generated sequences: https://drive.google.com/file/d/1-6XMO70jY9Tdj9R19vgq8u9DtCzDGPkm/view?usp=share_link

Experiment results

results: detailed results of each dataset of TBFS tasks.

yaozhong/bert_investigation

Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings

Data

Pre-training data

Fine-tuning data

Source code

Enviroments and required packages

Evaluated k-mer embeddings

TATA prediction task

TFBS prediction task

Pre-trained models

Experiment results