/megaDNA

Primary LanguageJupyter Notebook

megaDNA: a long-context language model for deciphering and generating bacteriophage genomes

Online_figure

Generative pre-trained transformers (GPTs) have revolutionized the field of natural language processing. Inspired by the success of large language models, we develop a long-context generative model for genomes. Our multiscale transformer model was pre-trained on unannotated bacteriophage genomes with byte-level tokenization. We demonstrate the foundational capabilities of our model including the prediction of essential genes, genetic variant effects, regulatory element activity and taxonomy of unannotated sequences. Furthermore, it generates de novo sequences up to 96K base pairs, which contain functional regulatory elements and novel proteins with phage-related functions.

Install

To install megaDNA, run the following bash script:

git clone https://github.com/lingxusb/megaDNA.git
cd megaDNA
pip install .

Trained model

Sequence generation

 import torch
 
 # load the model

 device = 'cpu' # use 'cuda' for GPU
 model = torch.load(model_path, map_location=torch.device(device))

 # new sequences can be generated by a primer sequence:

 nucleotides = ['**', 'A', 'T', 'C', 'G', '#'] # vocabulary

 seq_tokenized = model.generate(primer_sequence,
                                seq_len=context_length,
                                temperature=0.95, 
                                filter_thres=0.0)

 # To transform tokens back to DNA ucleotide sequence:
 def token2nucleotide(s):
     return nucleotides[s]
 generated_sequence = ''.join(map(token2nucleotide, seq_tokenized.squeeze().cpu().int()))

Please check our jupyter notebook: megaDNA_generate.ipynb. GPU recommended.

Or you can easily run the Colab Notebook in the browser. Please make sure to connect to a GPU instance (e.g. T4 GPU).

Features for the generated sequences

  • Annotated genes (Fig. 2b)
  • Annotated proteins with diverse functions (Fig. 2i & Fig. S12)
  • Folding of annotated proteins (Fig. 2h & Fig. S11)
  • Virus score that is comparable with natural phages (Fig. 2c)
  • Marker genes for phage (Fig. 2h)
  • Classified as Caudoviricetes (~37%, Fig. 2d)
  • Predicted hosts (~40%, Fig. S9)
  • Regulatory elements including promoters and RBS (Fig. 2f, 2g, Fig. S10)

Please check our preprint for more details.

Model embeddings and loss

# a random input sequence
encoded_sequence = np.random.choice(np.arange(1,5), 100)
input_seq = torch.tensor(encoded_sequence).unsqueeze(0).to(device) 

# get embeddings
output = model(input_seq, return_value = 'embedding')

# output[0:3] stores embeddings from three transformer layers.

# get model loss
output = model(input_seq, return_value = 'loss')

print(output)

In silico mutagenesis analysis

Please check our jupyter notebook: megaDNA_mutagenesis.ipynb. Fasta file and gene annotation for lambda phage can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/NC_001416.1

Reference