Generative pre-trained transformers (GPTs) have revolutionized the field of natural language processing. Inspired by the success of large language models, we develop a long-context generative model for genomes. Our multiscale transformer model was pre-trained on unannotated bacteriophage genomes with byte-level tokenization. We demonstrate the foundational capabilities of our model including the prediction of essential genes, genetic variant effects, regulatory element activity and taxonomy of unannotated sequences. Furthermore, it generates de novo sequences up to 96K base pairs, which contain functional regulatory elements and novel proteins with phage-related functions.
To install megaDNA
, run the following bash script:
git clone https://github.com/lingxusb/megaDNA.git
cd megaDNA
pip install .
- The original model: megaDNA_145M.
- Other model sizes: megaDNA_78M and megaDNA_277M.
- Fine-tuned model on E. coli phage: megaDNA_ecoli.
import torch
# load the model
device = 'cpu' # use 'cuda' for GPU
model = torch.load(model_path, map_location=torch.device(device))
# new sequences can be generated by a primer sequence:
nucleotides = ['**', 'A', 'T', 'C', 'G', '#'] # vocabulary
seq_tokenized = model.generate(primer_sequence,
seq_len=context_length,
temperature=0.95,
filter_thres=0.0)
# To transform tokens back to DNA ucleotide sequence:
def token2nucleotide(s):
return nucleotides[s]
generated_sequence = ''.join(map(token2nucleotide, seq_tokenized.squeeze().cpu().int()))
Please check our jupyter notebook: megaDNA_generate.ipynb. GPU recommended.
Or you can easily run the Colab Notebook in the browser. Please make sure to connect to a GPU instance (e.g. T4 GPU).
Features for the generated sequences
- Annotated genes (Fig. 2b)
- Annotated proteins with diverse functions (Fig. 2i & Fig. S12)
- Folding of annotated proteins (Fig. 2h & Fig. S11)
- Virus score that is comparable with natural phages (Fig. 2c)
- Marker genes for phage (Fig. 2h)
- Classified as Caudoviricetes (~37%, Fig. 2d)
- Predicted hosts (~40%, Fig. S9)
- Regulatory elements including promoters and RBS (Fig. 2f, 2g, Fig. S10)
Please check our preprint for more details.
# a random input sequence
encoded_sequence = np.random.choice(np.arange(1,5), 100)
input_seq = torch.tensor(encoded_sequence).unsqueeze(0).to(device)
# get embeddings
output = model(input_seq, return_value = 'embedding')
# output[0:3] stores embeddings from three transformer layers.
# get model loss
output = model(input_seq, return_value = 'loss')
print(output)
Please check our jupyter notebook: megaDNA_mutagenesis.ipynb. Fasta file and gene annotation for lambda phage can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/NC_001416.1
- A long-context language model for deciphering and generating bacteriophage genomes
- MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
- MEGABYTE-pytorch by Phil Wang
- Protein language models learn evolutionary statistics of interacting sequence motifs
- Please contact shaobinlx@gmail.com or raise an issue in the github repo with any questions.