/ProteinMLM

Generate amino acid sequence using mask language model(MLN).

Primary LanguagePython

Generating novel protein sequences using Gibbs sampling of masked language models

This repository represents the code supporting the work done in Generating novel protein sequences using Gibbs sampling of masked language models . Since then, the code has been continuously updated. For the version of the code that was used in that preprint, see: here

Install

Clone the repo and move into the new directory

git clone https://github.com/seanrjohnson/protein_gibbs_sampler.git
cd protein_gibbs_sampler

Then install either through conda or Docker

Conda/Pip

Make a clean new Conda environment

conda create -n protein_gibbs_sampler python~=3.8
conda activate protein_gibbs_sampler

Install this package and its prereqs with pip

pip install -e .

Test the install

pytest .

If you have CUDA installed, everything should pass, otherwise there will be one skipped test.

Docker

Setup container environment:

# From root of this repo
# Create container -- starts container running in detached mode and root of 
# 	repo mounted to /workspace in the container
make run-cpu # cpu-only
# make run # GPU pass through

# Attach to container
make attach

# ---Runnining inside container---
# Pip install the source for this package
pip install -e .

Additional Commands:

# Additional Commands
make start
make stop
make shell
make remove

Generating new protein sequences from the command line

This package contains three command line programs to make it easy to generate new sequences. For good performance, it is recommended to use GPU, but they will still run on a CPU, just excruciatingly slow for everything but small proteins.

pgen_esm.py

Given a seed sequence, generates new sequences.

pgen_esm_input.tsv

test_seq	{'num_iters': 20, 'burnin': 10, 'mask': True, 'in_order':False, 'num_positions_percent': 10, 'seed_seq': "MEPAATGQEAEECAHSGRGEAWEEV"}
test_seq2	{'num_iters': 20, 'burnin': 10, 'mask': True, 'in_order':False, 'num_positions_percent': 10, 'seed_seq': "MLEGADIVIIPAGV"}
pgen_esm.py -o pgen_out -i pgen_esm_input.tsv --num_output_sequences 10

For detailed help: pgen_esm.py -h

pgen_msa.py

Given a seed msa, uses esm-msa to generate new sequences.

fasta_input1.fasta

>s1
MEPAATGQEAE--AHSGRGEAWEEV
>s2
MCP-ATGR-AEMCAHS--GEAWLLV
>s3
MEQ-AGGRLAEM-AHHC-GEAWLLV

fasta_input2.fasta

>s1
MLEGADIVIIP-GV
>s2
MLDG---VLLPGAV
>s3
M-EPADILVV--GV

pgen_msa_input.tsv

test_seq	{'num_iters': 20, 'burnin': 10, 'mask': True, 'in_order':False, 'num_positions_percent': 10}	fasta_input1.fasta
test_seq2	{'num_iters': 20, 'burnin': 10, 'mask': True, 'in_order':False, 'num_positions_percent': 10}	fasta_input2.fasta
pgen_msa.py -o pgen_msa_out -i pgen_esm_msa_input.tsv --num_output_sequences 10 

For detailed help: pgen_msa.py -h

pgen_esm_from_fasta.py

Like pgen_esm.py except that the seed sequences come from fasta files, instead of being defined in the sampler arguments. The sequences in the fasta can either be aligned or unaligned. If they are aligned, then gaps will be removed before sampling, but setting --keep_gap_positions will add the gaps back in after sampling. If the fasta contains more than one sequence, then a random sequence will be selected for each round of sampling.

pgen_esm_from_fasta.py -o pgen_from_fasta_out -i pgen_msa_input.tsv --num_output_sequences 10 --keep_gap_positions

For detailed help: pgen_esm_from_fasta.py -h

References

This repository represents work building on related resources as cited below.

bert-gen

Github

Paper

@article{wang2019bert,
  title={BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model},
  author={Wang, Alex and Cho, Kyunghyun},
  journal={arXiv preprint arXiv:1902.04094},
  year={2019}
}

ESM

Github

Paper

@article{rives2019biological,
  author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob},
  title={Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences},
  year={2019},
  doi={10.1101/622803},
  url={https://www.biorxiv.org/content/10.1101/622803v3},
  journal={bioRxiv}
}

ESM-MSA

Paper

@article{rao2021msa,
  author = {Rao, Roshan and Liu, Jason and Verkuil, Robert and Meier, Joshua and Canny, John F. and Abbeel, Pieter and Sercu, Tom and Rives, Alexander},
  title={MSA Transformer},
  year={2021},
  doi={10.1101/2021.02.12.430858},
  url={https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1},
  journal={bioRxiv}
}

ProTrans

Github

Paper

@article {Elnaggar2020.07.12.199554,
	author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
	title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
	elocation-id = {2020.07.12.199554},
	year = {2020},
	doi = {10.1101/2020.07.12.199554},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
	eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
	journal = {bioRxiv}
}