/protein_scoring

Generating and scoring novel enzyme sequences with a variety of models and metrics

Primary LanguageJupyter NotebookMIT LicenseMIT

Computational Scoring and Experimental Evaluation of Enzymes Generated by Neural Networks

Colab notebooks

Notebook Description
ESM-MSA sampler uses the ESM-MSA model (a transformer-based neural network trained on protein multiple sequence alignments) to generate new protein sequences by iteratively mutating sequences from an input alignment.
Metrics Calculates various sequence- and structure-based quality scores for proteins, such as those produced by generative models.

Figures

Setup

conda env create --name protein_scoring -f conda_env.yml

jupyter lab

Related data and repositories

  • Source data: AlphaFold2 predicted structures, Full sequence lists, Tables of metrics, Tables of experimental results, Phylogenetic Trees. Jupyter notebooks under "notebooks_for_figures" will automatically download the necessary data from Zenodo, but if you want it for some other purpose, it's available at this link.
  • protein_gibbs_sampler: command line tools for generating new sequences using ESM-MSA sampling (used in the notebook above).

References

  • Johnson, Sean R., Xiaozhi Fu, Sandra Viknander, Clara Goldin, Sarah Monaco, Aleksej Zelezniak, and Kevin K. Yang. “Computational Scoring and Experimental Evaluation of Enzymes Generated by Neural Networks.” Nature Biotech, April 23, 2024. https://doi.org/10.1038/s41587-024-02214-2.