ml-protein-design-sav-gold: A Jupyter Notebook repository from LAS @ ETH Zurich

Content: Enhanced Sequence-Activity Mapping and Evolution of Artificial Metalloenzymes by Active Learning

This repository contains scripts for analysis, preparation and reporting of results from a study published in ACS Central Science journal:

Authors: Tobias Vornholt Mojmír Mutny, Gregor W. Schmidt, Christian Schellhaas, Ryo Tachibana, Sven Panke, Thomas R. Ward, Andreas Krause, and Markus Jeschek

Title: Enhanced Sequence-Activity Mapping and Evolution of Artificial Metalloenzymes by Active Learning

Journal: ACS Central Science Year: 2024

For the full-paper please refer to the link.

This repository containts:

plotting scripts in /plots
training scripts in bechmark_run
sequential decision making scripts in active_learning_X

As part of the project we have developed a standalone python package mutedpy, which can be found in the dependecies section.

Updates

21/05/2024 Initial version of public code online

Dependencies

This repository contains only basic script which build upon libraries

1. mutedpy https://github.com/Mojusko/mutedpy

2. stpy https://github.com/Mojusko/stpy

Further resources

A large part of the dataset could not fit to the repository. Additional data is located in

Embeddings form the /data can be downloaded from here.
NGS sequencing analysis can be downloaded from here.
10% subset of structured generated via Rosetta software. It can be downloaded here.
Pretrained and saved models for the plotting can be found here.

How setup the code?

The easiest way to rerun the clode is to clone repository along with the stpy repository as

git clone https://github.com/Mojusko/mutedpy
cp /experiment/ 
git clone https://github.com/lasgroup/ml-protein-design-sav-gold 
mv ml-protein-design-sav-gold streptavidin
cd streptavidin 
wget https://polybox.ethz.ch/index.php/s/XKNUFIGRY08py63 #retrieve saved embeddings data 
unzip data.zip data
wget https://polybox.ethz.ch/index.php/s/Bd9bi0ITfBI6xur #retrieve save pickled models
uzip models.zip models

Rerunning analyses

The benchmarking analysis can be rerun using the code in bechmark_run sub-folder. Namely, the final parameters for the chemical features can be run with:

cd experiments/streptavidin
mkdir results_strep
cd bechmark_run
mkdir job_files_exp
python benchmark_run/run_final_aa.py
sh job_files_exp/job0.sh

The final model is then saved to to the results_strep subfolder along with plots of different cross-validation splits. To rerun the extensive benchmark access to our MongoDB database is needed. The code and calculation of structure is however available online. We used benchmark_run/run_extra_analysis_bench_2.py to generate hyperparameter for benchmarking.

Plots

The plots for the publication and statistical analysis can be found in the subfolder plots/.

Citation

To cite this work, please use

@article{Vornholt2024,
	author = {Vornholt*, Tobias and Mutn{\'y}*, Mojm{\'\i}r and Schmidt, Gregor and Schellhaas, Christian and Tachibana, Ryo and Panke, Sven and Ward, Thomas R. and Krause, Andreas and Jeschek, Markus},
	journal = {ACS Central Science},
	title = {{Enhanced Sequence-Activity Mapping and Evolution of Artificial Metalloenzymes by Active Learning}},
	url = {https://www.biorxiv.org/content/10.1101/2024.02.06.579157v1.full.pdf},
	year = {2024}
}

Authors & Contact

This repository was assembled by Mojmir Mutny (ETH Zuerich) and Tobias Vornholt (ETH Zuerich and University of Basel).

For any inquries regarding the code, please use: mmutny@inf.ethz.ch