Few-shot prediction of the experimental functional measurements for proteins with single point mutations
The code for quantitative prediction of single point mutations impact on experimentally measured protein function.
Data preprocessing is needed to run the model training.
Wild type and mutated sequence represetnations are generated using ESM-1b model Rao et al. 2021
(ICML'21 version, June 2021).
ESM embedder files location must be specified for processing the sequence embeddings.
The paths are defined in utils.py.
The files can be downloaded from here: Model ESM1b and Regression weights ESM1b
Run data creation script with argument specifying where to put the resulted dumps.
Each dump file contains all the protein mutations, scores and embeddings.
run_prism_data_creation.py -dump_root [path_to_folder]
This will create pickled files:
To run a model for zero-shot pre-train process, specify where data dump files are located (parameter in config.win.ini or config.linux.ini)
After creation of pickled data with embeddings per protein (might take hours) the model can be run for training and evaluation.
Run model for zero-shot and fine-tuning, while [X] is an integer number from 1 to 39 according to a proteins list specified in utils.py
run_full_flow.py -ep [X]
MAVE data files are included in the 'mave' folder.
The complete project is in Python plus PyTorch and it is in 'code' folder.
Several configuration files are provided for specifying various parameters of the system.
One can spicify size of the model (number of channels, number of attention layers) and also training parameters like learning rate, number of epochs, patience, etc..