/fs-mutation-prediction

Official repository for the paper "Few-shot Prediction of the experimental functional measurements for proteins with single point mutations".

Primary LanguagePythonMIT LicenseMIT

Few-shot prediction of the experimental functional measurements for proteins with single point mutations

The code for quantitative prediction of single point mutations impact on experimentally measured protein function.

Usage

Data preprocessing is needed to run the model training.

Wild type and mutated sequence represetnations are generated using ESM-1b model Rao et al. 2021 (ICML'21 version, June 2021).
ESM embedder files location must be specified for processing the sequence embeddings.
The paths are defined in utils.py.

image

The files can be downloaded from here: Model ESM1b and Regression weights ESM1b

Run data creation script with argument specifying where to put the resulted dumps.

Each dump file contains all the protein mutations, scores and embeddings.

run_prism_data_creation.py -dump_root [path_to_folder]

This will create pickled files:

image

To run a model for zero-shot pre-train process, specify where data dump files are located (parameter in config.win.ini or config.linux.ini)

image

After creation of pickled data with embeddings per protein (might take hours) the model can be run for training and evaluation.

Run model for zero-shot and fine-tuning, while [X] is an integer number from 1 to 39 according to a proteins list specified in utils.py

run_full_flow.py -ep [X] 

Data

MAVE data files are included in the 'mave' folder.

Source code

The complete project is in Python plus PyTorch and it is in 'code' folder.

Configuration files

Several configuration files are provided for specifying various parameters of the system.

image

One can spicify size of the model (number of channels, number of attention layers) and also training parameters like learning rate, number of epochs, patience, etc..