An evolutionary context-integrated deep learning framework for protein engineering
ECNet (evolutionary context-integrated neural network) is a deep learning model that guides protein engineering by predicting protein fitness from the sequence. It integrates local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest with the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. Please see our Nature Communications paper for details.
Clone and export the GitHub repository directory to python path
git clone https://github.com/luoyunan/ECNet.git
cd ECNet
export PYTHONPATH=$PWD:$PYTHONPATH
This package is tested with Python 3.7
and CUDA 10.1
on Ubuntu 18.04
, with access to an Nvidia GeForce TITAN X GPU (12GB RAM) and Intel Xeon E5-2650 v3 CPU (2.30 GHz, 512G RAM). Please see requirements.txt
for necessary python dependencies, all of which can be easily installed with pip
or conda
. Due to an issue of installing pytorch 1.4.0
with pip
, please install pytorch
with conda
first.
conda install pytorch==1.4.0 cudatoolkit=10.1 -c pytorch
pip install -r requirements.txt
- Download example data (~5.4MB) from Dropbox.
wget https://www.dropbox.com/s/nkgubuwfwiyy0ze/data.tar.gz tar xf data.tar.gz
- Run the example script. The following script trains an ECNet model using the fitness data of the
second RRM domain of Pab1 (source). The scripts randomly splits 70% as training data, 10% as validation data, and 20% as test data.
It typically takes no more than 15 min on our tested environment to run this example. The output (printed to stdout) would be the correlation between predicted and ground-truth fitness values.
CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \ --train data/RRM_single.tsv \ --fasta data/RRM.fasta \ --local_feature data/RRM.braw \ --output_dir ./output/RRM_CV \ --save_prediction \ --n_ensembles 2 \ --epochs 100
ECNet has two required input files: 1) a FASTA file of the wild-type sequence, and 2) a TSV file describes the fitness values of variants. Other optional input files include the output of CCMPred for extracting local features and separate test TSV file.
- Sequence FASTA file (
--fasta
, required). A regular FASTA file of the wild-type sequence. This file should contain only one sequence. - Fitness TSV file (
--train
, required). Each line has two columnsmutation
andscore
separated by tab, describing the fitness value of a variant. Themutation
column is a string has the format[ref][pos][alt]
, e.g.,S100T
, meaning that the 100-th amino acid (index starting from 1) mutated fromS
toT
. If a variant has multiple mutations,;
is used to concatenated mutations. Thescore
column is a numerical value quantifies the variant's fitness. Example:Note: This file is supplied using themutation score M1S 1.0 F12I;L30K 2.0 G89A 0.06
--train
argument. If no separate test data is provided through the--test
argument, this TSV file will be split into three sets (train, valid, and test) using ratio specified by--split_ratio
(which are 3 float numbers). If there is another test TSV file is provided, this TSV file will be split into two sets (train and valid) as specified by--split_ratio
(which are 2 float numbers). - Local features (
--local_feature
, optional). A binary file generated by CCMPred using the-b
option (note that to use the-b
option you need to install CCMPred from its latest GitHub branch instead of the release; you may also need to installlibmsgpack-dev
. See instructions below). ECNet will extract local features from this file. This file is optional. If not provided, please add--no_local_feature
flag when runningrun_example.py
(or, equivalently, setuse_local_features=False
for theECNet
class) and ECNet won't use the local features. See below for instruction of generating this binary file using HHblits and CCMPred. - Additional test TSV file (
--test
, optional). This file has the same format as the--train
TSV file.
We suggest users tune hyperparameters for new protein. Several hyperparameters are exposed as arguments, e.g., d_embed
, d_model
, d_h
, n_layers
, etc.
- Install HHsuite and CCMPred following their instructions. Note that CCMPred should be installed from the latest branch instead of the release, otherwise the
-b
option is not available. Also, as CCMPred usesmsgpack
to create the binary file, you may also need to installlibmsgpack-dev
on your system if it is not available. For example, on Ubuntu, you can runsudo apt update
thensudo apt install libmsgpack-dev
. - Prepare a FASTA file
example.fasta
of the wild-type sequence of our interested protein. - Search the homologous sequences of the wild-type sequence using
hhblits
in HHsuite. (There multiple ways to search homologous sequences and format the alignment. Below we describe a way that uses hhblits to search homologous sequences. Other ways are also feasible, e.g., using jackhmmer as described in the DeepSequence paper.)hhblits -i example.fasta \ -d ${path_to_hhblits_database} \ -o example.hhr \ -oa3m example.a3m \ -n 3 \ -id 99 \ -cov 50 \ -cpu 8
- Reformat the a3m output of hhblits to PSICOV format (solution modified from here). In order to run CCMpred, the alignment must be reformatted to the "PSICOV" format used by CCMpred. We can first use the
reformat.pl
script from thehh-suite/scripts
directory to get an alignment in fasta format and then theconvert_alignment.py
from theCCMpred/scripts
directory to get the PSICOV format:${path_to_hh-suite}/scripts/reformat.pl example.a3m example.fas -r python ${path_to_CCMpred}/scripts/convert_alignment.py example.fas fasta example.psc
- Run CCMPred
ccmpred example.psc example.mat -b example.braw -d 0
- Use the argument
--local_feature example.braw
to provide the local features to ECNet.
The following example shows how to train ECNet on dataset A (passed via --train
) and test it on another dataset B (passed via --test
).
- Example 1: train on single-mutant fitness data of RRM (source), and predict for double-mutants
CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \ --train data/RRM_single.tsv \ --test data/RRM_double.tsv \ --fasta data/RRM.fasta \ --split_ratio 0.9 0.1 \ --local_feature data/RRM.braw \ --output_dir ./output/RRM \ --save_checkpoint \ --n_ensembles 2 \ --epochs 100
- Example 2: you can also load the trained model using the
--save_model_dir
argument and predict for test dataset:CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \ --test data/RRM_double.tsv \ --fasta data/RRM.fasta \ --local_feature data/RRM.braw \ --n_ensembles 2 \ --output_dir ./output/RRM \ --saved_model_dir ./output/RRM
Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat Commun 12, 5743 (2021). https://doi.org/10.1038/s41467-021-25976-8
@article{luo2021ecnet,
doi = {10.1038/s41467-021-25976-8},
url = {https://doi.org/10.1038/s41467-021-25976-8},
year = {2021},
month = sep,
publisher = {Springer Science and Business Media {LLC}},
volume = {12},
number = {1},
author = {Yunan Luo and Guangde Jiang and Tianhao Yu and Yang Liu and Lam Vo and Hantian Ding and Yufeng Su and Wesley Wei Qian and Huimin Zhao and Jian Peng},
title = {{ECNet} is an evolutionary context-integrated deep learning framework for protein engineering},
journal = {Nature Communications}
}
Please submit GitHub issues or contact Yunan Luo (luoyunan[at]gmail[dot]com) for any questions related to the source code.