Deep learning methods for interpreting protein contact maps
This work has been developed and used by the Bioinformatics, Data Mining and Machine Learning Laboratory (BDM) . Help from every community member is very valuable to make the tool better for everyone. Checkout the Lab Page.
PSI-BLAST 2.2.26 (For generating PSSM sequence profile)
CCMpred (For generating pseudo-likelihood maximization)
numpy 1.18.1
pandas 1.1.2
tensorflow-gpu(Recommended if a CUDA-compatibale GPU is available) /tensorflow 1.15.2
keras 2.1.6
We also provided the Dockerfile to help users setup the required environment when Docker is available on the host system. Please refer to the Docker Tutorials (https://www.docker.com/101-tutorial) for how to build your own image under different systems.
Two types of features are required: PSSM sequence profile which can be generated from PSI-BLAST, and PLM (pseudo-likelihood maximization) which can be generated from CCMpred from multiple sequence alignments (MSAs) produced by DeepMSA. The details of how to acquire both features are described in the DeepDist paper, which is also developed by the BDM lab.
The sequence databases used in the DeepMSA homologous sequences search include Uniclust30 (2017-10), Uniref90 (2018-04) and Metaclust50 (2018-01), our in-house customized database which combines Uniref100 (2018-04) and metagenomics sequence databases (2018-04), and NR90 database (2016). Sample features can be found under the example folder, and users can build both features from their own customized sequence databases.
Predict from given PLM and PSSM data (predict.py):
-h, --help
show this help message and exit-m, --model_type
Type of model, can be one of "sequence", "regional", or "combine"-l, --plm_data
Path to PLM data. Should be a numpy array flatten from (441,L,L), where L is the length of the input sequence. It should be saved as .npy format (https://numpy.org/doc/stable/reference/generated/numpy.save.html).-s, --pssm_data
Path to PSSM data. Should be a text file start with " # PSSM" as the first line, and the following contents should be 20 lines each contains L values.-o, --out_file
Path to output contact map. An L by L numeric matrix saved as TSV format.-w, --weights
Should attention weights be extracted. Sequence attention weights would have shape (heads,L,L). Regional attention weights would have shape (heads,n2,L,L), where n is the side length of the scope of attention mechanism. Both weights are saved as .npy format. Detailed description of how attention weights are computed can be found at https://www.biorxiv.org/content/10.1101/2020.09.04.283937v1.
Example:
python predict.py -m sequence -l example/plm/T0970.plm -s example/other/T0970.pssm.npy -o ./outmap.tsv
Train models for protein contact prediction with given PLM, PSSM and labels for training and validation datasets (train.py):
-
-h, --help
show this help message and exit. -
-plm, --plm_data_path
Path to PLM data files. -
-pssm, --pssm_data_path
Path to PSSM data files. -
-l, --label_path
Path to label data files. Should be text files with 0/1 indicate the contacts. -
-s, --sample_list_file
Config file indicating the sample names for training and validation. -
-m, --model_type
Type of model, can be "sequence" or "regional". -
-o, --output_dir
Path where the trained models and history are saved. -
-pa, --patience
Stop the training early for no improvements in validation after x epochs, default is 5. -
-e, --epochs
Number of epochs, default is 60.
Example:
python train.py -m sequence -plm example/plm/ -pssm example/other/ -l example/label/ -s example/train_sample_list.txt -v example/val_sample_list.txt -o ./