DeepDDG: Predicting the Stability Change of Protein Point Mutations Using Neural Networks
Further analysis of model building and evaluation result can be found at the report.
- dataset_1: is generated from original_dataset_oct_2020 by changing the column names as "_" separated with lower case letters except for "pH" and "T". All the columns are kept as they were in the original one.
- dataset_2: the mutation column is seprated into three columns as wild_residue, mutaion_site and mutant_residue. Wild and mutant residue are 3-letter amino acid representation.
- The train set contains 5444 mutation data points from 209 proteins.
- The test set contains 276 mutation data points from 37 proteins.
- Different ddG values are reported for same mutation, which is removed in dataset_3.
- Now the train and test set contains 4344 mutations of 209 proteins and 253 mutations of 37 proteins
- The train and test set contains no common proteins.
- The output_images/data_distribution directory compares the train and test set distributions upon:
- Amino acid vs number of mutations
- ddG vs number of mutations
- Mutation site vs number of mutations
- pH vs number of mutations
- Proteins vs number of mutations
- Temperature vs number of mutations
- Some data issues are reported at the bottom of this file.
-
Backbone dihedral: phi, psi and omega angels of backbone atoms [N, CA, C] of a residue. (6 features)
-
Solvent accessible surface area (SASA): using Naccess
- To download: http://www.bioinf.manchester.ac.uk/naccess/
- Get the access key from the authors, and compile it from /home directory. If the directory path contains space (i.e. ..../New volme/....) it cannot it compiled. It requires gfortran and csh.
- Sample usage: naccess 1a43A.pdb
- After installation the path should be:
"3rd_party_items/naccess"
-
Secondary structures (SS): using Stride
- Downloaded and compiled following http://webclu.bio.wzw.tum.de/stride/
- DeepDDG mentions they use 3 bits for SS, but did not mention for which 3 types.
- So, I keep coil (C), beta-sheet (E) and alpha-helix (H). Although DSSP classifies SS as 8 types.
- After installation the path should be:
"3rd_party_items/stride"
-
Hydrogen bonds: using HBPLUS
- To download: https://www.ebi.ac.uk/thornton-srv/software/HBPLUS/
- Usage manual: https://www.ebi.ac.uk/thornton-srv/software/HBPLUS/manual.html
- After installation the path should be:
"3rd_party_items/hbplus"
-
Position-specific scoring matrix (PSSM): using PSI-BLAST (authors version 2.7.1)
-
Download the latest version ncbi-blast executables from https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
- File name: ncbi-blast-2.12.0+-x64-linux.tar.gz
- ncbi-blast executables contain blastp, blastn, psiblast and so on.
- Quick start on BLAST: https://www.ncbi.nlm.nih.gov/books/NBK569856/
- After installation the path should be:
"3rd_party_items/ncbi-blast-2.12.0+/bin/psiblast"
-
Database setup
- I used rp-seq-15 instead of rp-seq-55, because rp-seq-55 is more than 20GB of size where as rp-seq-15 is less than 6GB of size.
- Download link: https://proteininformationresource.org/rps/
- Make the database compatible for psi-blast:
3rd_party_items/ncbi-blast-2.12.0+/bin/makeblastdb -dbtype prot -in path/to/rp-seq-55.fasta -input_type fasta -out 3rd_party_items/rp_seq_15/
-
Since PSSM feature generation takes longer, this is generated distributedly in GMU argo cluster:
sbatch jobs/distributed_pssm_generator.sh
-
Authors' also applied softmax over PSSM row-wise as the final PSSM.
-
The following items are optional for practicing purposes:
-
Download the database from https://ftp.ncbi.nlm.nih.gov/blast/db/*
- File name: swissprot.tar.gz (because it is small to setup and test)
-
Biopython provides
NcbipsiblastCommandline
to use psi-blast software.from Bio.Blast.Applications import NcbipsiblastCommandline
-
Sample example of BLAST usage from command line:
- To download swissprot:
3rd_party_items/ncbi-blast-2.12.0+/bin/update_blastdb.pl --decompress swissprot
- To download all "nr" databases: "nr" is the non-redundent version of the sequence database.
3rd_party_items/ncbi-blast-2.12.0+/bin/update_blastdb.pl --decompress nr [*]
ncbi-blast-2.12.0+/bin/update_blastdb.pl --decompress nr.02
- To create blast-db from fasta sequences:
3rd_party_items/ncbi-blast-2.12.0+/bin/makeblastdb -dbtype prot -in path/4eiuA.fasta -input_type fasta -out path_to_save/db_name
3rd_party_items/ncbi-blast-2.12.0+/bin/makeblastdb -dbtype prot -in 3rd_party_items/rp-seqs-15.fasta -input_type fasta -out 3rd_party_items/rp_req_15/rp_req_15
- To download swissprot:
-
-
-
Pairwise fitness score (PFS) after Multiple sequence alignment (MSA): using HH-suite3.0
- HH-suite homepage: https://github.com/soedinglab/hh-suite
- To download precompiled HH-suite: https://mmseqs.com/hhsuite/
- File name: hhsuite-linux-sse2.tar.gz
- User guide: https://github.com/soedinglab/hh-suite/wiki#generating-a-multiple-sequence-alignment-using-hhblits
- To set up the database I first download the scop40_01Mar17.tgz from https://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/ (because it is small 381M)
- Sample usage: ./3rd_party_items/hhsuite/bin/hhblits -i data/fastas/4eiuA.fasta -o test.msa -d ./3rd_party_items/scop40_01Mar17/scop40
- To download uniprot_20: https://wwwuser.gwdg.de/~compbiol/data/hhsuite/benchmark/
- After installation the path should be:
"3rd_party_items/hhsuite/bin/hhblits"
-
After the avobe setup successfully, run the following command
python datasets/input_data_generator.py
- To generate train/test set specific feature, line 19 and 20 must be changed accordingly.
- To train the model, run:
python DeepDDG/train.py
- To test the model, run:
python DeepDDG/test.py
- Column name: PDB ID with modifications to be made
- 1A43 -> if I want to take a chain, I shall take the "A"
- 1SPB:P -> I thought P is a chain but not!!! what is this then?
- 1A7V:Q1A -> the 1st residue is Q, which will be substituted by A.
- 1ACB:I:F10W -> there is no I chain,
- 1CEY_F14N -> you can see that although it's the 13th residue according to the PDB, the author labeled it 14.
- 1CEY_WT
- 1CFD_1-75: I think I need to consider only 1-75th residue
- 1CFD_1-78_F19Y: I think I need to consider 1-78th residue, here 19th is F, will be substituted by Y.
- 1VII_N68AK70M: N 68 ->A and K 70-> M
- The SASA value is taken from rsa file the absolute (ABS) all atoms value.
- For the same mutation of the same protein the dataset contains different ddg values. i.e 1A43 has 9 mutations where 4 of them are repeated with different ddG values. Took the average following Potapov et al (https://doi.org/10.1093/protein/gzp030).
- No chain id is not given in the original dataset. I selected the 1st chain by default. Because in many cases chain A is not present. And in many cases 1st chain is not A. Such cases are:
-
PDBID Chain-id 1lmb 4 1tup A 1azp A 1bf4 A 1otr B 1glu A 1hcq A 1iv7 B
-
- 1hfzA has the 1st residue as 1x. I take the starting index as where the residue id becomes integer.
- Entry 2A01 was removed. Check this: https://www.rcsb.org/structure/removed/2A01
- Last residue does not have any dihedral angles. Therefore, last residue cannot be neighbor residue.
- 1a7cA does not have residue from 334 to 347.
- Some ddG values are out of range ([-10, 10]).
- 1am7A does not have residue of number 17.
- 1amqA does not have residue of number 407.
- In many cases, a particular neighbor residue does not have Ca, C or N atoms which is required to compute dihedral angles. To solve this problem, this neighbor residue is avoided and set the next residue as neighbor and follow the process again.
- 2ptlA_T_19_A: this does not have any hit while computing PSSM, need bigger database, and no PSSM file generated. Therefore returned 0 for softmax pssm.
- 4hxj_A_D_141_A: the secondary structure is not of the same size of the number of residues, manually corrected.
- 5np8_A_T_378_P: this does not have residue from 373 to 381. But 369th residue is Thr (T). So manually changed mutation point.
- 2arf_A_H_1069_Q: when SASA is computed there is no gap between chain_id and residue_num column. This is corrected manually.
- Note that: this type of errors may occur in other cases, which is not reported since the adapted code successfully avoided or solved them.