DeepDDG reconstruction

DeepDDG: Predicting the Stability Change of Protein Point Mutations Using Neural Networks

Further analysis of model building and evaluation result can be found at the report.

Data cleaning

  • dataset_1: is generated from original_dataset_oct_2020 by changing the column names as "_" separated with lower case letters except for "pH" and "T". All the columns are kept as they were in the original one.
  • dataset_2: the mutation column is seprated into three columns as wild_residue, mutaion_site and mutant_residue. Wild and mutant residue are 3-letter amino acid representation.
  • The train set contains 5444 mutation data points from 209 proteins.
  • The test set contains 276 mutation data points from 37 proteins.
  • Different ddG values are reported for same mutation, which is removed in dataset_3.
  • Now the train and test set contains 4344 mutations of 209 proteins and 253 mutations of 37 proteins
  • The train and test set contains no common proteins.
  • The output_images/data_distribution directory compares the train and test set distributions upon:
    • Amino acid vs number of mutations
    • ddG vs number of mutations
    • Mutation site vs number of mutations
    • pH vs number of mutations
    • Proteins vs number of mutations
    • Temperature vs number of mutations
  • Some data issues are reported at the bottom of this file.

Feature computation

Training and testing

  • To train the model, run: python DeepDDG/train.py
  • To test the model, run: python DeepDDG/test.py

Clarifications

  • Column name: PDB ID with modifications to be made
  • 1A43 -> if I want to take a chain, I shall take the "A"
  • 1SPB:P -> I thought P is a chain but not!!! what is this then?
  • 1A7V:Q1A -> the 1st residue is Q, which will be substituted by A.
  • 1ACB:I:F10W -> there is no I chain,
  • 1CEY_F14N -> you can see that although it's the 13th residue according to the PDB, the author labeled it 14.
  • 1CEY_WT
  • 1CFD_1-75: I think I need to consider only 1-75th residue
  • 1CFD_1-78_F19Y: I think I need to consider 1-78th residue, here 19th is F, will be substituted by Y.
  • 1VII_N68AK70M: N 68 ->A and K 70-> M
  • The SASA value is taken from rsa file the absolute (ABS) all atoms value.

Data issues

  • For the same mutation of the same protein the dataset contains different ddg values. i.e 1A43 has 9 mutations where 4 of them are repeated with different ddG values. Took the average following Potapov et al (https://doi.org/10.1093/protein/gzp030).
  • No chain id is not given in the original dataset. I selected the 1st chain by default. Because in many cases chain A is not present. And in many cases 1st chain is not A. Such cases are:
    • PDBID Chain-id
      1lmb     4
      1tup     A
      1azp     A
      1bf4     A
      1otr     B
      1glu     A
      1hcq     A
      1iv7     B
      
  • 1hfzA has the 1st residue as 1x. I take the starting index as where the residue id becomes integer.
  • Entry 2A01 was removed. Check this: https://www.rcsb.org/structure/removed/2A01
  • Last residue does not have any dihedral angles. Therefore, last residue cannot be neighbor residue.
  • 1a7cA does not have residue from 334 to 347.
  • Some ddG values are out of range ([-10, 10]).
  • 1am7A does not have residue of number 17.
  • 1amqA does not have residue of number 407.
  • In many cases, a particular neighbor residue does not have Ca, C or N atoms which is required to compute dihedral angles. To solve this problem, this neighbor residue is avoided and set the next residue as neighbor and follow the process again.
  • 2ptlA_T_19_A: this does not have any hit while computing PSSM, need bigger database, and no PSSM file generated. Therefore returned 0 for softmax pssm.
  • 4hxj_A_D_141_A: the secondary structure is not of the same size of the number of residues, manually corrected.
  • 5np8_A_T_378_P: this does not have residue from 373 to 381. But 369th residue is Thr (T). So manually changed mutation point.
  • 2arf_A_H_1069_Q: when SASA is computed there is no gap between chain_id and residue_num column. This is corrected manually.
  • Note that: this type of errors may occur in other cases, which is not reported since the adapted code successfully avoided or solved them.