/korpm

Predicting the effect of mutations on protein stability using a simple orientational potential.

Primary LanguageC++OtherNOASSERTION

korpm

A fast method for predicting the stability change upon mutation from a 3D structure. Predicting protein stability changes upon mutation using a simple orientational potential. I. Martín-Hernández, Y. Dehouck, U. Bastolla, J.R. López-Blanco and P. Chacón. 2023. Bioinformatics, 39, 1, 2023

Git users: Be sure that LFS is installed to support long files, otherwise download the repo as a zip file.
The size of korp6Dv1.bin should be 331.8MB

Usage

The usage is very simple:

sbg/bin/korpm input.txt --dir Ssym --score_file pot/korp6Dv1.bin -o out.txt

it only requires: 1) a two-column input file specifying both the PDB file and mutation, i.e. 1BNI IA76A, and 2) the path where the PDB files are located (--dir) and the KORP potential file (--score_file). The results are also stored in the out.txt (-o option) file.

more out.txt
1BNI IA76A       -1.396
1EY0 TA44V        0.108
1IHB FA82Q       -0.727

The mutation columns stand for: the 1st letter is the wild-type amino acid, 2nd is the chain ID, the digits correspond to the PDB residue position, and the last letter is the mutated amino acid. We follow the standard convention ΔΔG >= 0 (positives) is stabilizing and ΔΔG < 0 (negatives) is destabilizing.

Source Code

The complete source C++ code of KORPM is under sbg directory. Just execute compile_korpm.sh script.

ΔΔG Curated Databases

We extracted from Thermomut and ProThermDB unique mutations trying to avoid entries that potentially interact with ligands or belong to protein-protein interfaces, and removing entries measured at extreme temperature or pH conditions. The initial curated database data comprise 3824 mutations from 139 protein families (seq. identity <25%) with an average of ΔΔG -1.0 Kcal/mol and a standard deviation of 1.6 Kcal/mol. In total, 73% are destabilizing (ΔΔG>0) and 27% are stabilizing (ΔΔG<0). By removing mainly alanines' destabilizing mutations, we obtain a more balanced subset that includes 2371 mutations from 129 protein families, 58% destabilizing and 42% stabilizing with an average of ΔΔG -0.7 Kcal/mol and a standard deviation of 1.6 Kcal/mol. This subset, named Id25c03_1merNCLB.txt, was used for extract training and validation datasets for k-fold cross-validation experiments. Note that this subset is far from being perfectly balanced, e.g., the most frequent amino acid involved in the mutation still is alanine, and cysteines, tryptophans, and, prolines still are underpopulated. This dataset does not include any of the mutations of the Ssym test set. There are two additional subsets excluding all the proteins presenting at least 25% of the sequence of identity with either Ssym dataset Id25c03_1merNCLB_Ssym.txt or S641 dataset Id25c03_1merNCLB_S461.txt. In our case, these different datasets were employed to test the absence of significant overfitting in our simple potential.

Id25c03_1merNCL.txt Id25c03_1merNCLB.txt

In the directory Db you can find all the corresponding PDB files.

Results with Ssym

Ssym is a data set with an equal number of stabilizing and destabilizing mutations compiled by Pucci et al. (https://doi.org/10.1093/bioinformatics/bty348) for which the structure of both the wild-type and mutant protein are available.

sbg/bin/korpm Ssym.txt --dexp --dir Ssym --score_file pot/korp6Dv1.bin -o Ssym_all.txt

Where Ssym.txt is the mutations input file and the Ssym directory is where the input PDB files are stored. Since this input contains the experimental ΔΔG (see Appendix for small corrections) you can cross-check the predictions by:

scripts/Mstat.pl Ssym_all.txt 10 11 2

# Current Positives|Negatives threshold (thr) is 0 (ddG >= 0 are not-destabilizing [positives] and ddG < 0 are destabilizing [negatives]).

aa     S     D     T   TP  avg  err   FP   TN  avg  err   FN     P     N   SEN   SPE   PPV   NPV   ACC  accn  RMSE  MAE   PCC    Sc    Ob1   Ob2  MCC
 X   342   342   684  264  1.5  0.9   71  271 -1.5  0.8   78   335   349 0.772 0.792 0.788 0.777 0.782 0.782 1.331 0.969 0.695  64.6  34.6   0.7  0.56
 A    97    97   194   80  1.9  1.0   15   82 -1.9  0.9   17    95    99 0.825 0.845 0.842 0.828 0.835 0.835 1.525 1.089 0.744  67.0  32.0   1.0  0.67
 V   106   106   212   82  1.2  0.8   24   82 -1.2  0.8   24   106   106 0.774 0.774 0.774 0.774 0.774 0.774 1.187 0.871 0.691  64.6  35.4   0.0  0.55
 I    68    68   136   57  1.6  0.8   11   57 -1.6  0.7   11    68    68 0.838 0.838 0.838 0.838 0.838 0.838 1.128 0.890 0.810  66.9  31.6   1.5  0.68
 L    41    41    82   30  1.7  1.0   12   29 -1.7  1.0   11    42    40 0.732 0.707 0.714 0.725 0.720 0.720 1.507 1.123 0.638  62.2  37.8   0.0  0.44
 .....etc
 

Check ΔΔG Anti-symmetry in Ssym

sbg/bin/korpm SsymD.txt --dexp --dir Ssym --score_file pot/korp6Dv1.bin -o Ssym_dir.txt
sbg/bin/korpm SsymR.txt --dexp --dir Ssym --score_file pot/korp6Dv1.bin -o Ssym_rev.txt
paste Ssym_dir.txt  Ssym_rev.txt  > temp
awk 'function abs(x){return (x < 0) ? -x : x;} {printf "%s %s %s %s %s %s %s %f  %f %s %s\n",$1,$19, $2, $20, $10, $11,$29, ($11+$29), abs(($11+$29)), $3, $4  }' temp > KORPM_Ssym.txt

you can see the results in your favourite plot, for example in gnuplot:

plot  "KORPM_Ssym.txt" u 6:7
stat "KORPM_Ssym.txt" u 6:7
...
  Linear Model:       y = -0.8207 x + 0.03672
  Slope:              -0.8207 +- 0.02468
  Intercept:          0.03672 +- 0.03807
  Correlation:        r = -0.8745
...
Italian Trulli

Comparative results Ssym

Here you can find some comparative results with state-of-the-art stability prediction programs:

METHODRMSEMAEPCCScOb1Ob2TPRTNR PPVNPVACCMCCAROCAPRC
KORPM1.330.970.6964.634.60.70.770.790.790.780.780.560.860.86
ACDCNN1.381.010.6961.538.10.00.700.700.700.700.700.400.800.80
FoldX1.861.290.5460.134.55.40.550.780.710.630.660.330.740.75
EvoFF1.561.120.5461.734.93.40.610.700.670.640.660.310.740.75
PopMusic-S1.581.150.5256.642.41.00.670.710.700.680.690.380.760.74
Dynamut1.881.370.3854.438.27.50.210.880.640.530.550.130.620.62
DDGun3D1.431.040.6361.837.40.70.680.690.690.690.690.370.750.76
ThermoNet1.531.090.5558.240.90.90.650.700.690.670.680.350.750.74
Cartddg3.442.630.6352.341.16.60.580.870.820.670.730.470.810.82

you can find complete results in Ssym_results

Results with S461

you can find complete results in S461_results

Appendix.

Corrections of original Ssym dataset based on ThermoMutDB data.

PDBMutationOriginal Corrected Medline References from ThermoMutDB
1BNIIA96V-3.1-0.92669964 (-0.90); 1569557 (0.95); 9551101 (-0.80)
1BNISA91A-2.4-1.814516751
1L63SA44T0.00.018289284
1L63SA38N0.0-0.011911773
1L63LA91A-3.9-2.610545167
1L63AA130S1.0-1.08218201
1LZ1VA2G-1.3-2.2911087397
1LZ1VA2L0.3-0.0511087397
1LZ1IA56T-4.3-3.69010773; 10556244
1LZ1VA74I-1.90.4511087397
1LZ1VA74L-0.40.1911087397
1LZ1VA74M-0.40.6511927576; 11087397
1LZ1VA110G-2.20.4811927576; 11087397
1LZ1VA110I-0.80.8611927576; 11087397
1LZ1VA110F-1.9-0.0511927576; 11087397
2LZMIA3C0.01.23405287
2LZMRA119E0.0-0.041942034
4LYZGA49A-0.7-1.911112507; 8771183
4LYZGA71A-2.1-0.3811112507
4LYZGA102A-1.20.0211112507
4LYZGA117A-0.8-1.4611112507
1RN1QC25K1.40.932663837
2LZMRA96K0.0-0.001Avoiding zero for the binary classification
1L63SA44E0.00.001Avoiding zero for the binary classification
2LZMKA60P0.0-0.001Avoiding zero for the binary classification

Funding

This work was supported by Spanish grants PID2019-109041GB-C21/AEI/10.13039/501100011033