GA Error Sources Framework

Quantification of Error Sources Accounting for Misidentification of Protein Partners in Coevolutionary Approaches

Physical interactions in proteins are maintained throughout evolution via compensatory mutations. As extensively investigated in recent years, the coevolutionary signal is considered highly relevant for the ab initio resolution of specific protein partners based on multiple sequence alignments (MSAs). Despite recent advances in the field, primarily rooted in mutual information (I) correlation analysis, the predictive problem of protein partners remains unsolved for sequence ensembles in general. This is primarily because there is no effective non-degenerate heuristic to search for the correct set of protein partners across the immense space of possibilities inherent in this type of problem. In recent publications, we shown genetic algorithm simulations that start from minimum mutual information fail at pairing native sequences correctly in a system with two MSAs after mutual information maximization. These errors arising from mismatches among (i) similar and (ii) non-similar sequences. However, a quantitative description is lacking in the scientific community. Thus, trying to elucidate degeneration of I in protein in teraction space, we contribute here a statistical framework to describe the probability distribution of interaction models of proteins A and B for a large number of sequences M that feature a unique “native” arrangement (‘) at a maximum mutual information content. Our specific aim is the quantitative description of the mutual information simulated native→scrambled→near-native.

Authors