Parallel-DCA is a python script designed to perform Direct Coupling Analysis (DCA) on multiple sequence alignments (MSAs) efficiently using parallel processing. This tool leverages the multiprocessing
module to speed up computations, making it suitable for large MSAs. This is a direct implementation of the original paper1.
The preprocessing step handles gaps and filters columns based on a gap cutoff, as follows:
Where:
-
$\text{MSA}$ is the multiple sequence alignment. -
$\text{numSequences}$ is the number of sequences in the MSA. -
$\text{gapCutoff}$ is the threshold for filtering columns based on the proportion of gaps.
Once the MSA data has been preprocessed, it follows the steps:
-
Estimates the frequency counts
$f_i(A)$ and$f_{ij}(A, B)$ from the MSA, using the pseudocount$\lambda = M_{eff}$ in Eqs. 1 and 2.$$f_i(A) = \frac{1}{M_{eff}+\lambda} \left( \frac{\lambda}{q}+ \sum_{a=1}^{M} \frac{1}{m^a} \delta(A, A_i^a) \tag{1} \right)$$ $$f_{ij}(A, B) = \frac{1}{M_{eff}+\lambda} \left(\frac{\lambda}{q^2} +\sum_{a=1}^{M} \frac{1}{m^a} \delta(A, A_i^a) \delta(B, A_j^a) \tag{2} \right)$$
-
Determines the empirical estimate of the connected-correlation matrix
$C_{ij}(A, B)$ using Eq. 3.$$C_{ij}(A, B) = f_{ij}(A, B) - f_i(A) f_j(B) \tag{3}$$
-
Determines the couplings
$e_{ij}(A, B)$ according to the second part of Eq. 4.$$e_{ij}(A, B) = -(C^{-1})_{ij}(A, B) \tag{4}$$
-
For each column pair
$i < j$ , estimates the direct information$DI_{ij}$ by solving Eqs. 5 and 6 for$P_{ij}^{(dir)}(A, B)$ , and plugs the results into Eq. 7.$$P_{ij}^{(dir)}(A, B) = \frac{\exp(e_{ij}(A, B) + h_i(A) + h_j(B))}{Z_{ij}} \tag{5}$$ $$f_{i}(A) = \sum_{B}P_{ij}^{(dir)}(A, B), \quad f_{j}(B) = \sum_{A}P_{ij}^{(dir)}(A, B) \tag{6} $$ $$DI_{ij} = \sum_{A, B} P_{ij}^{(dir)}(A, B) \ln \left( \frac{P_{ij}^{(dir)}(A, B)}{f_i(A) f_j(B)} \right) \tag{7}$$
- Preprocess MSA: Handles gaps and filters columns based on a gap cutoff.
- Map Residues to Integers: Converts amino acid residues to integer representations.
- Sequence Weights: Adjusts for sequence redundancy.
- Frequencies: Calculates single-site and pairwise frequencies.
- Connected Correlation Matrix
- Infers Couplings: using regularized inverse of the correlation matrix.
- Direct Information: between residue pairs.
To install the dependencies, use the following command:
pip install numpy biopython matplotlib
-
Run the analysis: Place your MSA file in the repository directory and modify the input file name in the script if necessary.
python parallel_dca.py
-
Checkpoints implementation: In case of an unreliable environment or long compute times, a script called
parallel_dca_checkpoints.py
was implemented. It contains regular checkpoints and saves the numpy arrays on every step. It can also handle problematic or computationally intensive frequency calculations, which are calculated first.python parallel_dca_checkpoints.py
Footnotes
-
Morcos, F., et al. (2011). Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences, 108(49), E1293-E1301. ↩