prototype scripts for detecting convergent adaptation of continuous traits at the amino acid level Continuous PCOC scripts and the ctenophore enzyme alignments in this repo are featured in Combing Transcriptomes for Secrets of Deep-Sea Survival: Environmental Diversity Drives Patterns of Protein Evolution (Winnikoff et al., ICB 2018), Open Access. They were also presented at the Convergent Adaptation to Polar Environments Workshop in Tampa, FL on Jan. 3, 2019.
As of 1/13/19, the scripts here implement two distinct methods: PCOC and CPGLS.
Original method: Rey et al. 2018
pcoc_cont_scenarios.py
is a wrapper for the pcoc_det.py
script published with
the above paper. It attempts to identify amino acid sites that are convergently
adapted in association with a phenotypic trait. This script takes a continuous
trait; pcoc_det.py
requires a binary one.
pcoc_cont_scenarios
executes as follows (illustrated in this flowchart):
-
Perform an ancestral trait reconstruction using the supplied species tree and continuous trait table. At time of writing, this is a simple Brownian motion (BM) reconstruction.
-
Generate a series of "convergent scenarios" for
pcoc_det
by proposing "cutoffs" for discretizing the continuous trait into a binary one. ('A' above) Cutoffs are established thus:-
List all node trait values (terminal and internal) in order.
-
Average trait values within a certain range of each other (termed trait precision, set with
-p
). -
Place cutoffs at the midpoints between each pair of consecutive trait values.
-
-
Run
pcoc_det
for each unique scenario. ('B' above) It is possible that two or more consecutive cutoffs result in the same binary trait mapping. -
Collate all
pcoc_det
posterior probability (PP) values into a master dataframe for downstream analysis. ('C' above) -
Visualize these PP data in a heatmap or a Manhattan plot, optionally plotted above a multiple sequence alignment with high-PP columns highlighted as specified via the
-pp
option. -
If the user requests simulation-based bootstrapping (
--sim
option) to check PCOC's work, recover the amino acid profiles assigned to the site and simulate neutral and convergent evolution along the tree 1000 times each using these profiles. -
Calculate upper quantiles (alpha,
--sim_alpha_vals
) of PPs for the convergent scenario and lower quantiles (beta,--sim_beta_vals
) of PPs for the neutral scenario. ('D' above) -
Superimpose these confidence intervals on the Manhattan plot. Running steps 1-7 above will give output like this:
Secondary-structural motifs were manually painted into this figure based on results from I-TASSER. The column range is cropped with option --xlim
, e.g. --xlim 43 93
for panel 'A'.
-
A series of command-line arguments (see
pcoc_cont_scenarios.py -h
) allow one to run only a subset of the steps above. e.g. Steps (4) and (5) above can be run on pre-generated PCOC results, and (7) and (8) can be run on existing simulation results for enormous time savings, but trait values and precision must be specified the same as when the analysis was first run, to ensure the same trait cutoff values. -
Convergence scenarios applied to the actual data indexed by trait cutoff value and stored in the
Scenarios
subdirectory. Simulation confidence curves are pickled in theSimulations
subdir and indexed by convergent scenario topology, to avoid replicating intensive simulations. -
pcoc_cont_scenarios
runspcoc_det
through a system call. This is becausepcoc_det.py
as provided has nomain()
, and it means the user should check thesubprocess.call()
inpcoc_cont_scenarios.py
to make sure it's compatible with their shell environment.pcoc_det
also requires moduleBpp
(Bio++), which apparently changes a lot.
A straightforward solution to these issues is to use the Docker container
distributed with PCOC:
docker pull carinerey/pcoc
and then install additional dependencies in it:
docker run -it carinerey/pcoc /bin/bash
pip install pandas biopython matplotlib
and then outside the container:
docker commit <container ID> pcoc-matplot
To run pcoc_cont_scenarios
in the Docker container, do something like this:
cd continuous-converge/
# or other dir containing both this repo and your data
# launch Docker container with $pwd mounted as /data/
docker run --mount type=bind,source="$(pwd)",target=/data/ -it pcoc-matplot /bin/bash
# run pcoc_cont_scenarios.py on your data. For the example dataset inside the repo:
xvfb-run python2 /data/pcoc_cont_scenarios.py -t data/example/FPs_data/tree/RAxML_bipartitions.FPs_62genes_MAFFT-gblocks-raxML-ultram.tree -o data/example/results -c data/example/FPs_data/trait/FPemWL_62genes_noref.tab -aa data/example/FPs_data/seq/FPs_62genes_MAFFT.fasta -bw 5.0 -f 2 -d -k _Avictoria -m master-table_2nm.tab -hm heatmap_2nm.pdf -mp manhattan_2nm.pdf
# ^xvfb-run is needed for graphics functionality in Docker
This is an original method based on phylogenetic regression sensu Grafen 1989.
A separate least-squares regression is performed for each column of an amino acid alignment. The predictor is AA state (a 20-D binary variable) and response is a 1-D continuous trait.
At time of writing, the phylogenetic variance-covariance matrix is constructed under a simple BM model transformed with Pagel's lambda. It is then adjusted to reflect the non-uniform substitution rates of amino acids. Read on for how this is done.
-
Construct phylogenetic variance-covariance "Q" matrix for the entire dataset.
-
Load an empirical or database-derived transition rate matrix and normalize it so that probabilities in each column sum to 1, rendering it asymmetrical.
Right now only BLOSUM62 is implemented.
-
Construct a biochemical variance-covariance "B" matrix for each column in the alignment:
-
Let B be a matrix with the same dimensions as Q, i.e. a row and a column for each taxon.
-
Let taxon i encode amino acid A at the column in question, while taxon j encodes amino acid B.
-
Let Bi, j = P(A->B) + P(B->A), i.e. the probability of transition from residue A to residue B or vice versa per the normalized transition rate matrix in (2) above.
-
-
Calculate the element-wise non-exclusive sum of Q and B. So for corresponding elements p and q: p + q - pq.
-
Run the regression followed by a variety of hypothesis tests and return the p-value for each, using any applicable multi-test correction.
-
CPGLS performs terribly in any simulation where either the "ancestral" or "convergent" taxa contain several amino acid states. 2 possible solutions to this:
-
ad hoc "pooling" of AA states that correspond to sufficiently similar trait values. This imposes no explicit hypothesis about which residues group together functionally.
-
Use a CAT-like profile mixture model Si Quang et al. 2008 to correlate continuous trait change with profile change along branches.
-
-
CPGLS is roughly 10x faster than PCOC as presently implemented.
-
Logistic regression is probably more appropriate for this purpose than phylogenetic regression, since amino acid state is fundamentally a categorical variable.
(no need to run from a Docker container or anything)
python2 cpgls.py -r example/FPs_data/trait/FPemWL_62genes_noref.tab -p example/FPs_data/seq/FPs_62genes_MAFFT.fasta -k _Avictoria -t example/FPs_data/tree/RAxML_bipartitions.FPs_62genes_MAFFT-gblocks-raxML-ultram.tree -m > FPs-pgls.tab