/ERC-Pipeline

The pipeline in this repository allows for the generation of evolutionary rate correlations (ERCs). ERCs use a phylogeny-based approach to look for evolutionary signatures of potential protein-protein interaction. Since ERCs can be fairly straightforward to produce, they can be used as an easy method of discovering candidate protein-protein interactions.

Primary LanguagePythonMIT LicenseMIT

ERC-Pipeline

The pipeline in this repository allows for the generation of evolutionary rate correlations (ERCs). ERCs use a phylogeny-based approach to look for evolutionary signatures of potential protein-protein interaction. Since ERCs can be fairly straightforward to produce, they can be used as an easy method of discovering candidate protein-protein interactions.

Installation

Environment Requirements

The pipeline should be compatible with Python version 3.7 or higher. R scripts were written for R 3.6.1 ("Action of the Toes") but are optional. Note that the pipeline was made to take advantage of multiple cores. We recommend at least 5 available cores and 16GB of RAM for large data sets.

Python Package Dependencies

The external Python dependencies can be installed using pip (most packages can also be found on conda/bioconda): pip install -r requirements.txt

Note that uvloop and pygraphviz are optional dependencies so if they fail to install on your system, you can still run the pipeline. Just note that pygraphviz is required for network diagrams.

R Package Requirements

All R packages are optional. They can be installed using the install.packages() function in base R.

  • If using time-corrected partial correlation calculations, ppcor
  • If using any of the generate_figs.R script code, ggplot2, readxl, writexl, ggvenn, dplyr, gridExtra, ape, picante, EnvStats, cowplot, patchwork, and Cairo

External Tooling Requirements

All of the following are required to be installed.

Preparation of Data

  1. Time-scaled phylogeny (Newick format)
    • If not specified, the pipeline will use the Mammalian time-scaled phylogeny generated using TimeTree (http://timetree.org/) for all the Mammalian taxa available in OrthoDB v10 (https://www.orthodb.org/). Implementation Note: The species names in this case are formatted in all caps with spaces replaced with underscores (i.e. Homo sapiens is formatted as HOMO_SAPIENS). You can find the full file in data/finished_mam_timetree.nwk.
  2. Protein sequences (FASTA format)
    • All proteins of interest should be provided as FASTA formatted sequences (these are assumed to be unaligned) in a directory. Note that the title of each sequence must be the taxon name for each sequence corresponding to the time-scaled phylogeny (i.e. if you have a human protein sequence, the FASTA sequence title must be exactly HOMO_SAPIENS if using the default tree).
    • This pipeline does not attempt to disambiguate paralogous sequences. So each protein sequence should be singe-copy, any multi-copy sequences in the FASTA files will be totally ignored.

Calculating ERCs on the command line

Assuming the sequences are in a directory called SEQUENCE_DIRECTORY and the time-scaled phylogeny is a file called PHYLOGENY.nwk and you wish to run calculations in a directory called OUTPUT_DIRECTORY.

python3 cli.py --timetree PHYLOGENY.nwk --sequences SEQUENCE_DIRECTORY --wd OUTPUT_DIRECTORY

  • Note that if you are using the default mammalian phylogeny, you can omit the --timetree PHYLOGENY.nwk argument. Additionally, you would be able to use the 20MY or 30MY ERC calculations described in Varela et al 2021 using either --erc-type 20my or --erc-type 30my respectively.

  • If you wish to use the time-corrected correlation method, you can add the --erc-type bt argument (requires R).

  • If you want to only run ERCs on specific pairs of proteins, you can pass the --align-pair /path/to/align1.fasta /path/to/align2.fasta option as many times as needed.

  • To specify the number of CPU cores to run with, use the -n argument.

  • If your sequence data are already prepared, you can use --skip-align and --skip-trim to skip alignment and trimming of alignments, respectively.

  • If your sequences are titled based on non-readable identifiers instead of protein names/symbols (for example: OrthoDB ids), you can create a tab-separated file with 2 columns: "alignment_identifier" and "readable_name". This will be used to replace the original alignment names with the readable names if you pass the path to this file to the argument: --id2name /path/to/file.tsv. The annotations used in the publication can be used like so: --id2name data/id2name.tsv

  • If you have previously run the ERC-Pipeline in another directory, you can include the proteins from the previous run in the current run with the --previous-run /path/to/previous/wd flag. You can pass it multiple times to include data from multiple runs. If you wish to include the (30MY) ERCs from the publication, you can do it like this: --previous-run data/30my_erc_results Note: If you wish to do this with the published previous 30MY ERC data, make sure to clone this repository with git-lfs installed (https://git-lfs.github.com/).

  • To automatically archive intermediate FASTA files, pass the --archive argument.

  • If you want to run ERCs along pieces of the alignments, add the --segment argument. This can be modified using the --slide argument, which rather than splitting the alignment into kmers, will normalize the data using a sliding window based on kmers. You can also use --kmer K to change the size of the kmers (replace "K" with the number).

  • If you want to just prepare all your data and not calculate ERCs, you can pass the --prepare flag.

Calculating ERCs in Python

The pipeline is easily accessible in Python. For practical examples of usage, check out the source of cli.py.

Setting up the environment:

from pipeline import ErcWorkspace

# The working dir and tree args are the only required arguments
workspace = ErcWorkspace("directory/to/run/in", "path/to/tree/topology.nwk")  

Note that this immediately changes the Python's runtime current working directory to the working directory passed as the first argument.

Definitions of the additional optional arguments;

  • segmented: If True, run ERC on pieces of the alignments
  • segment_size: The size of the pieces
  • internal_requirement: The minimum required size of internal branches to perform correlations with (only applies to normal ERCs and by default only all terminal branches are included in ERC calculations)
  • include_terminal: If you pass an internal_requirement argument, should terminal branches also be included in ERC calculations?
  • recalculate: If True, even if ERCs have been previously calculated for tree pairs in the environment, recalculate the correlation results.
  • sliding_window: If True, ERCs run on pieces of alignments are normalized with a sliding window.
  • skip_align: If True, skip alignment of the input sequences.
  • skip_trim: If True, skip trimming of alignments.
  • archive: If True, save intermediate files into archives.
  • taxon_set: Pass a list of taxa to perform calculations with, by default all taxa are considered.
  • time_corrected: If True, instead of standard Spearman's test, run Spearman's partial correlations controlling for time.
  • id2name: A dictionary that if passed, will be used to convert protein ids to readable names in outputs.
  • cores: Number of cores to use for calculations.
  • prepare: If True, calculate intermediate files and then stop.
  • skip_qc: If True, skip the QC step (still currently in development).

Include previously run ERCs (assuming you ran the pipeline in directory: old/erc_dir):

from pipeline import register_previous_run

register_previous_run("old/erc_dir/")

Add sequences to the workspace:

# Add alignments for all-by-all calculations
workspace.add_alignment("path/to/alignment.fasta")

# Add specific pairs of alignments to run calculations on
workspace.add_alignment("path/to/alignment1.fasta", "path/to/alignment2.fasta")

# Add alignments that should be concatenated together when run for ERCs
workspace.add_concatenated_alignment(["path/to/alignment1.fasta", "path/to/alignment2.fasta"], "concatenated_name")

# You can specify specific pairs of concatenated pairs as well
workspace.add_concatenated_alignment(["path/to/alignment1.fasta", "path/to/alignment2.fasta"], "concatenated_name",
                                     ["path/to/alignment3.fasta", "path/to/alignment4.fasta"], "concatenated_name2")

Run the ERCs:

# If running outside of a coroutine:
import asyncio
asyncio.get_event_loop().run_until_complete(workspace.run())

# If inside a coroutine:
await workspace.run()

Following calculations, you can load the data into a networkx Graph object using:

net = workspace.generate_network()

This network has each protein ran as nodes, with edges connecting the nodes based on ERCs. Each edge has the rho and p properties representing the ERC results.

Save data:

# Save the data to an excel spreadsheet
workspace.export_results("filename.xlsx")

Access Mammalian TimeTree:

from pipeline import _10mya_cutoff, _20mya_cutoff, _30mya_cutoff, prune_tree
from utilities import _self_path, safe_phylo_read
import os.path as osp

tree = safe_phylo_read("path/to/file.nwk")  # Read general newick trees
timetree = safe_phylo_read(osp.join(_self_path(), 'data', 'finished_mam_timetree.nwk'))  # Read the mammalian TimeTree

# Prune mammalian tree to the 10my, 20my, and 30my taxa sets respectively
_10my = prune_tree(timetree, _10mya_cutoff)
_20my = prune_tree(timetree, _20mya_cutoff)
_30my = prune_tree(timetree, _30mya_cutoff)

Misc. Features in Python

Improved async performance:

# If the uvloop package is installed (unix only), you can run the following to improve performance before running ERCs:
from utilities import try_hook_uvloop

try_hook_uvloop()

Extract rate data from trees:

from pipeline import get_rates

# Calculate rates for the trees passed. First argument is the PhyloTree object for the species topology.
# Second arg determines whether to prune the tree topologies so all trees have the same topologies
# Third arg is a list of taxa to limit calculations to
# The remaining args are the trees for proteins of interest
taxa, rates = get_rates(tree_topology_object, True, None, tree1, tree2)
# Taxa is a list of taxa names
# Rates is a list of lists. The first list are all the time units, the following lists are the rates. 
# With the indices of each element corresponding to the taxon in the matching index of the taxa list.

Convert rate data to correlations:

from pipeline import get_rates, rates_to_correlation

rate_info = get_rates(tree_topology_object, True, None, tree1, tree2)
rho, p = rates_to_correlation(rate_info)

Run an enrichment analysis on protein sets:

from pipeline import enrich_network

# Where protein_symbols is a list of protein identifiers (ids or protein symbols),
# background_protein_symbols is a list of protein identifiers from the background set (ids or protein symbols),
# id2name is a dictionary (can be empty) mapping ids to symbols, and the last argument is the base file name for 
# enrichment reports.
enrich_network(protein_symbols, background_protein_symbols, id2name, "enrichment_results")

The following features assume you generated a network object from your ERCs

Generate Reciprocal-Rank 20 Networks (RRNs)

from utilities import rrn_nets

# Each step represents an intermediate RRN result
step1, step2, step3 = rrn_nets(net, "central_protein")  # You can pass multiple central proteins

Export a network diagram (requires pygraphviz):

from utilities import graphviz_network_plot

# You can pass the following optional arguments:
# * highlight: A dictionary mapping proteins -> color code for nodes
# * circo: If True, use the circo graphviz layout instead of the default
# * highlight_by_font: If True, highlighted nodes will have the text highlighted instead of the background
graphviz_network_plot(my_network, "my_network.png")

Citations

Please cite this specific implementation of ERCs with the following reference:

Varela, Austin A., Sammy Cheng, and John H. Werren. 2021. "Novel ACE2 protein interactions relevant to COVID-19 predicted by evolutionary rate correlations." PeerJ 2021 9:e12159 https://doi.org/10.7717/peerj.12159

Please cite the general ERC method with the following reference:

Yan, Zhichao, Gongyin Ye, and John H. Werren. "Evolutionary rate correlation between mitochondrial-encoded and mitochondria-associated nuclear-encoded proteins in insects." Molecular biology and evolution 36.5 (2019): 1022-1036.