/tree.comparison

Data and analytical code for gene tree to organismal tree comparisons as described by Laslo, Just and Angelini

Primary LanguageR

tree.comparison

This repository contains data and analytical code for the gene tree to organismal tree comparisons described by Laslo, Just and Angelini (2022 JEZ-B). Comparisons are made for superfamily/order-level phylogenies constructed for each protein relative to a consensus organismal tree (Misof et al. 2014). Tree distance is based on normalized Clustering Information Distance (Smith 2020) implemented in the TreeDist R package.

Contents

In conceptual order:

  • duplication.pattern.R is an R script addressing phylogenetic pattern to gene duplication and loss.
  • get.orthologs.R is an R script file containing 3 functions.
    • get.orthologs is a function that performs BLASTp searches to a local BLAST database. The search query is supplied as an NCBI protein accession number. Searches are restricted to a list of taxa containing NCBI taxids specified in a CSV file. BLAST output can be customized, but defaults to a single line reporting the best match.
    • create.alignment.from.acc.list is a function that takes as input a list of GenBank protein accessions numbers, as supplied by get.orthologs, pulls the protein to sequences creating a multi-FASTA file, then aligns sequences using Clustal-Omega with default parameters. It also edits sequence names to conform to RAxML expectations.
    • convert.fa.to.phy is a short function to convert FASTA files to PHYLIP (.phy) format. Mostly just a wrapper for phylotools::dat2phylip.
  • tree.comparisons.R is an R script file containing 7 functions and accompanying code to conduct the tree comparisons from the paper.
    • quantify.polytomy - Quantify polytomy in an unrooted tree as a value from 0 (complete resolution of the topology) to 1 (a star tree).
    • remove.redundant.parentheses - Remove unnecessary parentheses from e.g. a Newick tree character string.
    • vectorize.newick.tips is a function that take a string in Newick tree format. Optionally, it can return tip labels as a character vector with or without commas and parentheses indicating topology. The function can also use a supplied taxonomy to substitute (in our case) superfamily/order-level taxa names for species and condense redundant branches.
    • clean.up.newick.vector is a helper function that facilitates condensing redundant branches.
    • compare.gene.trees.to.organismal.tree is the main function. It takes as input a path that must indicate a folder containing Newick format tree files for protein-based phylogenies. It also take as input a file name providing an organismal tree. It also requires a table of taxonomy, which is provided to the sub functions. Topologies condensed to a lower taxonomic level are saved to an output folder. Several tree metrics are calculated and comparisons between protein and organismal trees are made using Smith's normalized clustering information distance. In cases of paraphyly, permutation is used to randomly prune the tree because tree distance calculations. Permutation is also used for bootstrapping, which provides a 95% confidence interval.
    • parsimony.informative.sites calculates the portion (from 0 to 1) of sites (characters) in a sequence alignment that are parsimony-informative. Defaults to ignoring gaps in this calculation.
    • mean.pairwise.identity takes an alignment in the form provided by phylotools::read.phylip and makes a pairwise comparison of character identity for all sequences in an alignment. Gaps are not counted as matches. (Gaps are not counted in the numerator or denominator.) The mean identity is returned. Values range from 0 (total disagreement) to 1 (universal sequence identity).
  • tree.comparisons.R continues with non-modular code to execute these functions using input alignments and phylogenies using different bootstrap cutoffs for collapse nodes to polytomies. It also includes the code used to create the manuscript figures using ggplot2.
  • organismal.relationships.tre is a Newick file containing consensus relationships of insect superfamilies and orders (Misof et al. 2014; Johnson et al. 2018; Kawahara et al. 2019; McKenna et al. 2019; Peters et al. 2017; Wiegmann et al. 2011).

  • taxids.csv is a table of taxonomic classifications for species included in the search for gene orthologs. The column display.taxon is used to group species for the analysis.
  • tree.comparisons.bs??.csv are tables of metrics from ortholog alignments, protein trees, and distance comparisons to the organismal tree. bs50, etc. refers to the bootstrap support cutoff used to collapse nodes into polytomies. misc.comparisons.bs50.csv covers a small number of genes that were not included in the main analysis.
  • analysis.phy is a folder of PHYLIP-format alignments for protein orthologs included in the main analysis. misc.phy includes alignments for the miscellaneous gene set.
  • analysis.trees is a folder of Newick-format protein trees included in the main analysis. misc.trees includes trees for the miscellaneous gene set.
  • condensed.trees is a folder with Newick-format tree files that record the protein-based topologies condensed to the analysis-level taxa. In cases of paraphyly, all occurrences of taxa are included here. However, repeated taxa are randomly pruned in 10,000 iterations to estimate tree distances. condensed.misc.trees includes the output trees for the miscellaneous gene set.