Applicability of phylogenetic network algorithms for representing the evolutionary history of SARS-CoV-2

This repository contains all code, data, results and Supplemental Information for the thesis 'Applicability of phylogenetic network algorithms for representing the evolutionary history of SARS-CoV-2' written by Rosanne Wallin for the Minor Research Project performed at CWI Amsterdam, as part of the master Bioinformatics and Systems Biology (VU and UvA). The folders in this directory contain README.txt files explaining their contents.

Abstract

Phylogenetic networks are used to display complex evolutionary history involving so-called reticulation events, such as genetic recombination. Several methods have been developed to construct such networks, using for example a multiple sequence alignment or multiple phylogenetic trees as input. Coronaviruses are known to recombine frequently, but phylogenetic networks have not yet been used extensively to describe their evolutionary history. Here, we created a workflow to construct phylogenetic networks representing the evolutionary history of SARS-CoV-2 using the TriLoNet, TriL2Net, Tree-Child, Semi-Temporal and Maximum Pseudo-Likelihood algorithms. This included filtering noise from sets of phylogenetic trees by contracting edges based on branch length and bootstrap support, followed by resolving multifurcations. We show that this filtering approach generally reduces the minimum reticulation number and minimum temporal distance of the phylogenetic networks constructed from these trees, while preserving the overall topology. The networks constructed by the TriLoNet, TriL2Net and Tree-Child algorithms show no signs that SARS-CoV-2 itself is a recombinant virus. They do indicate a recombination from an ancestor of the SARS-CoV-2/RaTG13 lineage and the HKU3-1 lineage into the bat-SL-COVZC45 lineage, which was previously described in other research. Our results demonstrate that the TriLoNet, TriL2Net, Tree-Child and Semi-Temporal algorithms are applicable to coronavirus data. However, the constructed phylogenetic networks should be interpreted with care, taking the underlying network constraints and biological plausibility into account. Our workflow may serve as an example for pipelines to preprocess multiple sequence alignments to result in suitable input for phylogenetic network algorithms, providing a base to easily test and validate (multiple) algorithms with different input data and filtering options.