/SARS-CoV-2

Daily analyses of genomic SARS-CoV-2 data

Primary LanguagePython

Daily analyses of SARS-CoV-2 genomic data

This project is a part of a larger effort with the Galaxy team: covid19.galaxyproject.org

TL;DR

  1. Analysis of all current SARS-CoV-2 genomes for evidence of natural selection
  2. Divergence and diversity of SARS-CoV-2 genomes over time overall and by region

Analysis pipeline

  1. We collect data from the gisaid-logo database daily. These are mostly full genome sequences collected from different platforms and different regions. See here for a summary of the sequence data.

  2. We extract full genome human sequences and map them to the reference genes using a simple codon-aware pipeline. At this step we also compress the data to retain a single copy of each unique haplotype in the gene, and filter out sequences that have too many (>0.5%) uncalled/unresolved (N) bases.

  3. We reconstruct ML phylogenies on compressed data using raxml-ng

  4. We estimate gene-by-gene distances to compute diversity and divergence using TN93, summarized here

  5. We run several HyPhy dN/dS based selection analyses on each gene. We restrict these analyses to internal branches of the tree filter within-host evolution.

When analyzing intra-species or intra-host data, dN/dS estimates may be inflated due to the fact that not all observed sequence variation is due to substitutions, but some are simply mutations that have not yet been filtered by selection. In other words, dN/dS may be elevated by intra-species / intra- host polymorphism that need not be attributable by positive selection. One simple approach to mitigating this undesirable effect is to restrict site-specific analyses to Internal branches only. This is because internal branches encompass at least one step that is visible to selection (transmission and/or multiple rounds of replication), and are less likely to contain spurious polymorphic variants.

  1. These analyses include SLAC and FEL, MEME, and PRIME (the latter allows to test for conservation/change in specific biochemical properties at site) to identify which sites may be experiencing positve selection, and what properties may be important to preserve/change during these changes. The up-to-date summary is hosted here