/PopGLen

Bioinformatics pipeline to process whole genome resequencing data and perform genotype likelihood based population genomic analyses using ANGSD and related softwares. Flexible to datasets that combine high/low coverage and historical/fresh samples.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Genotype likelihood population genomics pipeline

Note: This workflow has seen a good number of changes that are not yet in a numbered version. The most complete version is currently found on the develop branch.

This workflow is aimed at processing raw sequencing reads and calculating population genomic statistics within a genotype likelihood framework. As it is focused on GL methods, it has options to adapt the workflow for processing data with low or variable coverage and/or contains historical/ancient samples with degraded DNA. It is under active development, so new features will be added.

Getting Started

To run this workflow, you'll need paired-end raw sequencing data and an uncompressed reference genome to map it to.

To run the workflow, you will need to be working on a machine with the following:

  • Conda (or, preferably, it's drop-in replacement mamba)
  • Singularity/Apptainer

Once these softwares are installed, to deploy and configure the workflow, you can follow the instructions provided in the Snakemake workflow catalog. You can refer to the Snakemake Documentation for additional information that may be relevant to your computing environment (running jobs through cluster job queues, setting default resources).

Notes on inputs

Reference genomes should be uncompressed, and contig names should be clear and concise. Currently, there are some issues parsing contig names with underscores, so please change these in your reference before running the pipeline. Alphanumeric characters, as well as . in contig names have been tested to work so far, other symbols have not been tested.

Potentially the ability to use bgzipped genomes will be added, I just need to check that it works with all underlying tools. Currently, it will for sure not work, as calculating chunks is hard-coded to work on an uncompressed genome.

Running on a cluster

Development was done on UPPMAX's Rackham cluster, and a simple profile is included in the rackham folder to simplify running this workflow through SLURM there. For running on other SLURM based cluster configs, this file should largely work with a few minor modifications of the defaults. Primarily, this means ensuring that the resources make sense for your system, i.e. changing the default partition and account. Not changing the default memory may result in over-reserving memory in some cases, but a quick fix would be to change 6400 to whatever the default memory reserved per cpu is on your HPC (though then you might need to up the threads requested fo in some rules). See Snakemake's cluster support documentation for information on how to adapt the profile for your HPC environment.

Resources (and what to do if the workflow is over/underbooking resources)

As Rackham ties memory reservations to cpu reservations, the resource allocation for rules is mostly done through threads currently. In the future thread and memory resources will be more explicitly defined. For now, if you find that the workflow is over/underbooking a given resource, you can adjust the resource reservations in your run profile. See the commented section in the rackham/config.yaml config to see an example of this. Right now, memory is only ever defined through threads, so you may need to lower the threads and add a memory resource to some rules using this method in order to optimize them for your system.

Features

Currently, the pipeline performs the following tasks:

Reference genome preparation

  • Indexing of reference for subsequent analyses

Raw read preparation

  • Trimming of paired-end reads from high quality libraries
  • Collapsing of paired-end reads from fragmented (aDNA/historical DNA) libraries

Read mapping

  • Mapping prepared raw reads to reference using bwa-mem and clipping of overlapping reads, combining of multiple sample runs/libraries
    • NOTE: Reads marked as historical (degraded) will only map reads short reads that overlap and collapse, to reduce mapping of likely contaminants.
  • Removal of PCR and sequencing duplicates separately for fresh (Picard) and fragmented (DeDup) DNA reads
  • Realignment around indels
  • Optional recalibration of base quality scores on degraded DNA bam files with MapDamage2
  • Indexing of deduplicated, realigned, mapped, and recalibrated reads

Sample quality control

  • Assess post-mortem DNA damage with DamageProfiler
  • Assess mapping quality stats with Qualimap
  • Assess endogenous content using mapping proportion before duplicate reads are removed

Data quality filtering

  • Analyses can be set with minimum mapping and base quality thresholds
  • Exclusion of entire scaffolds (i.e. sex-linked, low quality) through user config (both list and contig size based)
  • Exclusion of repeat regions from analyses using RepeatModeler/RepeatMasker
  • Exclusion of low mappability regions with GenMap
  • Exclusion of sites with extreme global depth values (determined separately for the entire dataset, and subsets at certain coverage ranges, then merged)
  • Exclusion of sites based on data missingness across dataset and/or per population
  • Filter using any number of additional user-defined BED files

GL based population genomic analyses

To speed up the pipeline, many of these analyses are done for part of the genome at a time, then later merged. This is only done for analyses where possible and where the time saved is appreciable. These chunks are made to be a user configured size to allowtuning of run-times (i.e. more jobs, shorter runtimes vs fewer jobs, longer runtimes).

SAF based analyses are done on variable and non-variable sites passing quality filters. This set is the same across all populations in the dataset and is based on the positions passing all the requested filters. Beagle (SNP) based analyses are done on a SNP set that is constant across all populations, determined from the output of the Beagle file for the whole dataset, and major and minor alleles are inferred from the whole dataset. When relevant, pruned SNPs are used. Pruning is done on the whole dataset beagle file and the same pruned sites are used for all populations.

Additionally, all analyses can be repeated with samples subsampled to a lower user configured depth. This helps to ensure results are not simply due to variance in depth between groups.

Analyses:

  • Estimation of linkage disequilibrium across genome and LD decay using ngsLD
  • Linkage pruning where relevant with ngsLD
  • PCA with PCAngsd
  • Admixture with NGSAdmix
  • Relatedness using NgsRelate and IBSrelate
  • 1D and 2D Site frequency spectrum production with ANGSD
  • Neutrality statistics per population (Watterson's theta, pairwise pi, Tajima's D) in user defined sliding windows with ANGSD
  • Estimation of heterozygosity per sample from 1D SFS with ANGSD
  • Pairwise $F_{ST}$ between all populations or individuals in user defined sliding windows with ANGSD
  • Inbreeding coefficients and runs of homozygosity per sample with ngsF-HMM (NOTE This is currently only possible for samples that are within a population sampling, not for lone samples which will always return an inbreeding coefficient of 0)
  • Pairwise $F_{ST}$ between all populations or individuals in user defined sliding windows with ANGSD
  • Identity by state (IBS) matrix between all samples using ANGSD

Planned

Some additional components to the pipeline are planned, the order below roughly corresponding to priority:

  • Add calculation of bootstrapped SFS
  • Manhattan plots in report for sliding window results
  • Allow starting with bam files - for those that want to process raw reads in their own way before performing analyses
  • Add calculation of Dxy
  • Add schema for configuration files to improve incorrect format handling and to enable defaults

Workflow directed action graph

Below is a graph of the workflow with most of the analyses enabled. This is generated directly by Snakemake, though I have removed the all, popfile, and link_ref rules to improve readability, as they are not needed to understand the analysis flow. A more condensed, readable diagram will be added shortly.

In general, there a few stages that can be seen grouping here. A mapping stage at the top, ending with symlink_bams, followed by the creation of a set of filters for the dataset that ends with combine_beds. Afterwards, population genetic analyses are performed in primarily two pathways - allele frequency based results (starting with angsd_doSaf_sample/pop) and SNP based results (starting with angsd_doGlf2 (beagle)).

A directed action graph (DAG) of the main workflow

Acknowledgements

The computations required for developing and testing this workflow has been enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) and the Swedish National Infrastructure for Computing (SNIC) at UPPMAX partially funded by the Swedish Research Council through grant agreements no. 2022-06725 and no. 2018-05973.