/refpanel

Pipeline to build a large human reference panel, using publicly available genomes

Primary LanguagePythonMIT LicenseMIT

refpanel

A fully automated and reproducible pipeline for building large reference panels of jointly-called and phased human genomes, aligned to GRCh38.

This pipeline was inspired by the alignment and SNP calling workflow used by the New York Genome Center (NYGC) in their recent paper High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios; implemented here, with improvements, using snakemake and conda for full reproducibility.

⚠️ This pipeline is in active development and subject to ongoing improvements.

Installation

Download the refpanel source code

git clone git@github.com:ekirving/refpanel.git && cd refpanel

This pipeline uses the conda package manager (or the faster mamba front-end) to handle installation of all software dependencies. If you do not already have conda or mamba installed, then please install one first.

Once conda is setup, build and activate a new environment for the refpanel pipeline

conda env create --name refpanel --file environment.yaml
conda activate refpanel

Data sources

This pipeline comes preconfigured to build a joint-callset, called refpanel-v2 (n=5,100), involving all publicly available samples from:

Plus additional public genomes from:

The data from these projects is hosted by the International Genome Sample Resource (IGSR) database (doi:10.1093/nar/gkw829) and the European Nucleotide Archive (ENA) (doi:10.1093/nar/gkq967).

If there are publicly available whole-genome sequencing data that you would like incorporated into refpanel-v3 please raise an issue on GitHub with the details of the publication and they will be considered for inclusion in future releases.

If you wish to build a customised joint-callset (e.g., including non-public samples), please refer to the configuration docs.

Downloading data

To ensure all data is processed consistently, refpanel downloads gVCF files for 1000G; CRAM files for 1000G, HGDP, SGDP and GGVP; and fastq files for all other data sources.

To (optionally) pre-fetch all the data dependencies, run:

./refpanel download_data &

All output will be automatically written to a log file refpanel-<YYYY-MM-DD-HHMM.SS>.log

⚠️ These files are very large: Please make sure you have sufficient disk space to store them!

Ancestry composition

Superpopulation assignments are based on the original 1000G, HGDP and SGDP metadata.

Superpopulation Code Samples
African Ancestry AFR 1,460
American Ancestry AMR 589
Central Asian and Siberian Ancestry CAS 66
Central and South Asian Ancestry CSA 199
East Asian Ancestry EAS 826
European Ancestry EUR 790
Middle Eastern Ancestry MEA 407
Oceanian Ancestry OCE 38
South Asian Ancestry SAS 678
West Eurasian Ancestry WEA 47

Joint-calling pipeline

In brief, refpanel produces a jointly-called and phased callset via the following steps:

For more information, refer to the DAG of the rule graph or the code itself.

Running the pipeline

To execute the full pipeline, end-to-end, run:

./refpanel &

All output will be automatically written to a log file refpanel-<YYYY-MM-DD-HHMM.SS>.log

⚠️ This will take a long time: Please make sure you run this on a server with as many CPUs, and as much RAM, as possible (e.g., this pipeline was developed and run on a cluster of nodes, each with 96 cores and 755Gb of RAM each).

The pipeline can also be broken down into separate steps, for distribution across multiple nodes in a cluster.