A fully automated and reproducible pipeline for building large reference panels of jointly-called and phased human
genomes, aligned to GRCh38
.
This pipeline was inspired by the alignment and SNP calling workflow used by the New York Genome Center (NYGC) in their recent paper High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios; implemented here, with improvements, using snakemake and conda for full reproducibility.
Download the refpanel
source code
git clone git@github.com:ekirving/refpanel.git && cd refpanel
This pipeline uses the conda package manager (or the
faster mamba front-end) to handle installation of all software
dependencies. If you do not already have conda
or mamba
installed, then please install one first.
Once conda
is setup, build and activate a new environment for the refpanel
pipeline
conda env create --name refpanel --file environment.yaml
conda activate refpanel
This pipeline comes preconfigured to build a joint-callset, called refpanel-v2
(n=5,100), involving all publicly
available samples from:
- 1000 Genomes Project (1000G), 30x NYGC version (n=3,202; inc. 2,504 phase 3 samples + 698 additional related samples);
The 1000 Genomes Project Consortium. (2015) Nature doi:10.1038/nature15393
Byrska-Bishop et al. (2022) Cell doi:10.1016/j.cell.2022.08.004 - Simons Genome Diversity Project (SGDP) (n=256; excluding overlap with 1000G);
Mallick et al. (2016) Nature doi:10.1038/nature18964 - Gambian Genome Variation Project (GGVP) (n=400; excluding overlap with 1000G);
Band et al. (2019) Nature Communications doi:10.1038/s41467-019-13480-z - Human Genome Diversity Project (HGDP) (n=816; excluding overlap with SGDP);
Bergström et al. (2020) Science doi:10.1126/science.aay5012 - Arabian Peninsula Population Genomic Study (APPG) (n=137);
Almarri et al. (2021) Cell doi:10.1016/j.cell.2021.07.013
Plus additional public genomes from:
- Meyer et al. (2012) Science (n=8); doi:10.1126/science.1224344
- Mondal et al. (2016) Nature Genetics (n=60); doi:10.1038/ng.3621
- Rodriguez-Flores et al. (2016) Genome Research (n=108); doi:10.1101/gr.191478.115
- de Barros Damgaard et al. (2018) Science (n=41); 10.1126/science.aar7711
- McColl et al. (2018) Science (n=2); doi:10.1126/science.aat3628
- Lindo et al. (2018) Science Advances (n=25); 10.1126/sciadv.aau4921
- Gelabert et al. (2019) BMC Genomics (n=12); doi:10.1186/s12864-019-5529-0
- Serra-Vidal et al. (2019) Current Biology (n=21); doi:10.1016/j.cub.2019.09.050
- Lorente-Galdos et al. (2019) Genome Biology (n=9); doi:10.1186/s13059-019-1684-5
- Crooks et al. (2020) BMC Genetics (n=3); doi:10.1186/s12863-020-00917-4
The data from these projects is hosted by the International Genome Sample Resource (IGSR) database (doi:10.1093/nar/gkw829) and the European Nucleotide Archive (ENA) (doi:10.1093/nar/gkq967).
If there are publicly available whole-genome sequencing data that you would like incorporated into refpanel-v3
please
raise an issue on GitHub with the details of the publication and they will be considered for inclusion in future releases.
If you wish to build a customised joint-callset (e.g., including non-public samples), please refer to the configuration docs.
To ensure all data is processed consistently, refpanel
downloads gVCF
files for 1000G; CRAM
files for 1000G, HGDP,
SGDP and GGVP; and fastq
files for all other data sources.
To (optionally) pre-fetch all the data dependencies, run:
./refpanel download_data &
All output will be automatically written to a log file refpanel-<YYYY-MM-DD-HHMM.SS>.log
Superpopulation assignments are based on the original 1000G, HGDP and SGDP metadata.
Superpopulation | Code | Samples |
---|---|---|
African Ancestry | AFR | 1,460 |
American Ancestry | AMR | 589 |
Central Asian and Siberian Ancestry | CAS | 66 |
Central and South Asian Ancestry | CSA | 199 |
East Asian Ancestry | EAS | 826 |
European Ancestry | EUR | 790 |
Middle Eastern Ancestry | MEA | 407 |
Oceanian Ancestry | OCE | 38 |
South Asian Ancestry | SAS | 678 |
West Eurasian Ancestry | WEA | 47 |
In brief, refpanel
produces a jointly-called and phased callset via the following steps:
- Adapter trimming with
fastp
(v.0.23.2) - Alignment to
GRCh38
withbwa mem
(v0.7.15) - Fix-mate, merge, sort,
and mark duplicates with
picard
(v2.5.0) - Base recalibration with
gatk BaseRecalibrator
(v3.5) - Conversion to
cram
withsamtools
(v1.14) - Per-sample calling of
gVCFs
withgatk HaplotypeCaller
(with sex-dependent ploidy) - Merging samples with
gatk CombineGVCFs
- Joint-calling of all samples with
gatk GenotypeGVCFs
- Variant quality score recalibration with
gatk VariantRecalibrator
- Annotation with dbSNP build 155 with
bcftools
(v1.14) - Hard-filtering of SNPs and INDELs with
bcftools
:- VQSR PASS;
- GT missingness < 5%;
- HWE p-value > 1e-10 in at least one super-population;
- Mendelian error rate < 5%, using trios from 1000G (n=602) and GGVP (n=133);
- MAC ≥ 2 (i.e., no singletons);
- Read-based phasing with
whatshap
(v1.2.1) using:- Illumina paired-end reads from all projects;
- 10x Genomics linked-read sequencing from HGDP (n=26) and APPG (n=137);
- Pedigree phasing with
whatshap
using:- Trios from 1000G (n=602) and GGVP (n=130);
- Statistical phasing with
shapeit4
(v4.2.2) - Variant effect prediction with
ensembl-vep
(v105.0)
For more information, refer to the DAG of the rule graph or the code itself.
To execute the full pipeline, end-to-end, run:
./refpanel &
All output will be automatically written to a log file refpanel-<YYYY-MM-DD-HHMM.SS>.log
The pipeline can also be broken down into separate steps, for distribution across multiple nodes in a cluster.