Genotype imputation is a computational technique for estimating missing genotypes in SNP array data, using a reference panel of haplotypes. This approach extends to low-coverage whole genome sequencing data, aiding in filling missing genotypes or enhancing uncertain genotype calls from sequencing reads.
For both SNP array and low-coverage whole genome sequencing data, we've created two distinct pipelines using the UK Biobank reference panel (>200,000 samples; 700M variants) for genotype imputation. To ensure cost-effective implementation, we leverage efficient state-of-the-art tools, including IMPUTE5 (Rubinacci et al., 2020) for SNP array imputation and GLIMPSE2 (Rubinacci et al., 2023) for low-coverage WGS imputation.
Our pipelines can take input from a multi-sample VCF/BCF file with SNP array genotypes or a set of low-coverage BAM/CRAM files. Using the UK Biobank reference panel, the pipeline executes imputation through applets and dx command jobs, tailor-made for the UKB RAP. At the end of each imputation pipeline, a single multi-sample BCF file is generated per chromosome, encompassing genotype posteriors, dosages, and phased best-guess genotypes. Further outputs like haploid dosages can be acquired by specifying appropriate options in the imputation software.
Tutorials on how to use the pipelines can be found at:
https://srubinacci.gitbook.io/uk-biobank-imputation-pipelines/
If you use the pipelines in your research work, please cite the following papers:
Reference panel
Low-coverage WGS imputation
SNP array imputation
The UK Biobank imputation pipelines are developed by Simone Rubinacci & Olivier Delaneau.
The UK Biobank imputation pipelines are distributed with an MIT license.