Some scripts to run QC on the UK Biobank 500k exome data
- This was developed & tested on RHEL 7
- This code was designed to run on UKBB RAP
- Read Depth & Allele Balance filters were run on RAP
- YMMV
- This code was designed to run on UKBB RAP
- Have either conda or mamba >= 4.12
$ mamba env create -f environment.yml
$ mamba activate UKB_500K_exome_QC
$ ./run_all.sh
There are 5 filters (each one is applied independently). For a (individual, variant) datum to be retained it must pass all filters for its individual & variant
- Individual missingness (row filter)
- combine 22 chromosomes into 1 file
- each individual we keep must be missing less than 10% of variants
- Variant missingness (column filter)
- keep variants missing from less than 10% of people
- Variant HWE (column filter)
- compute HWE p-value for all variants
- keep variants with HWE p-value > 1e-15
- Variant read depth (column filter)
- calculate minimum read depth for each variant
- determine whether a variant is a SNP or INDEL
- keep variants with sufficient read depth
- >= 7 (for SNPs)
- >= 10 (for indels)
- Variant allelic balance (column filter)
- calculate allelic balance for each sample in each variant
- get the highest allelic balance for each variant
- determine whether a variant is a SNP or INDEL
- keep variants with sufficient allelic balance
- >= 0.15 (for SNPs)
- >= 0.20 (for indels)
These steps are derived from the following publication:
Szustakowski, J.D., Balasubramanian, S., Kvikstad, E. et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat Genet 53, 942–948 (2021). https://doi.org/10.1038/s41588-021-00885-0
General steps are outlined above, but Software is provided as is.
Feel free to raise issues, but this project will not be maintained regularly.