Whole genome analysis pipeline

Sections

  1. Dataset QC prior to manipulating VCF
  2. Dataset pre-filtering
  3. High quality hard call subset of the data
  4. Outlier sample QC part 1
  5. Sex check sample QC
  6. Identity-by-descent filtering
  7. Principal components analysis
  8. Principal components filtering
  9. Outlier sample QC part 2
  10. Variant QC
  11. Assessing variant QC




WGS QC workflow chart

WGS QC workflow chart sample QC

WGS QC workflow chart variant QC

Repository description

This repo provides a template for processing WGS data in Hail

  • README: Considerations of WGS data and QC pipeline steps
  • Resources: Links and other things relevant to WGS pipelines
  • Modules: Hail script templates for each pipeline step

Relevant gnomAD resources:

AllofUS QC report (Spring 2023)




WGS considerations

File size:

  • WGS data is orders of magnitdues larger than GWAS array (fixed variant size) and exome capture sequencing (~1% of genome)
  • File size will scale with sample size
  • Strategies to deal with large files:
  • GVS file format: Google Big Query solution to save space due to the size of the VCF.

File formats:

  • Per-sample gVCFs: VCF format data for each sample, where all variant sites and reference blocks are listed for all contigs
  • Ref-blocks: Summary depth/genotype quality information in the gVCF across a genomic interval where the sample has no variant sites

Common and rare variation:

  • WGS data contains SNP/indel calls across the allele frequency spectrum, whereas array (MAF > 0.1%)
  • Low coverage WGS (< 10x) aided by imputation
  • High coverage WGS (> 20x) robust across all SNP/indels
  • Multiple levels of analysis
    • Pulling out common variants to include in GWAS meta-analysis (GWAMA)
    • Pulling out rare coding variants to include in Exome meta-analysis
    • Looking at rare non-coding variation (good luck!)

Repetitive regions:

  • WGS contains challenging genomic regions often ignored in array and exome capture sequencing
  • Regions with highly repetitive sequence that is difficult for short read sequening to align properly to the reference genome library
    • Telomeres/centromeres of each chromosome
    • Segmental duplications - large repetitive chumks
    • Low complexity regions (LCRs)
  • These regions are often excluded early on in an analysis




QC pipeline steps

Index of steps:

  1. Dataset QC prior to manipulating VCF
  2. Dataset pre-filtering
  3. High quality hard call subset of the data
  4. Outlier sample QC part 1
  5. Sex check sample QC
  6. Identity-by-descent filtering
  7. Principal components analysis
  8. Principal components filtering
  9. Outlier sample QC part 2
  10. Variant QC
  11. Assessing variant QC



Dataset QC prior to manipulating VCF

  • GOAL: Understanding the project/phenotype data you are working with
    • List and understand all sample phenotypes provided
    • Match up phenotype file IDs with genetic data IDs
    • Resolve any inconsistencies before moving forward
    • Start spreadsheet/table of datasets, sequence platforms
    • Know what genome reference your sequence data is mapped (most WGS data is hg38 / GRCh38)
    • Determine whether you are using a dense or sparse matrix table
    • Look over the VCF meta-data at the top of the VCF file
    • Read VCF and phenotype/sample file into Hail
    • Generate sample QC metrics from raw VCF
    • Write up paragraph of sample collection and descriptives

Categorical variables often analyzed in datasets

  • Cohort
  • Sequencing wave
  • C-Project (Broad specific)
  • Sequencing Plate
  • Sex
  • Affection status
  • Continental ancestry groupings / reported ancestry

Quantitative parameters often analyzed in datasets

  • Age
  • Top genetic principal components
  • Number of variant sites (n_snps)
  • Singleton rate (n_singleton)
  • Het / hom variant ratio (r_het_hom_var)



Dataset pre-filtering

  • GOAL: Remove variants that are highly unlikely to be analyzed
    • Remove variants that fail VQSR (or AS-VQSR)
    • Remove variants in telomeres / centromeres
    • Remove variants in low-complexity regions
    • Remove variants in segmental duplication regions
    • Generate sample QC metrics from pre-filtered VCF
    • VEP annotate remaining sites
      • Usually just need gene name and consequence



High quality hard call subset of the data

  • GOAL: Use a smaller subset of variants (i.e. 50-500k variants) to analyze relatedness (IBD) and population ancestry (PCA)
    • Bi-allelic SNVs only
    • Common variants (MAF > 0.1%)
    • Call rate > 99%
    • LD-pruned with a cutoff of r2 = 0.1
  • PROTIP: Use gnomAD / CCDG hg38 PCA variant list
    • HQ calls selected from large multi-ancestry WGS cohort
    • 224,591 variants ld-pruned with gnomAD v3: gs://gnomad/sample_qc/ht/genomes_v3.1/ld_pruned_combined_variants.ht
    • 259,482 variants pre-ld-pruning: gs://gnomad/sample_qc/ht/genomes_v3.1/pre_ld_pruning_combined_variants_without_washu.ht
  • If running from raw VCF calls
    • Run split.multi to maximize bi-allelic sites
    • Filter to PASS sites and SNV
    • Run variant.qc to get MAF



Outlier sample QC part 1

  • GOAL: Remove samples that are contaminated or have poor sequencing levels
    • Use pre-filtered dataset
    • Plot values below before using DEFAULT filters to ensure you are not throwing away large amounts of samples
      • freemix contamination filtering (DEFAULT > 0.05)
      • Chimeric read filtering (DEFAULT > 0.05 )
      • Call rate filtering (DEFAULT < 0.85)
      • Mean Depth coverage filtering (DEFAULT < 15)
      • Small insert size: Median insert size < 250bp



Sex check sample QC

  • GOAL: remove samples where genotype sex does not equal reported sex
    • Filter out variants within PAR coordinates (https://en.wikipedia.org/wiki/Pseudoautosomal_region)
      • Reported males shoud have X chromosome F-statistic from 0.8 to 1
      • Reported females shoud have X chromosome F-statistic from -0.2 to 0.4
      • Remove samples with ambiguous imputed sex (X chromosome F-statistic between 0.4 and 0.8)
    • Large-scale sex check errors are indicative of ID mismatching or upstream data mishandling



Identity-by-descent filtering

  • GOAL: remove 1st and 2nd degree relatives from population based sample
    • Use HQ hardcall dataset
    • Consider which IBD algorithm to use (KING, PC-AiR, PC-relate)
    • Which algorithm you use will depend on sample size, ancestry composition, and knowledge of family pedigrees in data
    • Unsure? Start by trying genetic relatedness using Hail pc-relate
    • Plot proportion of 0 shared vs 1 shared alleles
    • IBD filtering on PI-HAT value (DEFAULT > 0.2)



Principal components analysis

  • GOAL: Determine general ancestry of cohort
    • Use HQ hardcall dataset
    • Consider which PCA strategy to use
      • PCA comparing against self-identified ancestry information (PROS: easy if you have this info CONS: dependent on quality of self-id info)
      • PCA including reference datasets with known ancestry (PROS: good truth dataset CONS: can be hard to harmonize variants)
      • PCA using gnomAD PCs (PROS: no need to merge variants CONS: shrinkage in PCs due to variant mismatch
    • Unsure? Start with Hail hwe_normalized_pca
    • Run with and without reference panel data (1KG or 1KG+HGDP genome reference panel)
    • Create plots of PCs
      • Case / control coloring
      • Cohort coloring
    • Assigning samples to a particular ancestry



Principal components filtering

  • GOAL: match case and controls within a common genetic ancestry
    • If retaining multiple ancestries, make sure to define ancestry groups in phenotype file
    • PCA filtering (no DEFAULT filtering parameters)
      • 2-dimensional centroid approach (using distance from 1KG ancestry mean PC; used by Chia-Yen Chen)
      • pair-matching cases to controls (R package: optmatch; R function: pairmatch(); used by Mitja Kurki)
      • Within each ancestry group, re-run PCA
        • Re-evaluate PC dispersion
        • Add these PCs as covariates to phenotype file



Outlier sample QC part 2

  • GOAL: remove samples within continental ancestry groupings that have unusual variant properties
    • Examine variation in:
    • TiTv ratio
    • Het/Hom ratio
    • Insertion/Deletion ratio
    • Plots with colors defined by assigned ancestry
      • Different ancestries have significant mean differences in these ratios
    • Filter out within cohort outliers (DEFAULT > 4 Std. deviations within a particular ancestry)



Variant QC

  • GOAL: Remove low quality/somatic variants
    • Use pre-filtered VCF with outlier samples removed
    • Filter variants with low call rate (DEFAULT < 95%)
      • Split variant call rate by capture, case/control status, or other category
    • Remove monoallelic variants: no alt or no ref calls
    • Genotype filtering (set to filtered variants to missing)
      • Filter by Depth (DEFAULT < 10)
      • Filter by GQ (DEFAULT < 25)
      • HET calls: Filter by allele depth/balance (AB/AD; DEFAULT < 0.25)
        • Additional pAB filter on higher depth (binomial test - DEFAULT p < 1e-9)
      • HOMREF calls: (AB/AD; DEFAULT > 0.1)
      • HOMALT calls: (AB/AD; DEFAULT < 0.9)
    • Remove variants not in HWE (DEFAULT p < 1e-6)
    • Generate sample QC metrics



Assessing variant QC

  • GOAL: Determine if more stringent variant QC is needed
    • Examine QC parameters across 3 filtering steps:
      • Pre-filtered VCF
      • Sample QC'ed VCF
      • Variant QC'ed VCF
    • QC parameters:
      • Number of SNPs / Indels
      • TiTv ratio
      • Het/Hom ratio
      • Ins/Del ratio
      • Singleton synonymous rate
    • Primary categories:
      • Case/control (should be equal between groups)
      • Cohort (should vary predictably)
    • Determine whether additional variant filtering needs to be done



Sample QC Parameters

From GATK/Picard metadata

  • Freemix Contamination
  • Chimeric read percentage

From Hail Sample QC Annotation Table Primary QC parameters

  • callRate
  • rTiTv
  • rHetHomVar
  • rInsertionDeletion
  • nSingleton
  • dpMean

Secondary QC parameters

  • nCalled
  • nNotCalled
  • nHomRef
  • nHet
  • nHomVar
  • nSNP
  • nInsertion
  • nDeletion
  • nTransition
  • nTransversion
  • dpStDev
  • gqMean
  • gqStDev
  • nNonRef




Variant QC Parameters

Variant Quality

  • VQSLOD - Variant Quality Score in LOD space (or new RF posterior probabilities)
  • pHWE - Hard-Weinberg Equilibrium
  • AC - allele count
  • Median depth
  • QD - quality by depth

Genotype Quality

  • Depth
  • PHRED likelihood (PL)
  • Genotype Quality (GQ - same as PL in joint-called VCF)
  • Allele Depth (AD)




Case/control Whole Genome Sequencing - Burden and Association pipeline

Basic Requirements:

  • A QC-ready annotated matrix table
  • Samples annotated with PCs

Whole Genome Burden

Primary Hypotheses:

  • Do cases have a higher overall burden of rare deleterious variants than controls?
    • Does this burden decrease as allele frequency increase?
  • Is the burden concentrated in genes with evidence of selective constraint?
    • Does the burden effect attenuate outside of these constrained genes?
  • Sanity check: Is the rate of rare variants similar in cases and controls?
    • Highly significant differences suggest QC issues still persist

General Method:

  • Aggregate per-individual counts on selected annotation/allele frequency
  • Testing deleterious coding variant count using logistic regression
  • Include first 10 PCs and sex as covariates

Comparisons:

  • Stratify by cohort
    • Require cases and controls within any cohort designation
  • Stratify by commonly assessed variant annotations
    • Genic / non-genic
    • Conservation/constraint
    • CADD severity prediction
    • Damaging missense (PolyPhen damaging, SIFT deleterious, MPC > 2)
    • Protein truncating variants (frameshift_variant, splice_acceptor_variant, splice_donor_variant, stop_gained)
  • Stratify by Allele frequency
    • Ultra-rare variation: non-gnomAD singletons
    • Dataset + gnomAD AC < 5
    • Doubleton distributions

Infomative Graphics:

  • Forest plots of Odds ratio, 95% CI, and p-value

Gene-based association

Primary Hypotheses:

  • Do any genes associate with our phenotype after correcting for multiple testing?
  • Is there inflation of the median test statistic (lambda)
    • Inflation suggests that there is underlying population stratification/QC not accounted for
    • Can also mean a polygenic signal in well-powered cohorts

General Method:

  • Aggregate per-individual counts on selected annotation/allele frequency
  • Testing per-gene variant count using logistic regression
  • Include first 10 PCs and sex as covariates

Infomative Graphics:

Portal information

  • Example Case/Control WES Portal: SCHEMA

Basic association tests:

  • Fisher's Exact test

  • Poisson rate test

    • Compares variant counts when the mean and variance are equal (generally for de-novo / ultra-rare variants only)
    • poisson.test() in R
  • Logistic (or Firth regression) in Hail

    • Predicts case/control status by allele frequency and allows for covariates
    • Requires some asymptotic assumptions, which can be unmet at low allele counts (< 20)

Mixed model association tests

  • SAIGE

    • Mixed-model assocation with saddle-point approximation (SPA) to control for imbalanced case/control ratio
    • Both individal variant and gene-based tests (using SKAT) available
  • Kernel association - SKAT test

    • R implementation: SKAT
    • Random effects model with covariates
    • Handles alleles at various frequencies
  • Example of logistic and fishers-exact test using Hail (courtesy of Mitja Kurki) Hail Topic

  • R packages for rare variant testing

burden_testing.Rmd

  • A walkthrough of various gene-based tests, strategies, and Hail commands for testing rare variant burden