EM-mosaic detects mosaic point mutations that contribute to congenital heart disease
Alexander Hsieh, Sarah U Morton, Jon AL Willcox, Joshua M Gorham, Angela C Tai, Hongjian Qi, Steven DePalma, David McKean, Emily Griffin, Kathryn B Manheimer, Daniel Bernstein, Richard W Kim, Jane W Newburger, George A Porter Jr., Deepak Srivastava, Martin Tristani-Firouzi, Martina Brueckner, Richard P Lifton, Elizabeth Goldmuntz, Bruce D Gelb, Wendy K Chung, Christine Seidman, J G Seidman, Yufeng Shen
https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-020-00738-1
also available as a preprint: https://www.biorxiv.org/content/10.1101/733105v1
- to develop a method (EM-mosaic) to detect mosaic (post-zygotic) SNVs from exome sequencing data
- to estimate the contribution of mosaicism to Congenital Heart Disease.
- variant_calling/ - contains scripts used to call de novo variants from BAM files
- detection_pipeline/ - EM-mosaic pipeline used to QC de novo variants and call mosaic SNVs
- analysis/ - contains scripts used to analyze detected mosaics
- ADfile.example_minimal.txt - example de novo callset for testing (contains only columns required for mosaic detection: id, chr, pos, ref, alt, refdp, altdp)
- ADfile.example_full_annotation.txt - contains additional columns and annotations used to QC/filter variants
- detection_pipeline/generate_candidates.POST.R - detects candidate mosaic SNVs by calculating posterior odds for each de novo SNV following annotation and QC
# Call mosaics from input de novo callset (ADfile.example_minimal.txt) with output prefix 'test', posterior odds cutoff '10', and cohort size '2400'
Rscript generate_candidates.POST.R ADfile.example_minimal.txt test 10 2400
# Output Files
* test.candidates.txt = contains mosaic variants with posterior odds > cutoff
* test.denovo.txt = contains all de novo variants in input passing filters, annotated with posterior odds score, etc
* test.all_denovos.txt = contains all de novo variants (including sites failing filters), annotated with exclusion criteria
* test.outlier_samples.txt = contains IDs of all outlier samples (based on cohort size)
# Output Plots
* test.EM.pdf = histogram of variant allele fraction of input variants, with EM-estimated mosaic fraction
* test.dp_vs_vaf.pdf = scatterplot of DP vs. VAF, colored by germline/mosaic
* test.vaf_vs_post.df = scatterplot of VAF vs. log-scaled posterior odds, colored by germline/mosiac
* test.fdr_min_nalt.pdf = scatterplot of DP vs. Nalt, with line indicating FDR-based minimum Nalt value
* test.overdispersion.pdf = overdispersion plot
* test.QQ.pdf = QQ plot
- analysis/power_analysis.R - estimates mosaic detection power as a function of sample average depth
# Plot detection power as a function of sample average depth with model parameters determined from generate_candidates.POST.R
# Estimate the true frequency of mosaics with VAF>0.1 using 'test.denovo.txt' (with parameters LR cutoff = 41, Theta estimate = 59, cohortsize = 2400, sample avg depth = 80)
Rscript power_analysis.R test.denovo.txt test 41 59 2400 80
# Output Files
* test.pwsite.log.txt = data used for estimating detection power given variant site DP
* test.pwsamp.log.txt = data used to plot detection power as a function of sample avg DP
* test.vaf_pw_adj.log.txt = data used in adjusting mosaic counts
# Output Plots
* test.power_sample_dp.pdf = plot of detection power curves as a function of sample avg DP
* test.vaf_pw_adj.pdf = histogram of adjusted mosaic counts, raw mosaic rate, adjusted mosaic rate
- SAMtools (http://samtools.sourceforge.net/)
- GATK (https://software.broadinstitute.org/gatk/download/)
- ANNOVAR (http://annovar.openbioinformatics.org/en/latest/)
- IGV (https://software.broadinstitute.org/software/igv/)
- Alexander Hsieh - alexanderhsieh
- Yufeng Shen
Thanks to Jiayao and Hongjian for help with the variant calling and IGV automation
- Jiayao Wang - explorerwjy
- Hongjian Qi - 7lagrange
README.md template from PurpleBooth (https://gist.github.com/PurpleBooth)