uab-cgds-worthey/quac

Samples with high GC content

Closed this issue · 5 comments

We receive WGS samples with high mean GC content (obtained from qualimap) rather frequently, but it is not clear what is causing samples to have high GC. We also do not know what are their consequences in downstream analysis.

Note: This is not a QuaC issue; instead this has to do with sample QC.

I looked at chr1 coverage for Musc*** samples with high GC. They both had variable coverage across chromosome length, compared to expected coverage at ~1.0. While this indexcov figure shows only two samples (LW001647 and LW001654 - these are part of Pad** samples) with high GC, such observation is common for other samples with high GC as well. LW001643, which has normal GC, is shown for reference here with coverage around ~1.0.

image

Such coverage variability can also be seen in coverage across reference. Plots below were obtained from qualimap. Note how coverage (red line) is shaky for those with high GC.

  • LW001643
    image

  • LW001647
    image

  • LW001657
    image

See this table for samples' mean GC content.
image

I wasn't much successful trying to find literature on this topic. Indexcov paper highlights a sample with high coverage variability, and it notes that "samples like this one will have many spurious CNV calls"; however it doesn't discuss the cause of high coverage variability.

image

Btw, here is how LW001647 (red) and LW001654 (blue) compare to other Musc*** (Pad***) samples. Clearly they deviate from other samples' profiles.

  • Coverage histogram
    image

  • Cumulative genome coverage
    image

  • GC content distribution
    image

While I think atm that high GC content might not have significant effect on small variant calling (not convinced fully though!), I expect them to cause issues with other types of variant calls. We need to revisit this topic at some point.

I was curious if Musc** samples tended to have high %GC. Did some analysis but results don't support this notion (on quick look at least). Well, UAB samples do but not Pad*** samples.
image

Code here in Cheaha: /data/project/worthey_lab/projects/experimental_pipelines/mana/small_tasks/qc_highGC