Missed call due to high coverage
mbhall88 opened this issue · 2 comments
I have an example where we miss a very clear resistant call due to LOW_GT but the depth on alt vs ref is very obvious. I appreciate it is hard when we are drawing lines based on percentiles. Anyway, I thought this might be a nice example to do some testing/debugging with.
This is a sample from Sylvianne in Madagascar. It was one of 5 samples in a multiplexed nanopore run. It has a HUGE amount of reads (fastq file is 3.3Gb - compressed!) They are testing out MGIT in the lab and were using mykrobe to compare results. The MGIT and phenotyping for this sample show resistance to Isoniazid, Rifampicin, and Ethambutol.
When I run mykrobe ok the full set of reads, it returns resistant to INH only. When I randomly subsample the file to 150x depth mykrobe says resistant to INH and RIF. When I filter out anything (from the subsampled file) that doesn't map to H37Rv I also get resistant to RIF and INH.
The variant in question is rpoB_S450L-TCG761154TTG
. The information for this variant from the full and filtered runs are:
Full read set
"rpoB_S450L-TCG761154TTG": {
"variant": "ref-S450X?var_name=TCG761154TTG&num_alts=8&ref=NC_000962.3&enum=0&gene=rpoB&mut=S450X",
"genotype": [
1,
1
],
"genotype_likelihoods": [
-22153.337254734015,
-44.788636474841695
],
"info": {
"coverage": {
"reference": {
"percent_coverage": 75.0,
"median_depth": 1,
"min_non_zero_depth": 1,
"kmer_count": 20,
"klen": 21
},
"alternate": {
"percent_coverage": 100.0,
"median_depth": 166,
"min_non_zero_depth": 132,
"kmer_count": 3109,
"klen": 20
}
},
"expected_depths": [
160
],
"contamination_depths": [],
"filter": [
"LOW_GT_CONF"
],
"conf": 22109
},
"_cls": "Call.VariantCall"
Filtered read set
"Rifampicin": {
"predict": "R",
"called_by": {
"rpoB_S450L-TCG761154TTG": {
"variant": null,
"genotype": [
1,
1
],
"genotype_likelihoods": [
-7098.776446933491,
-9.626747418092155
],
"info": {
"coverage": {
"reference": {
"percent_coverage": 0.0,
"median_depth": 0,
"min_non_zero_depth": 0,
"kmer_count": 0,
"klen": 21
},
"alternate": {
"percent_coverage": 100.0,
"median_depth": 42,
"min_non_zero_depth": 37,
"kmer_count": 812,
"klen": 20
}
},
"expected_depths": [
41.0
],
"contamination_depths": [],
"filter": [],
"conf": 7089
},
"_cls": "Call.VariantCall"
}
}
},
All of the data from this run is stored on noah. I'll send through the paths on Slack.
Also, it looks like we've also missed a clear ETH call
Full read set
"embB_M306V-ATG4247429GTG": {
"variant": "ref-M306V?var_name=ATG4247429GTG&num_alts=1&ref=NC_000962.3&enum=0&gene=embB&mut=M306V",
"genotype": [
1,
1
],
"genotype_likelihoods": [
-14407.132009802384,
-426.5460799265036
],
"info": {
"coverage": {
"reference": {
"percent_coverage": 100.0,
"median_depth": 4,
"min_non_zero_depth": 3,
"kmer_count": 82,
"klen": 21
},
"alternate": {
"percent_coverage": 100.0,
"median_depth": 105,
"min_non_zero_depth": 82,
"kmer_count": 2203,
"klen": 21
}
},
"expected_depths": [
160
],
"contamination_depths": [],
"filter": [
"LOW_GT_CONF"
],
"conf": 13981
},
"_cls": "Call.VariantCall"
},
Filtered read set
"embB_M306V-ATG4247429GTG": {
"variant": "ref-M306V?var_name=ATG4247429GTG&num_alts=1&ref=NC_000962.3&enum=0&gene=embB&mut=M306V",
"genotype": [
1,
1
],
"genotype_likelihoods": [
-5175.289309876459,
-89.3627577744604
],
"info": {
"coverage": {
"reference": {
"percent_coverage": 0.0,
"median_depth": 0,
"min_non_zero_depth": 0,
"kmer_count": 0,
"klen": 21
},
"alternate": {
"percent_coverage": 100.0,
"median_depth": 25,
"min_non_zero_depth": 19,
"kmer_count": 515,
"klen": 21
}
},
"expected_depths": [
41.0
],
"contamination_depths": [],
"filter": [
"LOW_GT_CONF"
],
"conf": 5086
},
"_cls": "Call.VariantCall"
},
Maybe we need to have - in addition to the GT percentile stuff - some kind of overriding "hard-coded" rules that say something like if Ref depth is less than 2 and total depth is > 15 then ignore filters? (numbers are for illustration here)
There seem to be two issues here.
- Better results (sensitivity) when subsampling v high coverage sample
- Better results (sensitivity) when removing NON Mtb reads
1 I can believe and might be fixable by having a GCP threshold which drops when coverage is high. There is another issue related to that, can't find on my phone right now.
2 is surprising