Mykrobe-tools/mykrobe

Missed call due to high coverage

mbhall88 opened this issue · 2 comments

I have an example where we miss a very clear resistant call due to LOW_GT but the depth on alt vs ref is very obvious. I appreciate it is hard when we are drawing lines based on percentiles. Anyway, I thought this might be a nice example to do some testing/debugging with.

This is a sample from Sylvianne in Madagascar. It was one of 5 samples in a multiplexed nanopore run. It has a HUGE amount of reads (fastq file is 3.3Gb - compressed!) They are testing out MGIT in the lab and were using mykrobe to compare results. The MGIT and phenotyping for this sample show resistance to Isoniazid, Rifampicin, and Ethambutol.

When I run mykrobe ok the full set of reads, it returns resistant to INH only. When I randomly subsample the file to 150x depth mykrobe says resistant to INH and RIF. When I filter out anything (from the subsampled file) that doesn't map to H37Rv I also get resistant to RIF and INH.

The variant in question is rpoB_S450L-TCG761154TTG. The information for this variant from the full and filtered runs are:

Full read set

            "rpoB_S450L-TCG761154TTG": {
                "variant": "ref-S450X?var_name=TCG761154TTG&num_alts=8&ref=NC_000962.3&enum=0&gene=rpoB&mut=S450X",
                "genotype": [
                    1,
                    1
                ],
                "genotype_likelihoods": [
                    -22153.337254734015,
                    -44.788636474841695
                ],
                "info": {
                    "coverage": {
                        "reference": {
                            "percent_coverage": 75.0,
                            "median_depth": 1,
                            "min_non_zero_depth": 1,
                            "kmer_count": 20,
                            "klen": 21
                        },
                        "alternate": {
                            "percent_coverage": 100.0,
                            "median_depth": 166,
                            "min_non_zero_depth": 132,
                            "kmer_count": 3109,
                            "klen": 20
                        }
                    },
                    "expected_depths": [
                        160
                    ],
                    "contamination_depths": [],
                    "filter": [
                        "LOW_GT_CONF"
                    ],
                    "conf": 22109
                },
                "_cls": "Call.VariantCall"

Filtered read set

            "Rifampicin": {
                "predict": "R",
                "called_by": {
                    "rpoB_S450L-TCG761154TTG": {
                        "variant": null,
                        "genotype": [
                            1,
                            1
                        ],
                        "genotype_likelihoods": [
                            -7098.776446933491,
                            -9.626747418092155
                        ],
                        "info": {
                            "coverage": {
                                "reference": {
                                    "percent_coverage": 0.0,
                                    "median_depth": 0,
                                    "min_non_zero_depth": 0,
                                    "kmer_count": 0,
                                    "klen": 21
                                },
                                "alternate": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 42,
                                    "min_non_zero_depth": 37,
                                    "kmer_count": 812,
                                    "klen": 20
                                }
                            },
                            "expected_depths": [
                                41.0
                            ],
                            "contamination_depths": [],
                            "filter": [],
                            "conf": 7089
                        },
                        "_cls": "Call.VariantCall"
                    }
                }
            },

All of the data from this run is stored on noah. I'll send through the paths on Slack.

Also, it looks like we've also missed a clear ETH call

Full read set

            "embB_M306V-ATG4247429GTG": {
                "variant": "ref-M306V?var_name=ATG4247429GTG&num_alts=1&ref=NC_000962.3&enum=0&gene=embB&mut=M306V",
                "genotype": [
                    1,
                    1
                ],
                "genotype_likelihoods": [
                    -14407.132009802384,
                    -426.5460799265036
                ],
                "info": {
                    "coverage": {
                        "reference": {
                            "percent_coverage": 100.0,
                            "median_depth": 4,
                            "min_non_zero_depth": 3,
                            "kmer_count": 82,
                            "klen": 21
                        },
                        "alternate": {
                            "percent_coverage": 100.0,
                            "median_depth": 105,
                            "min_non_zero_depth": 82,
                            "kmer_count": 2203,
                            "klen": 21
                        }
                    },
                    "expected_depths": [
                        160
                    ],
                    "contamination_depths": [],
                    "filter": [
                        "LOW_GT_CONF"
                    ],
                    "conf": 13981
                },
                "_cls": "Call.VariantCall"
            },

Filtered read set

            "embB_M306V-ATG4247429GTG": {
                "variant": "ref-M306V?var_name=ATG4247429GTG&num_alts=1&ref=NC_000962.3&enum=0&gene=embB&mut=M306V",
                "genotype": [
                    1,
                    1
                ],
                "genotype_likelihoods": [
                    -5175.289309876459,
                    -89.3627577744604
                ],
                "info": {
                    "coverage": {
                        "reference": {
                            "percent_coverage": 0.0,
                            "median_depth": 0,
                            "min_non_zero_depth": 0,
                            "kmer_count": 0,
                            "klen": 21
                        },
                        "alternate": {
                            "percent_coverage": 100.0,
                            "median_depth": 25,
                            "min_non_zero_depth": 19,
                            "kmer_count": 515,
                            "klen": 21
                        }
                    },
                    "expected_depths": [
                        41.0
                    ],
                    "contamination_depths": [],
                    "filter": [
                        "LOW_GT_CONF"
                    ],
                    "conf": 5086
                },
                "_cls": "Call.VariantCall"
            },

Maybe we need to have - in addition to the GT percentile stuff - some kind of overriding "hard-coded" rules that say something like if Ref depth is less than 2 and total depth is > 15 then ignore filters? (numbers are for illustration here)

There seem to be two issues here.

  1. Better results (sensitivity) when subsampling v high coverage sample
  2. Better results (sensitivity) when removing NON Mtb reads

1 I can believe and might be fixable by having a GCP threshold which drops when coverage is high. There is another issue related to that, can't find on my phone right now.
2 is surprising