Incorrect allele ("2")
jasperlinthorst opened this issue · 13 comments
Hi Robert, I'm playing with QUILT to impute some low coverage samples using the HRC reference data.
With version 0.1.9 I run into the problem that some variant records contain incorrect GT fields, for example:
chr22 18173093 . G A . PASS EAF=1.411;INFO_SCORE=0.001;HWE=1;ERC=0.99992;EAC=3e-05;PAF=0.99997 GT:GP:DS:HD 2|0:0.169,-1.16,1.991:2.822:1.575,0.407
I guess, it should be 1|1, but it now makes a reference to allele '2' which doesn't exist.
With version 0.1.8 the output is not phased, but all GT fields are correct.
chr22 18173093 . G A . PASS EAF=1.422;INFO_SCORE=0;HWE=1;ERC=0.99992;EAC=3e-05;PAF=0.99997 GT:GP:DS:HD 1/1:0.169,-1.182,2.013:2.844:1.602,1.296
I hope it's just a small bug with respect to how the phased output is written, but if I'm misinterpreting things please let me know!
Kind regards,
Jasper
Hi Jasper,
Thanks for the message. Somethings gone quite wrong here (on my end almost certainly), the GP is giving out of bound values (should be between 0 and 1 each and sum to 1). I suspect it’s something to do with the forward backward algorithm for the full run but I’m surprised I haven’t caught all those bugs yet through usage or tests. Beyond the GT values, looking at the GP value not summing to 1 and/or with negative values, is this issue prevalent throughout the file (other SNPs)? Is this just 1 sample with this problem or many? Does it reproduce if you rerun? What’s the sample coverage? Did you change any other settings beyond the obvious ones to give paths?
Thanks,
Robbie
Hi Robbie, thanks for your quick reply!
Yes, I also saw this problem with other imputed SNPs and throughout different samples.
I also noted the negative GP values, but I also noted that, while hard to interpret as a genotype probability, they do always add up to 1. Re-running yields slightly different results everytime of course, but throughout the entire file the problem remains.
The sample coverages are mostly between 0.1x and 1x.
I don't really use any special combination of options, but I dropped the genetic_map_file, posfile and phasefile options and added the output_filename. I do use the QUILT_prepare_reference setup, after which I impute all samples individually, so the bamlist just contains 1 sample everytime.
I'll do some more testing today, I'll let you know if I can figure out what goes wrong...
OK thanks for this, I'm going to dig into this, see if there are obvious things on my end, try running it through a lot of samples to see if I broke something recently, etc. Thanks for bringing this to my attention
One thing that might be useful, you can throw in a check right after here
https://github.com/rwdavies/QUILT/blob/master/QUILT/R/functions.R#L1647
and then re-build and re-install using
./scripts/build-and-install/R
that dosage has range between 0 and 1, and throw an error otherwise (something like the following)
if ((min(dosage) < -1e-5) | (1 + 1e-5) < max(dosage)) {
print(range(dosage))
stop("Dosage out of bounds on forward-backward full iteration")
}
I'll probably add that into master once I can figure out how to print better error messaging should that situation occur. Assuming that throws an error, something has indeed gone wrong with the full forward-backwards HMM
I think I have a more subtle version of this captured, I'll take a look and report back
I've found at least one problem that corrects my local problem. I'm hopeful that single bug explains it (it removes my bug), but I want to understand why my tests didn't catch this - something wrong or insufficient tests.
If you're curious, internally, I represent the reference haplotypes (0's and 1's) 32 as a time (so as an int). Because they're often the same across 50000 haplotypes, using nMaxDH (maximum number of distinct haplotypes, usually set around~250) most common and store those, in compressed and uncompressed format (the 250 of the,), and then the reference haplotypes for that 32 SNP are stored as an int of 1:nMaxDH. Now for some of those reference haplotypes, that won't work, as there are more than nMaxDH distinct reference haplotypes over a span of 32 SNPs, so those are analyzed separately in the code. For the paper, nMaxDH was set pretty high (the ~250), so there were very few haplotypes in the reference not captured, but I've since made it determined by QUILT and often use a lower value, so it runs faster. This error was in the HMM for the special case - I use a lot of speedups e.g. delaying multiplications in the C++ code to minimize the number of operations / performs operations more efficiently, but I forgot to port one of them over to the special cases. Hence, I think, why I didn't notice it during the paper but it's coming up pretty often for you now.
Anyway I hope that's it, I will investigate some more and get back once resolved. Thanks again for pointing this out.
OK I found another similar small bug in a similar vein, again only for special cases. These were not caught in the tests which weren't expansive enough (they catch it now). I'm going to re-run a bunch of samples and see if the issue comes up again. I've pushed what I think are the right changes with my last commit, so it hopefully should work as expected now!
OK I ran a bunch of jobs last night and they worked without issue. Are you able to pull the latest commits, install using ./scripts/build-and-install.R
, and re-try your imputation jobs, and see if the issue is solved? You don't need to rebuild the reference
OK thanks, I'll wait to hear from you, then if all goes well Monday I'll push a version change.
All seems to work fine now, so I'll close the issue. A new release would be appreciated as well. Thanks a lot for your help!
Jasper
Thanks I greatly appreciate it, I'll make a new release today or tomorrow