Sequencing Error Rate Estimation in F1 Populations
Closed this issue ยท 6 comments
{updog}
can sometimes have poor performance when all of the following occur:
model = "f1"
ormodel = "s1"
,- Offspring sequenced at low depth, but
- Parents are sequenced to large depth, and
- All offspring have the same genotype. So the parental genotypes could be AA and aa (for diploids).
The issue is that {updog}
can prefer to estimate a large sequencing error rate and place the parents to both have genotypes of 1.
This should only affect the model = "f1"
and model = "s1"
options.
eaa231c patches things a little bit, by placing an upper bound on the sequencing error rate.
However, full correction will need me to dive back into the code. Right now, I don't use the parental sequences when optimizing the sequencing error rate (just the offspring sequences). Adjusting the objective to include parental sequences should fix the problem.
@Cristianetaniguti, could you take a look at the f1 branch of {updog}
and see if this mostly resolves the issues that you discovered?
You can install that repo via:
remotes::install_github(repo = "dcgerard/updog", ref = "f1")
Hey @dcgerard,
Sorry for taking a long time to test it. But I finally did!
I also tested another idea that you gave me when we met in San Diego, about using a combination of the updog genotype probabilities with a global error of 5%. I included as an error rate in OneMap HMM: 1 - (1 - estimated genotype probability by updog)*(1-global error)
I built linkage maps for f1 populations using a subset of the data and the combinations:
- markers from freebayes and GATK
- using updog from the main branch and from the f1 branch
- two diploid datasets, one with a depth ~6x and another ~90x
- using the genotype probabilities from updog, a global error rate of 5%, and the combination of updog genotype probabilities and a global error rate of 5%
- Removing non-informative (when both parents are homozygous or at least one of them is missing) markers before running updog
My conclusions are:
- Removing the non-informative before performing the updog improves the map quality while using the f1 and main branch, mainly in the high-depth dataset.
- Except in one of the scenarios (map with freebayes markers and using updog probabilities + global error), using the f1 branch provided maps slightly worse than using the main branch
- Using the genotype probabilities + global error provided the same or better maps than using only the global error ๐
I hope this answers your question. I think it is still worth using the main branch and filtering the non-informative before using updog.
Thanks for the work @Cristianetaniguti! This is very interesting (and counterintuitive, since the f1 branch should give better results when using model = "f1"
). Some questions:
- You used
model = "f1"
in both cases, right? - These results are only for diploids, right?
Hey @dcgerard,
- Yes, I used the exact same code. I only changed the updog version.
- Yes, I am still adapting Reads2Map to polyploids.
These issues are only details, overall updog performs pretty well and it usually improves low-depth data genotypes. Let me know if you have other ideas to test. I can do it easily now.
Thank you so much! I really appreciate you working on this!
Closing for now. Don't think there is anything more to do here at the moment.