dcgerard/updog

Sequencing Error Rate Estimation in F1 Populations

Closed this issue ยท 6 comments

{updog} can sometimes have poor performance when all of the following occur:

  1. model = "f1" or model = "s1",
  2. Offspring sequenced at low depth, but
  3. Parents are sequenced to large depth, and
  4. All offspring have the same genotype. So the parental genotypes could be AA and aa (for diploids).

The issue is that {updog} can prefer to estimate a large sequencing error rate and place the parents to both have genotypes of 1.

This should only affect the model = "f1" and model = "s1" options.

eaa231c patches things a little bit, by placing an upper bound on the sequencing error rate.

However, full correction will need me to dive back into the code. Right now, I don't use the parental sequences when optimizing the sequencing error rate (just the offspring sequences). Adjusting the objective to include parental sequences should fix the problem.

@Cristianetaniguti, could you take a look at the f1 branch of {updog} and see if this mostly resolves the issues that you discovered?

You can install that repo via:

remotes::install_github(repo = "dcgerard/updog", ref = "f1")

Hey @dcgerard,

Sorry for taking a long time to test it. But I finally did!

I also tested another idea that you gave me when we met in San Diego, about using a combination of the updog genotype probabilities with a global error of 5%. I included as an error rate in OneMap HMM: 1 - (1 - estimated genotype probability by updog)*(1-global error)

I built linkage maps for f1 populations using a subset of the data and the combinations:

  • markers from freebayes and GATK
  • using updog from the main branch and from the f1 branch
  • two diploid datasets, one with a depth ~6x and another ~90x
  • using the genotype probabilities from updog, a global error rate of 5%, and the combination of updog genotype probabilities and a global error rate of 5%
  • Removing non-informative (when both parents are homozygous or at least one of them is missing) markers before running updog

My conclusions are:

  • Removing the non-informative before performing the updog improves the map quality while using the f1 and main branch, mainly in the high-depth dataset.
  • Except in one of the scenarios (map with freebayes markers and using updog probabilities + global error), using the f1 branch provided maps slightly worse than using the main branch
  • Using the genotype probabilities + global error provided the same or better maps than using only the global error ๐Ÿ‘

I hope this answers your question. I think it is still worth using the main branch and filtering the non-informative before using updog.

Thanks for the work @Cristianetaniguti! This is very interesting (and counterintuitive, since the f1 branch should give better results when using model = "f1"). Some questions:

  1. You used model = "f1" in both cases, right?
  2. These results are only for diploids, right?

Hey @dcgerard,

  1. Yes, I used the exact same code. I only changed the updog version.
  2. Yes, I am still adapting Reads2Map to polyploids.

These issues are only details, overall updog performs pretty well and it usually improves low-depth data genotypes. Let me know if you have other ideas to test. I can do it easily now.

Thank you so much! I really appreciate you working on this!

Closing for now. Don't think there is anything more to do here at the moment.