fgvieira/ngsF

Suggestions for dealing with batch effects

davidecarlson opened this issue · 2 comments

Dear Dr. Viera,

I'm a new ngsF user, and I just ran the program using my samples. I found a very stark pattern in the resulting inbreeding coeffiecients:

0.000000
0.000000
0.034296
0.000000
0.000000
0.000000
0.475336
0.913737
0.620628
0.623424
0.397937
0.649332
0.731328
0.487009
0.480450
0.689157
0.706471
0.610795
0.577949
0.664756
0.868305
0.516195
0.503299

These data are ddRAD-seq from a plant species subject to considerable amounts of selfing, so I expect fairly high inbreeding coefficients. Notably, the first six samples all have very low F, and these six samples were all sequenced in a separate run from the remaining samples and also have lower coverage (~ 5-10x) than the rest of the samples (~15-20x).

I am guessing that this batch effect is somehow responsible for the lower F scores in the first six samples. Does this seem plausible? If so, I figure that I should separate these first six samples and analyze them separately. Does that seem like a reasonable approach?
Thanks!
Dave

Hi Dave,

are all these samples from the same population? ngsF infers inbreeding by looking at deviations from HWE, so the analysis should be done per population.

That said, I have actually seen this before, where heterogeneous coverage can lead to some bias in the estimates.
To be sur

Since you have ok good coverage, the easiest would be to downsample high coverage samples to ~5-10x. This coverage should be enough to estimate F.

Thanks for your response, and sorry for not following up earlier. It's a little bit hard to define what a population is in this system, but the samples should probably not be considered to be from the same population. I'll try your suggestions, thanks!