UW-GAC/GENESIS

What does correctKin do?

Closed this issue · 4 comments

Hello! It looks like correctKin is only to be used when the sample size is small. What does it do and what does "small sample size" being True or False lead to? Specifically I'm looking at this #54.

As mentioned in a previous issue, I'm running a single query against all the other samples, so I have two blocks, one with one sample and one with many. I want to make sure the correctKin is something I can safely skip, and I want to make sure I'm setting the small sample size correctly.

Hi! That is accurate - correctKin is only used when the option small.samp.correct = TRUE. This small sample correction is an adjustment made after the initial kinship estimates are calculated to attempt to protect from a small number of samples with unique ancestry having too much leverage in the PC based ancestry adjustment. When small.samp.correct = TRUE, the returned kinship estimates are after this adjustment. I've only observed this issue of samples with high leverage causing over adjustment in small samples - it's hard to put an exact number, but I would typically recommend using small.samp.correct = TRUE with samples < 5000 (if over adjustment isn't an issue in the sample, the small sample adjustment just won't alter the estimates significantly). We recently changed small.samp.correct = TRUE by default, because a lot of users seem to run PC-Relate on small numbers of individuals, and the small sample adjustment can be beneficial to them.

Note that if you have more than one sample block, then you can not use this adjustment. The code will automatically set the small.samp.correct parameter to FALSE (with a message printed to the console) when the number of sample blocks is more than 1.

Since you are running a single query against all other samples and have two blocks, you actually can not use the small.samp.correct (which calls the correctKin function). You can safely set this parameter to FALSE for your particular analyses.

Got it! That makes sense, and thanks for your detailed reply as always.

Regarding correctK2 and correctK0, are those also there to correct small sample sizes?

No, correctK2 and correctK0 should always be used if you are computing IBD probability estimates.

correctK2 does an adjustment to account for deviations from expected heterozygosity (as measured by the inbreeding coefficient) for each individual in the pair. (There's also an additional small sample size adjustment built into the function, but it won't run when small.samp.correct = FALSE).

correctK0 is used to choose the "better" k0 estimator for each sample pair. In testing PC-Relate when it was written, we found that one estimator gave better results for 1st degree relatives, while another estimator (a function of the estimated kinship and k2 values) gave better results for more distant relatives. PC-Relate calculates the first estimator for all pairs initially (since we don't know the relatedness a priori), and correctK0 replaces values for pairs with kinship estimate < 2^(-5/2) with (1 - 4*kin + k2).

That is super helpful! Thank you!