pcrelateToMatrix slow on 50k samples
Closed this issue · 10 comments
Hello, I am running pcrelateToMatrix on 5k samples and it has been running for several days.
Could you advise what I can do to run it faster? thresh=NULL
doesn't seem to help. Thanks!
That seems unusual; it should not be taking that long.
- What version of GENESIS are you using?
- How many rows are in the pcrelate
kinBtwn
data frame? - Have you tried setting the threshold to a higher value, so there are fewer samples in each block? Note that if you are using the default setting of
scaleKin=2
, the threshold applies to the rescaled values, so if you want to make blocks of 3rd-degree relatives and above, your threshold would be2*2^(-9/2) = 0.088
Thank you for your help!
- GENESIS_2.22.2
> nrow(pcrel$kinBtwn)
[1] 1249425066
- No, I haven't. Just to be sure, do you suggest to run
pcrelateToMatrix(pcrel, scaleKin=2, thresh=0.088)
?
Yes, that's what I would suggest. With thresh=NULL
, the algorithm will consider any kinship value >0 a related pair. Since you have so many relationships, I think that's why it's taking so long. Setting a different threshold will eliminate the pairs with very low kinship values, and give you a block-diagonal matrix where the samples in each block are more closely related. You can experiment with different threshold values.
It made a progress! In the end, it encountered an error as below. And, it did finish successfully with a higher threshold. But I don't think I understood how to select a good threshold. Would it be 2*2^(-4/2) to make blocks of 2nd-degree relatives and above?
Creating block matrices for clusters...
Error in if (n > 0) c(NA_integer_, -n) else integer() :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In rep.fac * nx : NAs produced by integer overflow
2: In .set_row_names(as.integer(prod(d))) :
NAs introduced by coercion to integer range
In addition, I encountered an error below while trying to build a null model using the GRM created from the high threshed in the step above. Could you help me understand and resolve the issues? Thanks much for your help!
Error in .local(x, ...) :
internal_chm_factor: Cholesky factorization failed
In addition: Warning message:
In .local(x, ...) :
Cholmod warning 'not positive definite' at file ../Cholesky/t_cholmod_rowfac.c, line 430
Hello - going back a couple of messages:
nrow(pcrel$kinBtwn)
[1] 1249425066
Something seems off here. If you have 5K samples, then kinBtwn
should only have 12,497,500 rows. Your number is off by a factor of 100. How many rows are in the kinSelf
object?
Ah, I have 50k samples. Thanks for checking and sorry for the confusion!
After trying with different parameters, I was able to build a null model successfully. I ended up using thresh = 0.3
. That's higher than what you recommended. What would be implications to use a high thresh in building GRM?
I'm a bit concerned that there is an issue with your PC-Relate results -- have you tried making any plots of the estimates? For example, a histogram of estimated pairwise kinship coefficients, or a hexbin plot of kinship coefficients vs K0 (if you estimated the IBD sharing probabilities), or a histogram of the estimated inbreeding coefficients. A threshold of 0.3 is really high, so unless you have a ton of 1st and 2nd degree relatives in your sample, you really shouldn't need the threshold to be that high.
This dataset is a subset of UKB exome and kin vs K0 is plotted below (sampled 10M rows). How would downstream analyses (e.g. fitting null model and association tests) be affected by the high threshold? Random effects from distantly related pairs are ignored? GRM gets orders of magnitude larger even with small changes (e.g. threshold of 2.7) and I had difficulties fitting null model. Thank you.
That kinship vs k0 plot is concerning. The maximum value k0 should take is 1 (it can be a little bit larger than 1 because the estimator is not truncated, but it certainly should not have values as large as you are seeing). I can think of a couple possibilities for why your relatedness estimates would be bad:
- You are adjusting for either too many or not enough ancestry principal components in the PC-Relate analysis. How many PCs did you adjust for and how did you go about selecting them?
- An issue with the genotype data in the GDS file. You might want to look at allele frequency and variant heterozygosity distributions to make sure those seem reasonable and that the genotype data is as expected.