tests lead to g.star=0 on all permutations
Closed this issue · 8 comments
Hello -
I've run varcomp.glob() successfully on my SNP data with three levels (prov, pop, site), but I'm having trouble testing the different levels. I've tried all tests at nperm=100, but each g.star value is 0 for all permutations and the p-value = 1. I'm not sure what is going wrong.
varcomp.glob(levels=levels1, loci=loci, diploid = TRUE)
test.within(loci, test=site, within = pop, nperm=100)
test.between.within(data.frame(loci), within = prov, rand.unit = site, test.lev = pop, nperm=100)
test.between(loci, rand.unit=pop, test=prov, nperm=100)
Any advice?
The syntax looks correct. Difficult to answer without a toy example, could you post an example? Do you get meaningful results with a subset of loci?
I ran the tests with a subset of 100 SNPs and it worked, I got plausible g-star values and a p-value. Thanks for the suggestion! Seems like my computer couldn't handle testing ~7k SNPs and 221 individuals.
Unfortunately, it seems that my processor/RAM combo can only handle testing 100 SNPs (anything higher and I get meaningless g-star/p-values), but the varcomp.glob f-stat coefficients at 100 SNPs are 0.02 to 0.05 off from the full dataset coefficients. Do you have suggestions for a minimum number of SNPs for accurate subsetting?
I cannot reproduce the behavior you described with simulated data. For instance with 1000 SNPs:
library(hierfstat)
dat<-sim.genot(size=100,nbloc=1000,nbal=2,nbpop=5,N=10000,mig=0.001,mut=1e-06)
wc(dat)
$`FST` [1] 0.02529191
$FIS [1] -0.001185896
t.b<-test.between(dat[,-1],test.lev=dat[,1],rand.unit=rep(1:50,each=10))
t.w<-test.within(dat[,-1],within=dat[,1],test.lev=rep(1:50,each=10))
str(t.w)
List of 2 $ g.star: num [1:100] 69845 69784 69028 69895 70473 ... $ p.val : num 0.53
str(t.b)
List of 2 $ g.star: num [1:100] 4939 5468 5380 5206 6102 ... $ p.val : num 0.01
I need a reproducible example to investigate further. Could you send one?
@nehasavant any heads up / progress? Can I close this issue? Cheers
Hi - Sorry for the delay. I'm not sure how to send a reproducible example without sending you my actual files. One difference from your example dataset is that I have some missing data (~3.4%).
Indeed, with 7k SNPs and 3-4% missing data, it is very likely that all individuals have at least one missing genotype, and the core of the testing is through g.stats.glob
, which filters out any individual with a missing genotype. As it is difficult /unclear how to impute properly missing genotypes, I recommend in this situation to use bootstrapping over loci (e.g. boot.vc
). If the CI does not overlap with 0, you can consider the statistic to differ significantly from 0.
Thank you for this recommendation. The boot.vc
worked well, even when using all ~7k SNPS:
boot_all_1000 <- boot.vc(levels=levels[, -1],loci=data2[, -1], diploid=TRUE, nboot=1000, quant=c(0.025,0.5,0.975))
Within boot_all_1000$ci
, I found that for all pairings the CI does not overlap with 0. To confirm, this means that all statistics at all levels are significantly different from zero and therefore levels 1, 2 and 3 effect on genetic structure are all significant?
This is interesting since with the subset of 100 loci, I found that only level 3 (which was tested with test.within()
) had a significant effect on genetic structure. I'm assuming more loci and permutations were needed.
Sounds correct. Difficult to say more without seeing the data. Best wishes