tshmak/lassosum

Error in splitvec.from.bfile(bfile) : length(pvec) == length(bfile) is not TRUE

rainajia opened this issue · 6 comments

Hi, I ran the pipeline by chromosomes using the same ref.file and test.file for each chromosome, then merged the output variables together using "merge" in a loop.
However, when I used "validate", it threw the error:
Error in splitvec.from.bfile(bfile) :length(pvec) == length(bfile) is not TRUE
Could you explain to me what might caused the error?

Can you give me your entire script?

Can you give me your entire script?
Hi, my original code is attached below. I have realised that to merge the lassosum.pipeline output, it didn't work when I merge them in a loop, but it worked when I do merge(out1, out2,out3...out22). However, it is taking very long to run validate with the merged "out", I have a large sample size of ~400k for my phenotype, which validation method would be the most efficient for large sample sizes?

for(i in 1:22){
print(paste0("now processing chromosome ",i))
bfile <- paste0("./Chr",i")
rfile <- paste0("../Chr",i,"_Random25k")

tmp <-
  lassosum.pipeline(
  cor = cor,
  chr = ss$CHR,
  pos = ss$POS,
  A1 = ss$A1,
  A2 = ss$A2,
  ref.bfile = rfile,
  test.bfile = bfile,
  max.ref.bfile.n=25000,  
  LDblocks = LDblocks, 
  cluster=cl)

  if(i==1){
       out <- tmp
       }else{
       out <- merge(out,tmp)
       }
}
target.res <- lassosum::validate(out, pheno = as.data.frame(pheno), covar=as.data.frame(cov))

So are you still getting this error splitvec.from.bfile(bfile) : length(pvec) == length(bfile) is not TRUE. And if so, at which stage?

So are you still getting this error splitvec.from.bfile(bfile) : length(pvec) == length(bfile) is not TRUE. And if so, at which stage?

I don't get this error anymore when I do validate(out) where "out <- merge(out1, out2, out3...out22)". The error occured previously when I do validate(out) where out is merged by each of the lassosum.poipeline output from chromosomes in a for loop as shown in the code above. Sorry about the confusion, my current question is which validatation method to use for large samples. I have 400k samples for matched genotype and phenotype, and the previous run with validate(out, pheno,covar) has ran over 9 hours with 40 cores. I was wondering if this is normal behaviour, or is there a better way to parallelise it?

Yes, calculating PGS can take a long time with a large sample size. One way to speed up the calculation is to use multiprocessing (see here). Another way is to try to ensure the covar and the pheno is in the exact order as test.bfile. (Maybe you need to ensure there are no missing values also, but I can't remember if that's the case.) If everything matches exactly, you will not see the message Calculating PGS..., and it should be very fast.

Yes, calculating PGS can take a long time with a large sample size. One way to speed up the calculation is to use multiprocessing (see here). Another way is to try to ensure the covar and the pheno is in the exact order as test.bfile. (Maybe you need to ensure there are no missing values also, but I can't remember if that's the case.) If everything matches exactly, you will not see the message Calculating PGS..., and it should be very fast.

Thanks very much, Calculating PGS... was exactly what I have been seeing. I will double check on these points.