tshmak/lassosum

Output the reweighed beta?

wavefancy opened this issue · 8 comments

Dear developers,

Is there a way to export the reweighed(LD panel adjusted) beta values? This can give us more flexibilities to do the following analysis.

Thanks much!
Wallace

Hi, are you talking about the output from lassosum.pipeline (ls) or validate.lassosum.pipeline (vls)? For the former, I think you can find the beta in ls$beta. For the latter, the best beta (following validation) is given in vls$best.beta.

Dear developers,

I am able to export the best validation beta. However, there's some problem when I use this beta to compute PRS by plink. The PGS in plink is sum(beta*copy_of_effective_alleles). I am not sure the beta exported from lassosum can apply this rule or not.

I found the number of snps in also.in.refpanel is equal with the number of best.beta value, therefore, I extract the 'valid' snps from summary statistics and align the reweighed beta to it as: r = cbind(ss[out$also.in.refpanel,]$snpid, ss[out$also.in.refpanel,]$a1, ss[out$also.in.refpanel,]$a2,summary(vls$best.beta)).

# out <- lassosum.pipeline
# vls <- validate(out). 

However, this output used to plink2 --score is not right. It has quite different scores when compared to the one from lassosum.

Can you please give me some clue on this? I really want to calculate the PGS based on lassosum's best validation beta. In this way, I can easily do this on UK biobank level data.

Thank you very much for your help.

Best
Walllace

I suppose running lassosum.pipeline on the UKBB data wasn't a problem? The PGS should already be calculated when you run lassosum.pipeline, provided you specify test.bfile. Running validate on the resulting object should be very fast, provided it is not recalculating the PGS. It is the recalculation of the PGS that can be slow on very large datasets. Observe when you run validate (or pseudovalidate or splitvalidate) whether there is a message saying "Calculating PGS...".

There are a number of ways to speed up running lassosum.

  1. If lassosum.pipeline is slow, and your .bed files are organized by chromosomes, you can run them separately by chromosomes, and then use merge to merge the resulting objects together. Notice you must use the same ref.bfile.
  2. Again for lassosum.pipeline only, you can reduce the size of the reference sample by using the sample option. Note that if you run lassosum.pipeline across different chromosomes, you should set the same seed using set.seed(...) to make sure you select the same subsample every time.
  3. Use cluster=... to enable parallel processing. Note that the memory requirement can go up when this is set.
  4. (1) (2) and (3) above together should make it possible to do lassosum within half an hour or so even for UKBB scale data. Note that you can use keep.test and keep.ref to specify the subsamples within your .bed files you want to use as your target sample and reference panel, respectively. These two may overlap if your summary statistics are from an external dataset.

If you are unsure, please post your script here so I can have a look. Please specify the rough dimension of the bed files you are using. There should be no need to export the betas to plink for calculating the PGS.

Hi Timothy,

Thank you very much for your help and guideline.

I am able to apply lassosum on ukbiobank level data, on the premise that I split the data into 10K samples a batch.

  1. For the determination of the beta for the best PGS, the test.bfile are required, do you have any idea how much samples should be used in this step?
  2. Do you have any suggestion on how many samples should be used for the LD reference panel? In the balance of efficiency and accuracy?

Thanks much!
Wallace

Hi Wallace,

I tried running lassosum.pipeline on 487409 participants of the UKBB. Chromosome 22 (20,000 SNPs) took around 180 seconds. Chromosome 1 (~120,000 SNPs) took around 500 seconds. This is with 10-fold paralelization. So if you can further parallelize by chromosomes, the whole thing should take less than half an hour. Here are my options:

ls <- lassosum.pipeline(cor = cor, 
LDblocks="EUR.hg19",
sample=5000, 
chr = chr, pos=pos,
snp = snp, 
A1=A1, A2=A2, 
test.bfile = bfile,   # 487409 participants
trace=2, 
cluster=cl)           #  cl <- makeCluster(10, type="FORK")

Note that I was using a subset of 5000 for the reference panel.

Now running validate on ls took a similar amount of time:

v <- validate(ls, pheno=Pheno, covar=covar)

Note that the length(Pheno) and nrow(covar) is the same as the nrow.bfile(bfile), i.e. = 487409. In this way, lassosum doesn't recompute the PGS, which can be costly, if the number of SNPs is large.

Running lassosum by chromosomes

If you run lassosum by chromosome, you need to merge the results together at a later stage, e.g.:

ls1 <- lassosum.pipeline(...) # Chromosome 1
saveRDS(ls1, file="chr1.rds")
ls2 <- lassosum.pipeline(...) # Chromosome 2
saveRDS(ls2, file="chr2.rds")
ls1 <- readRDS("chr1.rds")
ls2 <- readRDS("chr2.rds")
ls <- merge(ls1, ls2)

It may be the case that you have the entire chromosome in one big .bed file. You can also run lassosum by chromosome. Say your summary statistics data.frame (call it SS) have a column called "CHR" and the first few lines look like this:

CHR POS pval
1 1234444 0.002
1 2222344 0.0045
2 12534444 0.0073
2 2212344 0.00241
...

You can separate SS into subsets and run lassosum separately, e.g.:

ss1 <- subset(SS, CHR==1)
ls1 <- lassosum.pipeline(chr=ss1$CHR, pos=ss1$POS, ...)
ss2 <- subset(SS, CHR==2)
ls2 <- lassosum.pipeline(chr=ss2$CHR, pos=ss2$POS, ...)

I wouldn't suggest splitting your .bed file by samples. Better to split by SNPs/chromosomes.

Note for running splitvalidate/pseudovalidate

Unlike validate, splitvalidate/pseudovalidate will involve recomputing the PGS, which can be costly in terms of time. Removing the need to recompute PGS for splitvalidate will be one of my future work.

Hi Timothy,

Thank you very much for this detail explanation.

I am very interested in the parallel running of lassosum chr by chr. There's no problem for generating the chr specific rds files and merge them. However, for the final validation step and also the exportation of PRS, do we need a huge bed file which aggregated all chrs together. It would be great If there's a way, for the validation step, we can also run chr by chr, after that we may possibly need an aggregation step. As in this way, we can keep bed files chr by chr, no need to merge them together. Merge bed files with lots of samples is also time-consuming.

By the way, by following your tutorial, I am able to handle ukb project well. I ran lassosum.pipeline chr by chr, and then merge them, and validate on a merged huge bed file (including all chrs).

Best regards
Wallace

Yes, validate doesn't support chromosome-by-chromosome operation. This is because the final PRS must be aggregated over all chromosomes. However, validate does not actually use the .bed file if it is not recalculating the PGS. Therefore, if the pheno (and covar) you specified is of the same dimension as the original test.bfile (optionally subsetted by keep.test), it shouldn't matter how big the .bed file is. Just do not include the test.bfile or the keep.test option when you run validate. (i.e. include them only at the stage of running lassosum.pipeline.)