PRS from lassosum does not tally with PRS calculated manually
wkhonottingham opened this issue · 9 comments
Hi, I have followed the suggestion to apply validated beta into new file as below
out2 = subset(out, out$s=v$best.s, out$lambda=v$best.lambda)
v2 = validate(out2, pheno=pheno.read[,3], covar = covar.read, test.bfile=newfile)
At the same time, we tried to generate the PRS using v$best.beta and the genotype in the "newflie" using
X %*% v$best.beta, where X is the genotype in the newfile, coded as 0/1/2.
However, v2$best.pgs is not the same as PRS generated using X %*% v$best.beta. Could you please shed some light on this issue? The "--fill-missing-a2" function seems to be no longer available in plink, the missing data in X is replaced by 0, which is consistent as what's being done in lassosum. We also make sure that the A2 in X is the same as the A2 in newfile.
We have tested our scripts on other PRS generation and encountered no problem. I am really grateful if you could give us some advice on this issue.
Many thanks!
Seems related to #17.
But I am not computing PRS using plink. I compute using the formula X %*% v$best.beta where X is the genotype data in 0/1/2. If I understand correctly, v$best.beta is the best weights, which can be merged with out$sumstats to get the snp information including allele information. I then make sure data my "X" aligned according to the "A1" in out$sumstats before compute PRS "manually" using script instead of lassosum. I expect the outcome to be similar to v$best.pgs but the two are very different. Not sure if I have misunderstood something here...=)
If possible, could you try it with the example data attached with lassosum, provided in the directory here: system.file("data", package="lassosum")
? Let us know the exact steps you took to obtain the different results.
Brilliant idea! Let me look into it. Thanks!
Hi all,
I do not know if this issue has been resolved, but I can confirm with own data. Without knowing lassosum internals and also, being new to polygenic risk score analysis, I was able to get a 99.2% correlation between v$best.pgs
and X %*% v$best.beta
by:
- Reversing the genotypes in X
- Centering (not scaling) the outcome of
X %*% v$best.beta
:
Code should look something like:
Y <- X
Y[X==2] <- 0
Y[X==0] <- 2
my.pgs <- scale(`Y %*% v$best.beta`,scale=FALSE)
I was able to achieve >99.9 correlation between manual PGS and v$best.pgs
with the same process using the package test data, although with some tweaking as v$best.beta
does not contain betas for all markers in the bfile
and there isn't some standard indexing (apart from marker locations) to make a mapping (would be great to add rownames
to out$sumstats
by the way!).
I am not sure if this makes sense or I am completely wrong (due to my first steps in PGS analysis), but it would be great to clarify this.
Thank you again @tshmak for all this great work!
Panos
Hi @pmoulos
As with the OP, why don't you try reproducing the results using the example data attached with lassosum? This would make it much clearer what the differences are.
To get the SNPs that are included in out$sumstats
, I believe you can use the out$test.extract
vector. See here for some limited documentation.
Hi @tshmak,
Thank you for the prompt reply!
Now, I believe that I have performed the test with the package data before my first post. Maybe I have done something wrong of course. Here is what I did briefly:
library(lassosum)
library(data.table)
library(snpStats)
# Prepare the run
cwd <- getwd()
setwd(system.file("data", package="lassosum"))
ss <- fread("summarystats.txt")
ref.bfile <- "refpanel"
test.bfile <- "testsample"
LDblocks <- "EUR.hg19"
# Convert p-values to SNP correlations
cor <- p2cor(p=ss$P_val,n=60000,sign=log(ss$OR_A1))
# Lassosum predictor
out <- lassosum.pipeline(
cor=cor,
chr=ss$Chr,
pos=ss$Position,
A1=ss$A1,
A2=ss$A2,
ref.bfile=ref.bfile,
test.bfile=test.bfile,
LDblocks = LDblocks
)
v <- validate(out,plot=FALSE)
hist(v$best.pgs)
Now let's try the manual case
geno <- read.plink(bed="refpanel.bed",bim="refpanel.bim",fam="refpanel.fam")
X <- geno$genotypes
Y <- as(X,"numeric") # Converts SnpMatrix object to 0/1/2 genotypes
a.pgs <- Y[,out$test.extract] %*% v$best.beta
hist(a.pgs)
The distribution looks quite different
cor(v$best.pgs,a.pgs)
# -0.3134254
Let's try with the "rownames" approach
sumstats <- out$sumstats
# SNP names in Y with first A2 and then A1
rownames(sumstats) <- paste0(sumstats$chr,":",sumstats$pos,":",
sumstats$A2,":",sumstats$A1)
Intersect colnames
of Y
and rownames
of sumstats
length(intersect(colnames(Y),rownames(sumstats)))
# 1193
I believe this should be 1253 == nrow(sumstats)
?
Let's try different naming based on location only
tmp <- strsplit(colnames(Y),":")
# New Y names not unique but ok for our purposes
locNamesY <- sapply(tmp,function(x) paste0(x[1],":",x[2]))
locNamesSS <- paste0(sumstats$chr,":",sumstats$pos)
length(intersect(locNamesY,locNamesSS))
# Still 1193
So let's forget for a moment the out$test.extract
and try with rownames
:
common <- intersect(colnames(Y),rownames(sumstats))
betas <- v$best.beta
names(betas) <- rownames(sumstats)
And recalculate PGS
b.pgs <- Y[,common] %*% betas[common]
hist(b.pgs)
Looks much better, but with less SNPs in the PGS calculation, but "reversed"
cor(b.pgs,v$best.pgs)
# -0.9999526 # Perfect but reversed
Now, if we reverse the genotypes in Y
Z <- Y
Z[Y==2] <- 0
Z[Y==0] <- 2
c.pgs <- Z[,common] %*% betas[common]
hist(c.pgs)
Much better and in the correct "order"
cor(c.pgs,v$best.pgs)
# 0.9999526 # Perfect
Now, there may be some issues with the reading/interpretation of PLINK files? I understand that there has been a previous discussion with @privefl and his PLINK writing function before in a closed issue. However, I believe that I made the calculations using the package test data.
It would be nice to clarify what may be going on or some suggestions. I apologize in advance if I have got the whole thing wrong.
Best,
Panos
Thanks for the detailed examples.
I can confirm that the way snpStats reads the plink file, at least with the "as.numeric" conversion, is different from lassosum. (You can also try the undocumented lassosum:::readbfile function.) So whereas lassosum counts the number of A1 alleles, snpStats counts the A2 alleles. Perhaps this is indicated by the fact that you need to put in "A2" before "A1" in the rowname construction for snpStats?
Furthermore, lassosum tries to match summarystats and test.bfile to determine the common set of SNPs used in the end. (If the SNP is in test.bfile
but not in ref.bfile
, it will just use the s=0 shrunk coefficient for that SNP. out
also has an object called also_in_refbfile
which indicates that the SNP is not only in summarystats but also in ref.bfile.) I think in your example, you only considered sumstats and ref.bfile
rather than test.bfile
.
Finally there may also be very minor numerical differences due to the fact that lassosum performs the multiplication using C++, possibly at the same time as estimation, and may yield slightly different results from R.