tshmak/lassosum

Negative results

privefl opened this issue · 15 comments

I'm using the lassosum.pipeline / validate / subset / validate strategy of the README (splitting the test set in 2).
I'm getting v2$best.validation.result equals to -0.491. And AUC(v2$best.pgs, v2$pheno - 1) equals to 0.205.
Is this normal? Should I just take the opposite?

Note that the binary phenotype is coded as 1/2 as in standard PLINK format.

That's not normal. However, I am unable to determine what might have gone wrong based on what you reported. If it's ok, can you PM me your dataset and the exact command you used?

I wanted to make a reproducible example with fake data but I'm getting a new error:

Coordinating summary stats with reference panel...
Splitting genome by LD blocks ...
Running lassosum ...
s =  0.2 
Error in lassosum(cor = cor2, bfile = ref.bfile, shrink = s, extract = ref.extract,  : 
  !any(is.na(cor)) is not TRUE

I've no missing value in cor...

Reproducible example:

# devtools::install_github("privefl/bigsnpr")
library(bigsnpr)

set.seed(1)
N <- 1000; M <- 1e4; m <- 20
snp <- snp_fake(N, M)
snp$genotypes[] <- sample(0:2, size = N * M, replace = TRUE)
snp$map$physical.pos <- seq(2, 2e8, length.out = M)
set <- sample(M, m)
beta <- rnorm(m) / sqrt(m)
s <- drop(scale(snp$genotypes[, set]) %*% beta)
y <- ((s + rnorm(m)) > quantile(s, 0.7)) + 0
AUC(s, y)
snp$fam$affection <- y + 1

bed <- snp_writeBed(snp, tempfile(fileext = ".bed"))
rds <- snp_readBed(bed, sub("\\.bed$", "", bed))

snp <- snp_attach(rds)
G <- snp$genotypes
CHR <- snp$map$chromosome
POS <- snp$map$physical.pos
A1 <- snp$map$allele1
A2 <- snp$map$allele2

set.seed(1)
ind.train <- sort(sample(nrow(G), size = N * 0.6))
ind.test <- setdiff(rows_along(G), ind.train)
ind.test.split <- split(ind.test, sample(1:2, length(ind.test), TRUE))

gwas <- big_univLogReg(G, y[ind.train], ind.train)
plot(gwas, type = "Manhattan")

# devtools::install_github("tshmak/lassosum")
library(lassosum)
bed2 <- snp_writeBed(snp, tempfile(fileext = ".bed"),
                     ind.row = ind.test.split[[1]])
bed3 <- snp_writeBed(snp, tempfile(fileext = ".bed"),
                     ind.row = ind.test.split[[2]])
cor <- p2cor(p = predict(gwas, log10 = FALSE), n = length(ind.train), 
             sign = gwas$estim)
out <- lassosum.pipeline(cor = cor, chr = CHR, pos = POS, A1 = A1, A2 = A2,
                         test.bfile = sub("\\.bed$", "", bed2),
                         LDblocks = "EUR.hg19")

v <- validate(out)
out2 <- subset(out, s = v$best.s, lambda = v$best.lambda)
v2 <- validate(out2, test.bfile = sub("\\.bed$", "", bed3))

v2$best.validation.result        ## -0.07595338
AUC(v2$best.pgs, v2$pheno - 1)   ## 0.4441773

Thanks for the example. The !any(is.na(cor)) is not TRUE error turned out to be due to the genotypes in your synthetic example being all A/T. These SNPs all got excluded because they were ambiguous. You can get around this by passing the exclude.ambiguous=F option to lassosum.pipeline and also to the second validate command.

I tried it with these options and found that the resulting v$best.validation.result was 0.05, and v2$best.validation.result was 0.007.

Hum, it was foolish of me to use A/T. I now use C/T -- please reinstall package {bigsnpr}.
I've updated the example above in order to get negative results.

I think I've determined that the problem is because your snp_writeBed function writes in the opposite way to how I interpret a plink .bed file. In particular, I treat 00 as homozygous A1, and 11 as homozygous A2. I think you did it the other way round.

I have

> rawToBits(readBin(bed, what = raw(), n = 5)[4:5])
 [1] 00 00 01 01 01 01 00 00 00 00 00 00 01 01 01 01
> snp$genotypes[1:8]
[1] 0 2 2 0 0 0 2 2

In Plink format, 0,1,2 should correspond to the number of A1 alleles (not A2). Therefore the 16 bits you showed there would mean 2 0 0 2 2 2 0 0 to plink, which is opposite to what you had in snp$genotypes[1:8]. Alternatively, you could swap the A1 and A2 when you write out the .bim file. See https://www.cog-genomics.org/plink2/formats#bed for details.

Hum, you may be right. Yet, it is weird to code homoziguous referent by 2...
I'm using the same convention as in package {gaston}.

I can simply swap A1 and A2 when I use lassosum().
Thanks for your help.

Hi Tim, I am having exactly the same problem when I try to calculate PRS using the Type II diabetes summary statistics for my sample using T2D_ALL_PRIMARY dataset from:
https://blog.nus.edu.sg/agen/summary-statistics/t2d-2020/
My cor does not contain Na. and I tried to add the exclude.ambiguous option. However both methods did not work. I tried lassosum on other summary statistics and this issue did not happen. May I ask for your help with this? Thank you very much.

It is not clear what your problem is. There are many possible reasons for having negative results. Please open a new issue and describe exactly what your codes are, what the output is, and what you expect the output to be. Preferably include a reproducible example.

I have the same problem with the Lassosum calculation PRS. I used Low density lipoprotein cholesterol Base Date available on https://humandbs.biosciencedbc.jp/en/hum0014-v21#58qt and Target Date provided by my laboratory. The Quality Control was performed using Tutorial (https://choishingwan.github.io/PRS-Tutorial/base/). But Lassosum gets $best.validation.result equals to -0.0380079. I think it is not a normal situation and something is wrong with my data, but I can not find the mistake.

Could you please help me to find out the problem?

Try reversing all effects.

It works! Thank you very much.