Negative results
privefl opened this issue · 15 comments
I'm using the lassosum.pipeline
/ validate
/ subset
/ validate
strategy of the README (splitting the test set in 2).
I'm getting v2$best.validation.result
equals to -0.491. And AUC(v2$best.pgs, v2$pheno - 1)
equals to 0.205.
Is this normal? Should I just take the opposite?
Note that the binary phenotype is coded as 1/2 as in standard PLINK format.
That's not normal. However, I am unable to determine what might have gone wrong based on what you reported. If it's ok, can you PM me your dataset and the exact command you used?
I wanted to make a reproducible example with fake data but I'm getting a new error:
Coordinating summary stats with reference panel...
Splitting genome by LD blocks ...
Running lassosum ...
s = 0.2
Error in lassosum(cor = cor2, bfile = ref.bfile, shrink = s, extract = ref.extract, :
!any(is.na(cor)) is not TRUE
I've no missing value in cor
...
Reproducible example:
# devtools::install_github("privefl/bigsnpr")
library(bigsnpr)
set.seed(1)
N <- 1000; M <- 1e4; m <- 20
snp <- snp_fake(N, M)
snp$genotypes[] <- sample(0:2, size = N * M, replace = TRUE)
snp$map$physical.pos <- seq(2, 2e8, length.out = M)
set <- sample(M, m)
beta <- rnorm(m) / sqrt(m)
s <- drop(scale(snp$genotypes[, set]) %*% beta)
y <- ((s + rnorm(m)) > quantile(s, 0.7)) + 0
AUC(s, y)
snp$fam$affection <- y + 1
bed <- snp_writeBed(snp, tempfile(fileext = ".bed"))
rds <- snp_readBed(bed, sub("\\.bed$", "", bed))
snp <- snp_attach(rds)
G <- snp$genotypes
CHR <- snp$map$chromosome
POS <- snp$map$physical.pos
A1 <- snp$map$allele1
A2 <- snp$map$allele2
set.seed(1)
ind.train <- sort(sample(nrow(G), size = N * 0.6))
ind.test <- setdiff(rows_along(G), ind.train)
ind.test.split <- split(ind.test, sample(1:2, length(ind.test), TRUE))
gwas <- big_univLogReg(G, y[ind.train], ind.train)
plot(gwas, type = "Manhattan")
# devtools::install_github("tshmak/lassosum")
library(lassosum)
bed2 <- snp_writeBed(snp, tempfile(fileext = ".bed"),
ind.row = ind.test.split[[1]])
bed3 <- snp_writeBed(snp, tempfile(fileext = ".bed"),
ind.row = ind.test.split[[2]])
cor <- p2cor(p = predict(gwas, log10 = FALSE), n = length(ind.train),
sign = gwas$estim)
out <- lassosum.pipeline(cor = cor, chr = CHR, pos = POS, A1 = A1, A2 = A2,
test.bfile = sub("\\.bed$", "", bed2),
LDblocks = "EUR.hg19")
v <- validate(out)
out2 <- subset(out, s = v$best.s, lambda = v$best.lambda)
v2 <- validate(out2, test.bfile = sub("\\.bed$", "", bed3))
v2$best.validation.result ## -0.07595338
AUC(v2$best.pgs, v2$pheno - 1) ## 0.4441773
Thanks for the example. The !any(is.na(cor)) is not TRUE
error turned out to be due to the genotypes in your synthetic example being all A/T. These SNPs all got excluded because they were ambiguous. You can get around this by passing the exclude.ambiguous=F
option to lassosum.pipeline
and also to the second validate
command.
I tried it with these options and found that the resulting v$best.validation.result
was 0.05, and v2$best.validation.result
was 0.007.
Hum, it was foolish of me to use A/T. I now use C/T -- please reinstall package {bigsnpr}.
I've updated the example above in order to get negative results.
I think I've determined that the problem is because your snp_writeBed
function writes in the opposite way to how I interpret a plink .bed
file. In particular, I treat 00 as homozygous A1, and 11 as homozygous A2. I think you did it the other way round.
I have
> rawToBits(readBin(bed, what = raw(), n = 5)[4:5])
[1] 00 00 01 01 01 01 00 00 00 00 00 00 01 01 01 01
> snp$genotypes[1:8]
[1] 0 2 2 0 0 0 2 2
In Plink format, 0,1,2 should correspond to the number of A1 alleles (not A2). Therefore the 16 bits you showed there would mean 2 0 0 2 2 2 0 0 to plink, which is opposite to what you had in snp$genotypes[1:8]
. Alternatively, you could swap the A1 and A2 when you write out the .bim file. See https://www.cog-genomics.org/plink2/formats#bed for details.
Hum, you may be right. Yet, it is weird to code homoziguous referent by 2...
I'm using the same convention as in package {gaston}.
I can simply swap A1
and A2
when I use lassosum()
.
Thanks for your help.
Hi Tim, I am having exactly the same problem when I try to calculate PRS using the Type II diabetes summary statistics for my sample using T2D_ALL_PRIMARY dataset from:
https://blog.nus.edu.sg/agen/summary-statistics/t2d-2020/
My cor does not contain Na. and I tried to add the exclude.ambiguous option. However both methods did not work. I tried lassosum on other summary statistics and this issue did not happen. May I ask for your help with this? Thank you very much.
It is not clear what your problem is. There are many possible reasons for having negative results. Please open a new issue and describe exactly what your codes are, what the output is, and what you expect the output to be. Preferably include a reproducible example.
I have the same problem with the Lassosum calculation PRS. I used Low density lipoprotein cholesterol Base Date available on https://humandbs.biosciencedbc.jp/en/hum0014-v21#58qt and Target Date provided by my laboratory. The Quality Control was performed using Tutorial (https://choishingwan.github.io/PRS-Tutorial/base/). But Lassosum gets $best.validation.result equals to -0.0380079. I think it is not a normal situation and something is wrong with my data, but I can not find the mistake.
Could you please help me to find out the problem?
Try reversing all effects.
It works! Thank you very much.