chr1swallace/coloc

prepare the data structure for coloc.abf and loop for multiple variants running together

manL23 opened this issue · 0 comments

manL23 commented

Hi,
Thanks for creating this package. I tried to run colocalization with GWAS sumstats and mQTL data extracted from GoDMC dataset. However, there're some issues about the data structure to run coloc.abf with multiple snps together in a loop.
Here're the details of my datasets and problematic outputs from coloc.abf

GWAS summary statistics as data frame

  • subset the GWAS data based on p value threshold p< 1x10-5
  • Select the variables for coloc.abf . then the dataset as below:

'data.frame': 41882 obs. of 8 variables:
$ snp : chr "rs573419147" "rs557642992" "rs7454868" "rs574909700" ...
$ chr : int 17 17 6 6 6 6 6 6 6 6 ...
$ position : int 44327370 44365374 26799828 28832788 28833101 28840885 28843311 28958399 28961100 29266673 ...
$ maf.gwas : num 0.787 0.769 0.914 0.922 0.929 ...
$ beta.gwas : num 1.07 1.07 1.25 1.25 1.28 ...
$ SE : num 0.0149 0.0142 0.0205 0.0173 0.0179 ...
$ p.gwas : num 8.07e-06 8.11e-06 2.80e-27 3.86e-38 1.96e-42 ...
$ varbeta.gwas: num 0.000223 0.000201 0.000422 0.000299 0.000322 ...

mQTL data from GoDMC

  • Select the variables for coloc.abf , then the dataset as below:

data.frame': 13234 obs. of 8 variables:
$ snp.mqtl : chr "rs7074491" "rs7534848" "rs7545236" "rs1783551" ...
$ name : chr "chr10:134813720:SNP" "chr1:169530093:SNP" "chr1:169530070:SNP" "chr11:75231212:SNP" ...
$ cpg : chr "cg00320980" "cg16054275" "cg16054275" "cg15526825" ...
$ beta.mqtl : num -0.349 0.411 0.411 0.496 0.45 ...
$ se.mqtl : num 0.0093 0.011 0.011 0.0132 0.012 ...
$ samplesize : num 21983 27748 27748 25694 26642 ...
$ p.mqtl : num 2.38e-308 3.17e-308 3.17e-308 3.71e-308 4.17e-308 ...
$ varbeta.mqtl: num 8.64e-05 1.20e-04 1.20e-04 1.75e-04 1.44e-04 ...

• Merge two datasets into a single one

  • eventually, 44 SNPs overlapped between gwas and mqtl datasets

• make lists of lists as required by coloc.abf () without loop function

D1= list(beta=df$beta.gwas, varbeta=df$varbeta.gwas, MAF=df$maf.gwas, type="cc", s=40675/105318, N=105318, snp=df$snp)

The D1 looks as below:

List of 7
$ beta : num [1:44] 1.12 1.05 1.12 1.13 1.12 ...
$ varbeta: num [1:44] 1.45e-04 9.29e-05 1.47e-04 1.45e-04 1.47e-04 ...
$ MAF : num [1:44] 0.811 0.559 0.812 0.806 0.809 ...
$ type : chr "cc"
$ s : num 0.386
$ N : num 105318
$ snp : chr [1:44] "rs1104871" "rs11223635" "rs11967852" "rs11967935"

D2 =list(beta=df$beta.mqtl, varbeta=df$varbeta.mqtl, MAF=df$maf.gwas, type="quant", N=27750, snp=df$snp)

List of 6
$ beta : num [1:44] 0.1109 -0.0869 0.1123 0.1096 0.1095 ...
$ varbeta: num [1:44] 1.13e-04 7.64e-05 1.09e-04 1.10e-04 1.11e-04 ...
$ MAF : num [1:44] 0.811 0.559 0.812 0.806 0.809 ...
$ type : chr "quant"
$ N : num 27750
$ snp : chr [1:44] "rs1104871" "rs11223635" "rs11967852" "rs11967935" .

coloc.abf(D1,D2)

nsnps PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
44 0.000000e+00 6.668867e-72 0.000000e+00 1.000000e+00 3.177685e-52
However, I want to run the analysis on each SNP, and gives ~ different results per SNPs. So I wrote a loop as below
Set datalists with loop for coloc.abf () function
datalist1<- lapply(1:length(df$snp), function(i) {
d1[[i]] = list(snp =df$snp[i],beta=df$beta.gwas[i],varbeta=df$varbeta.gwas[i], MAF=df$maf.gwas[i], type="cc", s=40675/105318, N=105318)
})

datalist2 <- lapply(1:length(df$snp), function(i) {
d2[[i]] = list(snp =df$snp[i],beta=df$beta.mqtl[i],varbeta=df$varbeta.mqtl[i],
MAF=df$maf.gwas[i], type="quant", N=27750)
})

res <- lapply(1:length(df$snp),function(i)
coloc.abf(datalist1[[i]],datalist2[[i]])$summary)
do.call("rbind", res)

However, it seems that the code only run coloc with the same SNP per row

  nsnps PP.H0.abf    PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf

[1,] 1 0 4.108072e-22 0 0 1
[2,] 1 0 7.421940e-20 0 0 1
[3,] 1 0 1.642475e-23 0 0 1
[4,] 1 0 3.604846e-22 0 0 1
[5,] 1 0 6.957833e-22 0 0 1
.......
[40,] 1 0 1.937698e-23 0 0 1
[41,] 1 0 1.613592e-23 0 0 1
[42,] 1 0 1.571302e-23 0 0 1
[43,] 1 0 1.703198e-23 0 0 1
[44,] 1 0 7.755652e-23 0 0 1

Not sure if the issue cause by the data structures (it seemed fine as check_dataset function returned NULL) before running coloc.abf, or the loop function. Could you please provide more details about how to prepare the right datalists to coloc.abf for multiple SNPs and how to run multiple variants together? Many thanks!