"SBSLMM", refering to Summary statistics for Bayesian Sparse Linear Mixed Model (SBSLMM), which wrapped the Plink and Sbslmm code. SBSLMM both defines the indepnedent large effect SNPs and re-estimate the merginal effect with the reference LD panel and block information. The aim of SBSLMM can handle large-scale of SNPs with low memory and high running speed, expecailly UK Biobank scale (~9 million SNPs).
- PLINK
SBSLMM uses PLINK to select independent large effect SNPs. The link of Plink-1.9: https://www.cog-genomics.org/plink/1.9/. - sbslmm
SBSLMM uses sbslmm to re-restiamte the marginal effect (z-score). The executable file of sbslmm is included in the file folder SBSLMM. - block file
SBSLMM uses the block information. Reference: Berisa and Pickrell (2015) - Linkage Disequilibrium SCore regression (LDSC)
SBSLMM uses the estimated heritability. You can use any sofeware to estimate the heritability, such as LDSC, GEMMA and GCTB.
In the paper, we use LDSC to estimate heritability. Here is the code:
py=/your/python/path/python
ldsc=/your/ldsc/path/ldsc.py
summ=/your/summary/path/summary_pheno${PHENO}
ref=/your/reference/ldsc/path/
herit=/net/mulan/disk2/yasheng/GWAS/heritability/h2_pheno${PHENO}
## LDSC
${py} ${ldsc} --h2 ${summ}.ldsc.gz --ref-ld-chr ${ref} --w-ld-chr ${ref} --out ${herit}
rm ${summ}.ldsc.gz
## get herit
hstr=`sed -n '26p' ${herit}`
hse=`echo ${hstr#*:}`
h2=`echo ${hse%(*}`
- SBSLMM requires the following R packages: data.table, optparse. Install them by:
install.packages(c("data.table", "optparse"), dependencies=TRUE)
All the input files are in the test_dat
folder.
- summary statistics (GEMMA format)
The separate is tab (\t).
chr rs ps n_mis n_obs allele1 allele0 af beta se p_wald
1 rs554763599 41975332 0 2000 A G 0.485 3.651087e-02 3.413215e-02 2.848875e-01
1 rs678110 41976017 0 2000 C T 0.404 -3.192123e-02 3.518509e-02 3.643907e-01
1 rs72669005 41976217 32 1968 A G 0.042 -1.349646e-01 8.657531e-02 1.191719e-01
1 rs140839222 41976345 14 1986 C T 0.030 -1.107334e-01 1.014877e-01 2.753601e-01
1 rs377343544 41976500 0 2000 A G 0.061 3.474750e-02 7.228730e-02 6.307923e-01
1 rs11809423 41976529 0 2000 T C 0.038 -1.373579e-01 9.006960e-02 1.274125e-01
1 rs61775290 41977951 57 1943 A G 0.067 1.014252e-01 6.952261e-02 1.447549e-01
1 rs12744227 41979249 24 1976 T C 0.015 -2.832705e-01 1.448475e-01 5.064601e-02
1 rs72669009 41980147 20 1980 A G 0.061 -2.075229e-01 7.242865e-02 4.210971e-03
1 rs566707973 41981931 0 2000 C G 0.054 1.974115e-02 7.675284e-02 7.970476e-01
-
reference panel (plink format)
The same file name of bed, bim and fam files. -
block information (.bed) The block information is download from the website: https://bitbucket.org/nygcresearch/ldetect-data/src/master/EUR/. The
test_dat
folder inlcudes a block information of chromosome 1 (just for testing the code).
You can directly use the Rscript /SBSLMM/SBSLMM.R
. You can get the explaination of each parameter:
Rscript SBSLMM.R -h
Rscript SBSLMM.R --help
The details is:
--summary=CHARACTER
INPUT: the summary statistics (gemma output format)
--tmpPath=CHARACTER
INPUT: the temptorary path
--outPath=CHARACTER
INPUT: the output path
--plink=CHARACTER
INPUT: the perfix of Plink software
--ref=CHARACTER
INPUT: the perfix of reference panel
--r2=CHARACTER
INPUT: the cutoff of SNPs clumping (default:0.1)
--pv=CHARACTER
INPUT: the cutoff of SNPs pruning (default:1e-6)
--sbslmm=CHARACTER
INPUT: the perfix of sbslmm software
--mafMax=CHARACTER
INPUT: the maximium of the difference between reference panel and summary data
--nsnp=CHARACTER
INPUT: the number of SNPs in whole genome
--block=CHARACTER
INPUT: the block information (Berisa and Pickrell 2015)
--h2=CHARACTER
INPUT: the heritability of trait
--thread=CHARACTER
INPUT: the number of threads
The example of output file is:
rs79853901 T -0.147181 -0.482113 1
rs554763599 A 0.000182282 0.000257901 0
rs678110 C -9.0448e-05 -0.000130338 0
rs72669005 A -0.000189251 -0.000667139 0
rs140839222 C -0.000108468 -0.000449614 0
rs377343544 A 7.54451e-05 0.000222904 0
rs11809423 T -0.000177959 -0.000658152 0
rs61775290 A 0.000265844 0.000751854 0
rs12744227 T -0.000212508 -0.00123622 0
rs72669009 A -0.000273432 -0.000807861 0
The first column is SNP ID. The second column is allele code. The third code is scaled effect size. The forth is non-scaled effect size. Here, we use the MAF from the summary data. You can also use the testing data to transfer it. The fifth column is the index of whether it is large effect or not (1: large effect, 0: small effect). You can directly use the output file to plink-1.9 or plink2.
We use the data of test_dat
to show the SBSLMM method.
### estimate summary data (GEMMA)
### INPUT
## bfiletr: the 6th column of fam file is phenotype
### OUTPUT
## summary_gemma.assoc.txt: the summary data for SBSLMM
gemma=your/path/gemma/gemma-0.98-linux-static
bfiletr=/your/path/test_dat/train
cd /your/path/test_dat
summ=summary_gemma
${gemma} -bfile ${bfiletr} -notsnp -lm 1 -n 1 -o ${summ}
rm /your/path/test_dat/output/${summ}.log.txt
### SBSLMM
SBSLMM=/your/path/SBSLMM/SBSLMM.R
summf=/your/path/test_dat/summary_gemma
mkdir /your/path/test_dat/tmp
mkdir /your/path/test_dat/out
tmpPath=/your/path/test_dat/tmp
outPath=/your/path/test_dat/out
plink=/your/path/plink-1.9
ref=/your/path/test_dat/ref
sbslmm=/your/path/SBSLMM/sbslmm
m=`cat ${summf}.assoc.txt | wc -l`
## Default settings
Rscript ${SBSLMM} --summary ${summf}.assoc.txt --tmpPath ${tmpPath} --outPath ${outPath} --plink ${plink} --ref ${ref} \
--sbslmm ${sbslmm} --mafMax 0.2 --nsnp ${m} --block ${blockf}.bed --h2 0.1 --thread 2
## Your settings
Rscript ${SBSLMM} --summary ${summf}.assoc.txt --tmpPath ${tmpPath} --outPath ${outPath} --plink ${plink} --ref ${ref} \
--r2 0.1 --pv 1e-6 --sbslmm ${sbslmm} --mafMax 0.2 --nsnp ${m} --block ${blockf}.bed --h2 0.1 --thread 2
### Predict
bfilete=/your/path/test_dat/test
est=/your/path/test_dat/out/summary_gemma
pred=/your/path/pheno_pred
## plink 1.9
plink=/your/path/plink1.9
${plink} --bfile ${bfilete} --score ${est}.txt 1 2 4 sum --out ${pred}
## plink 2
plink=/your/path/plink2
${plink} --bfile ${bfilete} --score ${est}.txt 1 2 4 cols=+scoresums --out ${pred}
- P+T
To select the best cutoff, we ues a validate data to do P+T. We use the plink-1.9. Thetoplinkf
,clumpf
,simPred
andmax
functions are inPT
folder. Here, the code is as following:
toplinkf=/your/path/PT/toplinkf.R
clumpf=/your/path/PT/toclumpf.R
simpred=/your/path/PT/simPred.R
max=/your/path/PT/max.R
bfileval=/your/path/test_dat/validate
bfilete=/your/path/test_dat/test
## transfer gemma format to plink format
summf=/your/path/test_dat/summary_gemma
Rscript ${toplinkf} --gemma ${summf}.assoc.txt --plink ${summf}.plink.txt
##
clump1=/your/path/test_dat/summary
pred1=/your/path/test_dat/pheno
for p in 5e-8 1e-6 1e-4 1e-3 1e-2 5e-2 1e-1 2e-1 5e-1 1.0; do
Rscript ${clumpf1} --gemma ${summf}.assoc.txt --plink ${summf}.plink.txt --ref ${bfileval} --pth ${p} \
--clump ${clump1}_p${p} --pred ${pred1}
rm ${clump1}_p${p}.clumped
rm ${clump1}_p${p}.log
rm ${pred1}.log
rm ${pred1}.nopred
r21=/your/path/PT/r2
Rscript ${simpred} --pheno ${pred1}.profile --r2 ${r21}_p${p}.txt
rm ${pred1}.profile
done
pbestf1=/your/path/pbest1
pbestf2=/your/path/pbest2
Rscript ${max1} --r2 ${r21}_p --pbest1 ${pbestf1}.txt --pbest2 ${pbestf2}.txt
pbest1=`cat ${pbestf1}.txt`
pbest2=`cat ${pbestf2}.txt`
plink-1.9 --bfile ${bfilete} --score ${clump1}_p${pbest1}.txt 1 2 3 sum --out ${pred1}
mv ${clump1}_p${pbest1}.txt ${clump1}_best.txt
rm ${r21}_p*
rm ${pred1}.log
rm ${pred1}.txt
rm ${pred2}.txt
rm ${clump1}*
- SBLUP
SBLUP is performed by GCTA. The code is similar to the example of GCTA.
gcta=/your/path/gcta_1.91.7beta/gcta64
bfiletr=/your/path/tets_dat/train
summf=/your/path/tets_dat/summary_gemma
refld=/your/path/tets_dat/ref
herit=0.1
m=`cat ${summf}.assoc.txt | wc -l`
n=`cat ${bfiletr}.fam | wc -l`
cojo=$(echo "${m}*(1/${herit}-1)" | bc -l)
sed -i '1d' /your/path/test_dat/${summf}.assoc.txt
awk '{print $2,$6,$7,$8,$9,$10,$11,$5}' ${summf}.assoc.txt > ${summf}.ma
sed -i '1i\SNP A1 A2 freq b se p N' ${summf}.ma
est3=/your/path/esteff
${gcta} --bfile ${refld} --chr 1 --cojo-file ${summf}.ma --cojo-sblup ${cojo} --cojo-wind 200 --thread-num ${thread} --out ${est3}
pred3=/your/path/pheno
plink-1.9 --bfile ${bfilete} --score ${est}.sblup.cojo 1 2 4 sum --out ${pred3}
rm ${est3}.log
rm ${est3}*badsnps
rm ${est3}.sblup.cojo
rm ${pred3}.log
- LDpred LDpred is performed by manual of LDpred, including the cutoff and radius.
bfiletr=/your/path/train
bfileval=/your/path/validate
bfilete=/your/path/test
n=`cat ${bfiletr}.fam | wc -l`
herit=0.1
## transfer gemma format to LDpred format
Rscript ${toldpred} --gemma ${summf}.assoc.txt --ldpred ${summf}_LDpred.sumstat
## validate
coord1=/path/summary_cv.HDF5
${py} ${ldpred} coord --gf ${bfileval} --ssf ${summf}_LDpred.sumstat --out ${coord1} --N ${n} --ssf-format STANDARD --max-freq-discrep 0.2
ldest=/your/path/ld
infest=/your/path/esteff
${py} ${ldpred} gibbs --cf ${coord1} --ldr ${ldr} --ldf ${ldest} --out ${infest} --N ${n} --h2 ${herit} \
--f 1 0.3 0.1 0.03 0.01 0.003 0.001 0.0003 0.0001
pred41=/path/pheno
r24=/path/r2
for p in 1.0000e+00 3.0000e-01 1.0000e-01 3.0000e-02 1.0000e-02 3.0000e-03 1.0000e-03 3.0000e-04 1.0000e-04;do
plink-1.9 --bfile ${bfileval} --score ${infest}_LDpred_p${p}.txt 3 4 7 header sum --out ${pred41}_p${p}
Rscript ${simpred} --pheno ${pred41}_p${p}.profile --r2 ${r24}_p${p}.txt
rm ${pred41}_p${p}.log
rm ${pred41}_p${p}.nopred
rm ${pred41}_p${p}.profile
done
pbestf4=/your/path/r2_best.txt
Rscript ${max2} --r2 ${r24}_p --pbest ${pbestf4}
pbest4=`cat ${pbestf4}`
## test
coord2=/your/path/summary_ref.HDF5
${py} ${ldpred} coord --gf ${refld} --ssf ${summf}_LDpred.sumstat --out ${coord2} --N ${n} --ssf-format STANDARD --max-freq-discrep 0.2
${py} ${ldpred} gibbs --cf ${coord2} --ldr ${ldr} --ldf ${ldest} --out ${infest} --N ${n} --h2 ${herit} --f ${pbest4}
predInf4=/your/path/pheno_inf
predp4=/your/path/pheno_pbest
plink-1.9 --bfile ${bfilete} --score ${infest}_LDpred-inf.txt 3 4 7 sum --out ${predInf4}
plink-1.9 --bfile ${bfilete} --score ${infest}_LDpred_p${pbest4}.txt 3 4 7 sum --out ${predp4}
rm ${coord1}
rm ${coord2}
rm ${ldest}*.pkl.gz
rm ${infest}_LDpred*.txt
rm ${predInf4}.nopred
rm ${predInf4}.log
rm ${predp4}.nopred
rm ${predp4}.log
rm ${r24}*
- lassosum
We use the lassosum by the R packagelassosum
. Thelassosum
function is in thelassosum
file folder.
res5=/your/path/res
Rscript ${lassosum} --summ ${summf}.assoc.txt --ref ${refld} --valid ${bfileva} --test ${bfilete} --res ${res5}.txt
- lasso (sample size = 2,000)
We use the lasso algorithm in Plink-1.9. The code is as following:
## lasso
esttr3=/your/path/train
plink-1.9 --bfile ${bfiletr} --lasso ${herit} --out ${esttr3}
predt=/your/path/pheno
plink-1.9 --bfile ${bfilete} --score ${esttr3}.lasso 2 3 4 header sum --out ${predt}
rm ${predt}.log
rm ${esttr3}*
- BSLMM (sample size = 2,000)
We use the BSLMM in theGEMMA
software.bslmm.R
is in thesimulation_small_size
file folder.
## BSLMM
gemma=/your/path/gemma-0.98-linux-static
bslmmc=/your/path/simulation_small_size/bslmm.R
cd /your/path/
snpeff1=snpeff
${gemma} -bfile ${bfiletr} -bslmm 1 -o ${snpeff1} -w 6000 -s 2000 -rpace 1000
snpeff2=/your/path/output/snpeff
snpeff3=/your/path/snpeff
Rscript ${bslmmc} --bim ${bfiletr}.bim --eff ${snpeff2}.param.txt --effc ${snpeff3}.txt
predt=/your/path/pheno
plink-1.9 --bfile ${bfilete} --score ${snpeff3}.txt 1 2 3 sum --out ${predt}
rm ${snpeff2}*
rm ${snpeff3}*
rm ${predt}.log
All the simulation results are included in the manuscript.
For SBSLMM, we also process the UKB data. Following the Neale lab and official file of UKB, we perform qualtiy control of samples and SNPs.
We use snp_qc.R
and toplink.sh
to perform the SNP QC. The details are shown in the manuscript.
- MAF
We select SNPs with MAF>0.2. - INFO score
We select the SNPs with INFO score>0.8. - Duplicate
We delete the duplicated SNPs. - HWE
We delete the SNPs with HWE<1e-6. - Missing rate
We delete the SNPS with Pm>0.05.
We use mkpheno.R
to select the samples. The details are shown in the manuscript.
The flow chart is as following:
Here, we only show the code different to simulation, including P+T, LDpred and lassosum. Rather than simulation, real data analysis is should use the whole genome information. The code is as following:
### P+T
toplinkf=/your/path/PT/toplinkf.R
clumpf=/your/path/PT/toclumpf.R
sumpred=/your/path/PT/sumPred.R
max=/your/path/PT/max.R
res=/your/path/PT/res.R
## to plink format
summf=your/path/summary_gemma
for ((chr=1;chr<23;chr++));do
Rscript ${toplinkf} --gemma ${summf}_chr${chr}.assoc.txt --plink ${summf}_chr${chr}.plink.txt
done
### different cutoff values
clump=/your/path/PT/summary
pred=/your/path/PT/pheno
r2=/your/path/PT/r2
for p in 5e-8 1e-6 1e-4 1e-3 1e-2 5e-2 1e-1 2e-1 5e-1 1.0 ; do
for ((chr=1;chr<23;chr++)); do
## clumping
sub=your/path/subsample/chr${chr}
Rscript ${clumpf} --gemma ${summf}_chr${chr}.assoc.txt --plink ${summf}_chr${chr}.plink.txt --ref ${sub} --pth ${p} \
--clump ${clump}_p${p}_chr${chr} --pred ${pred}_p${p}_chr${chr}
rm ${clump}_p${p}_chr${chr}.clumped
rm ${clump}_p${p}_chr${chr}.log
rm ${pred}_chr${chr}.log
rm ${pred}_chr${chr}.nopred
done
## predict for each threshold
Rscript ${sumpred} --pred ${pred}_p${p}_chr --pheno ${PHENO} --r2 ${r2}_p${p}.txt
rm ${pred}_p${p}_chr*.profile
rm ${pred}_p${p}_chr*.log
done
### LDpred
toldpred=/your/path/ldpred/toldpred.R
py=/your/path/py3/bin/python
ldpred=/your/path/ldpred/LDpred.py
split=/your/path/ldpred/split_res2.R
calcr2=/your/path/ldpred/ldpred/r2.R
max=/your/path/ldpred/ldpred/max.R
ldr=200
summf=your/path/summary_gemma
cat ${summf}_chr*.assoc.txt > ${summf}.assoc.txt
Rscript ${toldpred} --gemma ${summf}.assoc.txt --ldpred ${summf}_LDpred.sumstat
## ldpred (cross validation)
n_file=/your/path/pheno_n_train.txt
n=`sed -n "${cross}, 1p" ${n_file}`
sub=/your/path/subsample/merge
coord1=/your/path/summary_cv_cross${cross}.HDF5
${py} ${ldpred} coord --gf ${sub} --ssf ${summ}_LDpred.sumstat --out ${coord1} --N ${n} --ssf-format STANDARD --max-freq-discrep 0.2
ldest=/your/path/ld_ldr${ldr}
infest=/your/path/esteff_ldr${ldr}
${py} ${ldpred} gibbs --cf ${coord1} --ldr ${ldr} --ldf ${ldest} --out ${infest} --N ${n} --h2 ${h2} --f 1 0.3 0.1 0.03 0.01 0.003 0.001 0.0003 0.0001
## cross validation
pred=/your/path/pheno
r2=/your/path/r2
for p in 1.0000e+00 3.0000e-01 1.0000e-01 3.0000e-02 1.0000e-02 3.0000e-03 1.0000e-03 3.0000e-04 1.0000e-04;do
plink-1.9 --bfile ${sub} --score ${infest}_LDpred_p${p}.txt 3 4 7 header sum --out ${pred}_p${p}
Rscript ${calcr2} --pred ${pred}_p${p}.profile --pheno ${PHENO} --r2 ${r2}_p${p}.txt
rm ${pred}_p${p}.log
rm ${pred}_p${p}.nopred
rm ${pred}_p${p}.profile
done
pbestf=/your/path/r2_pbest.txt
Rscript ${max} --r2 ${r2}_p --pbest ${pbestf}
pbest=`cat ${pbestf}`
## ldpred (reference data)
ref2=/your/path/frq/merge
coord2=/your/path/summary_ref_ldr${ldr}.HDF5
${py} ${ldpred} coord --gf ${ref2} --ssf ${summ}_LDpred.sumstat --out ${coord2} --N ${n} --ssf-format STANDARD --max-freq-discrep 0.2
${py} ${ldpred} gibbs --cf ${coord2} --ldr ${ldr} --ldf ${ldest} --out ${infest} --N ${n} --h2 ${h2} --f ${pbest}
## split to chr
Rscript ${split} --tot ${infest}_LDpred-inf.txt --sp ${infest}_inf_chr
Rscript ${split} --tot ${infest}_LDpred_p${pbest}.txt --sp ${infest}_pbest_chr
## remove
rm ${coord2}
rm ${ldest}*.pkl.gz
rm ${infest}_LDpred*.txt
rm ${summ}.assoc.txt
rm ${summ}_LDpred.sumstat
### lassosum
lassosum=/your/path/lassosum/lassosum_real.R
ref=/your/path/merge
val=/your/path/merge
thread=4
res=/your/path/pheno_cross${cross}_chr
Rscript ${lassosum} --summ ${summf}.assoc.txt --pheno ${PHENO} --thread ${thread} --cross ${cross} --res ${res}
rm ${summf}.assoc.txt