SBSLMM2: A Shell repository from biostatYu

SBSLMM

"SBSLMM", refering to Summary statistics for Bayesian Sparse Linear Mixed Model (SBSLMM), which wrapped the Plink and Sbslmm code. SBSLMM both defines the indepnedent large effect SNPs and re-estimate the merginal effect with the reference LD panel and block information. The aim of SBSLMM can handle large-scale of SNPs with low memory and high running speed, expecailly UK Biobank scale (~9 million SNPs).

Installation

PLINK
SBSLMM uses PLINK to select independent large effect SNPs. The link of Plink-1.9: https://www.cog-genomics.org/plink/1.9/.
sbslmm
SBSLMM uses sbslmm to re-restiamte the marginal effect (z-score). The executable file of sbslmm is included in the file folder SBSLMM.
block file
SBSLMM uses the block information. Reference: Berisa and Pickrell (2015)
Linkage Disequilibrium SCore regression (LDSC)
SBSLMM uses the estimated heritability. You can use any sofeware to estimate the heritability, such as LDSC, GEMMA and GCTB.
In the paper, we use LDSC to estimate heritability. Here is the code:

py=/your/python/path/python
ldsc=/your/ldsc/path/ldsc.py
summ=/your/summary/path/summary_pheno${PHENO}
ref=/your/reference/ldsc/path/
herit=/net/mulan/disk2/yasheng/GWAS/heritability/h2_pheno${PHENO}
## LDSC
${py} ${ldsc} --h2 ${summ}.ldsc.gz --ref-ld-chr ${ref} --w-ld-chr ${ref} --out ${herit}
rm ${summ}.ldsc.gz
## get herit
hstr=`sed -n '26p' ${herit}`
hse=`echo ${hstr#*:}`
h2=`echo ${hse%(*}`

SBSLMM requires the following R packages: data.table, optparse. Install them by:

install.packages(c("data.table", "optparse"), dependencies=TRUE)

Input file format

All the input files are in the test_dat folder.

summary statistics (GEMMA format)
The separate is tab (\t).

chr rs ps n_mis n_obs allele1 allele0 af beta se p_wald
1	rs554763599	41975332	0	2000	A	G	0.485	3.651087e-02	3.413215e-02	2.848875e-01
1	rs678110	41976017	0	2000	C	T	0.404	-3.192123e-02	3.518509e-02	3.643907e-01
1	rs72669005	41976217	32	1968	A	G	0.042	-1.349646e-01	8.657531e-02	1.191719e-01
1	rs140839222	41976345	14	1986	C	T	0.030	-1.107334e-01	1.014877e-01	2.753601e-01
1	rs377343544	41976500	0	2000	A	G	0.061	3.474750e-02	7.228730e-02	6.307923e-01
1	rs11809423	41976529	0	2000	T	C	0.038	-1.373579e-01	9.006960e-02	1.274125e-01
1	rs61775290	41977951	57	1943	A	G	0.067	1.014252e-01	6.952261e-02	1.447549e-01
1	rs12744227	41979249	24	1976	T	C	0.015	-2.832705e-01	1.448475e-01	5.064601e-02
1	rs72669009	41980147	20	1980	A	G	0.061	-2.075229e-01	7.242865e-02	4.210971e-03
1	rs566707973	41981931	0	2000	C	G	0.054	1.974115e-02	7.675284e-02	7.970476e-01

reference panel (plink format)
The same file name of bed, bim and fam files.
block information (.bed) The block information is download from the website: https://bitbucket.org/nygcresearch/ldetect-data/src/master/EUR/. The test_dat folder inlcudes a block information of chromosome 1 (just for testing the code).

Parameter Setting

You can directly use the Rscript /SBSLMM/SBSLMM.R. You can get the explaination of each parameter:

Rscript SBSLMM.R -h
Rscript SBSLMM.R --help

The details is:

--summary=CHARACTER
		INPUT: the summary statistics (gemma output format)
--tmpPath=CHARACTER
		INPUT: the temptorary path
--outPath=CHARACTER
		INPUT: the output path
--plink=CHARACTER
		INPUT: the perfix of Plink software
--ref=CHARACTER
		INPUT: the perfix of reference panel
--r2=CHARACTER
		INPUT: the cutoff of SNPs clumping (default:0.1)
--pv=CHARACTER
		INPUT: the cutoff of SNPs pruning (default:1e-6)
--sbslmm=CHARACTER
		INPUT: the perfix of sbslmm software
--mafMax=CHARACTER
		INPUT: the maximium of the difference between reference panel and summary data
--nsnp=CHARACTER
		INPUT: the number of SNPs in whole genome
--block=CHARACTER
		INPUT: the block information (Berisa and Pickrell 2015)
--h2=CHARACTER
		INPUT: the heritability of trait
--thread=CHARACTER
		INPUT: the number of threads

Output file format

The example of output file is:

rs79853901 T -0.147181 -0.482113 1
rs554763599 A 0.000182282 0.000257901 0
rs678110 C -9.0448e-05 -0.000130338 0
rs72669005 A -0.000189251 -0.000667139 0
rs140839222 C -0.000108468 -0.000449614 0
rs377343544 A 7.54451e-05 0.000222904 0
rs11809423 T -0.000177959 -0.000658152 0
rs61775290 A 0.000265844 0.000751854 0
rs12744227 T -0.000212508 -0.00123622 0
rs72669009 A -0.000273432 -0.000807861 0

The first column is SNP ID. The second column is allele code. The third code is scaled effect size. The forth is non-scaled effect size. Here, we use the MAF from the summary data. You can also use the testing data to transfer it. The fifth column is the index of whether it is large effect or not (1: large effect, 0: small effect). You can directly use the output file to plink-1.9 or plink2.

Example code for test data (SBSLMM)

We use the data of test_dat to show the SBSLMM method.

### estimate summary data (GEMMA)
### INPUT
## bfiletr: the 6th column of fam file is phenotype
### OUTPUT
## summary_gemma.assoc.txt: the summary data for SBSLMM
gemma=your/path/gemma/gemma-0.98-linux-static
bfiletr=/your/path/test_dat/train
cd /your/path/test_dat
summ=summary_gemma
${gemma} -bfile ${bfiletr} -notsnp -lm 1 -n 1 -o ${summ}
rm /your/path/test_dat/output/${summ}.log.txt

### SBSLMM
SBSLMM=/your/path/SBSLMM/SBSLMM.R
summf=/your/path/test_dat/summary_gemma
mkdir /your/path/test_dat/tmp
mkdir /your/path/test_dat/out
tmpPath=/your/path/test_dat/tmp
outPath=/your/path/test_dat/out
plink=/your/path/plink-1.9
ref=/your/path/test_dat/ref
sbslmm=/your/path/SBSLMM/sbslmm
m=`cat ${summf}.assoc.txt | wc -l`
## Default settings
Rscript ${SBSLMM} --summary ${summf}.assoc.txt --tmpPath ${tmpPath} --outPath ${outPath} --plink ${plink} --ref ${ref} \ 
                  --sbslmm ${sbslmm} --mafMax 0.2 --nsnp ${m} --block ${blockf}.bed --h2 0.1 --thread 2
## Your settings
Rscript ${SBSLMM} --summary ${summf}.assoc.txt --tmpPath ${tmpPath} --outPath ${outPath} --plink ${plink} --ref ${ref} \ 
                  --r2 0.1 --pv 1e-6 --sbslmm ${sbslmm} --mafMax 0.2 --nsnp ${m} --block ${blockf}.bed --h2 0.1 --thread 2

### Predict
bfilete=/your/path/test_dat/test
est=/your/path/test_dat/out/summary_gemma
pred=/your/path/pheno_pred
## plink 1.9
plink=/your/path/plink1.9
${plink} --bfile ${bfilete} --score ${est}.txt 1 2 4 sum --out ${pred}
## plink 2
plink=/your/path/plink2
${plink} --bfile ${bfilete} --score ${est}.txt 1 2 4 cols=+scoresums --out ${pred}

Simulation

P+T
To select the best cutoff, we ues a validate data to do P+T. We use the plink-1.9. The toplinkf, clumpf, simPred and max functions are in PT folder. Here, the code is as following:

toplinkf=/your/path/PT/toplinkf.R
clumpf=/your/path/PT/toclumpf.R
simpred=/your/path/PT/simPred.R
max=/your/path/PT/max.R

bfileval=/your/path/test_dat/validate
bfilete=/your/path/test_dat/test
## transfer gemma format to plink format
summf=/your/path/test_dat/summary_gemma
Rscript ${toplinkf} --gemma ${summf}.assoc.txt --plink ${summf}.plink.txt
## 
clump1=/your/path/test_dat/summary
pred1=/your/path/test_dat/pheno
for p in 5e-8 1e-6 1e-4 1e-3 1e-2 5e-2 1e-1 2e-1 5e-1 1.0; do 
Rscript ${clumpf1} --gemma ${summf}.assoc.txt --plink ${summf}.plink.txt --ref ${bfileval} --pth ${p} \
--clump ${clump1}_p${p} --pred ${pred1}
rm ${clump1}_p${p}.clumped
rm ${clump1}_p${p}.log
rm ${pred1}.log
rm ${pred1}.nopred
r21=/your/path/PT/r2
Rscript ${simpred} --pheno ${pred1}.profile --r2 ${r21}_p${p}.txt
rm ${pred1}.profile
done
pbestf1=/your/path/pbest1
pbestf2=/your/path/pbest2
Rscript ${max1} --r2 ${r21}_p --pbest1 ${pbestf1}.txt --pbest2 ${pbestf2}.txt
pbest1=`cat ${pbestf1}.txt`
pbest2=`cat ${pbestf2}.txt`
plink-1.9 --bfile ${bfilete} --score ${clump1}_p${pbest1}.txt 1 2 3 sum --out ${pred1}
mv ${clump1}_p${pbest1}.txt ${clump1}_best.txt
rm ${r21}_p*
rm ${pred1}.log
rm ${pred1}.txt
rm ${pred2}.txt
rm ${clump1}*

SBLUP
SBLUP is performed by GCTA. The code is similar to the example of GCTA.

gcta=/your/path/gcta_1.91.7beta/gcta64
bfiletr=/your/path/tets_dat/train
summf=/your/path/tets_dat/summary_gemma
refld=/your/path/tets_dat/ref
herit=0.1
m=`cat ${summf}.assoc.txt | wc -l`
n=`cat ${bfiletr}.fam | wc -l`
cojo=$(echo "${m}*(1/${herit}-1)" | bc -l)
sed -i '1d' /your/path/test_dat/${summf}.assoc.txt
awk '{print $2,$6,$7,$8,$9,$10,$11,$5}' ${summf}.assoc.txt > ${summf}.ma
sed -i '1i\SNP A1 A2 freq b se p N' ${summf}.ma
est3=/your/path/esteff
${gcta} --bfile ${refld} --chr 1 --cojo-file ${summf}.ma --cojo-sblup ${cojo} --cojo-wind 200 --thread-num ${thread} --out ${est3}
pred3=/your/path/pheno
plink-1.9 --bfile ${bfilete} --score ${est}.sblup.cojo 1 2 4 sum --out ${pred3}
rm ${est3}.log
rm ${est3}*badsnps
rm ${est3}.sblup.cojo
rm ${pred3}.log

LDpred LDpred is performed by manual of LDpred, including the cutoff and radius.

bfiletr=/your/path/train
bfileval=/your/path/validate
bfilete=/your/path/test
n=`cat ${bfiletr}.fam | wc -l`
herit=0.1
## transfer gemma format to LDpred format
Rscript ${toldpred} --gemma ${summf}.assoc.txt  --ldpred ${summf}_LDpred.sumstat
## validate
coord1=/path/summary_cv.HDF5
${py} ${ldpred} coord --gf ${bfileval} --ssf ${summf}_LDpred.sumstat --out ${coord1} --N ${n} --ssf-format STANDARD --max-freq-discrep 0.2
ldest=/your/path/ld
infest=/your/path/esteff
${py} ${ldpred} gibbs --cf ${coord1} --ldr ${ldr} --ldf ${ldest} --out ${infest} --N ${n} --h2 ${herit} \
--f 1 0.3 0.1 0.03 0.01 0.003 0.001 0.0003 0.0001 
pred41=/path/pheno
r24=/path/r2
for p in 1.0000e+00 3.0000e-01 1.0000e-01 3.0000e-02 1.0000e-02 3.0000e-03 1.0000e-03 3.0000e-04 1.0000e-04;do
plink-1.9 --bfile ${bfileval} --score ${infest}_LDpred_p${p}.txt 3 4 7 header sum --out ${pred41}_p${p}
Rscript ${simpred} --pheno ${pred41}_p${p}.profile --r2 ${r24}_p${p}.txt
rm ${pred41}_p${p}.log
rm ${pred41}_p${p}.nopred
rm ${pred41}_p${p}.profile
done
pbestf4=/your/path/r2_best.txt
Rscript ${max2} --r2 ${r24}_p --pbest ${pbestf4} 
pbest4=`cat ${pbestf4}`

## test
coord2=/your/path/summary_ref.HDF5
${py} ${ldpred} coord --gf ${refld} --ssf ${summf}_LDpred.sumstat --out ${coord2} --N ${n} --ssf-format STANDARD --max-freq-discrep 0.2
${py} ${ldpred} gibbs --cf ${coord2} --ldr ${ldr} --ldf ${ldest} --out ${infest} --N ${n} --h2 ${herit} --f ${pbest4}
predInf4=/your/path/pheno_inf
predp4=/your/path/pheno_pbest
plink-1.9 --bfile ${bfilete} --score ${infest}_LDpred-inf.txt 3 4 7 sum --out ${predInf4}
plink-1.9 --bfile ${bfilete} --score ${infest}_LDpred_p${pbest4}.txt 3 4 7 sum --out ${predp4}
rm ${coord1}
rm ${coord2}
rm ${ldest}*.pkl.gz
rm ${infest}_LDpred*.txt
rm ${predInf4}.nopred
rm ${predInf4}.log
rm ${predp4}.nopred
rm ${predp4}.log
rm ${r24}*

lassosum
We use the lassosum by the R package lassosum. The lassosum function is in the lassosum file folder.

res5=/your/path/res
Rscript ${lassosum} --summ ${summf}.assoc.txt --ref ${refld} --valid ${bfileva} --test ${bfilete} --res ${res5}.txt

lasso (sample size = 2,000)
We use the lasso algorithm in Plink-1.9. The code is as following:

## lasso
esttr3=/your/path/train
plink-1.9 --bfile ${bfiletr} --lasso ${herit} --out ${esttr3}
predt=/your/path/pheno
plink-1.9 --bfile ${bfilete} --score ${esttr3}.lasso 2 3 4 header sum --out ${predt}
rm ${predt}.log
rm ${esttr3}*

BSLMM (sample size = 2,000)
We use the BSLMM in the GEMMA software. bslmm.R is in the simulation_small_size file folder.

## BSLMM
gemma=/your/path/gemma-0.98-linux-static
bslmmc=/your/path/simulation_small_size/bslmm.R
cd /your/path/
snpeff1=snpeff
${gemma} -bfile ${bfiletr} -bslmm 1 -o ${snpeff1} -w 6000 -s 2000 -rpace 1000
snpeff2=/your/path/output/snpeff
snpeff3=/your/path/snpeff
Rscript ${bslmmc} --bim ${bfiletr}.bim --eff ${snpeff2}.param.txt --effc ${snpeff3}.txt
predt=/your/path/pheno
plink-1.9 --bfile ${bfilete} --score ${snpeff3}.txt 1 2 3 sum --out ${predt}
rm ${snpeff2}*
rm ${snpeff3}*
rm ${predt}.log

All the simulation results are included in the manuscript.

UK Biobank data process

For SBSLMM, we also process the UKB data. Following the Neale lab and official file of UKB, we perform qualtiy control of samples and SNPs.

SNP QC

We use snp_qc.R and toplink.sh to perform the SNP QC. The details are shown in the manuscript.

MAF
We select SNPs with MAF>0.2.
INFO score
We select the SNPs with INFO score>0.8.
Duplicate
We delete the duplicated SNPs.
HWE
We delete the SNPs with HWE<1e-6.
Missing rate
We delete the SNPS with Pm>0.05.

Sample QC

We use mkpheno.R to select the samples. The details are shown in the manuscript.
The flow chart is as following:

Real data analysis

Here, we only show the code different to simulation, including P+T, LDpred and lassosum. Rather than simulation, real data analysis is should use the whole genome information. The code is as following:

### P+T
toplinkf=/your/path/PT/toplinkf.R
clumpf=/your/path/PT/toclumpf.R
sumpred=/your/path/PT/sumPred.R
max=/your/path/PT/max.R
res=/your/path/PT/res.R
## to plink format
summf=your/path/summary_gemma
for ((chr=1;chr<23;chr++));do
Rscript ${toplinkf} --gemma ${summf}_chr${chr}.assoc.txt --plink ${summf}_chr${chr}.plink.txt
done
### different cutoff values
clump=/your/path/PT/summary
pred=/your/path/PT/pheno
r2=/your/path/PT/r2
for p in 5e-8 1e-6 1e-4 1e-3 1e-2 5e-2 1e-1 2e-1 5e-1 1.0 ; do 
for ((chr=1;chr<23;chr++)); do
## clumping
sub=your/path/subsample/chr${chr}
Rscript ${clumpf} --gemma ${summf}_chr${chr}.assoc.txt --plink ${summf}_chr${chr}.plink.txt --ref ${sub} --pth ${p} \
--clump ${clump}_p${p}_chr${chr} --pred ${pred}_p${p}_chr${chr}
rm ${clump}_p${p}_chr${chr}.clumped
rm ${clump}_p${p}_chr${chr}.log
rm ${pred}_chr${chr}.log
rm ${pred}_chr${chr}.nopred
done
## predict for each threshold
Rscript ${sumpred} --pred ${pred}_p${p}_chr --pheno ${PHENO} --r2 ${r2}_p${p}.txt
rm ${pred}_p${p}_chr*.profile
rm ${pred}_p${p}_chr*.log
done

### LDpred
toldpred=/your/path/ldpred/toldpred.R
py=/your/path/py3/bin/python
ldpred=/your/path/ldpred/LDpred.py
split=/your/path/ldpred/split_res2.R
calcr2=/your/path/ldpred/ldpred/r2.R
max=/your/path/ldpred/ldpred/max.R
ldr=200
summf=your/path/summary_gemma
cat ${summf}_chr*.assoc.txt > ${summf}.assoc.txt
Rscript ${toldpred} --gemma ${summf}.assoc.txt  --ldpred ${summf}_LDpred.sumstat
## ldpred (cross validation)
n_file=/your/path/pheno_n_train.txt
n=`sed -n "${cross}, 1p" ${n_file}`
sub=/your/path/subsample/merge
coord1=/your/path/summary_cv_cross${cross}.HDF5
${py} ${ldpred} coord --gf ${sub} --ssf ${summ}_LDpred.sumstat --out ${coord1} --N ${n} --ssf-format STANDARD --max-freq-discrep 0.2
ldest=/your/path/ld_ldr${ldr}
infest=/your/path/esteff_ldr${ldr}
${py} ${ldpred} gibbs --cf ${coord1} --ldr ${ldr} --ldf ${ldest} --out ${infest} --N ${n} --h2 ${h2} --f 1 0.3 0.1 0.03 0.01 0.003 0.001 0.0003 0.0001 
## cross validation
pred=/your/path/pheno
r2=/your/path/r2
for p in 1.0000e+00 3.0000e-01 1.0000e-01 3.0000e-02 1.0000e-02 3.0000e-03 1.0000e-03 3.0000e-04 1.0000e-04;do
plink-1.9 --bfile ${sub} --score ${infest}_LDpred_p${p}.txt 3 4 7 header sum --out ${pred}_p${p}
Rscript ${calcr2} --pred ${pred}_p${p}.profile --pheno ${PHENO} --r2 ${r2}_p${p}.txt
rm ${pred}_p${p}.log
rm ${pred}_p${p}.nopred
rm ${pred}_p${p}.profile
done
pbestf=/your/path/r2_pbest.txt
Rscript ${max} --r2 ${r2}_p --pbest ${pbestf} 
pbest=`cat ${pbestf}`
## ldpred (reference data)
ref2=/your/path/frq/merge
coord2=/your/path/summary_ref_ldr${ldr}.HDF5
${py} ${ldpred} coord --gf ${ref2} --ssf ${summ}_LDpred.sumstat --out ${coord2} --N ${n} --ssf-format STANDARD --max-freq-discrep 0.2
${py} ${ldpred} gibbs --cf ${coord2} --ldr ${ldr} --ldf ${ldest} --out ${infest} --N ${n} --h2 ${h2} --f ${pbest}
## split to chr
Rscript ${split} --tot ${infest}_LDpred-inf.txt --sp ${infest}_inf_chr
Rscript ${split} --tot ${infest}_LDpred_p${pbest}.txt --sp ${infest}_pbest_chr
## remove
rm ${coord2}
rm ${ldest}*.pkl.gz
rm ${infest}_LDpred*.txt
rm ${summ}.assoc.txt
rm ${summ}_LDpred.sumstat

### lassosum
lassosum=/your/path/lassosum/lassosum_real.R
ref=/your/path/merge
val=/your/path/merge
thread=4
res=/your/path/pheno_cross${cross}_chr
Rscript ${lassosum} --summ ${summf}.assoc.txt --pheno ${PHENO} --thread ${thread} --cross ${cross} --res ${res}
rm ${summf}.assoc.txt

biostatYu/SBSLMM2