BFGWAS_SS is an updated version of the BFGWAS tool as proposed by Yang J. et. al, AJHG, 2017. BFGWAS_SS implements a more efficient MCMC algorithm using only Pre-calculated or Public GWAS summary data.
- With individual-level GWAS data, this BFGWAS_SS version first compute single variant Z-score statistics (GWAS summary statistics) and LD correlation files from individual-level GWAS data, and then load these two summary statistics files to run MCMC.
- With summary-level GWAS data, this BFGWAS_SS version can read Z-score statistics file and LD correlation files generated from a reference panel to run MCMC.
- BFGWAS_SS only handles non-overlapped annotations, assuming each SNP can only be annotated with one annotation label. A new annotation file format with 6 columns:
#CHROM POS ID REF ALT Annotation
is used now. - BFGWAS_SS runs MCMC in parallel for multiple genome-blocks, which is handled through a
Makefile
generated by a PERL script./bin/gen_mkf.pl
. EM-MCMC algorithm is handled based on dependency jobs in the generatedMakefile
. Users will only need to submit the generatedMakefile
as a job for the analysis results. - BFGWAS_SS is based on a multivariable Bayesian variable selection model, which accounts for non-overlapped functional annotations and LD to fine-map and prioritize GWAS hits.
-
Install required C++ libraries C++ libraries zlib, gsl, eigen3, lapack, atlas, blas are used to develop this tool. Please install these libraries to your system and include the library path
-I[path to libraries]
accordingly in the C++ compilation command line in theMakefile
. -
Compile C++ library
./libStatGen/libStatGen.a
under your system by using the following commands:
cd BFGWAS_SS/libStatGen/;
make clean;
make
- Compile C++ source code for the executible file ./bin/Estep_mcmc that will be used to run the Estep MCMC algorithm, by using the following commands under
BFGWAS_SS/
directory:
cd BFGWAS_SS/;
make clean ;
make
- Even though a compiled executible file ./bin/Estep_mcmc from our cluster is provided on GITHUB, please still compile one for your own system. The
BFGWAS_SS/Makefile
might need to be adapted for one's own system.
- Tool TABIX
- R library data.table, tidyverse
Example data files are provided under ./1KG_example/ExData/
.
- VCF Genotype files are required for using individual-level GWAS data. The VCF genotype files should be one per genome block (variants of the same chromosome should be in the same block), sorted by position, and then zipped by
bgzip
(file names are of[filehead].vcf.gz
), e.g.,./1KG_example/ExData/VCFs/**.vcf.gz
. - All VCF genotype files should be put under the same parent directory.
- Genotype files are supposed to be segmented based on LD. Genome blocks are expected to be approximately independent with ~5K-10K SNPs per block.
- Example segmentation information derived by LDetect are provided in
./1KG_example/ExData/Segmentations/lddetect_1KG_*_hg19.bed
for EUR, AFR, and ASN populations.
- Genome Block Prefix File of VCF genotype files as in
./1KG_example/ExData/fileheads_4region.txt
is required. Each row of the list file is the file head of the VCF file of one genome block as in[filehead].vcf.gz
. Note that the VCF file extension suffix.vcf.gz
should not be included.
Two prior parameters need to have initial values given in the hyper parameter file, e.g., ./1KG_example/ExData/InitValues6.txt
.
- First row of header starts with a
#
sign and each row represents the prior causal probabilitypi
and a scale parameter for effect size variancetau_beta
per annotation. - Default initial
pi
is recomend to set as1e-6
for analyzing ~10M SNPs, or1e-6
for analyzing ~1M SNPs, which will be updated in the M-step. - Default
tau_beta
is recomend to set as1
, which will not be updated in the M-step. The magnitude of Bayesian effect size estimates are recomended to be set at about the same magnitude as marginal effect sizes. A largertau
value will shrink Bayesian effect size estimates more towards0
.
#pi | tau_beta |
---|---|
1e-6 | 1 |
1e-6 | 1 |
... | ... |
The annotation code file provides information about how we want to group unique annotation symbols. The first row will provide the total number of annotation categories we will consider, e.g., ./1KG_example/ExData/AnnoCode6.txt
. Annotation code will start from 0
, and increment by 1
up to n_type -1
.
#n_type | 6 |
---|---|
CODING | 0 |
UTR | 1 |
PROMOTOR | 2 |
... | ... |
The annotation files are corresponding to genome-blocks, and should be gzipped and named as Anno_${filehead}.txt.gz
. See example files under ./1KG_example/ExData/Annotations/
.
- There are 5 columns in the annotation file,
#CHROM POS ID REF ALT ANNO
. - Annotation symbols are stored one per SNP (per row) in the sixth column. Each annotation symbols should have a code value specified in the annotation code file.
- Missing annotations can be leave as blank.
- Phenotype file is required even if one only need to generate LD correlation files, e.g.,
./1KG_example/ExData/phenoAMD_1KG.txt
.- Only samples appear in the phenotype file will be used to generate summary statistics.
- A fake phenotype file with sample IDs in the first column and all
0
values in the second column can be used to generate LD correlation files.
- VCF genotype files are required for generating summary statistics, e.g.,
./1KG_example/ExData/VCFs/
. - LD correlation file are generated and stored per genome block, e.g.,
./1KG_example/Test_Wkdir/LDcorr/
. Upper triangular of the genotype correlation matrix among all nearby SNPs in the specified window are saved in the9th
column ofCORR
. LD correlation files has to be generated using the./bin/Estep_mcmc
tool. - GWAS summary data can be used, which should also be segmented into one file per genome block, e.g.,
./1KG_example/Test_Wkdir/Zscore/
.- Zscore statistic files are of the following format, and the first
7
columns are required in the following order:
- Zscore statistic files are of the following format, and the first
| #CHROM | POS | ID | REF| ALT | N|MAF|Z_SCORE| | ----------|:-------------:|:-------------:| :-------------:| :-------------:| :-------------:| :-------------:| :-------------:| :-------------:| | 19 | 5811718 | rs202128143| TTTG| T| 2504 | 1.438e-01 | -3.170e+00 | -6.335e-02| | ... | ... | ... | ... | ... | ... | ... |
- A Makefile is generated to wrap all jobs of running
E-step
per genome blocks with provided hyper parameter values, and then runM-step
with Bayesian CPP and effect size estiamtes across all genome blocks.
- User will only need to submit a job to run the Makefile and specify the number of jobs to be run in parallel.
-
Bayesian estimates of CPP and genetic effect sizes by last EM iteration are stored in the
${wkdir}/Eoutput/paramtemp${em}.txt
file.- Column
Pi
denotes the Bayesian estimates of CPP - Column
Beta
denotes the Bayesian estimates of genetic effect sizes - Column
mBeta
denotes the marginal genetic effect sizes by single variant analysis - Column
Pval_svt
denotes the p-value by single variant analysis
- Column
-
${wkdir}/Eoutput/EM_result.txt
contains regressionr2
estimate and hyper parameter estimates per EM iteration- Regression
r2
might fall out of the range of[0,1]
, when GWAS summary data were used.
- Regression
-
${wkdir}/Eoutput/hyptemp*.txt
files contain data need to be used in theM-step
- Load GWAS results
- The sum of Bayesian CPP estimates across all SNPs represents the expected number of causal SNPs, which might be considered as a metric to decide initial hyper parameter values.
- Plot Bayesian effect size estimates vs. marginal effect size estimates
- Make manhantton plot
- Plot CPP estimates per annotation group