Bayesian Functional Genome-wide Association Study (BFGWAS)

BFGWAS_SS is an updated version of the BFGWAS tool as proposed by Yang J. et. al, AJHG, 2017. BFGWAS_SS implements a more efficient MCMC algorithm using only Pre-calculated or Public GWAS summary data.

With individual-level GWAS data, this BFGWAS_SS version first compute single variant Z-score statistics (GWAS summary statistics) and LD correlation files from individual-level GWAS data, and then load these two summary statistics files to run MCMC.
With summary-level GWAS data, this BFGWAS_SS version can read Z-score statistics file and LD correlation files generated from a reference panel to run MCMC.
BFGWAS_SS only handles non-overlapped annotations, assuming each SNP can only be annotated with one annotation label. A new annotation file format with 6 columns: #CHROM POS ID REF ALT Annotation is used now.
BFGWAS_SS runs MCMC in parallel for multiple genome-blocks, which is handled through a Makefile generated by a PERL script ./bin/gen_mkf.pl. EM-MCMC algorithm is handled based on dependency jobs in the generated Makefile. Users will only need to submit the generated Makefile as a job for the analysis results.
BFGWAS_SS is based on a multivariable Bayesian variable selection model, which accounts for non-overlapped functional annotations and LD to fine-map and prioritize GWAS hits.

Software Installation
Input Files
Example Usage
Output Files
Analyze Results
Flags

Software Installation

1. Compile Estep_mcmc

Install required C++ libraries C++ libraries zlib, gsl, eigen3, lapack, atlas, blas are used to develop this tool. Please install these libraries to your system and include the library path -I[path to libraries] accordingly in the C++ compilation command line in the Makefile.
Compile C++ library ./libStatGen/libStatGen.a under your system by using the following commands:

cd BFGWAS_SS/libStatGen/;
make clean;
make

Compile C++ source code for the executible file ./bin/Estep_mcmc that will be used to run the Estep MCMC algorithm, by using the following commands under BFGWAS_SS/ directory:

cd BFGWAS_SS/;
make clean ;
make

Even though a compiled executible file ./bin/Estep_mcmc from our cluster is provided on GITHUB, please still compile one for your own system. The BFGWAS_SS/Makefile might need to be adapted for one's own system.

2. Additional Requirements

Tool TABIX
R library data.table, tidyverse

Input Files

Example data files are provided under ./1KG_example/ExData/.

Individual GWAS Input Files

1. Genotype VCF Files

VCF Genotype files are required for using individual-level GWAS data. The VCF genotype files should be one per genome block (variants of the same chromosome should be in the same block), sorted by position, and then zipped by bgzip (file names are of [filehead].vcf.gz), e.g., ./1KG_example/ExData/VCFs/**.vcf.gz.
All VCF genotype files should be put under the same parent directory.
Genotype files are supposed to be segmented based on LD. Genome blocks are expected to be approximately independent with ~5K-10K SNPs per block.
Example segmentation information derived by LDetect are provided in ./1KG_example/ExData/Segmentations/lddetect_1KG_*_hg19.bed for EUR, AFR, and ASN populations.

2. Phenotype File

Summary GWAS Input Files

1. GWAS Zscore File

2. Reference LD File

Other Input Files

1. Genome Block Prefix File

Genome Block Prefix File of VCF genotype files as in ./1KG_example/ExData/fileheads_4region.txt is required. Each row of the list file is the file head of the VCF file of one genome block as in [filehead].vcf.gz. Note that the VCF file extension suffix .vcf.gz should not be included.

2. Prior Parameter File

Two prior parameters need to have initial values given in the hyper parameter file, e.g., ./1KG_example/ExData/InitValues6.txt.

First row of header starts with a # sign and each row represents the prior causal probability pi and a scale parameter for effect size variance tau_beta per annotation.
Default initial pi is recomend to set as 1e-6 for analyzing ~10M SNPs, or 1e-6 for analyzing ~1M SNPs, which will be updated in the M-step.
Default tau_beta is recomend to set as 1, which will not be updated in the M-step. The magnitude of Bayesian effect size estimates are recomended to be set at about the same magnitude as marginal effect sizes. A larger tau value will shrink Bayesian effect size estimates more towards 0.

#pi	tau_beta
1e-6	1
1e-6	1
...	...

3. Annotation Code File

The annotation code file provides information about how we want to group unique annotation symbols. The first row will provide the total number of annotation categories we will consider, e.g., ./1KG_example/ExData/AnnoCode6.txt. Annotation code will start from 0, and increment by 1 up to n_type -1.

#n_type	6
CODING	0
UTR	1
PROMOTOR	2
...	...

4. Annotation File

The annotation files are corresponding to genome-blocks, and should be gzipped and named as Anno_${filehead}.txt.gz. See example files under ./1KG_example/ExData/Annotations/.

There are 5 columns in the annotation file, #CHROM POS ID REF ALT ANNO.
Annotation symbols are stored one per SNP (per row) in the sixth column. Each annotation symbols should have a code value specified in the annotation code file.
Missing annotations can be leave as blank.

Example Usage (`./test_script.sh`)

Step 1. Obtain Summary Statistics

Phenotype file is required even if one only need to generate LD correlation files, e.g., ./1KG_example/ExData/phenoAMD_1KG.txt.
- Only samples appear in the phenotype file will be used to generate summary statistics.
- A fake phenotype file with sample IDs in the first column and all 0 values in the second column can be used to generate LD correlation files.
VCF genotype files are required for generating summary statistics, e.g., ./1KG_example/ExData/VCFs/.
LD correlation file are generated and stored per genome block, e.g., ./1KG_example/Test_Wkdir/LDcorr/. Upper triangular of the genotype correlation matrix among all nearby SNPs in the specified window are saved in the 9th column of CORR. LD correlation files has to be generated using the ./bin/Estep_mcmc tool.
GWAS summary data can be used, which should also be segmented into one file per genome block, e.g., ./1KG_example/Test_Wkdir/Zscore/.
- Zscore statistic files are of the following format, and the first 7 columns are required in the following order:

| #CHROM | POS | ID | REF| ALT | N|MAF|Z_SCORE| | ----------|:-------------:|:-------------:| :-------------:| :-------------:| :-------------:| :-------------:| :-------------:| :-------------:| | 19 | 5811718 | rs202128143| TTTG| T| 2504 | 1.438e-01 | -3.170e+00 | -6.335e-02| | ... | ... | ... | ... | ... | ... | ... |

Step 2. Generate Make File

A Makefile is generated to wrap all jobs of running E-step per genome blocks with provided hyper parameter values, and then run M-step with Bayesian CPP and effect size estiamtes across all genome blocks.

Step 3. Run Make File

User will only need to submit a job to run the Makefile and specify the number of jobs to be run in parallel.

Output Files

Bayesian estimates of CPP and genetic effect sizes by last EM iteration are stored in the ${wkdir}/Eoutput/paramtemp${em}.txt file.
- Column Pi denotes the Bayesian estimates of CPP
- Column Beta denotes the Bayesian estimates of genetic effect sizes
- Column mBeta denotes the marginal genetic effect sizes by single variant analysis
- Column Pval_svt denotes the p-value by single variant analysis
${wkdir}/Eoutput/EM_result.txtcontains regression r2 estimate and hyper parameter estimates per EM iteration
- Regression r2 might fall out of the range of [0,1], when GWAS summary data were used.
${wkdir}/Eoutput/hyptemp*.txt files contain data need to be used in the M-step

Analyze Results (`/1KG_example/AnalyzeResults/Analysis.r`)

Load GWAS results
The sum of Bayesian CPP estimates across all SNPs represents the expected number of causal SNPs, which might be considered as a metric to decide initial hyper parameter values.
Plot Bayesian effect size estimates vs. marginal effect size estimates
Make manhantton plot
Plot CPP estimates per annotation group

yjingj/BFGWAS_SS