/mixer

Causal Mixture Model for GWAS summary statistics

Primary LanguageCGNU General Public License v3.0GPL-3.0

Contents

Introduction

This folder contains a Python port of MiXeR, wrapping the same C/C++ core as we previously used from MATLAB. This is work in progress, but eventually it should singificantly improve user experience, as Python allows much simpler installation procedures, makes it less error prone, allows to implement well-documented command-line interface (python mixer.py --help), and provide visualization.

Input data for MiXeR consists of summary statistics from a GWAS, and a reference panel. MiXeR format for summary statistics is compatible with LD Score Regression (i.e. the sumstats.gz files), and for those users who are already familiar with munge_sumstats.py script we recommend to use LD Score Regression pipeline to prepare summary statistics. At the same time, we encourage everyone to take a look our own pipeline for processing summary statistics. For the reference panel we recommend to use 1000 Genomes Phase3 data, pre-processed according to LD Score Regression pipeline, and available for download from LDSC website. Further details are given in Data downloads and Data preparation sections.

Once you have all input data in MiXeR-compatible format you may proceed with running univariate (fit1, test1) and cross-trait (fit2, test2) analyses, as implemented in mixer.py command-line interface. The results will be saved as .json files. To visualize the results we provide a script in python, but we encourage users to write their own scripts that understand the structure of .json files, process the results. Further details are given in Run MiXeR, MiXeR options and Visualize MiXeR results sections.

If you encounter an issue, or have further questions, please create a new issue ticket.

If you use MiXeR software for your research publication, please cite the following paper(s):

  • for univariate analysis: D. Holland et al., Beyond SNP Heritability: Polygenicity and Discoverability Estimated for Multiple Phenotypes with a Univariate Gaussian Mixture Model, PLOS Genetics, 2020, https://doi.org/10.1371/journal.pgen.1008612
  • for cross-trait analysis: O.Frei et al., Bivariate causal mixture model quantifies polygenic overlap between complex traits beyond genetic correlation, Nature Communications, 2019, https://www.nature.com/articles/s41467-019-10310-0

MiXeR versions:

  • v1.0 - public release for Matlab
  • v1.1 - internal release for Matlab and Python
  • v1.2 - internal release for Python (Matlab support removed)
  • v1.3 - internal release for Python (change HapMap3 to twenty random sets of ~600K SNPs each)

Install MiXeR

Singularity containers

Update May 2021: MiXeR v1.3 is now available as a singularity container. See mixer_simu.md and mixer_real.md for usage examples. If you are new to singularity containers, please refer to hello.md to get started.

The steps below provide an alternative setup (without singularity containers).

Prerequisites

  • Linux environment (tested on CentOS release 6.9, Ubuntu 18.04).
  • Python 3 (tested with Python 3.7), with numpy, scipy>=1.2.1, numdifftools, matplotlib_venn
  • Reasonably new GCC compiler (tested with gcc/6.1.0) or an Intel C++ compiler (tested with intel-2018b)
  • Boost libraries (https://www.boost.org/, tested with 1.68.0)

If you are an experienced C++ programmer it shouldn't be difficult to compile MiXeR core in MAC or Windows. If you've made it work, please share your changes via pull request.

Hardware requirements

MiXeR software is very CPU intensive. Minimal memory requirement is to have 32 GB of RAM available to MiXeR. MiXeR efficiently uses multiple CPUs. We recommend to run MiXeR on a system with at least 16 physical cores.

Install on Linux using pre-built binaries

Not available yet.

Build from source - Linux

The exact steps depend on your build environment.

  • If you work in HPC environment with modules system, you can load some existing combination of modules that include Boost libraries:
    module load CMake/3.15.3-GCCcore-8.3.0 Boost/1.73.0-GCCcore-8.3.0 Python/3.7.4-GCCcore-8.3.0 # TSD (gcc)
    module load Boost/1.71.0-GCC-8.3.0 Python/3.7.4-GCCcore-8.3.0 CMake/3.12.1                   # SAGA (gcc)  
    module load Boost/1.68.0-intel-2018b-Python-3.6.6 Python/3.6.6-intel-2018b CMake/3.12.1      # SAGA (intel)
    
  • Alternatively, you may download and compile Boost libraries yourself:
    cd ~ && wget https://dl.bintray.com/boostorg/release/1.69.0/source/boost_1_69_0.tar.gz 
    tar -xzvf boost_1_69_0.tar.gz && cd boost_1_69_0
    ./bootstrap.sh --with-libraries=program_options,filesystem,system,date_time
    ./b2 --clean && ./b2 --j12 -a
    
  • Clone and compile MiXeR repository
    cd ~ && git clone --recurse-submodules -j8 https://github.com/precimed/mixer.git
    mkdir mixer/src/build && cd mixer/src/build
    cmake .. && make bgmg -j16                                   # if you use GCC compiler
    CC=icc CXX=icpc cmake .. && make bgmg -j16                   # if you use Intel compiler
    cmake .. -DBOOST_ROOT=$HOME/boost_1_69_0 && make bgmg -j16   # if you use locally compiled boost
    
    

Data downloads

  • Summary statistics, for example

    • Schizophrenia GWAS by Ripke at al. (2014) (search for Download 49 EUR samples here)
    • Educational Attainment GWAS by Lee et al. (2018) (search for GWAS_EA_excl23andMe.txt here)
  • Download reference data from this URL

    wget https://data.broadinstitute.org/alkesgroup/LDSCORE/1000G_Phase3_plinkfiles.tgz
    wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2
    tar -xzvf 1000G_Phase3_plinkfiles.tgz
    bzip2 -d w_hm3.snplist.bz2
    

Data preparation

  • Summary statistics (NIRD: /projects/NS9114K/MMIL/SUMSTAT/TMP/nomhc/)

    • We recommed to use our own scripts to pre-process summary statistcs (clone from here):
      python sumstats.py csv --sumstats daner_PGC_SCZ49.sh2_mds10_1000G-frq_2.gz --out PGC_SCZ_2014_EUR.csv --force --auto --head 5 --ncase-val 33640 --ncontrol-val 43456 
      python sumstats.py zscore --sumstats PGC_SCZ_2014_EUR.csv | \
      python sumstats.py qc --exclude-ranges 6:26000000-34000000 --max-or 1e37 | \
      python sumstats.py neff --drop --factor 4 --out PGC_SCZ_2014_EUR_qc_noMHC.csv --force 
      gzip PGC_SCZ_2014_EUR_qc_noMHC.csv
      
      python sumstats.py csv --sumstats GWAS_EA_excl23andMe.txt.gz --out SSGAC_EDU_2018_no23andMe.csv --force --auto --head 5 --n-val 766345
      python sumstats.py zscore --sumstats SSGAC_EDU_2018_no23andMe.csv | \
      python sumstats.py qc --exclude-ranges 6:26000000-34000000 --out SSGAC_EDU_2018_no23andMe_noMHC.csv --force
      gzip SSGAC_EDU_2018_no23andMe_noMHC.csv  
      
    • [DEPRECATED] If you use MiXeR v1.1 and v1.2, it will recognize summary statistics in LDSC format as described here. In brief, each trait must be represented as a single table containing columns SNP, N, Z, A1, A2. Thus, it is possible to use munge_sumstats.py script as described here. This might be convenient for users who are already familiar with LDSR functionality. In MiXeR v1.3 the format for GWAS summary statistics files is the same, but now I advice against using HapMap3 to constrain the set of SNPs for the fit procedure. Instead, MiXeR should receive a full set of summary statistics from a GWAS on imputed genotype data. Also, note that for case/control munge_sumstats.py from LD Score Regression generate sample size as a sum n = ncase + ncontrol. We recommend to use neff = 4 / (1/ncase + 1/ncontrol) to account for imbalanced classes.
  • Generate .ld and .snps files from the reference panel. Note that this step optional if you work with EUR-based summary statistics. To use EUR reference, simply download the files from here or take it from NIRD (/projects/NS9114K/MMIL/SUMSTAT/LDSR/1000G_EUR_Phase3_plink) or TSD ( ) if you have access. NB! Download size is around 24 GB.

    • Run python mixer.py ld to calculate linkage disequilibrium information in a genotype reference panel. The following command must be run once for each chromosome.
    python3 <MIXER_ROOT>/precimed/mixer.py ld \
       --lib <MIXER_ROOT>/src/build/lib/libbgmg.so \
       --bfile LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.<chr_label> \
       --out LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.<chr_label>.run4.ld \
       --r2min 0.05 --ldscore-r2min 0.05 --ld-window-kb 30000
    

    The output is written to LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.<chr_label>.run4.ld file, and log details into LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.<chr_label>.run4.ld.log file. The resulting files contain information about alleling LD r2 correlations, LD scores, and allele frequencies of the variants in the reference panel passed as --bfile argument. The files DO NOT contain any individual-level information. When you store the resulting .ld file, it is important to keep it along side with corresponding .bim file, as information about marker name (SNP rs#), chromosome, position, and alleles (A1/A2) is NOT encoded in .ld file.

    • To generate 1000G.EUR.QC.prune_maf0p05_rand2M_r2p8.repNN.snps files, repeat the following in a loop for
      export REP=1 # repeat for REP in 1..20
      python3 <MIXER_ROOT>/precimed/mixer.py snps \
         --lib <MIXER_ROOT>/src/build/lib/libbgmg.so \
         --bim-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.bim \
         --ld-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.run4.ld \
         --out LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.prune_maf0p05_rand2M_r2p8.rep${SLURM_ARRAY_TASK_ID}.snps \
         --maf 0.05 --subset 2000000 --r2 0.8 --seed ${SLURM_ARRAY_TASK_ID}
      
      This can be done as a job array, see this script for an example. Note that in the code above the @ symbol does NOT need to be replace with an actual chromosome. It should stay as @ in your command.

Run MiXeR

Univariate analysis

Fit the model:

python3 <MIXER_ROOT>/precimed/mixer.py fit1 \
      --trait1-file SSGAC_EDU_2018_no23andMe_noMHC.csv.gz \
      --out SSGAC_EDU_2018_no23andMe_noMHC.fit.rep${SLURM_ARRAY_TASK_ID} \
      --extract LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.prune_maf0p05_rand2M_r2p8.rep${SLURM_ARRAY_TASK_ID}.snps \
      --bim-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.bim \
      --ld-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.run4.ld \
      --lib  <MIXER_ROOT>/src/build/lib/libbgmg.so \

Apply the model to the entire set of SNPs, without constraining to LDSR/w_hm3.justrs:

python3 <MIXER_ROOT>/precimed/mixer.py test1 \
      --trait1-file SSGAC_EDU_2018_no23andMe_noMHC.csv.gz \
      --load-params-file SSGAC_EDU_2018_no23andMe_noMHC.fit.rep${SLURM_ARRAY_TASK_ID}.json \
      --out SSGAC_EDU_2018_no23andMe_noMHC.test.rep${SLURM_ARRAY_TASK_ID} \
      --bim-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.bim \
      --ld-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.run4.ld \
      --lib  <MIXER_ROOT>/src/build/lib/libbgmg.so \

The results will be saved <out_file>.json file. Repeat the above analysis for the second trait (PGC_SCZ_2014_EUR_qc_noMHC.csv.gz).

To visualize the results:

python precimed/mixer_figures.py combine --json PGC_SCZ_2014_EUR_qc_noMHC.fit.rep@.json --out combined/PGC_SCZ_2014_EUR.fit
python precimed/mixer_figures.py one --json PGC_SCZ_2014_EUR.json --out PGC_SCZ_2014_EUR --statistic mean std

Bivariate (cross-trait) analysis

Fit the model:

python3 <MIXER_ROOT>/python/mixer.py fit2 \
      --trait1-file PGC_SCZ_2014_EUR_qc_noMHC.csv.gz \
      --trait2-file SSGAC_EDU_2018_no23andMe_noMHC.csv.gz \
      --trait1-params-file PGC_SCZ_2014_EUR_qc_noMHC.fit.rep${SLURM_ARRAY_TASK_ID}.json \
      --trait2-params-file SSGAC_EDU_2018_no23andMe_noMHC.fit.rep${SLURM_ARRAY_TASK_ID}.json \
      --out PGC_SCZ_2014_EUR_qc_noMHC_vs_SSGAC_EDU_2018_no23andMe_noMHC.fit.rep${SLURM_ARRAY_TASK_ID} \
      --extract LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.prune_maf0p05_rand2M_r2p8.rep${SLURM_ARRAY_TASK_ID}.snps \
      --bim-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.bim \
      --ld-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.run4.ld \
      --lib  <MIXER_ROOT>/src/build/lib/libbgmg.so \

Apply the model to the entire set of SNPs, without constraining to LDSR/w_hm3.justrs:

python3 <MIXER_ROOT>/python/mixer.py test2 \
      --trait1-file PGC_SCZ_2014_EUR_qc_noMHC.csv.gz \
      --trait2-file SSGAC_EDU_2018_no23andMe_noMHC.csv.gz \
      --load-params-file PGC_SCZ_2014_EUR_qc_noMHC_vs_SSGAC_EDU_2018_no23andMe_noMHC.fit.rep${SLURM_ARRAY_TASK_ID}.json \
      --out PGC_SCZ_2014_EUR_qc_noMHC_vs_SSGAC_EDU_2018_no23andMe_noMHC.test.rep${SLURM_ARRAY_TASK_ID} \
      --bim-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.bim \
      --ld-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.run4.ld \
      --lib  <MIXER_ROOT>/src/build/lib/libbgmg.so \

Note that these parameters point to the results of univariate analysis for both traits, so those must be generated first. The results will be saved <out_file>.json file.

To visualize the results (where <prefix> is, for example, PGC_SCZ_2014_EUR_qc_noMHC_vs_SSGAC_EDU_2018_no23andMe_noMHC):

python precimed/mixer_figures.py combine --json <prefix>.fit.rep@.json --out <prefix>.fit
python precimed/mixer_figures.py combine --json <prefix>.test.rep@.json --out <prefix>.test
python precimed/mixer_figures.py two --json-fit <prefix>.fit.json --json-test <prefix>.test.json --out <out_file> --statistic mean std

MiXeR options

Run --help commands to list available options and their description.

python3 mixer.py ld --help
python3 mixer.py snps --help
python3 mixer.py fit1 --help
python3 mixer.py test1 --help
python3 mixer.py fit2 --help
python3 mixer.py test2 --help
python3 mixer.py perf --help

Visualize MiXeR results

First step is to average the results across 20 runs with mixer_figures.py combine, which works for univariate (fit1, test1) and bivariate (fit2, test2) runs:

python precimed/mixer_figures.py combine --json <prefix>.fit.rep@.json --out <prefix>.fit

The resulting .json files can be converted to figures and .csv tables via the following commands (one for univariate, two for bivariate; each of these commands accept .json files from fit and test steps).

python precimed/mixer_figures.py one --json <out_file>.json --out <out_file>
python precimed/mixer_figures.py two --json <out_file>.json --out <out_file>
python precimed/mixer_figures.py two --json-fit <out_file_fit>.json --json-test <out_file_test>.json --out <out_file>

For the two command, instead of --json, it is possible to specify --json-fit and --json-test separately. This allows to combine negative log-likelihood plot (available in fit2 only) and QQ plots (available in test2 only). Note that all --json accept wildcards (*) or a list of multiple files. This allows to generate .csv tables containing results from multiple MiXeR runs.

MiXeR results format

MiXeR produces the following results, as described in the original publication.

  • Univariate .csv file, including model parameters and AIC/BIC values,
  • Univariate QQ plots
  • Univariate MAF- and LD-stratified QQ plots
  • Univariate Power curves
  • Bivariate .csv file, including model parameters, AIC/BIC values and Dice coefficient
  • Bivariate stratified QQ plots (cross-trait enrichment)
  • Bivariate density plots
  • Bivariate negative log-likelihood function

These output is described in the cross-trait MiXeR publication.

AIC BIC interpretation

.csv files generated by python precimed/mixer_figures.py commands contain AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) values. To generate AIC / BIC values you should point mixer_figures.py to json files produced by fit1 or fit2 steps (not those from test1 or test2 steps).

The idea of model selection criteia (both AIC and BIC) is to find whether the input data (in our case the GWAS summary statistics) have enough statistical power to warrant a more complex model - i.e. a model with additional free parameters that need to be optimized from the data. For example, the LDSR model has two free parameters - the slope and an intercept. The univariate MiXeR model has three parameters (pi, sig2_beta and sig2_zero). Naturally, having an additional free parameters allows MiXeR to fit the GWAS data better compared to LDSR model, however it needs to be substantially better to justify an additional complexity. AIC and BIC formalize this trade-off between model's complexity and model's ability to describe input data.

The difference between AIC and BIC is that BIC is a more conservative model selection criterion. Based on our internal use of MiXeR, it is OK discuss the resulsult if only AIC supports MiXeR model, but it's important point as a limitation that the input signal (GWAS) has borderline power to fit MiXeR model.

For the univariate model, AIC / BIC values are described in the cross-trait MiXeR paper (ref). A negative AIC value means that there is not enough power in the input data to justify MiXeR model as compared to LDSR model, and we do do not recommend applying MiXeR in this situation.

For the bivariate model the resulting table contains two AIC, and two BIC values, named as follows: best_vs_min_AIC, best_vs_min_BIC , best_vs_max_AIC, and best_vs_max_BIC. They are explained below - but you may need to develop some intuition to interpret these numbers. Consider taking a look at the figure shown below, containing negative log-likelihood plots. Similar plots were presented in ref, supplementary figure 19. We use such likelihood-cost plots to visualise the performance of the best model vs min and max.

First, let's interpret best_vs_max_AIC value. It uses AIC to compare two models: the best model with polygenic overlap that is shown in the venn diagram (i.e. fitted by MiXeR), versus the max model with maximum possible polygenic overlap given trait's genetic architecture (that is, in the max model the causal variants of the least polygenic trait form a subset of the causal variants in the most polygenic trait). A positive value of best_vs_max_AIC means that best model explains the observed GWAS signal better than max model, despite its additional complexity (here additional complexity comes from the fact that MiXeR has to find an actual value of the polygenic overlap, i.e. estimate the size of the grey area on the venn diagram).

Similarly, best_vs_min_AIC compares the best model versus model with minimal possible polygenic overlap. It is tempting to say that minimal possible polygenic overlap means the same as no shared causal variants, i.e. a case where circles on a venn diagram do not overlap. However, there is a subtle detail in our definition of the minimal possible polygenic overlap in cases where traits show some genetic correlation. Under the assumptions of cross-trait MiXeR model, the presence of genetic correlation imply non-empty polygenic overlap. In our AIC / BIC analysis, we constrain the best model to a specific value of the genetic correlation, obtained by the same procedure as in LDSR model (i.e. assuming an infinitesimal model, as described in ref. In this setting, minimal possible polygenic overlap corresponds to pi12 = rg * sqrt(pi1u * pi2u). best_vs_min_AIC is an important metric - a positive value indicates that the data shows support for existense of the polygenic overlap, beyond the minimal level need to explain observed genetic correlation between traits.

Finally, best_vs_max_BIC and best_vs_max_BIC have the same meaning as the values explained above, but using more stringent BIC criterion instead of AIC.

The last panel on the following figure shows negative log-likelihood plot as a function of polygenic overlap. For the general information about maximum likelihood optimization see here. In our case we plot negative log-likelihood, that's we search for the minimum on the curve (this sometimes is refered to as a cost function, with objective to find parameters that yield the lowest cost).

On the negative log-likelihood plot, the min model is represented by the point furthest to the left, max furthest to the right and best is the lowest point of the curve. (off note, in some other cases log-likelihood plot can be very noisy - then best is still the lowest point, but it doesn't make practical sence as it is just a very noisy estimate - and this is exactly what we want to clarify with AIC / BIC).

The minimum is obtained at n=1.3K shared causal variants - that's our best model, scores at about 25 points (lower is better). A model with least possible overlap has n=0 shared causal variants - that our min model, scored at 33 points. Finally, a model with largest possible overlap has n=4K shared causal variants - that our max model, scores at 50 points. We use AIC (and BIC) to compare best model versus the other models.

GIANT_HEIGHT_2018_UKB_vs_PGC_BIP_2016 json

Upgrade nodes from MiXeR v1.2 to v1.3

  • The source code is nearly identical, but I've changed the procedure to run MiXeR which is described in this README file. Still, you need to updated the code by git pull. If you updated MiXeR in early summer 2020 there will be no changes in the native C++ code (hense no need to re-compile the libbgmg.so binary), but to be safe it's best to execute cd <MIXER_ROOT>/src/build && cmake .. && make command (after module load relevant modules, as described in this section.)
  • If you previously downloaded 1kG reference, you need to download 20 new files called 1000G.EUR.QC.prune_maf0p05_rand2M_r2p8.repNN.snps. Otherwise you need to generate them with mixer.py snps command as described above.
  • If you previously generated your input summary statistics constraining them to HapMap3 SNPs, you need to re-generated them without constraining to HapMap3.
  • Assuming you have a SLURM script for each MiXeR analysis, you need to turn this script into a job array (#SBATCH --array=1-20) which executes 20 times, and use --extract 1000G.EUR.QC.prune_maf0p05_rand2M_r2p8.rep${SLURM_ARRAY_TASK_ID}.snps flag in mixer.py fit1 and mixer.py fit2, also adjusting input/output file names accordingly. Example scripts are available in scripts folder.
  • To process .json files produced by MiXeR, you now first need to run a mixer_figures.py combine step as shown above. This will combine 20 .json files by averaging individual parameter estimates, and calculating standard errors.
  • When you generate figures with mixer_figures.py one and mixer_figures.py two commands and use the "combined" .json files, you'll need to add --statistic mean std to your commands.
  • With MiXeR v1.3 you should expect a slightly lower polygenicity estimate, mainly because of --maf 0.05 constraint on the frequency of genetic variants used in the fit procedure. The rationale for --maf 0.05 filter is to still have a robust selection of SNPs in the fit procedure despite not having HapMap3 filter.
  • With MiXeR v1.3 you should expect a 10 fold increase in CPU resources needed per ran (20 times more runs, but each run is ~600K SNPs which is half the size of HapMap3).