This package contains the client software for an end-to-end multiparty computation (MPC) protocol for secure GWAS as described in: Secure genome-wide association analysis using multiparty computation Hyunghoon Cho, David J. Wu, Bonnie Berger Nature Biotechnology, 2018 This software is written in C++ and has been tested in Ubuntu 17.04. Dependencies: clang++ compiler (3.9; https://clang.llvm.org/) GMP library (6.1.2; https://gmplib.org/) libssl-dev package (1.0.2g-1ubuntu11.2) NTL package (10.3.0; http://www.shoup.net/ntl/) Notes on NTL: We made a few modifications to NTL for our random streams. Copy the contents of code/NTL_mod/ into the NTL source directory as follows before compiling NTL: Copy ZZ.cpp into NTL_PACKAGE_DIR/src/ Copy ZZ.h into NTL_PACKAGE_DIR/include/NTL/ In addition, we recommend setting NTL_THREAD_BOOST=on during the configuration of NTL to enable thread-boosting. Compilation: First, update the paths in code/Makefile: CPP points to the clang++ compiler executable. INCPATHS points to the header files of installed libraries. LDPATH contains the .a files of installed libraries. To compile, run ./make inside the code/ directory. This will create three executables of interest: bin/GenerateKey bin/DataSharingClient bin/GwasClient This process should take only a few seconds. How to run: Our MPC protocol consists of four entities: SP, CP0, CP1, and CP2. Study participants (SP) provide input data to the protocol, and three computing parties (CP0, CP1, CP2) interactively carry out GWAS. More detailed description is provided in our manuscript. Note we treat SP as a single entity that holds all input genomes and phenotypes, but it is straightforward to generalize this setup to the crowdsourcing scenario where multiple SPs securely share their data with the CPs. An instance of the client program is created for each involved party on different machines, where the ID of the corresponding party is provided as an input argument: 0=CP0, 1=CP1, 2=CP2, 3=SP. These multiple instances of the client will interact over the network to jointly carry out the MPC protocol. For testing purposes, some (or all) of the instances may be run on the same machine. ==== Step 1: Setup Shared Random Keys ==== Secure communication channels needed for the overall protocol are: CP0 <-> CP1, CP0 <-> CP2, CP1 <-> CP2, CP1 <-> SP, CP2 <-> SP Use GenerateKey to obtain a secret key for each pair, which should be named: P0_P1.key, P0_P2.key, P1_P2.key, P1_P3.key, P2_P3.key The syntax for running GenerateKey is as follows: ./GenerateKey OUTPUT_FILENAME.key In addition, generate global.key and share it with all parties. We provide pre-generated keys in key/ directory. In practice, each party will have the keys for only the channels they are involved in. ==== Step 2: Setup Parameters ==== We provide example parameter settings in: par/test.par.PARTY_ID.txt For more information about each parameter, please consult code/param.h and Supplementary Information of our publication. For a test run, update the following parameters according to your network environment and leave the rest: PORT_* IP_ADDR_* Make sure that the specified ports are not blocked by the firewall. ==== Step 3: Setup Input Data ==== On the machine where the SP instance will be running, the data set should be available in plaintext. We provide an example data set in test_data/ directory. The required format is as follows: geno.txt: a minor allele dosage matrix with NUM_INDS rows and NUM_SNPS columns. -1 indicates missing genotypes. pheno.txt: a phenotype vector with NUM_INDS rows. We assume the phenotype is binary, but our protocol can be extended to support continuous values. cov.txt: a covariate matrix with NUM_INDS rows and NUM_COVS columns. We assume all covariates are binary, but our protocol can be extended to support continuous values. In addition, a file (see pos.txt) containing the genomic positions of SNPs in geno.txt (one-per-line in the same order) is considered public and should be shared with all parties. The path to this file is provided in the parameter SNP_POS_FILE. ==== Step 4: Initial Data Sharing ==== On the respective machines, cd into code/ and run DataSharingClient for each party in the following order: CP0: bin/DataSharingClient 0 ../par/test.par.0.txt CP1: bin/DataSharingClient 1 ../par/test.par.1.txt CP2: bin/DataSharingClient 2 ../par/test.par.2.txt SP: bin/DataSharingClient 3 ../par/test.par.3.txt ../test_data/ During this step, SP securely shares its data with CP1 and CP2. The resulting shares are stored in the cache files: {CACHE_FILE_PREFIX}_input_geno.bin {CACHE_FILE_PREFIX}_input_pheno_cov.bin For the toy dataset, this process should take only a few seconds. ==== Step 5: GWAS ==== On the respective machines, cd into code/ and run GwasClient for each party (excluding SP) in the following order: CP0: bin/GwasClient 0 ../par/test.par.0.txt CP1: bin/GwasClient 1 ../par/test.par.1.txt CP2: bin/GwasClient 2 ../par/test.par.2.txt The final output including the association statistics ("assoc") and the QC filter results for individuals ("ikeep") and SNPs ("gkeep1", "gkeep2") are provided in: {OUTPUT_FILE_PREFIX}_*.txt Note these files are created only on the machine where CP2 is running. This step also takes only a few seconds on the toy dataset. Expected runtimes for realistic dataset sizes can be found in our manuscript. A note on logistic regression: We additionally provide a proof-of-concept implementation of secure logistic regression for obtaining effect size estimates (odds ratio) of the top 100 SNPs identified by our GWAS protocol. This is achieved by running LogiRegClient in the same manner as GwasClient. Note that this step should be performed after the GWAS computation as it uses the results and the cached data files from GWAS. Reproducing results in our manuscript: GWAS datasets used in our manuscript are available through dbGaP. Accession numbers for the datasets and associated preprocessing steps are provided in our manuscript. Contact for questions: Hoon Cho, hhcho@mit.edu
shreyanJ/secure-gwas
Software for "Secure genome-wide association analysis using multiparty computation", Nature Biotechnology, 2018
C++MIT