GWAS-Flow
was written and published in the hope that you might find it useful. If you do and use it for your research please cite the paper published alongside the software, which is currently publicly accessible on the BiorXiv preprint server. https://www.biorxiv.org/content/10.1101/783100v1 doi: 10.1101/783100
GWAS_Flow
is an open source python based software provding a GPU-accelerated framework for performing genome-wide association studies (GWAS), published under the MIT-License.
GWAS is a set of major algorithms in quantitative genetics to find associations between phenotypes and their respective genotypes.
With a broad range of applications ranging from plant breeding to medicine.
In recent years the data sets used for those studies increased rapidly in size, and accordingly the time necessary to perform these on conventional CPU-powered machines increased exponentially.
Here we use TensorFlow a framework that is commonly used for machine learning applications to utilize graphical processing units (GPU) for GWAS.
- tensorflow (v.1.14.0)
- numpy (v.1.16.4)
- pandas(v.24.2)
- scipy (v.1.3.0)
- h5py (v.2.9.0)
- matplotlib
- Docker (v.19.03.1)
- Singularity (v.2.5.2)
This has been tested on multiple linux systems with anconda versions > 4.7
clone the repository directly with git
git clone https://github.com/Joyvalley/GWAS_Flow
create an anaconda environment and install the necessary packages using the gwas_flow_env.yaml configuration file
conda env create -f gwas_flow_env.yaml
conda activate gwas_flow
For the installation with docker the only required software is docker itself.
git clone https://github.com/Joyvalley/GWAS_Flow.git
cd GWAS_Flow
docker build -t gwas_flow docker
git clone https://github.com/Joyvalley/GWAS_Flow.git
docker build -t gwas_flow docker
!! make sure to change /PATH/TO/FOLDER
docker run -v /var/run/docker.sock:/var/run/docker.sock -v /PATH/TO/FOLDER:/output --privileged -t singularityware/docker2singularity:1.11 gwas_flow:latest
change the name of e.g. gwas_flow_latest-2019-08-19-8c98f492dd54.img to gwas_flow_sing.img
GWAS_Flow is designed to work with several different input data formats. For all of them there is are sample data avaialble in the folder gwas_sample_data/ The minimal requirement is to provide a genotype and a phenotype file if no kinship matrix is provided a kinship matrix according to van Raden ist caluculated from the provided marker information. Depending on the size of the marker matrix this can take a while.
python gwas.py -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py
python gwas.py -x gwas_sample_data/G_sample.csv -y gwas_sample_data/Y_sample.csv -k gwas_sample_data/K_sample.csv
To use PLINK data format add a bed bim and fam file with the same prefix to the folder. You can tell GWAS-Flow to use those files by using prefix.plink as the option for the genotype file
python gwas.py -x gwas_sample_data/my_plink.plink -y gwas_sample_data/pheno2.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py
Flgas and options are
-x , --genotype : file containing marker information in csv or hdf5 format of size
-y , --phenotype : file container phenotype information in csv format
-k , --kinship : file containing kinship matrix of size k X k in csv or hdf5 format
-m : name of column to be used in phenotype file. Default m='phenotype_value'
--cof: file with cofactor information (only one co-factor as of now)
-a , --mac_min : integer specifying the minimum minor allele count necessary for a marker to be included. Default a = 1
-bs, --batch-size : integer specifying the number of markers processed at once. Default -bs 500000
-p , --perm : perform n permutations
--plot : create manhattanplot
-o , --out : name of output file. Default -o results.csv
-h , --help : prints help and command line options
use python gwas.py -h
to see the command line options
Execute the docker container with the sample data
docker run --rm -u $UID:$GID -v $PWD:/data gwas_flow:latest -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py
On Windows you can use something like this after activating the file sharing for the drive the repo is stored on:
cd c:\PATH\TO\REPO\GWAS_Flow
docker run -v c:/PATH/TO/REPO/GWAS_Flow:/data gwas_flow:latest -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py
!! The GPU versions of docker and singularity are still under development and might or might not work properly with your setup. To run the GWAS-Flow on GPUs as of now we recommand the usage of anaconda environments
Execute the singularity image with the sample data
singularity run gwas_flow_sing.img -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py
So far GWAS-Flow is capable of using on co-factor the co-factor is added to the analysis with the flag --cof FILENAME
e.g
python gwas.py -x gwas_sample_data/G_sample.csv -y gwas_sample_data/Y_sample.csv -k gwas_sample_data/K_sample.csv --cof gwas_sample_data/cof.csv
add the flag --perm 100
to calculate a significance threshold based on 100 permutations. Change 100 to any integer larger 2 to perform n permutations
By default there is no plot generated if you add --plot True
a manhattan plot is generated
The dash-dotted line is the bonferroni threshold of significance and the dashed line the permutation based threshold
The latter is only calculated if the flag --perm n
was used with n > 2.
The image displays the average time of 10 runs with 10000 markers each and varying number of phenotypes for GWAS_Flow
on GPU and CPUs and a standard R-Script for GWAS.
The computational time growths exponentially with increasing number of phenotypes.
With lower numbers of phenotypes (< 800), the CPU version is faster than the GPU Version.
This gets more and more lopsided the more phenotypes are included.
All calculations have been performed on 16 i9 vCPUS and a NVIDIA Tesla P100 graphic card.
The unit tests can be run one the console with:
python -m unittest test.py
All the necassary test data is stored in test_data