GWAS data simulation

Human GWAS data simulator from Changalidis et al., 2023.

Examples of usage as well as experiments described in the article are presented in the tests/ directory (with descriptions in tests/readme.md).

Installation

To make this tool working, you need:

Python 3.9 (+ packages are listed in requirements.txt, installation: pip install -r requirements.txt)
R 4.1 (+ packages are listed in requirements-R.txt, quick installation: install_r_reqs.R)
PLINK, PLINK2, bedtools and HAPGEN2 be installed. Pathes to these tools have to be presented in dependencies.yaml (use -d parameter to specify). The example is in dependencies.yaml file.

Launch

The only required parameters are:

Genotypes, one of these:
- -il/--input_list - csv-file that describes one input file per line with structure: <abs_path_to_file>,<chromosome_number>.
- -gel/--geno_list and -chl/--chr_list and specify genotype file(s) and chromosome(s) respectively:

    --geno_list /wd/data/genotypes/file1.vcf,/wd/data/genotypes/file2.vcf
    --chr_list chr1,chr2

-af/--anno_file -- file with chromosome annotations in gtf-format.
Dependencies file with tools pathes (set by -d/--dependencies).
Output directories for data and images (default: ./data and ./images) -- they should be created.
For choosing causal SNPs/pathways:
- -ucs/--use_causal_snps - has to be set or not set.
- then -cp/--causal_pathways OR -cs/--causal_snps for specifying causal pathways/SNPs respectively.
- if we use causal pathways (i.e. use_causal_snps is not set), we also have to specify --gmt_file.

Full list of parameters with its description and default values can be found here:

./biogwas.py --help

Example of launching simulation:

./biogwas.py \
--data_dir <output_data_dir>  \
--img_dir <output_images_dir>  \
--input_list  <path_to_genofile_list>.list \
--anno_file <path_to_gtf>.gtf \
--gmt_file <path_to_gmt>.gmt  \
--causal_pathways <path_to_pathways>.csv \
--pattern PATTERN  \
--causal_id CAUSAL_ID  \
--sim_id SIM_ID

Or shortly:

./biogwas.py \
-dd <output_data_dir>  \
-imd <output_images_dir>  \
-il  <path_to_genofile_list>.list \
-af <path_to_gtf>.gtf \
-gf <path_to_gmt>.gmt  \
-cp <path_to_pathways>.csv \
-p PATTERN  \
-cid CAUSAL_ID  \
-sid SIM_ID

Docker

In order to build image use Docker file:

docker build -t biogwas .

Launch from CLI

After that you can easily run docker by something like:

docker run \
-v "<work_dir>:<work_dir_inside_container>" \
biogwas \
/bioGWAS/biogwas.py \
-d /dependencies.yaml \
-dd <output_data_dir_inside_container>  \
-imd <output_images_dir_inside_container>  \
-il  <path_to_genofile_list_inside_container>.list \
-af <path_to_gtf_inside_container>.gtf \
-gf <path_to_gmt_inside_container>.gmt  \
-cp <path_to_pathways_inside_container>.csv \
-p "dpat" \
-cid "dcid" \
-sid "dsid"

Your parameters for biogwas goes after 5th string, i.e. you have to leave first 5 strings as they are (except -v flag which shows how to mount directories inside container, you have to specify all directories you're using: the easiest way is to make just one directory with subdirectories for I/O, and specify this one directory):

All other flags can be changed. To read manual, simply run with flag --help:

docker run \
-v "<work_dir>:<work_dir_inside_container>" \
biogwas \
/bioGWAS/biogwas.py \
--help

Launch with docker-compose.yaml

Alternatively you can launch bioGWAS using docker-compose. Example of configuration is in docker-compose.yaml file. You have to change (again, it shows how to mount directories inside container, you have to specify all directories you're using: the easiest way is to make just one directory with subdirectories for I/O, and specify this one directory):

volumes:
  - "<path_to_workdir>:<path_inside_container>"

To point to the working dir.

You have to work with the command part:

command: >
/bioGWAS/biogwas.py
-d /dependencies.yaml
-dd "/data/gwassim_check/attempt_docker/data"
-imd "/data/gwassim_check/attempt_docker/images"
-il "/data/1000genomes/data2/chr.list"
-if "/data/1000genomes/data2/EUR_SAMPLES_ID.txt"
-af "/data/1000genomes/data2/gencode.v37.annotation.gtf"
--gmt_file "/data/1000genomes/data2/h.all.v2023.1.Hs.symbols.gmt"
--causal_pathways "/data/1000genomes/data2/pathways.csv"
-p "dpat"
-cid "dcid"
-sid "dsid"

You have to leave first three strings as they are:

command: >
/bioGWAS/biogwas.py
-d /dependencies.yaml

All other parameters should be changed to your settings.

Note: not all settings and flags are shown here, to display the full list, change all parameters to:

command: >
/bioGWAS/biogwas.py
--help

After editing docker-compose.yaml, run simulation using command:

docker-compose up

You can also go inside container (not recommended):

docker run --rm -it --entrypoint /bin/bash biogwas

Examples

Examples of usage as well as validation steps, described in the paper is located in tests directory.

Step-by-step tutorial

1. Install Docker

The simple way to launch our tool is using Docker. If you use docker, there is no need to install all other packages, everything will be installed for you!

2. Download source code

Download this repo source code:

git clone https://github.com/TohaRhymes/bioGWAS.git

Go to the directory with the source code:

cd bioGWAS

Build an image:

docker build -t biogwas .

3. Make working directory

Create a working directory and browse to it, in linux:

mkdir biogwas_test
cd biogwas_test

4. Prepare essential files

For this example, let all necessary files be in ./data/ directory (otherwise you have to mount all directories with data when launching docker).

Genotypes files

There are 2 ways to set genotypes files:

Method 1: using file

genotypes file(s) (in .vcf or PLINK's bfile format (the former is available only with skipping genotypes simulation)).
- you need to create a list of all necessary file(s) to be included, and the chromosome corresponding to this file, e.g.the example:
```
/wd/data/genotypes/file1.vcf,chr1
/wd/data/genotypes/file2.vcf,chr2
...
```
  Note: path should be written as files will appear in docker container. All other files path will be specified later, when we launch docker.
  Let this file be ./data/genotypes.list in our example.

Method 2: using flags

Instead of creating a file with genotypes and specifying it using --input_list, we can run bioGWAS using --geno_list and --chr_list and specify genotype file(s) and chromosome(s) respectively:

    ...
    --geno_list /wd/data/genotypes/file1.vcf,/wd/data/genotypes/file2.vcf
    --chr_list chr1,chr2
    ...

Samples to be included (optionally)

It should be txt-file with samples ids to be included in the analysis (one per line), the example: txt HG00096 HG00097 HG00099 HG00100 ... Let this file be ./data/samples.txt in our example.

In case you don't have this file, you can create it using bcftools: bcftools query -l your/vcf/file.vcf > ./data/samles.txt

Annotations

It should be annotation file in gtf format. In our test we used comprehensive gene annotation downloaded from gencode site. Let this file be ./data/gencodes.gtf in our example.

Specifying causal SNPs/pathways

* If you want to use specific causal SNPs, a list of these SNPs, one per line, example:
    ```txt
    1:172643220
    3:128435895
    4:76045432
    4:87976387
    5:7891402
    ```
    Let this file be `./data/snps.txt` in our example.
* If you want to use specific pathways, you need to:
	* use GMT files with your pathways (you can download hallmark, and other gmt files from [gsea-msigdb website](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp)). Let this file be `./data/pathways.gmt` in our example.
	* A list of causal pathway(s), e.g.:
        ```txt
        KEGG_PPAR_SIGNALING_PATHWAY
        KEGG_LONG_TERM_DEPRESSION
        
        ```
        Let this file be `./data/pathways_list.txt` in our example.

Summary

So, in total the necessary files are:

./data/genotypes/ - directory with genotypes files
./data/genotypes.list - list with genotypes files
./data/samples.txt - list with samples
./data/gencodes.gtf - annotation file
One of two:
- ./data/snps.txt - list of causal SNPs
- ./data/pathways.gmt and ./data/pathways_list.txt - gmt-file with pathways, and your causal pathways.

5. Run simulation

(Let working directory inside container be /wd, and our data will be in /wd/data, however it can be anything.)

If you want to run using causal pathways:

docker run \
-v "./:/wd" \
biogwas \
/bioGWAS/biogwas.py \
--data_dir /wd \
--img_dir /wd  \
--input_list  /wd/data/genotypes.list \
--ids_file  /wd/data/samples.txt \
--anno_file /wd/data/gencodes.gtf \
--gmt_file /wd/data/pathways.gmt  \
--causal_pathways /wd/data/pathways_list.txt

If you want to run using causal SNPs:

docker run \
-v "./:/wd" \
biogwas \
/bioGWAS/biogwas.py \
--data_dir /wd \
--img_dir /wd  \
--input_list  /wd/data/genotypes.list \
--ids_file  /wd/data/samles.txt \
--anno_file /wd/data/gencodes.gtf \
--gmt_file /wd/data/pathways.gmt  \
--use_causal_snps \
--causal_snps /wd/data/snps.txt

6. Wait!

After finishing, all output data will be in ./data dir.

Citation

When use, please cite:


@Article{biology13010010,
AUTHOR = {Changalidis, Anton I. and Alexeev, Dmitry A. and Nasykhova, Yulia A. and Glotov, Andrey S. and Barbitoff, Yury A.},
TITLE = {bioGWAS: A Simple and Flexible Tool for Simulating GWAS Datasets},
JOURNAL = {Biology},
VOLUME = {13},
YEAR = {2024},
NUMBER = {1},
ARTICLE-NUMBER = {10},
URL = {https://www.mdpi.com/2079-7737/13/1/10},
PubMedID = {38248441},
ISSN = {2079-7737},
ABSTRACT = {Genome-wide association studies (GWAS) have proven to be a powerful tool for the identification of genetic susceptibility loci affecting human complex traits. In addition to pinpointing individual genes involved in a particular trait, GWAS results can be used to discover relevant biological processes for these traits. The development of new tools for extracting such information from GWAS results requires large-scale datasets with known biological ground truth. Simulation of GWAS results is a powerful method that may provide such datasets and facilitate the development of new methods. In this work, we developed bioGWAS, a simple and flexible pipeline for the simulation of genotypes, phenotypes, and GWAS summary statistics. Unlike existing methods, bioGWAS can be used to generate GWAS results for simulated quantitative and binary traits with a predefined set of causal genetic variants and/or molecular pathways. We demonstrate that the proposed method can recapitulate complete GWAS datasets using a set of reported genome-wide associations. We also used our method to benchmark several tools for gene set enrichment analysis for GWAS data. Taken together, our results suggest that bioGWAS provides an important set of functionalities that would aid the development of new methods for downstream processing of GWAS results.},
DOI = {10.3390/biology13010010}
}