Human GWAS data simulator from Changalidis et al., 2023.
Examples of usage as well as experiments described in the article are presented in the tests/
directory (with descriptions in tests/readme.md).
To make this tool working, you need:
Python 3.9
(+ packages are listed inrequirements.txt
, installation:pip install -r requirements.txt
)R 4.1
(+ packages are listed inrequirements-R.txt
, quick installation:install_r_reqs.R
)PLINK
,PLINK2
,bedtools
andHAPGEN2
be installed. Pathes to these tools have to be presented independencies.yaml
(use-d
parameter to specify). The example is independencies.yaml
file.
The only required parameters are:
- Genotypes, one of these:
-il/--input_list
- csv-file that describes one input file per line with structure:<abs_path_to_file>,<chromosome_number>
.-gel/--geno_list
and-chl/--chr_list
and specify genotype file(s) and chromosome(s) respectively:
--geno_list /wd/data/genotypes/file1.vcf,/wd/data/genotypes/file2.vcf
--chr_list chr1,chr2
-af/--anno_file
-- file with chromosome annotations in gtf-format.- Dependencies file with tools pathes (set by
-d/--dependencies
). - Output directories for data and images (default:
./data
and./images
) -- they should be created. - For choosing causal SNPs/pathways:
-ucs/--use_causal_snps
- has to be set or not set.- then
-cp/--causal_pathways
OR-cs/--causal_snps
for specifying causal pathways/SNPs respectively. - if we use causal pathways (i.e.
use_causal_snps
is not set), we also have to specify--gmt_file
.
Full list of parameters with its description and default values can be found here:
./biogwas.py --help
Example of launching simulation:
./biogwas.py \
--data_dir <output_data_dir> \
--img_dir <output_images_dir> \
--input_list <path_to_genofile_list>.list \
--anno_file <path_to_gtf>.gtf \
--gmt_file <path_to_gmt>.gmt \
--causal_pathways <path_to_pathways>.csv \
--pattern PATTERN \
--causal_id CAUSAL_ID \
--sim_id SIM_ID
Or shortly:
./biogwas.py \
-dd <output_data_dir> \
-imd <output_images_dir> \
-il <path_to_genofile_list>.list \
-af <path_to_gtf>.gtf \
-gf <path_to_gmt>.gmt \
-cp <path_to_pathways>.csv \
-p PATTERN \
-cid CAUSAL_ID \
-sid SIM_ID
In order to build image use Docker file:
docker build -t biogwas .
After that you can easily run docker by something like:
docker run \
-v "<work_dir>:<work_dir_inside_container>" \
biogwas \
/bioGWAS/biogwas.py \
-d /dependencies.yaml \
-dd <output_data_dir_inside_container> \
-imd <output_images_dir_inside_container> \
-il <path_to_genofile_list_inside_container>.list \
-af <path_to_gtf_inside_container>.gtf \
-gf <path_to_gmt_inside_container>.gmt \
-cp <path_to_pathways_inside_container>.csv \
-p "dpat" \
-cid "dcid" \
-sid "dsid"
Your parameters for biogwas goes after 5th string, i.e. you have to leave first 5 strings as they are (except -v
flag which shows how to mount directories inside container, you have to specify all directories you're using: the easiest way is to make just one directory with subdirectories for I/O, and specify this one directory):
All other flags can be changed. To read manual, simply run with flag --help
:
docker run \
-v "<work_dir>:<work_dir_inside_container>" \
biogwas \
/bioGWAS/biogwas.py \
--help
Alternatively you can launch bioGWAS using docker-compose
. Example of configuration is in docker-compose.yaml
file. You have to change (again, it shows how to mount directories inside container, you have to specify all directories you're using: the easiest way is to make just one directory with subdirectories for I/O, and specify this one directory):
volumes:
- "<path_to_workdir>:<path_inside_container>"
To point to the working dir.
You have to work with the command part:
command: >
/bioGWAS/biogwas.py
-d /dependencies.yaml
-dd "/data/gwassim_check/attempt_docker/data"
-imd "/data/gwassim_check/attempt_docker/images"
-il "/data/1000genomes/data2/chr.list"
-if "/data/1000genomes/data2/EUR_SAMPLES_ID.txt"
-af "/data/1000genomes/data2/gencode.v37.annotation.gtf"
--gmt_file "/data/1000genomes/data2/h.all.v2023.1.Hs.symbols.gmt"
--causal_pathways "/data/1000genomes/data2/pathways.csv"
-p "dpat"
-cid "dcid"
-sid "dsid"
You have to leave first three strings as they are:
command: >
/bioGWAS/biogwas.py
-d /dependencies.yaml
All other parameters should be changed to your settings.
Note: not all settings and flags are shown here, to display the full list, change all parameters to:
command: >
/bioGWAS/biogwas.py
--help
After editing docker-compose.yaml
, run simulation using command:
docker-compose up
You can also go inside container (not recommended):
docker run --rm -it --entrypoint /bin/bash biogwas
Examples of usage as well as validation steps, described in the paper is located in tests
directory.
The simple way to launch our tool is using Docker. If you use docker, there is no need to install all other packages, everything will be installed for you!
Download this repo source code:
git clone https://github.com/TohaRhymes/bioGWAS.git
Go to the directory with the source code:
cd bioGWAS
Build an image:
docker build -t biogwas .
Create a working directory and browse to it, in linux:
mkdir biogwas_test
cd biogwas_test
For this example, let all necessary files be in ./data/
directory (otherwise you have to mount all directories with data when launching docker).
There are 2 ways to set genotypes files:
Method 1: using file
- genotypes file(s) (in
.vcf
or PLINK's bfile format (the former is available only with skipping genotypes simulation)).- you need to create a list of all necessary file(s) to be included, and the chromosome corresponding to this file, e.g.the example:
Note: path should be written as files will appear in docker container. All other files path will be specified later, when we launch docker.
/wd/data/genotypes/file1.vcf,chr1 /wd/data/genotypes/file2.vcf,chr2 ...
Let this file be./data/genotypes.list
in our example.
- you need to create a list of all necessary file(s) to be included, and the chromosome corresponding to this file, e.g.the example:
Method 2: using flags
Instead of creating a file with genotypes and specifying it using --input_list
,
we can run bioGWAS using --geno_list
and --chr_list
and specify genotype file(s) and chromosome(s) respectively:
...
--geno_list /wd/data/genotypes/file1.vcf,/wd/data/genotypes/file2.vcf
--chr_list chr1,chr2
...
It should be txt-file with samples ids to be included in the analysis (one per line), the example:
txt HG00096 HG00097 HG00099 HG00100 ...
Let this file be ./data/samples.txt
in our example.
In case you don't have this file, you can create it using bcftools
: bcftools query -l your/vcf/file.vcf > ./data/samles.txt
It should be annotation file in gtf format. In our test we used comprehensive gene annotation downloaded from gencode site. Let this file be ./data/gencodes.gtf
in our example.
* If you want to use specific causal SNPs, a list of these SNPs, one per line, example:
```txt
1:172643220
3:128435895
4:76045432
4:87976387
5:7891402
```
Let this file be `./data/snps.txt` in our example.
* If you want to use specific pathways, you need to:
* use GMT files with your pathways (you can download hallmark, and other gmt files from [gsea-msigdb website](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp)). Let this file be `./data/pathways.gmt` in our example.
* A list of causal pathway(s), e.g.:
```txt
KEGG_PPAR_SIGNALING_PATHWAY
KEGG_LONG_TERM_DEPRESSION
```
Let this file be `./data/pathways_list.txt` in our example.
So, in total the necessary files are:
./data/genotypes/
- directory with genotypes files./data/genotypes.list
- list with genotypes files./data/samples.txt
- list with samples./data/gencodes.gtf
- annotation file- One of two:
./data/snps.txt
- list of causal SNPs./data/pathways.gmt
and./data/pathways_list.txt
- gmt-file with pathways, and your causal pathways.
(Let working directory inside container be /wd
, and our data will be in /wd/data
, however it can be anything.)
If you want to run using causal pathways:
docker run \
-v "./:/wd" \
biogwas \
/bioGWAS/biogwas.py \
--data_dir /wd \
--img_dir /wd \
--input_list /wd/data/genotypes.list \
--ids_file /wd/data/samples.txt \
--anno_file /wd/data/gencodes.gtf \
--gmt_file /wd/data/pathways.gmt \
--causal_pathways /wd/data/pathways_list.txt
If you want to run using causal SNPs:
docker run \
-v "./:/wd" \
biogwas \
/bioGWAS/biogwas.py \
--data_dir /wd \
--img_dir /wd \
--input_list /wd/data/genotypes.list \
--ids_file /wd/data/samles.txt \
--anno_file /wd/data/gencodes.gtf \
--gmt_file /wd/data/pathways.gmt \
--use_causal_snps \
--causal_snps /wd/data/snps.txt
After finishing, all output data will be in ./data
dir.
When use, please cite:
@Article{biology13010010,
AUTHOR = {Changalidis, Anton I. and Alexeev, Dmitry A. and Nasykhova, Yulia A. and Glotov, Andrey S. and Barbitoff, Yury A.},
TITLE = {bioGWAS: A Simple and Flexible Tool for Simulating GWAS Datasets},
JOURNAL = {Biology},
VOLUME = {13},
YEAR = {2024},
NUMBER = {1},
ARTICLE-NUMBER = {10},
URL = {https://www.mdpi.com/2079-7737/13/1/10},
PubMedID = {38248441},
ISSN = {2079-7737},
ABSTRACT = {Genome-wide association studies (GWAS) have proven to be a powerful tool for the identification of genetic susceptibility loci affecting human complex traits. In addition to pinpointing individual genes involved in a particular trait, GWAS results can be used to discover relevant biological processes for these traits. The development of new tools for extracting such information from GWAS results requires large-scale datasets with known biological ground truth. Simulation of GWAS results is a powerful method that may provide such datasets and facilitate the development of new methods. In this work, we developed bioGWAS, a simple and flexible pipeline for the simulation of genotypes, phenotypes, and GWAS summary statistics. Unlike existing methods, bioGWAS can be used to generate GWAS results for simulated quantitative and binary traits with a predefined set of causal genetic variants and/or molecular pathways. We demonstrate that the proposed method can recapitulate complete GWAS datasets using a set of reported genome-wide associations. We also used our method to benchmark several tools for gene set enrichment analysis for GWAS data. Taken together, our results suggest that bioGWAS provides an important set of functionalities that would aid the development of new methods for downstream processing of GWAS results.},
DOI = {10.3390/biology13010010}
}