Evolutionary signatures from GWAS
This pipeline detects enrichment and depletion of evolutionary signatures from GWAS summary statistics.
Note: this pipeline is currently under development.
Install
Please install these dependecies.
1. PLINK 1.90 beta
2. Anaconda
- note: the shell should have path access to run the plink command
Run the commands
as shown.
- clone git repository:
git clone https://github.com/abraham-abin13/gsel_vec.git
- change directory:
cd gsel_vec
- *run download_data.sh:
./download_data.sh
- note: requires the the 'wget' utility; if on a mac, install wget using brew (
brew install wget
)
- create conda env:
conda env create -f conda_env_manual_dask.yml
- activate conda env:
source activate evo_dask
- install package:
pip install .
- activate conda env
source activate evo_dask
- run the command
top_gsel_vec.py
- outputs are saved to a folder called 'testing' in the current directory.
- See Outputs section and Details sections below for interpreting output files.
Run
Inputs
The main command is top_gsel_vec.py <analysis_name> <gwas_summary_file> <output_path>
:
- analysis_name: a label describing the analysis
- gwas_summary_file: full path to the gwas summary statistics file
- output_path: full path to create output directory which will have the same name as analysis_name.
Outputs
Outputs are saved to output_path in a folder with the name analysis_name. The final outputs of the analyses are stored in the folder final_outputs with the following folders:
- trait_enrichments/trait_enrichment.tsv: enrichment for each evolutionary measure at the trait level -pval_zscore_per_anno/*.tsv: Z-score and empirical p-value for each trait associated locus compared to matched background; one file per evolutionary measure
- plot_outputs/:
Details
A. GWAS Summary Statistics File Format
- should be a tab separated file with one row per SNP
- coordinates should be in GRChg37
- should have the following headers in this order: ['snp', 'chr', 'pos', 'p']
B. Interpreting trait_enrichment.tsv output
- one row per evolutionary measure
Column | Description |
---|---|
annotation | name of evolutionary measure |
mean_trait_loci | mean across all trait associated loci |
median_trait_loci | median across all trait associated loci |
emp_pvalue | p-value from comparing to matched background |
*matched* | matched distributions mean, median, std, 5th, & 95th percentiles |
n_lead_loci | number of trait associated regions after LD clumping |
enrich_per_mean_diff_by_genomstd | enrichment calculated as described below |
- For each evolutionary measure, enrichment is calculated by subtracting the mean across trait-associated loci from mean across matched loci and dividing this by the genome-wide standard deviation for that evolutoinary measure:
C. Interpreting pval_zscore_per_anno/*.tsv output
- one file per evolutionary measure
- one row per LD independent trait-assocaited region
Column | Description |
---|---|
lead_snp | SNP with lowest P-value for a trait-associated region |
z_score | z-score of trait-associated region compared to matched background |
lead_snp_anno | mean of evolutionary measure of SNPs in a trait-associated region |
mean_controls | mean of evolutionary measure of matched background |
pvalue | empirical p-value from comparing evolutinoary measure at a trait-associated region (lead_snp_anno) to matched background |
corrected_pval_benj_hoch | empirical p-value corrected for number of trait-associated regions |
annotation | name of evoluationary measure |
final_qc_include | if False, remove trait-associated region from downstream analysis |
D. Methods
- For details please see manuscript.
Running one chromosome at a time
For GWAS with a large number of trait-associated regions, this pipeline can be run on individual chromosomes.
A. Instructions
- split summary statistics into files for each chromosome
- per chromosome summary stats files should be named chr<#>_<analysis_name>
- create a folder for output_path named <analysis_name>
- run
top_gsel_vec.py -a <analysis_name> -g <path_to_per_chromosome_summary_stats_file> -o <output_path> --run_by_chr
for each chromosome- note the
--run_by_chr
flag is added - note the <output_path> should be the full path to the folder named <analysis_name> created in step 3.
- repeat this command with each chromosome file (changing the -g flag input)
- note the
- run
top_gsel_vec.py -a <analysis_name> -o <output_path> --run_by_chr_enrich
.- This will calculate trait wide enrichments.
- Trait-wide enrichment will saved to
<analysis_name>_extreme_regions_mean_enrichment_all_annotation.tsv
Adding a new evolutionary measure to GSEL
- Create a tab separated file with annotation values
- To add an evolutionary measure, create a tab separated file with two columns.
- Columns names have to be: “chr_pos” and “value”.
- "chr_pos" column is the position of SNPs formatted as “<chromosome#>:<basepair_position>” Note: The coordinates should be in GRChg37.
- "value" column should be the value of your evolutionary measure for each SNP
- Convert annotation value file to a python pickled dictionary
- Run the script below to convert annotation value file to a python pickled dictionary. It takes one input argument which is the full path to the file from step 1. This script will write to the current directory a pickled dictionary named: “<full_path_to_annotation_tsv_file>.pickle” Run the command as shown below:
python ./gsel_vec/gsel_vec/create_annotations/make_anno_dict.py <full_path_to_annotation_tsv_file>
- Move pickled file to ‘anno_dict” folder
- Move the pickled to the following location ./gsel_vec/gsel_vec/data/anno_dict. Here is an example command:
mv <pickled_file_full_path> ./gsel_vec/gsel_vec/data/anno_dict
- Modify the file anno_description.txt to include new evolutionary measure
- Open the file anno_description.txt located at “gsel_vec/gsel_vec/data/anno_dict/anno_description.txt”.
- Add a line at the bottom with three values separated by commas.
- The first value is the name of the evolutionary measure, the second value is the name of the pickled file, the third column can only be True or False.
- This third column specifies whether two-sided P-value or one-sided P-values should be calculated. See example of a line that should be added:
evo_measure_name,evo_measure_values.pickle,True
- This third column specifies whether two-sided P-value or one-sided P-values should be calculated. See example of a line that should be added:
- Reinstall GSEL.
- uninstall GSEL if you already have it installed. Run
pip uninstall gsel_vec
- make sure you are in the ./gsel_vec directory. Then run
pip install .
- GSEL can be run using the instruction given above and it will now include the new evolutionary measure.
Running GSEL on non-European populations
To run GSEL on non European population requires two modifications. (1) The 1000 Genomes SNP data used for LD clumping and (2) the SNPSnap database should be changed to the population of interest.
(1) 1000 Genomes Phase 3
- In PLINK (*.bim, *.bed, *.fam) format, you can download 1000 Genomes Phase 3 for the population of interest at this link: https://www.internationalgenome.org/data/.
- Make sure each file is named using this format: “chr<#>.bed”,“chr<#>.bim” or “chr<#>.fam”
- Move all files for chromosomes 1-22 to this location.
gsel_vec/gsel_vec/data/1kg
(2) SNPSnap provides a dataset used for matching in the GSEL pipeline. Currently, datasets for the European, East Asian, and West African super populations can be downloaded from https://data.broadinstitute.org/mpg/snpsnap/database_download.html
. If you require a different population, then you must create the SNSnap dataset yourself. See SNSnap website for details.
-
Download snpsnap dataset using distance type of “LD” and distance cut-off of r2=0.9 and your population of interest.
-
Ensure that the downloaded file is named “snpsnap_database_ld0.9.tab.gz”.
-
Move this file to
gsel_vec/gsel_vec/data/snpsnap_database
-
Run GSEL using instructions given above.