This pipeline colocalizes eQTL and GWAS signals using various colocalization methods. Currently, we have support for COLOC and for the eCAVIAR method, for which we use a time-efficient implementation, FINEMAP, for the fine-mapping step.
- Setup
- Preparing Input Files
- Creating the configuration file
- Running the pipeline
- Pipeline output
- Tips for power users
- Troubleshooting
- Colocalization software used
- Contributors and acknowledgements
To run this pipeline, first clone this repository into your desired directory.
You will need to install PLINK 1.9, FINEMAP,
and tabix. Binary executables for these tools should be placed in a
/bin
directory on the level under the project directory. The PLINK executable must be named plink
, the FINEMAP executable must be named finemap
,
and the tabix
executable must be named tabix
.
The following python2 modules must also be installed:
The following R packages (vN.N or higher) must also be installed:
coloc
to run COLOC
GWAS and eQTL input files must be tab-delimited lists of SNPs, containing at least the following columns:
chr
: Chromosomesnp_pos
: Position of SNP in hg19alt
: Effect alleleref
: Non-effect allelebeta
: Estimated effect size of thealt
allele, when compared with theref
allelese
: Standard error ofbeta
QTL files must additionally contain a gene
or feature
column that specifies which gene's expression
is being tested. (Multiple features can be given in the same file.)
QTL files may include the following additional optional columns:
effect_af_eqtl
: Allele frequency of the effect (alt
) allele
Formatting GWAS files may be easier using the scripts found at (https://github.com/mikegloudemans/gwas-download).
The column order does not matter, but the column names must be specified in the header of the file.
An example GWAS input file:
chr snp_pos alt ref beta se pvalue
1 751756 C T .013006 .017324 .4528019
1 752566 A G -.005243 .0157652 .7394597
1 752721 G A -.003032 .0156381 .8462652
1 752894 C T .00464 .0162377 .7750657
[...]
Depending on which tool(s) you're running, additional columns may be required. This will be described below.
GWAS files may include the following additional optional columns:
effect_af_gwas
: Allele frequency of the effect (alt
) allele
It is okay to have additional columns beyond the required or optional ones; these will be ignored when the program is run. What's important is that all the required columns are included.
The GWAS, eQTL, and VCF files must be sorted first by chromosome and then by snp position,
and finally zipped with bgzip
and indexed using tabix
. This ensures that the pipeline will run
at a reasonable speed.
The following two commands could be used to index a file named my_eqtl_file.txt
with the chromosome number
in the second column (corresponds to -s
parameter) and the chromosome position in the third column (-b
and -e
parameters):
bgzip -f /home/my_eqtl_file.txt
tabix -f -S 1 -s 2 -b 3 -e 3 /home/my_eqtl_file.txt.gz
FINEMAP requires a reference VCF to compute the LD between positions of interest.
COLOC requires a reference VCF to estimate MAFs for the variants, unless the MAFs are directly specified in the GWAS and/or eQTL input files.
We recommend using the 1000 Genomes VCF, which can be obtained freely from the 1000 Genomes website in its hg19 or hg38 form.
The configuration file specifies the specific colocalization methods to be performed, the parameter settings for these methods, the input and output file locations, and the basis for selecting the loci to be tested.
The file must be in JSON format. The top level is a JSON object in which the following parameters may be specified. To get an idea of what this file should look like, ask for an example config file.
The root directory for output files generated by this analysis.
Default: /users/mgloud/projects/brain_gwas
A directory in which to place all output files generated by this analysis
(relative to the out_dir
parameter).
If not specified, the output directory will simply be a combination of the config file
name a time stamp of the time at which the pipeline was started running.
The directory where you'll store temporary files (these files will be removed at the end of each run).
Default: /users/mgloud/projects/brain_gwas/tmp
The process by which the loci to test are selected. Can be eqtl
, gwas
, both
, snps_from_list
, or overlap_loci
. It takes the form of a dictionary:
"selection_basis":
{
"eqtl": {},
"gwas": {},
"both": {},
"snps_from_list": "/path/to/snps_from_list.txt",
"overlap_loci": "/path/to/coloc-tests.txt"
},
...
Note only one of the above methods would actually be included in the config file. For now, just ask me for help when you get to this part.
Plot all colocalization plots from LocusCompare.
Possible values are True
and False
.
Do not attempt to make colocalization plots from LocusCompare.
Possible values are True
and False
.
For now, if you are working with GWAS not included in LocusCompare,
this value must be set to True
to prevent errors.
Specifies a JSON object containing additional objects whose keys are the file locations of each eQTL file being tested. These files must be bgzipped and tabixed for the pipeline to run (see preparation instructions above in the "Preparing Input Files" section).
The eQTL file objects should contain two additional fields:
A name for the reference VCF to be used for aligning the directions of the eQTL SNPs and in
some cases for estimating the minor allele frequencies. For now, it's best to have the
same reference genome for all eQTL and all GWAS files (hopefully in the future this
will be changed). This name can be any valid string; you will later define the location
and the properties of this reference VCF file in the ref_genomes
parameter.
If no custom reference VCF is desired, we recommend using the 1K genomes VCF, which is freely available on the consortium website in an already-indexed form.
For now, there are two options available.
Set this parameter to "effect_size"
if your file contains a column with a signed beta
for each tested SNP and an se
column specifying the standard error
of the estimated effect.
Set this parameter to "pval_only"
if your file contains a p-value for each tested SNP
and one of the following:
- An
effect_direction
column with+
and-
values telling whether the direction of the alt allele's effect on gene expression is positive or negative. - A
beta
column with signed beta values for each tested SNP.
Specifies a JSON object containing additional objects whose keys are the file locations of each GWAS file being tested. These files must be bgzipped and tabixed for the pipeline to run (see preparation instructions above in the "Preparing Input Files" section).
The GWAS file objects should contain two additional fields:
A name for the reference VCF to be used for aligning the directions of the eQTL SNPs and in
some cases for estimating the minor allele frequencies. For now, it's best to have the
same reference genome for all eQTL and all GWAS files (hopefully in the future this
will be changed). This name can be any valid string; you will later define the location
and the properties of this reference VCF file in the ref_genomes
parameter.
If no custom reference VCF is desired, we recommend using the 1K genomes VCF, which is freely available on the consortium website in an already-indexed form.
For now, there are two options available.
Set this parameter to "effect_size"
if your file contains a column with a signed beta or with an
odds ratio for each tested SNP, and an se
column specifying the standard error
of the estimated effect.
Set this parameter to "pval_only"
if your file does not contain the above columns, but it contains
a p-value for each tested SNP, and a "direction" column telling whether the direction of the
alt allele's effect on gene expression is positive or negative.
Specifies a JSON object containing a list of all reference population VCFs used in this analysis.
The reference VCFs are used for calculating LD and sometimes for estimating minor allele frequencies,
and their names should correspond to the reference genomes specified in the eqtl_experiments
field.
Each reference VCF should be a JSON object with 3 required fields:
The bgzipped, tabixed file where the reference genome is located.
NOTE: Some reference VCFs are given in one long file, while others are given by chromosome. If yours
are given by chromosome, you can include the a {0}
placeholder in this string denoting the chromosome number
For example, in the filename 1KG_ALL.chr{0}.phase3.vcf.gz
, the colocalization script
will automatically replace the placeholder with the relevant chromosome number when running
the analysis.
The attribute in the INFO
field of the VCF that specifies the allele frequency.
This is done so we can filter out low-frequency variants to avoid generating files that
are impossible for colocalization methods to analyze due to edge cases.
For example, in 1K genomes VCFs the info field contains the following string:
AC=15;AF=0.00299521;AN=5008;NS=2504;DP=2036;EAS_AF=0.002;AMR_AF=0;AFR_AF=0.0091 ...
We want to extract the allele frequency, which is the second parameter listed here, so we
enter "AF"
for the af_attribute
field.
(For now, if you're having trouble figuring this out, just ask me. This part isn't very well-polished right now.)
The total number of individuals profiled in the reference population. (For the entire 1K genomes VCF, this number will be 2504.)
An object containing a list of the colocalization methods to use for this run. Each entry is itself a JSON object whose key is the method to run and whose value is an object specifying the parameter setting for this colocalization method.
Right now there are only two tested colocalization methods available: coloc
and finemap
.
Example: to run both FINEMAP and COLOC with no parameter modifications, we would specify
{
...,
"methods":
{
"finemap":{},
"coloc":{}
},
...
}
We now describe additonal parameters and setting needed to run each colocalization method:
The COLOC pipeline has no modifiable parameters; however, in order to run this method, some additional properties need to be specified.
Each GWAS in the gwas_experiments
section must specify not only a ref
and a
gwas_format
, but also the following parameters.
Either "cc"
or "quant"
, depending on whether the GWAS is a quantitative or case-control GWAS.
If the same type
applies to all of the files in gwas_experiments
, you may instead ignore
the gwas_experiments
-level type
value and instead assign "cc"
or "quant"
to the
global parameter "gwas_type"
.
The sample size of the GWAS. Currently, I'm not sure whether it's valid to estimate the sample size if it's not known; I hope to investigate this eventually.
Additionally, to run COLOC, the eqtl_experiments
need to have this parameter added:
The total sample size of the eQTL study.
Alternatively, if you are performing colocalizations for a large number of GWAS and QTL files
and do not want to specify "N"
within the config file, you may define the following global
parameters:
gwas_sample_sizes
eqtl_sample_sizes
Both of these parameters may to tab-delimited files with no header and two columns:
first, the file name (not the full path); second, the sample size of the corresponding study.
For example, gwas_sample_sizes
may point to a file that looks like this:
coloc_imputed_ADIPOGen_Adiponectin.txt.gz 29304
coloc_imputed_Astle_et_al_2016_Eosinophil_counts.txt.gz 173480
coloc_imputed_Astle_et_al_2016_Granulocyte_count.txt.gz 173480
coloc_imputed_Astle_et_al_2016_High_light_scatter_reticulocyte_count.txt.gz 173480
coloc_imputed_Astle_et_al_2016_Lymphocyte_counts.txt.gz 173480
The file names must exactly match the corresponding basename
of the files specified in
gwas_experiments
and eqtl_experiments
.
Look here for an example of a config file written for COLOC.
Currently, the FINEMAP pipeline has no modifiable parameters.
Once you have specified your desired options in the configuration file, it's simple to run the pipeline.
python ./dispatch.py [config_file] [num_threads]
If you are running the pipeline from SCG, load the following modules first:
module load r/3.6 # R v3.6
module load miniconda/2 # python2
module load plink/1.90b6.13 # PLINK v1.90b
module load tabix # tabix
The results of the pipeline are placed in a time-stamped folder in the output directory.
LocusCompareR visualizations of significant loci are
available in the plots
subdirectory. Please see the LocusCompareR website for more details on
how to interpret these plots if you are unfamiliar with LocusZoom-style plots.
The rest of the output will vary depending on which colocalization tools you ran. The outputs for specific tools are described below.
At the top level, the directory contains a file [gwas_name]_finemap_clpp_status.txt
.
This file contains a row for each locus tested. The sixth column is the CLPP score.
For more information on interpreting the CLPP score, see Hormozdiari et al.
We suggest filtering tested loci by the CLPP score and then visualizing the locus for further confirmation, to avoid
false positives.
A good starting filter is to select loci with at least 100 variants tested, a CLPP score of > 0.02, and a top -log(pvalue) of > 5 for both the eQTL and the GWAS studies. These filters can be made more stringent as needed if too many false positives are being detected.
The finemap
directory contains two file for each colocalization test, showing the posterior causal
probabilities for each SNP at the locus. These probabilities are shown in the snp_prob
column.
A file will be created in the top directory with the form [gwas_name]_coloc_h4pp_status.txt
.
There will be an h4pp
column in this file that contains the posterior probability of
H4 (colocalization) as described the COLOC model framework.
Regardless of which colocalization methods were selected, two files will be created
named ERROR_variants.txt
and skipped_variants.txt
that aid in troubleshooting
failed colocalization runs.
ERROR_variants.txt
contains a list of loci for which the pipeline crashed before
running to completion. A Python trackback report will be listed in the last column
whenever possible, and will hopefully be helpful in diagnosing the problem; if it
is unclear though, you're welcome to contact me or to file an issue report on the
Github page.
skipped_variants.txt
contains variants that were skipped for known reasons rather
than due to unhandled issues in the pipeline. Some common reasons for skipping variants
are when no variants overlap between the GWAS, eQTL, and VCF files, or when the width
of the overlapping region is too small to perform a proper colocalization analysis.
In some rare cases we also find variants that are skipped because no genes were tested
for the eQTL study in that region; when this happens the description is listed as
"Gene desert".
In some cases, if you want to test a very large number of GWAS files (e.g. UK Biobank), it can be cumbersome to actually create the config file required to analyze all the files. Similar, in some cases you might only want to test certain combinations of GWAS and eQTL files at certain loci, rather than every possible pairwise combination.
If this is true for you, I recommend that instead of manually assembling the config file,
you load in a template of the JSON file and construct it programmatically using Python's
json
library or your language of choice.
I hope to eventually make available some code I've written for performing this in a principled and automated manner on any arbitrary set of GWAS and eQTL files according to the combinations passing certain thresholds.
My goal is to make these pipelines as easily runnable and as intuitive as possible! Please contact me with any problems you encounter, or post an issue report, and I'll happily work with you to help you out or to fix any bugs you might find.
FINEMAP v1.1 cite
[add others before making public]
- Mike Gloudemans
- Abhiram Rao
- Brunilda Balliu
- Nicole Gay
- Boxiang Liu
- Jeremy Tien
A big thanks to the following people with help troubleshooting and testing the pipeline:
- Pagé Goddard
- Derek Klarin