A collection of scripts to estimate cell type estimates from bulk tissue gene expression data, perform interaction-eQTL (ieQTL) mapping and run colocalization analysis
Following dependencies should be installed before running the pipeline:
- R (>= 3.2.2)
- Rpackages: xCell, GSEABase, coloc, data.table
- Python (>= 3.6.4 and >=2.7.8)
- Python modules: pandas, numpy
- tensorqtl for eQTL mapping
- A modified eigenMT python script to read tensorQTL output directly. It also include postprocessing steps that need annotation.py to produce the summary stat output format from GTEx v8. The original work can be found here
- tabix
- Example input data from the Geuvadis project are available here
To illustrate each step of this pipeline we will use data from the Geuvadis project. Please download GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.bed/.bim/.fam
, GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.vcf.gz/.tbi
, GEUVADIS.445_samples.expression.bed.gz/.tbi
and GEUVADIS.445_samples.covariates.txt
from the link above and save in the example
directory to run each of the scripts.
NOTE: The purpose of this pipeline is to give an overview of analyses performed in the upcoming GTEx v8 cell type composition paper. It specifies individual parameters, provides helper functions and describes the environment each script was used in. To use it for your own data scripts will need to be adjusted accordingly. This workflow is not meant to be used as a standalone pipeline but as a template to reproduce GTEx v8 results or to use it in other projects.
The R-package xCell calculates cell type enrichment scores from bulk tissue gene expression data. Following script Run_xCell.sh
includes steps to read in the input file, add additional samples from GTEx v6p to ensure tissue-diversity, run xCell and outputs estimates of the cell type of interest to use as interaction covariate in tensorqtl.
./Run_xCell.sh
Tensorqtl can be used to map cis-eQTLs that interact with cell type estimates. Following script lists all tensorqtl parameters and parameters that are specific to the job scheduler slurm at NYGC.
celltype=B-cells; tissue=Cells_EBV-transformed_lymphocytes;
sbatch ./Run_tensorqtl_interaction.sh ${tissue} ${celltype}
Permutationschemes adapted for interaction-eQTLs were not available when GTEx v8 cell type i-eQTLs were generated. We used eigenMT as alternative multiple testing correction method for cis-eQTL studies. Gene-level corrected p-values were further adjusted using Benjamini-Hochberg procedure.
NOTE: To run
submit_eigenmt_py3.sh
genotype and phenotype input files need to be provided. These can be generated using the scriptRun_prepare_eigenmt.sh
, which only needs to be run once.
celltype=B-cells; tissue=Cells_EBV-transformed_lymphocytes;
./Run_eigenmt.sh ${tissue} ${celltype}
The output file contains summary statistics for the interaction term, the genotype and cell type main effects, the eigenMT + Benjamini-Hochberg-corrected pvalues for the interaction term, and gene names and biotypes extracted from Gencode:
variant_id gene_id gene_name biotype phenotype_id tss_distance maf ma_samples ma_count pval_g b_g b_g_se pval_i b_i b_i_se pval_gi b_gi b_gi_se pval_emt tests_emt pval_adj_bh
chr1_1075715_A_G_b38 ENSG00000187961.13 KLHL17 protein_coding ENSG00000187961.13 115128 0.0940299 119 126 0.0436386 0.056653 0.0280206 0.0791713 -0.089065 0.0506479 0.00106786 0.0940563 0.0286051 1 1034 1
We ran colocalization analysis only on significant cell type ieGenes (FDR < 0.05). Following code extracts gene-ids of these ieGenes.
mkdir -p example/coloc
celltype=B-cells; tissue=Cells_EBV-transformed_lymphocytes;
cat example/${celltype}_ieqtl/eqtls_final/${tissue}_${celltype}_ieqtl.eigenMT.annotated.txt | awk '$21 < 0.05 {print $2}' > example/${celltype}_ieqtl/phenolist_${tissue}_${celltype}_ieqtl_fdr5.txt
Extract summary statistics of the entire cis-window of significant cell type ieGenes. This script needs about 60-80G of memory.
celltype=Neutrophils; tissue=Whole_Blood;
p2f="example/${celltype}_ieqtl"
python3 extract_parquet2txt.py \
-p ${p2f}/${tissue}*.parquet \
-s ${p2f}/phenolist_${tissue}_${celltype}_ieqtl_fdr5.txt \
-o ${tissue}_${celltype}_ieqtl_fdr5.txt.gz \
-d ${p2f}
celltype=Neutrophils; tissue=Whole_Blood;
./Run_coloc_gtex.v8i_114traits_prior_tq.sh ${tissue}_${celltype}_ieqtl_fdr5.txt.gz GTEx_Analysis_2017-06-05_v8_samplesize_abbrv_1.${celltype}.tissues.txt 1
celltype=Neutrophils; tissue=Whole_Blood;
./Run_coloc_post.sh coloc_${tissue}_${celltype}_ieqtl_fdr5_HEIGHT.Rda ${tissue}_${celltype}_ieqtl_fdr5 114traits
celltype=Neutrophils; tissue=Whole_Blood;
./Run_locusplot.sh Coloc.res4locuszoom_${tissue}_${celltype}_ieqtl_fdr5.txt 2 ieqtl locusplots/ieqtl