/Cell-type-ieQTL-pipeline

A collection of scripts to estimate cell type estimates from bulk tissue gene expression data, perform ieQTL mapping and run colocalization analysis

Primary LanguageR

Cell type interaction-eQTL pipeline

A collection of scripts to estimate cell type estimates from bulk tissue gene expression data, perform interaction-eQTL (ieQTL) mapping and run colocalization analysis


Dependencies

Following dependencies should be installed before running the pipeline:

  • R (>= 3.2.2)
  • Rpackages: xCell, GSEABase, coloc, data.table
  • Python (>= 3.6.4 and >=2.7.8)
  • Python modules: pandas, numpy
  • tensorqtl for eQTL mapping
  • A modified eigenMT python script to read tensorQTL output directly. It also include postprocessing steps that need annotation.py to produce the summary stat output format from GTEx v8. The original work can be found here
  • tabix
  • Example input data from the Geuvadis project are available here

Workflow

To illustrate each step of this pipeline we will use data from the Geuvadis project. Please download GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.bed/.bim/.fam, GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.vcf.gz/.tbi, GEUVADIS.445_samples.expression.bed.gz/.tbi and GEUVADIS.445_samples.covariates.txt from the link above and save in the example directory to run each of the scripts.

NOTE: The purpose of this pipeline is to give an overview of analyses performed in the upcoming GTEx v8 cell type composition paper. It specifies individual parameters, provides helper functions and describes the environment each script was used in. To use it for your own data scripts will need to be adjusted accordingly. This workflow is not meant to be used as a standalone pipeline but as a template to reproduce GTEx v8 results or to use it in other projects.

1. Estimate cell type abundance from bulk tissue gene expression data

The R-package xCell calculates cell type enrichment scores from bulk tissue gene expression data. Following script Run_xCell.sh includes steps to read in the input file, add additional samples from GTEx v6p to ensure tissue-diversity, run xCell and outputs estimates of the cell type of interest to use as interaction covariate in tensorqtl.

./Run_xCell.sh

2. Map cell type interaction-eQTLs

2.1 Run eQTL interaction analysis

Tensorqtl can be used to map cis-eQTLs that interact with cell type estimates. Following script lists all tensorqtl parameters and parameters that are specific to the job scheduler slurm at NYGC.

celltype=B-cells; tissue=Cells_EBV-transformed_lymphocytes;
sbatch ./Run_tensorqtl_interaction.sh ${tissue} ${celltype}
2.2 Correct for multiple testing

Permutationschemes adapted for interaction-eQTLs were not available when GTEx v8 cell type i-eQTLs were generated. We used eigenMT as alternative multiple testing correction method for cis-eQTL studies. Gene-level corrected p-values were further adjusted using Benjamini-Hochberg procedure.

NOTE: To run submit_eigenmt_py3.sh genotype and phenotype input files need to be provided. These can be generated using the script Run_prepare_eigenmt.sh, which only needs to be run once.

celltype=B-cells; tissue=Cells_EBV-transformed_lymphocytes;
./Run_eigenmt.sh ${tissue} ${celltype}

The output file contains summary statistics for the interaction term, the genotype and cell type main effects, the eigenMT + Benjamini-Hochberg-corrected pvalues for the interaction term, and gene names and biotypes extracted from Gencode:

variant_id	gene_id	gene_name	biotype	phenotype_id	tss_distance	maf	ma_samples	ma_count	pval_g	b_g	b_g_se	pval_i	b_i	b_i_se	pval_gi	b_gi	b_gi_se	pval_emt	tests_emt	pval_adj_bh
chr1_1075715_A_G_b38	ENSG00000187961.13	KLHL17	protein_coding	ENSG00000187961.13	115128	0.0940299	119	126	0.0436386	0.056653	0.0280206	0.0791713	-0.089065	0.0506479	0.00106786	0.0940563	0.0286051	1	1034	1

3. Perform colocalization analysis of cell type i-eQTLs and GWAS traits

3.1 Extract significant cell type ieGenes

We ran colocalization analysis only on significant cell type ieGenes (FDR < 0.05). Following code extracts gene-ids of these ieGenes.

mkdir -p example/coloc
celltype=B-cells; tissue=Cells_EBV-transformed_lymphocytes;
cat example/${celltype}_ieqtl/eqtls_final/${tissue}_${celltype}_ieqtl.eigenMT.annotated.txt | awk '$21 < 0.05 {print $2}' > example/${celltype}_ieqtl/phenolist_${tissue}_${celltype}_ieqtl_fdr5.txt

Extract summary statistics of the entire cis-window of significant cell type ieGenes. This script needs about 60-80G of memory.

celltype=Neutrophils; tissue=Whole_Blood;
p2f="example/${celltype}_ieqtl"
python3 extract_parquet2txt.py \
    -p ${p2f}/${tissue}*.parquet \
    -s ${p2f}/phenolist_${tissue}_${celltype}_ieqtl_fdr5.txt \
    -o ${tissue}_${celltype}_ieqtl_fdr5.txt.gz \
    -d ${p2f}
3.1 Run coloc
celltype=Neutrophils; tissue=Whole_Blood;
./Run_coloc_gtex.v8i_114traits_prior_tq.sh ${tissue}_${celltype}_ieqtl_fdr5.txt.gz GTEx_Analysis_2017-06-05_v8_samplesize_abbrv_1.${celltype}.tissues.txt 1
3.2 Postprocessing
celltype=Neutrophils; tissue=Whole_Blood;
./Run_coloc_post.sh coloc_${tissue}_${celltype}_ieqtl_fdr5_HEIGHT.Rda ${tissue}_${celltype}_ieqtl_fdr5 114traits
3.3 Make locusplots
celltype=Neutrophils; tissue=Whole_Blood;
./Run_locusplot.sh Coloc.res4locuszoom_${tissue}_${celltype}_ieqtl_fdr5.txt 2 ieqtl locusplots/ieqtl