dominant-eQTL

This analysis pipeline can be used to detect loci having non-additive genetic assoications with gene expression levels.

Theory

We applied a multiple linear regression model to identify dominant associaions between SNPs and gene expression levels.

$G_A$ stands for the number of non-reference alleles. $G_A$ = 0 if genotype is reference homozygous; $G_A$ = 1 of genotype is heterozygous; $G_A$ = 2 if genotype is non-reference homozygous. This is the variable that captures the additive effects, which is commonly used in the analysis of gene expression quantitative loci(eQTLs).

$G_D$ captures dominant effects, where $G_D$ = 1 if genotype is heterozygous and $G_D$ = 0 if genotype is either reference or non-reference homozygous.

E denotes the observed gene expression levels. Our model assumes that the noise across samples is normally distributed with 0 mean and some variance value.

Our null hypothesis is that there is no evidence for a dominant association, which is equivalent to β2 not significantly different from zero. An ANOVA model can be used instead of a multiple linear regression to detect both additive and dominant effects of the genotype; however the F-test used in the ANOVA model aims to test if at least one beta is significantly different from zero. This does not fulfill our objective, as we specifically wish to test whether there is evidence showing that β2 is not equal to zero. Alternatively, a t-test can be used to test the effect sizes of Ga and Gd separately.

Here is an illustration of what dominant eQTL looks like.

Workflow

Getting Started

Dependencies

snakemake 3.13.3
python2.7
numpy
R

Snakemake Workflow

Input files: Genotype file (genotype.txt) and RNA-Seq counts matrix file (gene_expr.txt) downloaded from GTEx project

Run PCA analysis on the genotype matrix

Rscript --vanilla pca_genotype.R genotype genotype.txt genotype_pca.txt

Regress out sex, age, race and hidden covariates from gene expression matrix using PEER package

Rscript --vanilla peer_factor.R gene_expr.txt genotype_pca.txt phenotype_info.txt gene_expr_PEER.txt

Split jobs for parallele computation:1000 genes per job

python split_into_files.py gene_expr_PEER.txt {files}_peer_expr.txt

Merge SNPs that are within +/- 100kb of gene body with gene expression matrixes by column into a large matrix

python snps_counts_comb_by_chr_no_filter.py snp.h5 index.h5 genotype_pca.txt {files}_peer_expr.txt {files}_snps_counts_comb.txt

Adjust p-values using beta approximation and further speed up using matrix multiplication

Rscript --vanilla beta_adjust_P_matrix_multip_corrected.R {files}_snps_counts_comb.txt {files}_output.txt

Merge output files

cat {files}_output.txt > all_chr_output.txt

extract out snp-gene pair that shows dominant effects

Rscript --vanilla extract_snp_gene_pair_comb_matrix.R {files}_output.txt {files}_snps_counts_comb.txt dominant_snp_gene_pair.txt

Authors

Jing Gu

Acknowledgements

Dr. Graham McVicker Salk Institute for Biological Studies
Patrick Fiaux UC San Diego
Arko Sen Salk Institute for Biological Studies
Hsiuyi Chen Salk Institute for Biological Studies
Selene Tyndale Salk Institute for Biological Studies

j3gu/Dominant-eQTLs