This is an R package implementing the TWRCI algorithm for discovering root causal gene expression levels -- or root causal genes for short -- from observational data. Root causal genes correspond to the first gene expression levels that are disturbed in disease near the beginning of pathogenesis. In contrast, core genes lie at the end of pathogenesis and driver genes only account for the effects of somatic mutations primarily in cancer. TWRCI requires individual level data containing genetic variants, gene expression levels from the relevant tissue and the phenotype (variant-expression-phenotype data).
The academic article describing TWRCI in detail can be found here. Please cite the article if you use any of the code in this repository.
The Experiments folder contains any additional code needed to replicate the experimental results in the paper after downloading GTEx V8 protected access data and lifting over to hg19 with BCFtools. All code was tested in R version 4.3.1 and BCFtools version 1.18.
Install time is under 5 minutes for a typical desktop computer.
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("qvalue", "biomaRt", "AnnotationDbi", "org.Hs.eg.db", "gage", "RBGL", "Rgraphviz"))
Download the independence 1.0.1 package from the CRAN archive and install it from source:
install.packages(path_to_file, repos = NULL, type="source")
Then:
library(devtools)
install_github("ericstrobl/TWRCI")
library(TWRCI)
Installation of the cTWAS package is also required if you want to replicate the experiments:
install_github("xinhe-lab/ctwas",ref = "main")
Set the number of samples of the variant-expression-phenotype data to 500:
nsamps = 500
Instantiate 30 instrumental variables:
datat = matrix(rnorm(nsamps*30),nsamps,30)
Set the number of gene expression levels plus the phenotype to 10:
p = 10
Assign variants to each gene and the phenotype:
SNPs = generate_SNP_list(p,ncol(datat))
Generate the DAG:
G = generate_DAG(p,SNPs$SNPs,bulk_nbatch=1)
Generate the remaining expression-phenotype data:
RNAseq = sample_data(G,SNPs$SNPs,datat,p=p,nsamps=nsamps)
Run TWRCI:
out = TWRCI(RNAseq$X, RNAseq$SNP_data, RNAseq$Y, RNAseq$batch, RNAseq$batch, graph=TRUE, CRCEs=TRUE)
Run time is under 2 minutes for a typical desktop computer. Print the output of TWRCI:
print(out$K) # causal order
print(out$SNPs) # annotations
print(out$G_est) # causal graph
print(out$CRCEs) # CRCE estimates