TEffectR: An R package for studying the potential effects of transposable elements on gene expression with linear regression model
Citation information : Karakülah G, Arslan N, Yandım C, Suner A. 2019. TEffectR: an R package for studying the potential effects of transposable elements on gene expression with linear regression model. PeerJ 7:e8192 https://doi.org/10.7717/peerj.8192
Transposable elements (TEs) are DNA sequences that are able to translocate themselves along a host genome (Biemont & Vieira 2006). This R (https://www.r-project.org) package was developed for dissecting significant associations between TEs and nearby genes in a given RNA-sequencing (RNA-seq) data set by establishing a linear regression model (LM). Our R package, namely TEffectR, makes use of publicly available RepeatMasker TE (http://www.repeatmasker.org) and Ensembl gene annotations (https://www.ensembl.org/index.html) and calculate total unique read counts of TEs from sorted and indexed genome aligned BAM files. Then, it predicts the associations of TE expressions with the transcription of adjacent genes under diverse biological conditions. These associations could be made use of by biologists to understand potential influences of TE expression on the regulation of nearby genes.
- R version should be version 3.5+
- While using r programming, we suggest you to use Rstudio which is the R statistical computing environment to use and understand functions TEffectR well.
- Bedtools is required on your local computer.
- devtools is required to install TEffectR.
- TEffectR uses these R packages so you have to install all of them. You may visit the following websites to install them easily:
library(devtools)
devtools::install_github("karakulahg/TEffectR")
- Load the library:
library(TEffectR)
-
Download the most recent RepeatMasker annotation file for the organism of interest.
-
The following function takes RepeatMasker annotation file as input and extracts the genomic location of each TE along with the repeat class and family information. The output of rm_format() function is used while searching TEs overlapping in the upstream region of a given gene list. In our case, we use hg38 assembly:
repeatmasker.annotation <- TEffectR::rm_format(filepath = "~/Path2Directory/hg38.fa.out.gz" )
- Read raw gene counts. An example gene count matrix can be dowloaded from: here. In this step, we make use of a publicly available whole transcriptome sequencing dataset including normal and tumor tissue specimens obtained from 22 ER+/HER2-breast cancer patients (GEO Accession ID: GSE103001). The count matrix was generated with HISAT2 - StringTie pipeline.
exprs <- read.csv("gene_count_matrix.csv", row.names = 1, header=T, stringsAsFactors = F)
-
Retrieve the genomic locations of all genes in the given read count matrix.
-
The URL argument takes the version of Ensembl database used for gene expression quantification and can be listed here. Alternatively, you can list using the following command:
biomaRt::listEnsemblArchives()
-
ID.type must be ensembl_gene_name, ensembl_gene_id, or ensembl_transcript_id.
-
In our case, we use Ensembl gene IDs (e.g. ENSG00000000003, ENSG00000000005, ...):
gene.annotation <- get_intervals(x = rownames(exprs), assembly="hg38", ID.type = "ensembl_gene_id", URL="http://dec2014.archive.ensembl.org" )
-
-
The following function takes the genomic intervals of genes and TEs as input. Besides, the user also requires to provide three additional parameters: (i) the maximum distance allowed between the start sites of genes and TEs, (ii) whether genes and TEs must be located in same strand and (iii) TE family or subfamily name (e.g. SINE, LINE). This function helps to detect TE species that are localized within the upstream (the "distance" argument takes positive values e.g. 5000) or downstream (the "distance" argument takes negative values e.g. -5000) of genes of interest.
overlaps <- TEffectR::get_overlaps(g=gene.annotation, r=repeatmasker.annotation, strand = "strandness", distance = 5000, repeat_type = "LTR")
- Count uniquely mapped reads to the TEs that are located within 5kb upstream of the given gene list. This step returns a raw count matrix of the total number of reads originated from TE sequences. Only the reads exhibiting 100% overlap with given TE regions are considered and the user needs to specify individual paths for each BAM file as input. All BAM files used in this step can be downloaded from: here This step may take up to a few hours depending on the number of BAM files.
BAM.list <- c("~/Path2Directory/SRR5962198/SRR5962198_unique_sorted.bam",
"~/Path2Directory/SRR5962199/SRR5962199_unique_sorted.bam",
"~/Path2Directory/SRR5962200/SRR5962200_unique_sorted.bam",
"~/Path2Directory/SRR5962201/SRR5962201_unique_sorted.bam",
"~/Path2Directory/SRR5962202/SRR5962202_unique_sorted.bam",
"~/Path2Directory/SRR5962203/SRR5962203_unique_sorted.bam",
"~/Path2Directory/SRR5962204/SRR5962204_unique_sorted.bam",
"~/Path2Directory/SRR5962205/SRR5962205_unique_sorted.bam",
"~/Path2Directory/SRR5962206/SRR5962206_unique_sorted.bam",
"~/Path2Directory/SRR5962207/SRR5962207_unique_sorted.bam",
"~/Path2Directory/SRR5962208/SRR5962208_unique_sorted.bam",
"~/Path2Directory/SRR5962209/SRR5962209_unique_sorted.bam",
"~/Path2Directory/SRR5962210/SRR5962210_unique_sorted.bam",
"~/Path2Directory/SRR5962211/SRR5962211_unique_sorted.bam",
"~/Path2Directory/SRR5962212/SRR5962212_unique_sorted.bam",
"~/Path2Directory/SRR5962213/SRR5962213_unique_sorted.bam",
"~/Path2Directory/SRR5962214/SRR5962214_unique_sorted.bam",
"~/Path2Directory/SRR5962215/SRR5962215_unique_sorted.bam",
"~/Path2Directory/SRR5962216/SRR5962216_unique_sorted.bam",
"~/Path2Directory/SRR5962217/SRR5962217_unique_sorted.bam",
"~/Path2Directory/SRR5962218/SRR5962218_unique_sorted.bam",
"~/Path2Directory/SRR5962219/SRR5962219_unique_sorted.bam",
"~/Path2Directory/SRR5962220/SRR5962220_unique_sorted.bam",
"~/Path2Directory/SRR5962221/SRR5962221_unique_sorted.bam",
"~/Path2Directory/SRR5962222/SRR5962222_unique_sorted.bam",
"~/Path2Directory/SRR5962223/SRR5962223_unique_sorted.bam",
"~/Path2Directory/SRR5962224/SRR5962224_unique_sorted.bam",
"~/Path2Directory/SRR5962225/SRR5962225_unique_sorted.bam",
"~/Path2Directory/SRR5962226/SRR5962226_unique_sorted.bam",
"~/Path2Directory/SRR5962227/SRR5962227_unique_sorted.bam",
"~/Path2Directory/SRR5962228/SRR5962228_unique_sorted.bam",
"~/Path2Directory/SRR5962229/SRR5962229_unique_sorted.bam",
"~/Path2Directory/SRR5962230/SRR5962230_unique_sorted.bam",
"~/Path2Directory/SRR5962231/SRR5962231_unique_sorted.bam",
"~/Path2Directory/SRR5962232/SRR5962232_unique_sorted.bam",
"~/Path2Directory/SRR5962233/SRR5962233_unique_sorted.bam",
"~/Path2Directory/SRR5962234/SRR5962234_unique_sorted.bam",
"~/Path2Directory/SRR5962235/SRR5962235_unique_sorted.bam",
"~/Path2Directory/SRR5962236/SRR5962236_unique_sorted.bam",
"~/Path2Directory/SRR5962237/SRR5962237_unique_sorted.bam",
"~/Path2Directory/SRR5962238/SRR5962238_unique_sorted.bam",
"~/Path2Directory/SRR5962239/SRR5962239_unique_sorted.bam",
"~/Path2Directory/SRR5962240/SRR5962240_unique_sorted.bam",
"~/Path2Directory/SRR5962241/SRR5962241_unique_sorted.bam")
SampleName.list <- colnames(exprs)
TE.counts <- TEffectR::count_repeats(bamlist = BAM.list, namelist = SampleName.list, ranges=overlaps)
- The following command takes the output of count_repeats() function as input. It is used to calculate the total number of sequencing reads derived from each TE that is located within the upstream of a certain gene.
SumOfTEs <- TEffectR::summarize_repeat_counts(counts = TE.counts, namelist = SampleName.list)
- The following core function applies filtering, TMM normalization, voom transformation and LM to the given raw count expression values, respectively. It takes four arguments: (i) raw gene counts, (ii) raw TE counts, (iii) a data frame containing user-defined covariates (e.g. tissue type, disease state), and (iv) the output of get_overlaps() function. When covariates are determined, one may include all the biological factors to see if they could explain the expression of the gene in conjuction with TE expression. However, one may as well only use the TE expression as the single predictor without the inclusion of further covariates. In this case, the covariates parameter takes NULL value.
This function returns three outputs: (i) a tsv file containing the p-value of each model, significance level of covariates and associated adjusted R squared values, (ii) another tsv file containing log2(CPM) values of genes and TEs included in LM, and (iii) a group of diagnostic plots for each significant model (p < 0.05).
#Create a data frame containing user-defined covariates.
df.covariates <- data.frame( tissue_type=c(rep("Normal", 22), rep("Tumor", 22)), patient=c(c(1:22), c(1:22)) )
#Apply multiple linear regression models using the given list of covariates and TE counts.
results <- TEffectR::apply_lm(gene.annotation = gene.annotation, gene.counts = exprs, repeat.counts = SumOfTEs, covariates = df.covariates, prefix = "LTR-5kb")
-
The tab separated file containing the p-value of each model can be downloaded here and Log2(CPM) values of genes and repeats are available here
-
If you would like to run TEffectR pipeline with the TE counts generated by other quantification tools (e.g. SQuIRE or TEtranscripts), please make sure that the generated input files of such tools, which you use to run the TEffectR, must be in the following format:
gene.annotation :
chr start end strand geneID geneName
chrX 2691179 2741309 ENSG00000002586 CD99
chr4 11393150 11429765 - ENSG00000002587 HS3ST1
chr6 136256863 136289851 - ENSG00000029363 BCLAF1
...
gene.counts :
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10 Sample11
ENSG00000002586 0 0 2 0 2 0 0 1 0 0 11
ENSG00000002587 491 243 745 1233 1004 592 3493 539 412 312 1403
ENSG00000029363 438 176 702 1266 1130 526 2304 579 480 331 2122
...
sum.repeat.counts :
geneName repeatClass repeatName Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10 Sample11
CD99 LINE CR1-11_Crp 24 0 12 0 0 0 24 0 12 0 12
HS3ST1 LINE CR1-3_Croc 24 54 12 35 89 103 24 67 12 47 12
BCLAF1 LINE HAL1 68 35 23 0 88 49 57 39 25 43 38
....
Sample files can also be downloaded here: gene.annotation, gene.counts, and repeat.counts and the following pipeline can be run with these files:
library(stringr)
library(biomaRt)
library(biomartr)
library(dplyr)
library(Rsamtools)
library(edgeR)
library(rlist)
library(limma)
library(TEffectR)
# read your gene annotation file
gene.annot<-read.table("gene.annotation.tsv", header= T, stringsAsFactors = F)
# read your gene expression file
gene.counts<-read.table("gene.counts.tsv", header= T, row.names=1, stringsAsFactors = F)
# read your summarised repeat annotation file
sum.repeat.counts<-read.table("sum.repeat.counts.tsv", header= T, stringsAsFactors = F)
# include covariates
covariates <- data.frame("TissueType" = c(rep("N",5), rep("T",6)) )
OR
# in case the TE expression is only single predictor
covariates <- NULL
prefix<-"SampleRun"
# apply linear modeling function of TEffectR
lm<-TEffectR::apply_lm(gene.annotation = gene.annot,
gene.counts = gene.counts,
repeat.counts = sum.repeat.counts,
covariates = covariates,
prefix = prefix)
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=tr_TR.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=tr_TR.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=tr_TR.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=tr_TR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 rlist_0.4.6.1 edgeR_3.24.3 limma_3.38.3 Rsamtools_1.34.0 Biostrings_2.50.2
[7] XVector_0.22.0 GenomicRanges_1.34.0 GenomeInfoDb_1.18.1 IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0
[13] dplyr_0.7.8 biomartr_0.8.0 biomaRt_2.38.0 stringr_1.3.1 TEffectR_0.1.0
loaded via a namespace (and not attached):
[1] progress_1.2.0 tidyselect_0.2.5 locfit_1.5-9.1 purrr_0.3.0 lattice_0.20-35
[6] yaml_2.2.0 blob_1.1.1 XML_3.98-1.16 rlang_0.3.1 pillar_1.3.1
[11] glue_1.3.0 DBI_1.0.0 BiocParallel_1.16.5 bit64_0.9-7 GenomeInfoDbData_1.2.0
[16] bindr_0.1.1 zlibbioc_1.28.0 memoise_1.1.0 Biobase_2.42.0 curl_3.3
[21] AnnotationDbi_1.44.0 Rcpp_1.0.0 readr_1.3.1 bit_1.1-14 hms_0.4.2
[26] digest_0.6.18 stringi_1.2.4 grid_3.5.1 tools_3.5.1 bitops_1.0-6
[31] magrittr_1.5 RCurl_1.95-4.11 RSQLite_2.1.1 tibble_2.0.1 crayon_1.3.4
[36] pkgconfig_2.0.2 data.table_1.12.0 prettyunits_1.0.2 assertthat_0.2.0 httr_1.4.0
[41] rstudioapi_0.9.0 R6_2.3.0 compiler_3.5.1