2020-06-12
To install WHISTLE from Github, use the following code:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("GenomicRanges","GenomicFeatures",
"BSgenome","IRanges",
"S4Vectors","GenomicScores",
"AnnotationDbi","Biostrings",
"TxDb.Hsapiens.UCSC.hg19.knownGene"))
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("ZW-xjtlu/WHISTLE", build_vignettes = TRUE)
If you want to annotate sequence and evolutionary conservation scores, please install the following packages (for hg19):
BiocManager::install(c("BSgenome.Hsapiens.UCSC.hg19",
"fitCons.UCSC.hg19",
"phastCons100way.UCSC.hg19"))
The WHISTLE R-package has been developed for the computational representation of the domain knowledge related to epitranscriptomic modifications, such as m6A, Pseudouridine, and m7G. The inputs of the main function transcriptomicFeatures are the GRanges
object for range based RNA modification locations and the TxDb
object for the transcript annotations with other optional annotation objects. The WHISTLE package has the following two key utilities:
-
Annotating the features that are powerful in the machine learning prediction of the mRNA modifications.
-
Elucidate the biological mechanisms of epitranscriptomic regulation by analyzing the statistical associations between the annotated features and other functional genomic features in field of epitranscriptomics.
Transcript annotation can be provided as a GTF/GFF file, a TxDb object, or automatically downloaded from UCSC through the internet using makeTxDbFromUCSC
.
We will in the next see how the annotation of the transcriptomic features can be accomplished in a single command.
Let us firstly load the WHISTLE package and the TxDb
for transcript annotations.
library(WHISTLE)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
The main functioon of WHISTLE R-package is to annotate transcriptomic features based on a range based RNA modification sites input (Sites in single based resolution are highly recommended). The necessary inputs of this function are the RNA modification site in GRanges
object and the transcript annotation files in TxDb
object. In oder to extract features other than transcriptomic topologies, additional annotation objects such as BSgenome
and GSscores
are required. Please see ?transcriptomicFeatures
for detailed explainations.
feature_hg19 <- transcriptomicFeatures(
mod_gr = m6A_ex_hg19,
txdb = txdb_hg19
)
The output is a data.frame
object with the columns indicating the annoated features in dummy variables or real valued number data types. The data.frame
table can be used in prediction models directly as the feature/predictor matrix. The prediction models could be linear regression such as lm
and glm
in R base functions, or other advanced machine learning models such as svm and random forest implemented in R package caret
.
str(feature_hg19) # Feature list as data.frame table
## 'data.frame': 100 obs. of 22 variables:
## $ UTR5 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ UTR3 : logi FALSE FALSE TRUE TRUE FALSE FALSE ...
## $ stop_codon : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ start_codon : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ TSS : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ exon_stop : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ alt_exon : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ const_exon : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ inter_exon : logi TRUE TRUE FALSE FALSE TRUE FALSE ...
## $ long_exon : logi FALSE TRUE TRUE TRUE FALSE TRUE ...
## $ last_ex400sc: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ pos_UTR5 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pos_UTR3 : num 0 0 0.685 0.319 0 ...
## $ pos_exon : num 0.201 0.445 0.69 0.799 0.217 ...
## $ dist_5sj_2k : num 126 1325 2000 2000 93 ...
## $ dist_3sj_2k : num 30 1060 2000 2000 24 277 8 98 19 192 ...
## $ len_UTR3 : num 643 514 3882 504 0 ...
## $ len_UTR5 : num 144 225 472 104 0 169 126 393 42 286 ...
## $ len_gene_ex : num 1549 16254 7625 1706 9073 ...
## $ ndist_m2h : num 200 200 200 200 200 200 200 200 200 200 ...
## $ sncRNA : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ lncRNA : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
In general, the features annotated by transcriptomicFeatures
belong to the following categories:
-
Topological Features Features constructed from the transcriptomic landmarks. The toplogical features were previously reported by the published studies to associate with the locations of RNA modifiations.
-
Sequence Features The sequence derived features that are related to epitranscriptomics, such as the sequence motifs and the predicted RNA secondary structures.
-
Functional Genomic Features Features of other functional genomic information that have proofed correlation with the mRNA modifications, such as the functional types of the transcript, evolutionary conservation of the sequence, and the RBP binding sites of RNA modification writers, readers, and erasers.
Please notice that, when only the TxDb object is provided, only the topological features will be completely annotated in the returned data.frame
table.
To generate the full feature set, please refer to the following code example for hg19.
library(BSgenome.Hsapiens.UCSC.hg19)
library(fitCons.UCSC.hg19)
library(phastCons100way.UCSC.hg19)
#Generate complete explanatory feature set for epitranscriptomic modifications
feature_hg19_complete <- transcriptomicFeatures(
mod_gr = m6A_ex_hg19,
# TxDb object for the transcript annotation.
txdb = txdb_hg19,
# BSgenome object for the genome sequence.
bsgnm = Hsapiens,
# GScores object for the Fitness Consequence scores.
fc = fitCons.UCSC.hg19,
# GScores object for the PhastCons scores.
pc = phastCons100way.UCSC.hg19,
# RNA fold predicted RNA secondary structures for hg19 exons.
struct_hybridize = Struc_hg19,
# Binding sites of m6A regulators.
genomic_annot = RBP_m6A_hg19,
# Entrez gene ids of house keeping genes.
hk_genes_id = hkGene_eid_hg19,
# Entrez gene ids of microRNA targeted genes.
miRtar_genes_id = miRtarGene_eid_hg19,
# Instances of DRACH motif for m6A modification.
motif = c("AAACA","AGACA","AAACT","AGACT","AAACC","AGACC",
"GAACA","GGACA","GAACT","GGACT","GAACC","GGACC",
"TAACA","TGACA","TAACT","TGACT","TAACC","TGACC"),
# The clustering effect on DRACH motifs.
motif_clustering = "DRACH",
# The clustering on the input modification Granges.
annot_clustering = m6A_ex_hg19
)
## __ __ __ _______________ ______
## / / / / / // / ___/_ __/ / / ___/
## / / / / / // /\__ \ / / / / / ___/
## / /_/ /_/ // /___/ // / / /___/ /___
## \___/\___//_/_____//_/ /_____/_____/ version 1.0.1
## Type 'citation("WHISTLE")' for citing this R package in publications.
## Extracting exons, transcripts, and genes ... OK
## Handeling transcript isoform ambiguities ... OK
## Extracing the ambiguity removed exonic landmarkers ... OK
## Constructing transcriptomic features ...
## Feature standardization: 'No Sandardization'.
## Isoform ambiguity method: 'Longest Mature RNA Transcript'.
## Gene ambiguity method: 'Average over Overlapping Genes'.
## The feature documentations are listed below:
## ============================================================================================================
## ID | Feature name | Variable type | Feature description
## ============================================================================================================
## 1 | UTR5 | Dummy variable | Site overlapping with 5'UTR.
## ------------------------------------------------------------------------------------------------------------
## 2 | UTR3 | Dummy variable | Site overlapping with 3'UTR.
## ------------------------------------------------------------------------------------------------------------
## 3 | stop_codon | Dummy variable | Site within 201 bp centered by the stop codon.
## ------------------------------------------------------------------------------------------------------------
## 4 | start_codon | Dummy variable | Site within 201 bp centered by the start codon.
## ------------------------------------------------------------------------------------------------------------
## 5 | TSS | Dummy variable | Site within 100bp downstream of TSS.
## ------------------------------------------------------------------------------------------------------------
## 6 | TSS_A | Dummy variable | Site within 100bp downstream of TSS that start with A.
## ------------------------------------------------------------------------------------------------------------
## 7 | exon_stop | Dummy variable | Site overlapping with exons containing stop codon.
## ------------------------------------------------------------------------------------------------------------
## 8 | alt_exon | Dummy variable | Site overlapping with alternatively spliced exons.
## ------------------------------------------------------------------------------------------------------------
## 9 | const_exon | Dummy variable | Site overlapping with constitutively spliced exons.
## ------------------------------------------------------------------------------------------------------------
## 10 | inter_exon | Dummy variable | Site overlapping with internal exons.
## ------------------------------------------------------------------------------------------------------------
## 11 | long_exon | Dummy variable | Site overlapping with long exons (exon length >= 400bp).
## ------------------------------------------------------------------------------------------------------------
## 12 | last_ex400sc | Dummy variable | Site within the 5' 400bp of the last exons containing stop codon.
## ------------------------------------------------------------------------------------------------------------
## 13 | pos_UTR5 | Real on [0,1] | Relative positioning of the site on 5'UTR.
## ------------------------------------------------------------------------------------------------------------
## 14 | pos_UTR3 | Real on [0,1] | Relative positioning of the site on 3'UTR.
## ------------------------------------------------------------------------------------------------------------
## 15 | pos_exon | Real on [0,1] | Relative positioning of the site on exon.
## ------------------------------------------------------------------------------------------------------------
## 16 | dist_5sj_2k | Positive integer | Distance from site to the upstream (5' end) splicing junction.
## ------------------------------------------------------------------------------------------------------------
## 17 | dist_3sj_2k | Positive integer | Distance from site to the downstream (3' end) splicing junction.
## ------------------------------------------------------------------------------------------------------------
## 18 | len_UTR3 | Positive integer | Length of the 3'UTR overlapped by the site.
## ------------------------------------------------------------------------------------------------------------
## 19 | len_UTR5 | Positive integer | Length of the 5'UTR overlapped by the site.
## ------------------------------------------------------------------------------------------------------------
## 20 | len_gene_ex | Positive integer | length of the gene exonic regions overlapped by the site.
## ------------------------------------------------------------------------------------------------------------
## 21 | AAACA | Dummy variable | Site on sequence motif: AAACA.
## ------------------------------------------------------------------------------------------------------------
## 22 | AGACA | Dummy variable | Site on sequence motif: AGACA.
## ------------------------------------------------------------------------------------------------------------
## 23 | AAACT | Dummy variable | Site on sequence motif: AAACT.
## ------------------------------------------------------------------------------------------------------------
## 24 | AGACT | Dummy variable | Site on sequence motif: AGACT.
## ------------------------------------------------------------------------------------------------------------
## 25 | AAACC | Dummy variable | Site on sequence motif: AAACC.
## ------------------------------------------------------------------------------------------------------------
## 26 | AGACC | Dummy variable | Site on sequence motif: AGACC.
## ------------------------------------------------------------------------------------------------------------
## 27 | GAACA | Dummy variable | Site on sequence motif: GAACA.
## ------------------------------------------------------------------------------------------------------------
## 28 | GGACA | Dummy variable | Site on sequence motif: GGACA.
## ------------------------------------------------------------------------------------------------------------
## 29 | GAACT | Dummy variable | Site on sequence motif: GAACT.
## ------------------------------------------------------------------------------------------------------------
## 30 | GGACT | Dummy variable | Site on sequence motif: GGACT.
## ------------------------------------------------------------------------------------------------------------
## 31 | GAACC | Dummy variable | Site on sequence motif: GAACC.
## ------------------------------------------------------------------------------------------------------------
## 32 | GGACC | Dummy variable | Site on sequence motif: GGACC.
## ------------------------------------------------------------------------------------------------------------
## 33 | TAACA | Dummy variable | Site on sequence motif: TAACA.
## ------------------------------------------------------------------------------------------------------------
## 34 | TGACA | Dummy variable | Site on sequence motif: TGACA.
## ------------------------------------------------------------------------------------------------------------
## 35 | TAACT | Dummy variable | Site on sequence motif: TAACT.
## ------------------------------------------------------------------------------------------------------------
## 36 | TGACT | Dummy variable | Site on sequence motif: TGACT.
## ------------------------------------------------------------------------------------------------------------
## 37 | TAACC | Dummy variable | Site on sequence motif: TAACC.
## ------------------------------------------------------------------------------------------------------------
## 38 | TGACC | Dummy variable | Site on sequence motif: TGACC.
## ------------------------------------------------------------------------------------------------------------
## 39 | ndist_m2h | Positive integer | Distance to the nearest neighbooring sites (maximum to 200bp).
## ------------------------------------------------------------------------------------------------------------
## 40 |dist_DRACH_m2h| Positive integer | Distance to the nearest DRACH motif (maximum at 200bp).
## ------------------------------------------------------------------------------------------------------------
## 41 | PC_1bp | Real on [0,1] | Phast cons (PC) score of the site.
## ------------------------------------------------------------------------------------------------------------
## 42 | PC_101bp | Real on [0,1] | Average PC scores within 101bp window centered by the site.
## ------------------------------------------------------------------------------------------------------------
## 43 | FC_1bp | Real on [0,1] | Fitness consequence (FC) score of the site.
## ------------------------------------------------------------------------------------------------------------
## 44 | FC_101bp | Real on [0,1] | Average FC scores within 101bp window centered by the site.
## ------------------------------------------------------------------------------------------------------------
## 45 | struct_sterm | Dummy variable | Site within predicted sterm regions of RNA secondary structure.
## ------------------------------------------------------------------------------------------------------------
## 46 | struct_loop | Dummy variable | Site within predicted loop regions of RNA secondary structure.
## ------------------------------------------------------------------------------------------------------------
## 47 | METTL3 | Dummy variable | Site overlapping with range based annotation: METTL3.
## ------------------------------------------------------------------------------------------------------------
## 48 | METTL14 | Dummy variable | Site overlapping with range based annotation: METTL14.
## ------------------------------------------------------------------------------------------------------------
## 49 | WTAP | Dummy variable | Site overlapping with range based annotation: WTAP.
## ------------------------------------------------------------------------------------------------------------
## 50 | RBM15 | Dummy variable | Site overlapping with range based annotation: RBM15.
## ------------------------------------------------------------------------------------------------------------
## 51 | RBM15B | Dummy variable | Site overlapping with range based annotation: RBM15B.
## ------------------------------------------------------------------------------------------------------------
## 52 | YTHDF1 | Dummy variable | Site overlapping with range based annotation: YTHDF1.
## ------------------------------------------------------------------------------------------------------------
## 53 | YTHDF2 | Dummy variable | Site overlapping with range based annotation: YTHDF2.
## ------------------------------------------------------------------------------------------------------------
## 54 | YTHDF3 | Dummy variable | Site overlapping with range based annotation: YTHDF3.
## ------------------------------------------------------------------------------------------------------------
## 55 | YTHDC1 | Dummy variable | Site overlapping with range based annotation: YTHDC1.
## ------------------------------------------------------------------------------------------------------------
## 56 | YTHDC2 | Dummy variable | Site overlapping with range based annotation: YTHDC2.
## ------------------------------------------------------------------------------------------------------------
## 57 | HNRNPA2B1 | Dummy variable | Site overlapping with range based annotation: HNRNPA2B1.
## ------------------------------------------------------------------------------------------------------------
## 58 | HNRNPC | Dummy variable | Site overlapping with range based annotation: HNRNPC.
## ------------------------------------------------------------------------------------------------------------
## 59 | ALKBH5 | Dummy variable | Site overlapping with range based annotation: ALKBH5.
## ------------------------------------------------------------------------------------------------------------
## 60 | FTO | Dummy variable | Site overlapping with range based annotation: FTO.
## ------------------------------------------------------------------------------------------------------------
## 61 | sncRNA | Dummy variable | Site overlapping with sncRNA (ncRNA <= 200bp).
## ------------------------------------------------------------------------------------------------------------
## 62 | lncRNA | Dummy variable | Site overlapping with lncRNA (ncRNA > 200bp).
## ------------------------------------------------------------------------------------------------------------
## 63 | HK_genes | Dummy variable | Site overlapping with house keeping genes.
## ------------------------------------------------------------------------------------------------------------
## 64 | miRtar_genes | Dummy variable | Site overlapping with microRNA targeted genes.
## ------------------------------------------------------------------------------------------------------------
## Transcriptomic features generation complete.
Please contact the maintainer of WHISTLE if you have encountered any problems:
ZhenWei: zhen.wei@xjtlu.edu.cn
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] phastCons100way.UCSC.hg19_3.7.2
## [2] fitCons.UCSC.hg19_3.7.1
## [3] GenomicScores_1.10.0
## [4] BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [5] BSgenome_1.54.0
## [6] rtracklayer_1.46.0
## [7] Biostrings_2.54.0
## [8] XVector_0.26.0
## [9] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [10] GenomicFeatures_1.38.1
## [11] AnnotationDbi_1.48.0
## [12] Biobase_2.46.0
## [13] GenomicRanges_1.38.0
## [14] GenomeInfoDb_1.22.0
## [15] IRanges_2.20.2
## [16] S4Vectors_0.24.4
## [17] BiocGenerics_0.32.0
## [18] WHISTLE_0.99.1
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.1 bit64_0.9-7
## [3] AnnotationHub_2.18.0 shiny_1.4.0
## [5] assertthat_0.2.1 interactiveDisplayBase_1.24.0
## [7] askpass_1.1 BiocManager_1.30.10
## [9] BiocFileCache_1.10.2 blob_1.2.1
## [11] GenomeInfoDbData_1.2.2 Rsamtools_2.2.2
## [13] yaml_2.2.1 progress_1.2.2
## [15] BiocVersion_3.10.1 pillar_1.4.3
## [17] RSQLite_2.2.0 lattice_0.20-38
## [19] glue_1.3.1 digest_0.6.24
## [21] promises_1.1.0 htmltools_0.4.0
## [23] httpuv_1.5.2 Matrix_1.2-18
## [25] XML_3.99-0.3 pkgconfig_2.0.3
## [27] biomaRt_2.42.0 zlibbioc_1.32.0
## [29] xtable_1.8-4 purrr_0.3.3
## [31] later_1.0.0 BiocParallel_1.20.1
## [33] tibble_3.0.1 openssl_1.4.1
## [35] ellipsis_0.3.0 SummarizedExperiment_1.16.1
## [37] magrittr_1.5 crayon_1.3.4
## [39] mime_0.9 memoise_1.1.0
## [41] evaluate_0.14 tools_3.6.2
## [43] prettyunits_1.1.1 hms_0.5.3
## [45] lifecycle_0.2.0 matrixStats_0.55.0
## [47] stringr_1.4.0 DelayedArray_0.12.3
## [49] compiler_3.6.2 rlang_0.4.5
## [51] grid_3.6.2 RCurl_1.98-1.1
## [53] rappdirs_0.3.1 bitops_1.0-6
## [55] rmarkdown_2.1 DBI_1.1.0
## [57] curl_4.3 R6_2.4.1
## [59] GenomicAlignments_1.22.1 knitr_1.28
## [61] dplyr_0.8.4 fastmap_1.0.1
## [63] bit_1.1-15.2 stringi_1.4.5
## [65] Rcpp_1.0.3 vctrs_0.2.4
## [67] dbplyr_1.4.2 tidyselect_1.0.0
## [69] xfun_0.12