This workflow is designed to, through droplet-based scRNA-seq, not only retain highly-expressed lncRNAs robustly detected by Cell Ranger and Kallisto, but also uncover functionally relevant lncRNAs that are only detected by Kallisto.
ELATUS
package requires only a standard computer with enough RAM to support the in-memory operations.
ELATUS
is supported for macOS and Linux. The package has been tested on the following systems:
- macOS: Catalina (10.15.7)
- Linux: Ubuntu 20.04.4 LTS, CentOS 7.15.1804
ELATUS
has been tested in R version 4.1.0 and 4.1.12.
Seurat and some bioconductor packages will be required that will not be automatically installed. Therefore, they will have to be manually installed:
install.packages("Seurat")
BiocManager::install("scran")
BiocManager::install("BUSpaRse")
BiocManager::install("DropletUtils")
BiocManager::install("scDblFinder")
Installation is performed with devtools. Installing all the different dependencies takes around 15 minutes in a personal laptop. However, the most rece versions of Seurat and Matrix packages can produce inconsistencies and result in compilation errors. This is easilisy solved by removing Matrix and installing a previous release:
remove.packages("Matrix")
install.packages("https://cran.r-project.org/src/contrib/Archive/Matrix/Matrix_1.5-1.tar.gz")
Now you can install ELATUS with:
install.packages("devtools")
devtools::install_github("kikegoni/ELATUS")
Once all the dependencies are installed, the installation of ELATUS takes 2-3 minutes in a desktop computer. If you are having an issue that you believe to be tied to software versioning issues, please drop us an Issue.
ELATUS needs the raw unfiltered count matrix from Cell Ranger and Kallisto as input. In particular, Cell Ranger count generates a barcodes.tsv.gz file, a features.tsv.gz file and a matrix.mtx.gz file, while Kallisto-Bustools generates a barcodes.txt file, a genes.txt file and an .mtx file. Guidelines to preprocess fastq files with Cell Ranger and Kallisto to generate these input files are included here. Also, ELATUS can be applied with the filtered count data using ELATUS_filtered function. ELATUS uses public CRISPRi data from 7 human cell lines to investigate exclusive lncRNAs whose function has been bibliographically validated. This bibliographic collection will be further expanded in the future.
For this particular example the mouse brain dataset consisting on 1000 cells that was analyzed in the paper is used ("Mm_brain_1k"). In this example, ELATUS uses as input the raw scRNA-seq count matrices generated by Kallisto and Cell Ranger. Next, it removes empty droplets and filters low-quality cells. After a normalization step, it clusters the cells and identifies highly-expressed lncRNAs. It retains not only highly-expressed lncRNAs robustly detected by Cell Ranger and Kallisto, but also lncRNAs that are only detected by Kallisto and whose functionality has been independently validated or that exhibit characteristics of functional lncRNAs. A vignette with a step-by-step of the ELATUS processing is included here.
library("ELATUS")
functional_lncRNAs <- ELATUS(kallisto_path=system.file("extdata", "kallisto_example_raw_matrix", package = "ELATUS"), kallisto_name="cells_genes_NO_multimapping", cellRanger_path=system.file("extdata", "cellRanger_example_raw_matrix", package = "ELATUS"), organism = "Mouse", lower_emptydrops = 1000, EmptyDrops_FDR_thres = 0.01, cells_mito_threshold= 15, cells_max_threshold = 30000, cells_min_genes_detected_threshold = 500, threshold_minumun_gene_counts = 250, threshold_cells_detected = 25, dimred_clustering = "PCA", k_neighbors = 5, ratio_threshold = 40, CR_threshold = 10, SI_threshold = 0.15)
For interactive demos of the functions, please check out the vignettes built into the package. They can be accessed as follows:
require(ELATUS)
?ELATUS
?Filtering
?SI
?biologically_relevant_lncRNAs
?crispr_info
?emptydrops_filt
?get_candidates
?import_CellRanger_sc
?import_kallisto_sc
?qc_metrics
?remove_doublets
?top_genes
In this investigation, we provide a comprehensive collection of functional lncRNAs in scRNA-seq. This set includes both highly-expressed lncRNAs robustly detected by Cell Ranger and Kallisto, as well as exclusive lncRNAs highly expressed and specific. For this example, it identifies 109 lncRNAs that exhibit characteristics of functional lncRNAs (running time < 10 minutes in a desktop computer).
> head(functional_lncRNAs,5)
candidates ratio gene kallisto_total_expression cellRanger_total_expression SI cell_type_SI crispr_intersection category
Gm6209 Gm6209 219.7538052 ENSMUSG00000102715.2 218.7538 0.00000 0.23788233 7 FALSE Exclusive_lncRNA
Gm15637 Gm15637 96.4384075 ENSMUSG00000087386.2 289.7927 2.01532 0.20469014 15 FALSE Exclusive_lncRNA
Gm19938 Gm19938 0.9987093 ENSMUSG00000102331.2 623.9760 624.78372 0.13784393 11 NA Common_lncRNA
Snhg6 Snhg6 0.9946959 ENSMUSG00000098234.8 1168.3728 1174.60826 0.04570088 7 NA Common_lncRNA
Gm16152 Gm16152 1.4073071 ENSMUSG00000087131.8 378.7060 268.81030 0.14565333 3 NA Common_lncRNA
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /beegfs/easybuild/CentOS/7.5.1804/Skylake/software/FlexiBLAS/3.0.4-GCC-11.2.0/lib64/libflexiblas.so.3.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ELATUS_1.0.0
loaded via a namespace (and not attached):
[1] utf8_1.2.2 R.utils_2.12.0
[3] reticulate_1.26 tidyselect_1.2.0
[5] RSQLite_2.2.18 AnnotationDbi_1.56.2
[7] htmlwidgets_1.5.4 grid_4.1.2
[9] BiocParallel_1.28.3 Rtsne_0.16
[11] scDblFinder_1.8.0 DropletUtils_1.14.2
[13] ScaledMatrix_1.2.0 munsell_0.5.0
[15] codetools_0.2-18 ica_1.0-3
[17] statmod_1.4.37 scran_1.22.1
[19] xgboost_1.6.0.1 future_1.28.0
[21] miniUI_0.1.1.1 spatstat.random_3.1-3
[23] colorspace_2.0-3 progressr_0.11.0
[25] Biobase_2.54.0 filelock_1.0.2
[27] Seurat_4.0.1 stats4_4.1.2
[29] SingleCellExperiment_1.16.0 ROCR_1.0-11
[31] tensor_1.5 listenv_0.8.0
[33] MatrixGenerics_1.6.0 BUSpaRse_1.8.0
[35] GenomeInfoDbData_1.2.7 polyclip_1.10-0
[37] bit64_4.0.5 rhdf5_2.38.1
[39] parallelly_1.32.1 vctrs_0.6.5
[41] generics_0.1.3 BiocFileCache_2.2.1
[43] R6_2.5.1 GenomeInfoDb_1.30.1
[45] ggbeeswarm_0.6.0 rsvd_1.0.5
[47] locfit_1.5-9.6 AnnotationFilter_1.18.0
[49] bitops_1.0-7 rhdf5filters_1.6.0
[51] spatstat.utils_3.0-1 cachem_1.0.6
[53] DelayedArray_0.20.0 assertthat_0.2.1
[55] promises_1.2.0.1 BiocIO_1.4.0
[57] scales_1.2.1 beeswarm_0.4.0
[59] gtable_0.3.1 beachmat_2.10.0
[61] globals_0.16.1 goftest_1.2-3
[63] ensembldb_2.18.4 rlang_1.1.1
[65] zeallot_0.1.0 splines_4.1.2
[67] rtracklayer_1.54.0 lazyeval_0.2.2
[69] spatstat.geom_3.0-6 plyranges_1.14.0
[71] yaml_2.3.5 reshape2_1.4.4
[73] abind_1.4-5 GenomicFeatures_1.46.5
[75] httpuv_1.6.6 tools_4.1.2
[77] ggplot2_3.4.2 ellipsis_0.3.2
[79] spatstat.core_2.4-4 RColorBrewer_1.1-3
[81] BiocGenerics_0.40.0 ggridges_0.5.4
[83] Rcpp_1.0.9 plyr_1.8.7
[85] sparseMatrixStats_1.6.0 progress_1.2.2
[87] zlibbioc_1.40.0 purrr_1.0.2
[89] RCurl_1.98-1.9 prettyunits_1.1.1
[91] rpart_4.1.16 deldir_1.0-6
[93] viridis_0.6.2 pbapply_1.5-0
[95] cowplot_1.1.1 S4Vectors_0.32.4
[97] zoo_1.8-11 SeuratObject_4.1.3
[99] SummarizedExperiment_1.24.0 ggrepel_0.9.1
[101] cluster_2.1.4 magrittr_2.0.3
[103] data.table_1.14.2 scattermore_0.8
[105] lmtest_0.9-40 RANN_2.6.1
[107] ProtGenerics_1.26.0 fitdistrplus_1.1-8
[109] matrixStats_0.62.0 hms_1.1.2
[111] patchwork_1.1.2 mime_0.12
[113] xtable_1.8-4 XML_3.99-0.11
[115] IRanges_2.28.0 gridExtra_2.3
[117] scater_1.22.0 compiler_4.1.2
[119] biomaRt_2.50.3 tibble_3.1.8
[121] KernSmooth_2.23-20 crayon_1.5.2
[123] R.oo_1.25.0 htmltools_0.5.3
[125] mgcv_1.8-40 later_1.3.0
[127] tidyr_1.2.1 DBI_1.1.3
[129] dbplyr_2.2.1 MASS_7.3-58.1
[131] rappdirs_0.3.3 Matrix_1.5-1
[133] cli_3.6.2 R.methodsS3_1.8.2
[135] metapod_1.2.0 parallel_4.1.2
[137] igraph_1.3.5 GenomicRanges_1.46.1
[139] pkgconfig_2.0.3 GenomicAlignments_1.30.0
[141] sp_1.5-0 plotly_4.10.0
[143] scuttle_1.4.0 spatstat.sparse_3.0-0
[145] xml2_1.3.3 vipor_0.4.5
[147] dqrng_0.3.0 XVector_0.34.0
[149] stringr_1.4.1 digest_0.6.29
[151] sctransform_0.3.5 RcppAnnoy_0.0.20
[153] spatstat.data_3.0-0 Biostrings_2.62.0
[155] leiden_0.4.3 edgeR_3.36.0
[157] uwot_0.1.14 DelayedMatrixStats_1.16.0
[159] restfulr_0.0.15 curl_4.3.3
[161] shiny_1.7.2 Rsamtools_2.10.0
[163] rjson_0.2.21 lifecycle_1.0.3
[165] nlme_3.1-160 jsonlite_1.8.2
[167] Rhdf5lib_1.16.0 BiocNeighbors_1.12.0
[169] limma_3.50.3 viridisLite_0.4.1
[171] BSgenome_1.62.0 fansi_1.0.3
[173] pillar_1.8.1 lattice_0.20-45
[175] KEGGREST_1.34.0 fastmap_1.1.0
[177] httr_1.4.4 survival_3.4-0
[179] glue_1.6.2 png_0.1-7
[181] bluster_1.4.0 bit_4.0.4
[183] stringi_1.7.8 HDF5Array_1.22.1
[185] blob_1.2.3 BiocSingular_1.10.0
[187] memoise_2.0.1 dplyr_1.0.10
[189] irlba_2.3.5.1 future.apply_1.9.1