This repository contains all the data, functions, scripts to run simulations and analysis, and scripts to generate plots for the paper "Exponential-family embedding with application to cell developmental trajectories for single-cell RNA-seq data".
This code was developed and tested primarily on R 4.0.3 on a Macbook (macOS 11.1 Big Sur) equipped with an i5 processor.
This package can be installed through devtools
in R.
library("devtools")
devtools::install_github("linnykos/esvd", subdir = "eSVD")
The package itself depends on several packages. These include MASS
, foreach
, doMC
, princurve
, igraph
, clplite
, softImpute
, RSpectra
, plot3D
, np
, org.Mm.eg.db
, and DBI
. See the
last section of this README to see where (i.e., CRAN, Bioconductor, or GitHub) to download all such packages.
Warning: On Windows, to install the doMC
package, use the following code in R.
install.packages("doMC", repos="http://R-Forge.R-project.org")
The above installation is only for the R package. To reproduce the entire simulation and analysis, you will need to pull/fork this entire repository. You will need to install the Git Large File Storage system to do this (see below).
Additionally, the data analysis and simulations themselves require the additional packages: PMA
, descend
, vioplot
, Rtsne
, NMF
, dimRed
, destiny
, umap
, pCMF
, SummarizedExperiment
, zinbwave
, fastICA
, and Seurat
.
The dataset used in this article is also included in the repository.
This is the Marques single-cell dataset collected by Marques et al. (2016). While the original dataset
is publicly available on GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE75330),
we provide a locally preprocessed dataset, which was created to be amendable for our analysis in R.
This dataset is a 21 MB .RData
file, and is synced onto GitHub using the Git Large File Storage system (https://git-lfs.github.com/). Please
install this system before proceeding.
In the appendix, we investigate the Zeisel single-cell data collected by Zeisel et al. (2015). Similarly,
while the original dataset
is publicly available on GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60361),
we provide a locally preprocessed dataset, which was created to be amendable for our analysis in R.
This dataset is a 13 MB .RData
file, and is synced onto GitHub using the Git Large File Storage system
All the code below were run on a server with 15 cores. For a single fit of eSVD,
using the initialization
function completes within a minute, and using the fit_factorization
function (on 15 cores) completes between 30 minutes for our dataset of 5069 cells and 983 genes.
Note however, fitting multiple embeddings (in order to do parameter selection using the tuning_select_scalar
function) is the most time-expensive step. This step happens in step3_scalar_tuning
, where 16 different combinations of k
and the curved Gaussian's nuisance
parameter scalar
are tried, each with cv_trials=3
subsampling of the missing values.
To reproduce the Marques (i.e., oligodendrocyte) analysis (Section 2, Section 7, and Appendix H of our paper, including Figures 1-3, 6-8, S.1, S.8-S.17), navigate to the main
folder. From this location, run the R scripts command window. All the results and figures in these sections are reproduced by running main.R
, which calls 16 different R scripts in succession, each producing .RData
files that the next uses as input. The figures are produced in the last 8 script, step8_figures_zz_data.R
through step8_figures_zz_additional_analyses.R
.
Specifically:
-
step8_figures_zz_data.R
produces Figures 3 and S.1. -
step8_figures_zz_training_testing.R
produces Figures 2, 7 and S.13. -
step8_figures_zz_2D_densities.R
produces Figures 1 and S.11. -
step8_figures_zz_2D_embedding.R
produces Figures S.8 and S.10. -
step8_figures_zz_3D_embedding.R
produces Figures 6, 8, S.9 and S.12. -
step8_figures_zz_cascade.R
produces Figure S.14. -
step8_figures_zz_additional_analyses.R
produces Figures S.15-S.17.
To reproduce the Zeisel analysis (Appendix H.5), navigate to the main_zeisel
folder. From this location, run the R scripts command window. All the results and figures in these sections are reproduced by running main_zeisel.R
, which calls 5 different R scripts in succession, each producing .RData
files that the next uses as input. Figure S.18 and Table S.1 are produced in the last script step6_zeisel_metrics.R
and Figure S.19 is produced in the script, step4_zeisel_analysis.R
.
To reproduce the simulations (Section 6 and Appendix D of our paper), navigate to the simulation
folder. Below, we describe which scripts are associated with which files. Almost all these scripts depend on factorization_generator.R
and factorization_methods.R
. For each figure, the simulation needed to complete the simulation should finish within half a day.
-
Figure 4: Run
illustration_example.R
. -
Figure 5: Run
factorization_suite_negbinom_esvd.R
andfactorization_suite_negbinom_rest.R
, followed byfactorization_suite_negbinom_postprocess.R
. -
Figure S.2: Run
consistency_simulation.R
followed byconsistency_simulation_plot.R
. -
Figure S.3: Run
factorization_suite_poisson_esvd.R
andfactorization_suite_poisson_rest.R
, followed byfactorization_suite_poisson_postprocess.R
. -
Figure S.4: Run
factorization_suite_curved_gaussian_esvd.R
andfactorization_suite_curved_gaussian_rest.R
, followed byfactorization_suite_curved_gaussian_postprocess.R
. -
Figure S.5: Run
factorization_suite_zinbwave_esvd.R
andfactorization_suite_zinbwave_rest.R
, followed byfactorization_suite_zinbwave_postprocess.R
. -
Figure S.6: Run
factorization_suite_tuning_zinbwave
followed byfactorization_suite_tuning_zinbwave_postprocess.R
. -
Figure S.7: Run
factorization_suite_pcmf_esvd.R
andfactorization_suite_pcmf_rest.R
, followed byfactorization_suite_pcmf_postprocess.R
.
The following shows the suggested package versions that the developer (GitHub username: linnykos) used when developing the eSVD package.
> session_info()
─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.0.3 (2020-10-10)
os macOS Big Sur 10.16
system x86_64, darwin17.0
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2021-01-31
─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
! package * version date lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.0.2)
annotate 1.68.0 2020-10-27 [1] Bioconductor
AnnotationDbi 1.52.0 2020-10-27 [1] Bioconductor
askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
Biobase * 2.50.0 2020-10-27 [1] Bioconductor
BiocGenerics * 0.36.0 2020-10-27 [1] Bioconductor
BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
bit 4.0.4 2020-08-04 [1] CRAN (R 4.0.2)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.0.2)
bitops 1.0-6 2013-08-17 [1] CRAN (R 4.0.2)
blob 1.2.1 2020-01-20 [1] CRAN (R 4.0.2)
boot 1.3-25 2020-04-26 [1] CRAN (R 4.0.3)
callr 3.5.1 2020-10-13 [1] CRAN (R 4.0.2)
car 3.0-10 2020-09-29 [1] CRAN (R 4.0.2)
carData 3.0-4 2020-05-22 [1] CRAN (R 4.0.2)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.2)
class 7.3-17 2020-04-26 [1] CRAN (R 4.0.3)
cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.2)
clplite 0.1.0 2021-01-06 [1] Github (yixuan/clplite@cffcf11)
cluster * 2.1.0 2019-06-19 [1] CRAN (R 4.0.3)
codetools 0.2-18 2020-11-04 [1] CRAN (R 4.0.2)
colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.2)
conquer 1.0.2 2020-08-27 [1] CRAN (R 4.0.2)
cowplot 1.1.1 2020-12-30 [1] CRAN (R 4.0.2)
crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
cubature 2.0.4.1 2020-07-06 [1] CRAN (R 4.0.2)
curl 4.3 2019-12-02 [1] CRAN (R 4.0.1)
CVST * 0.2-2 2018-05-26 [1] CRAN (R 4.0.2)
data.table 1.13.6 2020-12-30 [1] CRAN (R 4.0.2)
DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.2)
DelayedArray 0.16.0 2020-10-27 [1] Bioconductor
deldir 0.2-3 2020-11-09 [1] CRAN (R 4.0.2)
DEoptimR 1.0-8 2016-11-19 [1] CRAN (R 4.0.2)
desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.2)
descend * 1.0.0 2021-01-31 [1] Github (jingshuw/descend@b903fc8)
destiny * 3.4.0 2020-10-27 [1] Bioconductor
devtools * 2.3.2 2020-09-18 [1] CRAN (R 4.0.2)
digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
dimRed * 0.2.3 2019-05-08 [1] CRAN (R 4.0.2)
doMC 1.3.7 2020-10-14 [1] CRAN (R 4.0.2)
doParallel 1.0.16 2020-10-16 [1] CRAN (R 4.0.2)
dotCall64 1.0-0 2018-07-30 [1] CRAN (R 4.0.2)
dplyr 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
DRR * 0.0.4 2020-02-12 [1] CRAN (R 4.0.2)
e1071 1.7-4 2020-10-14 [1] CRAN (R 4.0.2)
edgeR 3.32.0 2020-10-27 [1] Bioconductor
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
R eSVD * 1.0.0.3 <NA> [?] <NA>
fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2)
fastICA * 1.2-2 2019-07-08 [1] CRAN (R 4.0.2)
fastmap 1.0.1 2019-10-08 [1] CRAN (R 4.0.2)
fields 11.6 2020-10-09 [1] CRAN (R 4.0.2)
fitdistrplus 1.1-3 2020-12-05 [1] CRAN (R 4.0.2)
forcats 0.5.0 2020-03-01 [1] CRAN (R 4.0.2)
foreach 1.5.1 2020-10-15 [1] CRAN (R 4.0.2)
foreign 0.8-81 2020-12-22 [1] CRAN (R 4.0.2)
fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
future 1.21.0 2020-12-10 [1] CRAN (R 4.0.2)
future.apply 1.7.0 2021-01-04 [1] CRAN (R 4.0.2)
genefilter 1.72.0 2020-10-27 [1] Bioconductor
generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
GenomeInfoDb * 1.26.2 2020-12-08 [1] Bioconductor
GenomeInfoDbData 1.2.4 2020-12-14 [1] Bioconductor
GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor
ggplot.multistats 1.0.0 2019-10-28 [1] CRAN (R 4.0.2)
ggplot2 3.3.3 2020-12-30 [1] CRAN (R 4.0.2)
ggrepel 0.9.0 2020-12-16 [1] CRAN (R 4.0.2)
ggridges 0.5.2 2020-01-12 [1] CRAN (R 4.0.2)
ggthemes 4.2.0 2019-05-13 [1] CRAN (R 4.0.2)
globals 0.14.0 2020-11-22 [1] CRAN (R 4.0.2)
glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
goftest 1.2-2 2019-12-02 [1] CRAN (R 4.0.2)
gridBase 0.4-7 2014-02-24 [1] CRAN (R 4.0.2)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.0.2)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.2)
haven 2.3.1 2020-06-01 [1] CRAN (R 4.0.2)
hexbin 1.28.1 2020-02-03 [1] CRAN (R 4.0.2)
hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.2)
htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
htmlwidgets 1.5.3 2020-12-10 [1] CRAN (R 4.0.2)
httpuv 1.5.4 2020-06-06 [1] CRAN (R 4.0.2)
httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
ica 1.0-2 2018-05-24 [1] CRAN (R 4.0.2)
igraph 1.2.6 2020-10-06 [1] CRAN (R 4.0.2)
IRanges * 2.24.1 2020-12-12 [1] Bioconductor
irlba 2.3.3 2019-02-05 [1] CRAN (R 4.0.2)
iterators 1.0.13 2020-10-15 [1] CRAN (R 4.0.2)
jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.2)
kernlab * 0.9-29 2019-11-12 [1] CRAN (R 4.0.2)
KernSmooth 2.23-18 2020-10-29 [1] CRAN (R 4.0.2)
laeken 0.5.1 2020-02-05 [1] CRAN (R 4.0.2)
later 1.1.0.1 2020-06-05 [1] CRAN (R 4.0.2)
lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.3)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.0.2)
leiden 0.3.6 2020-12-07 [1] CRAN (R 4.0.2)
lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
limma 3.46.0 2020-10-27 [1] Bioconductor
listenv 0.8.0 2019-12-05 [1] CRAN (R 4.0.2)
lmtest 0.9-38 2020-09-09 [1] CRAN (R 4.0.2)
locfit 1.5-9.4 2020-03-25 [1] CRAN (R 4.0.2)
magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
maps 3.3.0 2018-04-03 [1] CRAN (R 4.0.2)
MASS 7.3-53 2020-09-09 [1] CRAN (R 4.0.3)
Matrix * 1.3-0 2020-12-22 [1] CRAN (R 4.0.2)
MatrixGenerics * 1.2.0 2020-10-27 [1] Bioconductor
MatrixModels 0.4-1 2015-08-22 [1] CRAN (R 4.0.2)
matrixStats * 0.58.0 2021-01-29 [1] CRAN (R 4.0.2)
memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.2)
mgcv 1.8-33 2020-08-27 [1] CRAN (R 4.0.3)
mime 0.9 2020-02-04 [1] CRAN (R 4.0.2)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.0.2)
misc3d 0.9-0 2020-09-06 [1] CRAN (R 4.0.2)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.2)
nlme 3.1-151 2020-12-10 [1] CRAN (R 4.0.2)
NMF * 0.23.0 2020-08-01 [1] CRAN (R 4.0.2)
nnet 7.3-14 2020-04-26 [1] CRAN (R 4.0.3)
np 0.60-10 2020-02-06 [1] CRAN (R 4.0.2)
openssl 1.4.3 2020-09-18 [1] CRAN (R 4.0.2)
openxlsx 4.2.3 2020-10-27 [1] CRAN (R 4.0.2)
org.Mm.eg.db 3.12.0 2021-01-06 [1] Bioconductor
parallelly 1.23.0 2021-01-04 [1] CRAN (R 4.0.2)
patchwork 1.1.1 2020-12-17 [1] CRAN (R 4.0.2)
pbapply 1.4-3 2020-08-18 [1] CRAN (R 4.0.2)
pcaMethods 1.82.0 2020-10-27 [1] Bioconductor
pCMF * 1.2.1 2021-01-06 [1] Github (gdurif/pCMF@9a09a7a)
pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.2)
pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.0.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.2)
pkgmaker * 0.32.2 2020-10-20 [1] CRAN (R 4.0.2)
plot3D 1.3 2019-12-18 [1] CRAN (R 4.0.2)
plotly 4.9.2.2 2020-12-19 [1] CRAN (R 4.0.2)
plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.2)
PMA * 1.2.1 2020-02-03 [1] CRAN (R 4.0.2)
png 0.1-7 2013-12-03 [1] CRAN (R 4.0.2)
polyclip 1.10-0 2019-03-14 [1] CRAN (R 4.0.2)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2)
princurve 2.1.5 2020-08-25 [1] CRAN (R 4.0.2)
processx 3.4.5 2020-11-30 [1] CRAN (R 4.0.2)
promises 1.1.1 2020-06-09 [1] CRAN (R 4.0.2)
proxy 0.4-24 2020-04-25 [1] CRAN (R 4.0.2)
ps 1.5.0 2020-12-05 [1] CRAN (R 4.0.2)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
quadprog 1.5-8 2019-11-20 [1] CRAN (R 4.0.2)
quantreg 5.83 2021-01-22 [1] CRAN (R 4.0.2)
R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
ranger 0.12.1 2020-01-10 [1] CRAN (R 4.0.2)
RANN 2.6.1 2019-01-08 [1] CRAN (R 4.0.2)
RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 4.0.2)
Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.2)
RcppAnnoy 0.0.18 2020-12-15 [1] CRAN (R 4.0.2)
RcppEigen 0.3.3.9.1 2020-12-17 [1] CRAN (R 4.0.2)
RcppHNSW 0.3.0 2020-09-06 [1] CRAN (R 4.0.2)
RCurl 1.98-1.2 2020-04-18 [1] CRAN (R 4.0.2)
readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.2)
registry * 0.5-1 2019-03-05 [1] CRAN (R 4.0.2)
remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2)
reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.0.2)
reticulate 1.18 2020-10-25 [1] CRAN (R 4.0.2)
rio 0.5.16 2018-11-26 [1] CRAN (R 4.0.2)
rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.2)
rngtools * 1.5 2020-01-23 [1] CRAN (R 4.0.2)
robustbase 0.93-7 2021-01-04 [1] CRAN (R 4.0.2)
ROCR 1.0-11 2020-05-02 [1] CRAN (R 4.0.2)
rpart 4.1-15 2019-04-12 [1] CRAN (R 4.0.3)
rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.2)
RSpectra 0.16-0 2019-12-01 [1] CRAN (R 4.0.2)
RSQLite 2.2.1 2020-09-30 [1] CRAN (R 4.0.2)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.2)
rsvd 1.0.3 2020-02-17 [1] CRAN (R 4.0.2)
Rtsne * 0.15 2018-11-10 [1] CRAN (R 4.0.2)
S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.2)
scattermore 0.7 2020-11-24 [1] CRAN (R 4.0.2)
scatterplot3d 0.3-41 2018-03-14 [1] CRAN (R 4.0.2)
sctransform 0.3.2 2020-12-16 [1] CRAN (R 4.0.2)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
Seurat * 3.9.9.9024 2021-01-04 [1] Github (satijalab/seurat@8004d7c)
shiny 1.5.0 2020-06-23 [1] CRAN (R 4.0.2)
SingleCellExperiment * 1.12.0 2020-10-27 [1] Bioconductor
sm * 2.2-5.6 2018-09-27 [1] CRAN (R 4.0.2)
smoother 1.1 2015-04-16 [1] CRAN (R 4.0.2)
softImpute 1.4 2015-04-08 [1] CRAN (R 4.0.2)
sp 1.4-4 2020-10-07 [1] CRAN (R 4.0.2)
spam 2.6-0 2020-12-14 [1] CRAN (R 4.0.2)
SparseM 1.78 2019-12-13 [1] CRAN (R 4.0.2)
spatstat 1.64-1 2020-05-12 [1] CRAN (R 4.0.2)
spatstat.data 1.7-0 2020-12-16 [1] CRAN (R 4.0.2)
spatstat.utils 1.17-0 2020-02-07 [1] CRAN (R 4.0.2)
stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor
survival 3.2-7 2020-09-28 [1] CRAN (R 4.0.3)
tensor 1.5 2012-05-05 [1] CRAN (R 4.0.2)
testthat * 3.0.1 2020-12-17 [1] CRAN (R 4.0.2)
tibble 3.0.4 2020-10-12 [1] CRAN (R 4.0.2)
tidyr 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
TTR 0.24.2 2020-09-01 [1] CRAN (R 4.0.2)
umap * 0.2.7.0 2020-11-04 [1] CRAN (R 4.0.2)
usethis * 2.0.0 2020-12-10 [1] CRAN (R 4.0.2)
uwot 0.1.10 2020-12-15 [1] CRAN (R 4.0.2)
vcd 1.4-8 2020-09-21 [1] CRAN (R 4.0.2)
vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.2)
VIM 6.0.0 2020-05-08 [1] CRAN (R 4.0.2)
vioplot * 0.3.5 2020-06-15 [1] CRAN (R 4.0.2)
viridisLite 0.3.0 2018-02-01 [1] CRAN (R 4.0.1)
withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2)
XML 3.99-0.5 2020-07-23 [1] CRAN (R 4.0.2)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.0.2)
xts 0.12.1 2020-09-09 [1] CRAN (R 4.0.2)
XVector 0.30.0 2020-10-28 [1] Bioconductor
zinbwave * 1.12.0 2020-10-28 [1] Bioconductor
zip 2.1.1 2020-08-27 [1] CRAN (R 4.0.2)
zlibbioc 1.36.0 2020-10-28 [1] Bioconductor
zoo * 1.8-8 2020-05-02 [1] CRAN (R 4.0.2)
[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
R ── Package was removed from disk.