Some suggestions for improved robustness
Closed this issue · 4 comments
Hi,
thanks for making this package. When trying it out on some data sets (e.g., the example data used in the velociraptor vignette), I ran into some hurdles, and I thought I'd bring them to your attention:
- When running with
pca = FALSE
, the matrices are not transposed:Lines 168 to 169 in 3bb83f5
- In several cases, I got an output coordinate matrix with fewer cells than present in my input data, which I think is because only connected components of the graph are retained, according to the default value in
veloviz/R/projectedNeighbors.R
Line 127 in 3bb83f5
- Because of the
log10(x+1)
transformation applied internally, negative values (at least any below -1) in the input matrices will cause problems and return an error message which is not very helpful for tracking down the actual problem. Adding some checks to ensure the validity of the input arguments would be helpful. - For some reason, some of the composite distances for one cell in my dataset were NA, which caused problems here:
veloviz/R/projectedNeighbors.R
Line 70 in 3bb83f5
min_dist_idx
had length 0. Addingna.rm=TRUE
inmin
atveloviz/R/projectedNeighbors.R
Line 54 in 3bb83f5
- It is also possible that there are multiple neighbors with the same composite distance, which will raise a warning at
veloviz/R/projectedNeighbors.R
Line 70 in 3bb83f5
- From R v4.0.0,
matrix
objects also inherits fromarray
, which invalidates code assuming thatclass(x)
has length 1 (https://mran.microsoft.com/snapshot/2020-04-29/doc/manuals/r-devel/NEWS.html), e.g.,Line 51 in 3bb83f5
Warning messages:
1: In if (!class(curr) %in% c("dgCMatrix", "dgTMatrix")) { :
the condition has length > 1 and only the first element will be used
Hello! Thank you for trying out the package and for your detailed suggestions.
We've implemented updates to address points 1, 2, and 3.
For points 4 and 5, we actually do not end up using min_dist_idx
or nn_idx
to construct the graph. We use k_min_dists_idx
and knn_idx
, which specify multiple nearest neighbors for each cell. (This is a remnant of an older version of the function.) k_min_dists_idx
is created using order()
to get the cells with the minimum distances and so this is not sensitive to NAs the same way that min()
is.
veloviz/R/projectedNeighbors.R
Line 56 in 4a851fc
That being said, I am not sure why NAs would be generated when computing composite distances. Could you please share the specific data that gave this result?
For point 6, thanks for pointing out the new behavior - we will address it shortly.
Best,
Lyla
Thanks! For example, I was trying with the example data from the velociraptor vignette mentioned above:
sce <- scRNAseq::HermannSpermatogenesisData()
sce <- sce[, 1:500]
sce <- scuttle::logNormCounts(sce, assay.type = 1)
dec <- scran::modelGeneVar(sce)
top.hvgs <- scran::getTopHVGs(dec, n = 2000)
velo.out <- velociraptor::scvelo(sce, subset.row = top.hvgs, assay.X = "spliced")
curr <- assay(velo.out, "Ms")
proj <- assay(velo.out, "Ms") + 1 * assay(velo.out, "velocity")
proj[proj <= 0] <- 0
veloviz = veloviz::buildVeloviz(
curr = curr, proj = proj,
normalize.depth = FALSE,
use.ods.genes = TRUE,
alpha = 0.05,
pca = TRUE,
nPCs = 5,
center = TRUE,
scale = TRUE,
k = 5,
similarity.threshold = 0,
distance.weight = 1,
distance.threshold = 0.25,
weighted = TRUE,
seed = 0,
verbose = TRUE
)
# Error in nn_idx[i] <- min_dist_idx : replacement has length zero
Session info
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils
[7] datasets methods base
other attached packages:
[1] scRNAseq_2.4.0 SingleCellExperiment_1.12.0
[3] SummarizedExperiment_1.20.0 Biobase_2.50.0
[5] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
[7] IRanges_2.24.1 S4Vectors_0.28.1
[9] BiocGenerics_0.36.0 MatrixGenerics_1.2.1
[11] matrixStats_0.58.0
loaded via a namespace (and not attached):
[1] nlme_3.1-152 ProtGenerics_1.22.0
[3] bitops_1.0-6 bit64_4.0.5
[5] filelock_1.0.2 progress_1.2.2
[7] httr_1.4.2 tools_4.0.3
[9] R6_2.5.0 irlba_2.3.3
[11] mgcv_1.8-33 DBI_1.1.1
[13] lazyeval_0.2.2 zellkonverter_1.0.2
[15] withr_2.4.1 tidyselect_1.1.0
[17] prettyunits_1.1.1 bit_4.0.4
[19] curl_4.3 compiler_4.0.3
[21] BiocNeighbors_1.8.2 basilisk.utils_1.2.2
[23] xml2_1.3.2 DelayedArray_0.16.1
[25] rtracklayer_1.50.0 askpass_1.1
[27] rappdirs_0.3.3 stringr_1.4.0
[29] digest_0.6.27 Rsamtools_2.6.0
[31] basilisk_1.2.1 XVector_0.30.0
[33] pkgconfig_2.0.3 htmltools_0.5.1.1
[35] sparseMatrixStats_1.2.1 limma_3.46.0
[37] dbplyr_2.1.0 fastmap_1.1.0
[39] ensembldb_2.14.0 rlang_0.4.10
[41] rstudioapi_0.13 RSQLite_2.2.3
[43] shiny_1.6.0 DelayedMatrixStats_1.12.3
[45] generics_0.1.0 veloviz_0.0.0.9000
[47] jsonlite_1.7.2 BiocParallel_1.24.1
[49] dplyr_1.0.4 RCurl_1.98-1.2
[51] magrittr_2.0.1 BiocSingular_1.6.0
[53] GenomeInfoDbData_1.2.4 scuttle_1.0.4
[55] Matrix_1.3-2 Rcpp_1.0.6
[57] reticulate_1.18 lifecycle_0.2.0
[59] edgeR_3.32.1 stringi_1.5.3
[61] yaml_2.2.1 zlibbioc_1.36.0
[63] BiocFileCache_1.14.0 AnnotationHub_2.22.0
[65] grid_4.0.3 blob_1.2.1
[67] dqrng_0.2.1 promises_1.1.1
[69] ExperimentHub_1.16.0 crayon_1.4.0
[71] lattice_0.20-41 splines_4.0.3
[73] Biostrings_2.58.0 beachmat_2.6.4
[75] GenomicFeatures_1.42.1 hms_1.0.0
[77] locfit_1.5-9.4 pillar_1.4.7
[79] igraph_1.2.6 biomaRt_2.46.2
[81] XML_3.99-0.5 glue_1.4.2
[83] BiocVersion_3.12.0 scran_1.18.5
[85] BiocManager_1.30.10 vctrs_0.3.6
[87] httpuv_1.5.5 openssl_1.4.3
[89] purrr_0.3.4 velociraptor_1.0.0
[91] assertthat_0.2.1 cachem_1.0.3
[93] rsvd_1.0.3 mime_0.9
[95] xtable_1.8-4 AnnotationFilter_1.14.0
[97] RSpectra_0.16-0 later_1.1.0.1
[99] tibble_3.0.6 GenomicAlignments_1.26.0
[101] AnnotationDbi_1.52.0 memoise_2.0.0
[103] statmod_1.4.35 bluster_1.0.0
[105] ellipsis_0.3.1 interactiveDisplayBase_1.28.0
Hi, thanks for sharing your data. I tried it out and something quite interesting is happening. It seems that for some cells, their expression is similar enough that when only including the top 2000 (or 3000, or 5000..) highly variable genes for further analysis, they actually end up having the exact same gene expression. This happens for at least cells 30, 70, 216, and 380 in the data you included above. (They are not exact duplicates when looking at all genes though).
This presents a problem when computing composite distance, specifically the cosine correlation component. Here, we are calculating the cosine correlation between the a cell's velocity vector (proj[,i] - curr[,i]
) and the vector representing the change from a cell to its putative projected neighbor (curr[,j] - curr[,i]
). The cosine correlation is defined as the dot product normalized by the magnitudes of the vectors. If cells i and j have the same expression, then the magnitude of curr[,j] - curr[,i] = 0
, resulting in NaN when computing the cosine correlation and in turn, the composite distance.
I had encountered this problem with simulated data but didn't expect it to occur with real data. For the time being, I added na.rm = T
as you suggested above so that it doesn't throw an error. As it stands, when looking for a cell's projected neighbors, cells with the exact same expression are going to be ignored because of the way that order()
deals with NaNs. This should work (see embedding below) until I implement a more meaningful way of dealing with cases like this.
VeloViz embedding of data from velociraptor vignette with parameters k=50
, distance.weight = 10
, distance.threshold = 1
, similarity.threshold = 0
, weighted = TRUE
, remove.unconnected = TRUE
, colored by velocity pseudotime:
Great, thank you!