Some suggestions for improved robustness

Hi,

thanks for making this package. When trying it out on some data sets (e.g., the example data used in the velociraptor vignette), I ran into some hurdles, and I thought I'd bring them to your attention:

When running with pca = FALSE, the matrices are not transposed:

veloviz/R/main.R

Lines 168 to 169 in 3bb83f5

pca.curr = curr

pca.proj = proj

which means that the output coordinates are given for the genes rather than for the cells.

In several cases, I got an output coordinate matrix with fewer cells than present in my input data, which I think is because only connected components of the graph are retained, according to the default value in

veloviz/R/projectedNeighbors.R

Line 127 in 3bb83f5

    
           graphViz = function(observed, projected, k, distance_metric = "L2", similarity_metric = "cosine", distance_weight = 1, distance_threshold = 1, similarity_threshold = -1, weighted = TRUE, remove_unconnected = TRUE, return_graph = FALSE,  plot = TRUE, cell.colors = NA, title = NA){

Could this argument be exposed to the user?

Because of the log10(x+1) transformation applied internally, negative values (at least any below -1) in the input matrices will cause problems and return an error message which is not very helpful for tracking down the actual problem. Adding some checks to ensure the validity of the input arguments would be helpful.
For some reason, some of the composite distances for one cell in my dataset were NA, which caused problems here:

veloviz/R/projectedNeighbors.R

Line 70 in 3bb83f5

nn_idx[i] = min_dist_idx

since min_dist_idx had length 0. Adding na.rm=TRUE in min at

veloviz/R/projectedNeighbors.R

Line 54 in 3bb83f5

min_dist_idx = which(nn_dists_i==min(nn_dists_i))

allowed me to select non-NA values if they exist; however, it is not clear to me why the NA distances appeared in the first place - perhaps some more checks can be added to catch situations that may lead to such problems?
It is also possible that there are multiple neighbors with the same composite distance, which will raise a warning at

veloviz/R/projectedNeighbors.R

Line 70 in 3bb83f5

nn_idx[i] = min_dist_idx
From R v4.0.0, matrix objects also inherits from array, which invalidates code assuming that class(x) has length 1 (https://mran.microsoft.com/snapshot/2020-04-29/doc/manuals/r-devel/NEWS.html), e.g.,

veloviz/R/main.R

Line 51 in 3bb83f5

if (!class(curr) %in% c("dgCMatrix", "dgTMatrix")) {

Warning messages:
1: In if (!class(curr) %in% c("dgCMatrix", "dgTMatrix")) { :
  the condition has length > 1 and only the first element will be used

Hello! Thank you for trying out the package and for your detailed suggestions.

We've implemented updates to address points 1, 2, and 3.

For points 4 and 5, we actually do not end up using min_dist_idx or nn_idx to construct the graph. We use k_min_dists_idx and knn_idx, which specify multiple nearest neighbors for each cell. (This is a remnant of an older version of the function.) k_min_dists_idx is created using order() to get the cells with the minimum distances and so this is not sensitive to NAs the same way that min() is.

veloviz/R/projectedNeighbors.R

Line 56 in 4a851fc

k_min_dists_idx = order(nn_dists_i)[1:k]

That being said, I am not sure why NAs would be generated when computing composite distances. Could you please share the specific data that gave this result?

For point 6, thanks for pointing out the new behavior - we will address it shortly.

Best,
Lyla

Thanks! For example, I was trying with the example data from the velociraptor vignette mentioned above:

sce <- scRNAseq::HermannSpermatogenesisData()
sce <- sce[, 1:500]
sce <- scuttle::logNormCounts(sce, assay.type = 1)
dec <- scran::modelGeneVar(sce)
top.hvgs <- scran::getTopHVGs(dec, n = 2000)
velo.out <- velociraptor::scvelo(sce, subset.row = top.hvgs, assay.X = "spliced")
curr <- assay(velo.out, "Ms")
proj <- assay(velo.out, "Ms") + 1 * assay(velo.out, "velocity")
proj[proj <= 0] <- 0
veloviz = veloviz::buildVeloviz(
    curr = curr, proj = proj,
    normalize.depth = FALSE,
    use.ods.genes = TRUE,
    alpha = 0.05,
    pca = TRUE,
    nPCs = 5,
    center = TRUE,
    scale = TRUE,
    k = 5,
    similarity.threshold = 0,
    distance.weight = 1,
    distance.threshold = 0.25,
    weighted = TRUE,
    seed = 0,
    verbose = TRUE
)
# Error in nn_idx[i] <- min_dist_idx : replacement has length zero

Session info

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
 [1] scRNAseq_2.4.0              SingleCellExperiment_1.12.0
 [3] SummarizedExperiment_1.20.0 Biobase_2.50.0             
 [5] GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
 [7] IRanges_2.24.1              S4Vectors_0.28.1           
 [9] BiocGenerics_0.36.0         MatrixGenerics_1.2.1       
[11] matrixStats_0.58.0         

loaded via a namespace (and not attached):
  [1] nlme_3.1-152                  ProtGenerics_1.22.0          
  [3] bitops_1.0-6                  bit64_4.0.5                  
  [5] filelock_1.0.2                progress_1.2.2               
  [7] httr_1.4.2                    tools_4.0.3                  
  [9] R6_2.5.0                      irlba_2.3.3                  
 [11] mgcv_1.8-33                   DBI_1.1.1                    
 [13] lazyeval_0.2.2                zellkonverter_1.0.2          
 [15] withr_2.4.1                   tidyselect_1.1.0             
 [17] prettyunits_1.1.1             bit_4.0.4                    
 [19] curl_4.3                      compiler_4.0.3               
 [21] BiocNeighbors_1.8.2           basilisk.utils_1.2.2         
 [23] xml2_1.3.2                    DelayedArray_0.16.1          
 [25] rtracklayer_1.50.0            askpass_1.1                  
 [27] rappdirs_0.3.3                stringr_1.4.0                
 [29] digest_0.6.27                 Rsamtools_2.6.0              
 [31] basilisk_1.2.1                XVector_0.30.0               
 [33] pkgconfig_2.0.3               htmltools_0.5.1.1            
 [35] sparseMatrixStats_1.2.1       limma_3.46.0                 
 [37] dbplyr_2.1.0                  fastmap_1.1.0                
 [39] ensembldb_2.14.0              rlang_0.4.10                 
 [41] rstudioapi_0.13               RSQLite_2.2.3                
 [43] shiny_1.6.0                   DelayedMatrixStats_1.12.3    
 [45] generics_0.1.0                veloviz_0.0.0.9000           
 [47] jsonlite_1.7.2                BiocParallel_1.24.1          
 [49] dplyr_1.0.4                   RCurl_1.98-1.2               
 [51] magrittr_2.0.1                BiocSingular_1.6.0           
 [53] GenomeInfoDbData_1.2.4        scuttle_1.0.4                
 [55] Matrix_1.3-2                  Rcpp_1.0.6                   
 [57] reticulate_1.18               lifecycle_0.2.0              
 [59] edgeR_3.32.1                  stringi_1.5.3                
 [61] yaml_2.2.1                    zlibbioc_1.36.0              
 [63] BiocFileCache_1.14.0          AnnotationHub_2.22.0         
 [65] grid_4.0.3                    blob_1.2.1                   
 [67] dqrng_0.2.1                   promises_1.1.1               
 [69] ExperimentHub_1.16.0          crayon_1.4.0                 
 [71] lattice_0.20-41               splines_4.0.3                
 [73] Biostrings_2.58.0             beachmat_2.6.4               
 [75] GenomicFeatures_1.42.1        hms_1.0.0                    
 [77] locfit_1.5-9.4                pillar_1.4.7                 
 [79] igraph_1.2.6                  biomaRt_2.46.2               
 [81] XML_3.99-0.5                  glue_1.4.2                   
 [83] BiocVersion_3.12.0            scran_1.18.5                 
 [85] BiocManager_1.30.10           vctrs_0.3.6                  
 [87] httpuv_1.5.5                  openssl_1.4.3                
 [89] purrr_0.3.4                   velociraptor_1.0.0           
 [91] assertthat_0.2.1              cachem_1.0.3                 
 [93] rsvd_1.0.3                    mime_0.9                     
 [95] xtable_1.8-4                  AnnotationFilter_1.14.0      
 [97] RSpectra_0.16-0               later_1.1.0.1                
 [99] tibble_3.0.6                  GenomicAlignments_1.26.0     
[101] AnnotationDbi_1.52.0          memoise_2.0.0                
[103] statmod_1.4.35                bluster_1.0.0                
[105] ellipsis_0.3.1                interactiveDisplayBase_1.28.0

Hi, thanks for sharing your data. I tried it out and something quite interesting is happening. It seems that for some cells, their expression is similar enough that when only including the top 2000 (or 3000, or 5000..) highly variable genes for further analysis, they actually end up having the exact same gene expression. This happens for at least cells 30, 70, 216, and 380 in the data you included above. (They are not exact duplicates when looking at all genes though).

This presents a problem when computing composite distance, specifically the cosine correlation component. Here, we are calculating the cosine correlation between the a cell's velocity vector (proj[,i] - curr[,i]) and the vector representing the change from a cell to its putative projected neighbor (curr[,j] - curr[,i]). The cosine correlation is defined as the dot product normalized by the magnitudes of the vectors. If cells i and j have the same expression, then the magnitude of curr[,j] - curr[,i] = 0, resulting in NaN when computing the cosine correlation and in turn, the composite distance.

I had encountered this problem with simulated data but didn't expect it to occur with real data. For the time being, I added na.rm = T as you suggested above so that it doesn't throw an error. As it stands, when looking for a cell's projected neighbors, cells with the exact same expression are going to be ignored because of the way that order() deals with NaNs. This should work (see embedding below) until I implement a more meaningful way of dealing with cases like this.

VeloViz embedding of data from velociraptor vignette with parameters k=50, distance.weight = 10, distance.threshold = 1, similarity.threshold = 0, weighted = TRUE, remove.unconnected = TRUE, colored by velocity pseudotime:

Great, thank you!

	pca.curr = curr
	pca.proj = proj