elbamos/largeVis

hdbscan-non-numeric argument to binary operator

Closed this issue · 24 comments

library(largeVis)
set.seed(123)
ts_matrix_elec <- elect_data %>% scale() %>% t()
visObject <- largeVis(ts_matrix_elec, n_trees = 50,
K = 10)
plot(t(visObject$coords))

clusters <- hdbscan(visObject, verbose = FALSE) # failed
Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs - :
non-numeric argument to binary operator

gplot(clusters, t(visObject$coords))

What happened? Is there any suggestion?

Hey can you make elect_data available and I'll look? thanks.

Thank you! and how to make the data avaliable for you?

https://www.dropbox.com/s/0v41q45yvn9ahzh/test_gm_data.csv?dl=0
I upload the data to Dropbox, thank you!

It's time series from January 2012 to December 2015 monthly data, there are 11524 individuals, i want to cluster these individuals based on the time dimension.

I can't reproduce it. Can you try the version currently in branch hotfix/twobugs and confirm if the issue is now resolved?

Loading required package: Matrix
> library(readr)
> elect_data <- read_csv("~/Downloads/test_gm_data.csv")
Parsed with column specification:
cols(
  .default = col_double()
)
See spec(...) for full column specifications.
> str(elect_data)
<snip>
> library(largeVis)
Loading required package: Rcpp
> library(magrittr)
> ts_matrix <- elect_data %>% scale() %>% t()
> visObj <- largeVis(ts_matrix, n_trees = 50, K = 10, verbose = TRUE)
Searching for neighbors.
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating edge weights...
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Estimating embeddings.
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Warning message:
In largeVis(ts_matrix, n_trees = 50, K = 10, verbose = TRUE) :
  The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector.
> plot(t(visObj$coords))
> clusters <- hdbscan(visObj, verbose = TRUE)
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
> gplot(clusters, t(visObj$coords))
Warning message:
Removed 1337 rows containing missing values (geom_segment). 
> 

image

image

how to use that version?

devtools::install_github("elbamos/largeVis", ref = "hotfix/twobugs")

image

too slow, is there any faster download method?

Yes, it works!

But why there is "NA" in the plot? I can't upload this image, did you see that in you plot?

Yes. Points will have cluster NA if the algorithm does not put them in a cluster. You can review the documentation on the algorithm for detail if you'd like.

I'm going to close this now - feel free to reopen if anything comes up.

Thank you!

@bifeng There was a bug in the version of largeVis that you tested a week ago. The bug caused the hdbscan algorithm to fail to combine clusters that should be combined. If you try the version that I've just pushed, it should produce better results on your dataset.

I am also encountering the same problem.

> load('C:/lab/normdata.Rdata')

> library(largeVis)

> library(ggplot2)

> norm <- scale(norm)

> l <- largeVis(norm,verbose=T)
Searching for neighbors.
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating edge weights...
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Estimating embeddings.
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|

> clusters <- largeVis::hdbscan(l,verbose=T)
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
********Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs -  : 
  non-numeric argument to binary operator
In addition: Warning message:
In largeVis(norm, verbose = T) :
  The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector.

I'd rather not publicly post the data. Can I email it to you?

I couldn't reproduce it. Are you sure you're using a current version?

screen shot 2017-06-30 at 1 12 48 pm
screen shot 2017-06-30 at 1 12 57 pm

I installed it with

devtools::install_github("elbamos/largeVis", ref = "hotfix/twobugs")

Is this correct?

I've reinstalled from master and it's still throwing the same error.

> h <- hdbscan(vis, verbose=T)
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
********Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs -  : 
  non-numeric argument to binary operator
In addition: Warning message:
In largeVis(norm, verbose = T) :
  The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector.

That's very odd. Can you send me the log or a screenshot of a complete session? Start from an empty environment, load largeVis, check the version, and try the commands in just the way I did them?

> load('C:/lab/normdata.Rdata')

> library(largeVis)

> library(ggplot2)

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_2.2.1  largeVis_0.2.2 Matrix_1.2-9  

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2 scales_0.4.1     compiler_3.4.0   lazyeval_0.2.0   plyr_1.8.4      
 [6] tools_3.4.0      gtable_0.2.0     tibble_1.3.3     Rcpp_0.12.11     grid_3.4.0      
[11] rlang_0.1.1      munsell_0.4.3    lattice_0.20-35 

> norm <- scale(norm)

> vis <- largeVis(norm,verbose=T)
Searching for neighbors.
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating edge weights...
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Estimating embeddings.
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|

> plot(t(vis$coords))

> h <- hdbscan(vis, verbose=T)
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
********Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs -  : 
  non-numeric argument to binary operator
In addition: Warning message:
In largeVis(norm, verbose = T) :
  The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector.