elbamos/largeVis

randomProjectionTreeSearch gets stuck and never returns

Closed this issue · 16 comments

Apologies in advance if this is not a "bug" but just something I am doing wrong.

I have a data set of 423K rows and 225 dimensions. I am running the different largeVis steps separately to debug ("randomProjectionTreeSearch", "buildEdgeMatrix", "buildWijMatrix", "projectKNNs"). The first step runs at full speed for a couple of seconds and then settles in a mono-thread load (15% on a 4 core machine) and never returns.

I have had the same behaviour with a similar dataset (423K rows) but with 500 different dimensions. In that case changing the "K" parameter prevented the issue. I have gone over the various hyper parameters but have not been able to find a setting that works for my set of 225 dimensions.

Is there any way that I can debug this so as to prevent me from having to search randomly the solution space of hyper parameters ? I have tried setting the option "getOption("verbose", TRUE)" but this does not ouput anything.

Any help would be appreciated. In any case, thanks for your wonderful package!

Spec:

  • Windows 10 pro
  • 16 Gb RAM, core i7 6700 HQ (4 core)
  • largeVis 0.1.10 x64 (compiled against github, though I have also tried CRAN 32-bit version)
  • R 3.3.2 x86_64-w64-mingw32

That's odd. Is it possible for you to share your data?

Where the 32/64-bit issue comes in is actually with allocating the matrix indices, and I wouldn't expect it to be problematic on that dataset in that function. The issue with bits is actually not whether the OS is 32 or 64-bit, its whether ARMA_64BIT_WORD is set. The issue is how big a sparse matrix Armadillo is willing to make. Are you sure that ARMA_64BIT_WORD was enabled when you compiled? (Just installing from github won't do it, you have to add -DARMA_64BIT_WORD to your R Makevars.)

I really appreciate your help tracking this down - I haven't focused much on Sparse matrices since I seemed to be the only one interested in using that functionality.

I took a look at your data and I'm not able to reproduce the error. The 32- vs-64-bit build should not matter at this data size. Peak RAM use was ~ 8 GB.

> file = file("../test-skillizr.bin", "rb")
> numRows = readBin(file, integer(), 1, endian = "big")
> numCols = readBin(file, integer(), 1, endian = "big")
> docs = readLines(file, numRows, ok = FALSE, skipNul =  TRUE)
> index = readBin(file, double(), numCols * numRows, endian = "big")
> index = matrix(index, numRows, numCols, byrow = TRUE)
> str(index)
 num [1:423093, 1:225] 7.77e-05 8.56e-05 1.95e-04 1.28e-04 5.97e-05 ...
Warning message:
closing unused connection 3 (../test-skillizr.bin) 
> index = t(index)
> library(largeVis)
> neighobrs <- randomProjectionTreeSearch(index, max_iter = 1, verbose = TRUE, K = 100, n_trees = 50, tree_threshold = 100)
Searching for neighbors.
0                                                                                                   %
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
> str(neighobrs)
 num [1:100, 1:423093] 331837 42204 308194 184452 211760 ...
> edges = buildEdgeMatrix(data = index, neighbors = neighobrs, distance_method = "Cosine")
> str(edges)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:42309300] 21237 21804 26710 42204 59873 60531 91230 93913 108898 122277 ...
  ..@ p       : int [1:423094] 0 28 52 157 202 432 462 597 935 1068 ...
  ..@ Dim     : int [1:2] 423093 423093
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:42309300] 0.00534 0.05356 0.00856 0.00369 0.00799 ...
  ..@ factors : list()
> wij = buildWijMatrix(edges)
> str(wij)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:65098192] 921 6379 20259 21237 21804 26710 36382 36384 42204 59873 ...
  ..@ p       : int [1:423094] 0 101 201 340 437 681 784 959 1338 1547 ...
  ..@ Dim     : int [1:2] 423093 423093
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:65098192] 3.24e-251 5.26e-12 2.34e-193 2.85e-02 1.79e-02 ...
  ..@ factors : list()
> wij = buildWijMatrix(edges, perplexity = 5)
> str(wij)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:52643056] 6379 21237 26710 42204 59873 60531 77421 84110 90655 93027 ...
  ..@ p       : int [1:423094] 0 47 140 249 335 516 573 746 1093 1293 ...
  ..@ Dim     : int [1:2] 423093 423093
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:52643056] 1.23e-267 2.04e-03 4.21e-17 2.33e-01 2.98e-13 ...
  ..@ factors : list()

Can you try from a clean install?

P.S.: Is there a reason you're setting perplexity to 5?

Actually I take that slightly back -- some of the numbers in the wij matrix are so small, that if you're using a 32-bit build of R (not Armadillo, but R itself), they might underflow. I don't know that this would cause the error you're reporting, however.

Can you confirm this is working so I can close the Issue?

@avanwouwe You can try the version currentyl in the /develop branch here. The multithreading in the neighbor search was refactored, so that might help you.

@avanwouwe I'm going to close this for now. The neighbor search in the develop branch has threading almost completely refactored, so if you're still getting thread lock issues, I think it may have to do with your OpenMP library. If you're able to find anything out or make and progress please let me know, and please feel free to reopen.