GregorySchwartz/too-many-cells

`--dense` does not work

grst opened this issue · 10 comments

grst commented

Thanks for adding the --dense option.
However, at the moment there still seems to be a problem.
On a dense matrix with ~70k cells and 50 dimensions, it runs for more than 30min with 32 cores and takes a vast amount of memory (~120GB).

On the other hand, without the dense option it completes within a couple of minutes.

Given that it is already pretty fast without dense, I'm not even sure if it's worth supporting this option.

Cheers,
Gregor

Does it complete or hang?

grst commented

it does complete on the test matrix (2000 cells) as fast as without dense.
On the 70k cell matrix I aborted after a bit more than 30min.

Hmm, this might be worth looking into rather than removing an option that may be valuable in the future.

Is it possible to give me sample data that causes this issue?

grst commented

I don't want to share the data publicly, but I can make it available to you via email.

Here are the two full commands:

without dense

sudo time docker run -it -v /storage/scratch/toomanycells:/scratch:Z \
 gregoryschwartz/too-many-cells:0.1.1.0 make-tree -m /scratch/harmony.csv \
  --no-filter --normalization NoneNorm -o /scratch/out +RTS -N8

10:03.04 elapsed, memory usage ~2GB

with dense
It now even fails with some memory problem on a ~ 51x70000 matrix.

sudo time docker run -it -v /storage/scratch/toomanycells:/scratch:Z \
  gregoryschwartz/too-many-cells:0.1.1.0 make-tree -m /scratch/harmony.csv \
  --no-filter --normalization NoneNorm --dense -o /scratch/out +RTS -N8       

Sketching tree [=================================>..................................................]  40% 
too-many-cells: internal error: Unable to commit 1353711616 bytes of memory
(GHC version 8.4.3 for x86_64_unknown_linux)    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

Would you happen to have a smaller example? Reproducing the error with the smallest example possible is the best way for me to find the issue.

grst commented

Only appears to happen with a certain size.
Here's a subset of 20,000 cells which I think I can share here.
test_matrix.csv.gz

It completes with --dense, but without it takes only about half of the runtime and and 10% of the memory.

It's spending most of the time making the dense diagonal matrix which would become 70,000 squared in the crashing example, taking up a lot of space. I assume it works fine for the sparse matrix as the diagonal matrix is mostly 0s. I can probably find a way around this, but for now don't use dense for large numbers of cells.

grst commented

As I said, I'm find with using the sparse mode with the low-dimensional dense matrices as it works and is reasonably fast.

Feel free to close the issue.

I think I fixed the issue in GregorySchwartz/spectral-clustering@aa07ea5, so now it should take up less space. Featured in 0.1.3.1 which is also on docker. Will close for now, if the problem persists, please let me know!