GregorySchwartz/too-many-cells

index out of bounds error

grst opened this issue · 7 comments

grst commented

Hi,

I'm experiencing an index-out-of-bounds error when running too-many-cells on a csv file using the docker container:

sudo docker run -it -v /storage/scratch/toomanycells:/scratch gregoryschwartz/too-many-cells:0.1.0.0 make-tree -m /scratch/test.csv 

Sketching tree [=================================>..................................................]  40%too-many-cells: ./Data/Vector/Generic.hs:245 ((!)): index out of bounds (0,0)                             
CallStack (from HasCallStack):
  error, called at ./Data/Vector/Internal/Check.hs:87:5 in vector-0.12.0.1-GGZqQZyzchy8YFPCF67wxL:Data.Vector.Internal.Check

Do you have an idea what might go wrong?

Here's the matrix: test.csv.gz. The data are the first 50 PC's on 2000 immune cells.

It's a csv file, so too-many-cells expects row and column names. Also, the initial column name should be empty as it names the rows. See the example section at https://gregoryschwartz.github.io/too-many-cells/. I'll try to add a better error message. Let me know if it works!

Wait, so that is an issue that you need column and row names. However, the real issue is that you are using PCA, so the default cell and gene filtering and normalization are removing and scaling incorrectly (they aren't counts). So the answer to fix your issue is to use --no-filter --normalization NoneNorm if they aren't counts. I updated the documentation to reflect this option. Also, because the matrix is now dense, I can't make any guarantees about how long it will take to run.

grst commented

Thanks the filtering options did the trick! Now it does something :)

Concerning runtime:

  • Shouldn't it be a lot faster on PCA even though it's "dense" simply because it has a lot fewer dimensions?
  • The docker container only uses one core, is that normal behaviour?

It's optimized for sparse matrices and the library I'm using uses an IntMap (IntMap Double) structure, so it would probably be slower on dense for multiplication and the like. I can see about allowing for dense matrices (and thus using BLAS/LAPACK for much faster running). To use multiple cores, add +RTS -N${NUMCORES} to the end of the entire command, where -N4 uses 4 cores for instance. Just -N uses all available cores. Of course, mileage may vary depending on the nature of the data.

grst commented

Thanks, the multiprocessing seems to work now!

I think using BLAS with dense matrices would be a great enhancement. My use-case is that I used external tools (scanpy) for filtering and removing confounding factors. Additionally, some recent tools for combining multiple datasets (Scanorama, Harmony) work in PCA-space, and I think it would be great to apply too-many-cells to large, integrated datasets.

Concerning runtime:
The documentation says it takes "some time" to generate the tree. What ballpark runtime (hours/days/weeks?) can I expect on a (sparse) dataset with 10k cells on a server with 32 cores?

Thanks for using it! Let me know if you have any difficulties or want added features! A ballpark runtime for a 10k cells in a sparse scRNA-seq count matrix would take around 20 minutes on a single core. 50k maybe a few hours. I have not tested multi-core, it may actually be slower depending on the fine-grain or coarse-grain nature of the beast.

I'll close this since the initial issue has been resolved. If you have another issue, please open up a separate one.