memory and time needed to process 68kPBMCs.h5ad

Question

memory and time needed to process 68kPBMCs.h5ad

Closed this issue 2 years ago · 3 comments

How much memory and time should be needed to process the 68kPBMCs dataset? I converted it to h5ad using the notebook in this repository, and have been trying to process it using the commands provided (and a singularity image from the docker image) but it seems to keep hanging and then running out of memory (and time). I was testing with 8GB, 30min with 1GPU, but maybe I need to up the memory and time allocations?

Answer 1 · 2022-02-17T15:22:13.000Z

Hello,
yes usually not only the dataset but also the models need to be kept in memory. Thus, it would be good if you can try with more memory (at least double) and also time. If that does not help, please tell me at which point the processing stops and copy of the command line output. That could help to debug if there is a bug. Thanks in advance.

Answer 2 · 2022-02-21T19:33:50.000Z

I figured it out, it was a memory issue. However it seems not even 16GB memory is enough to process the 68kPBMCs file. Would you be able to tell me how much memory and CPUs you used for preprocessing? I was not able to find it in the paper.

On a different note, I am now getting this error on a much smaller dataset (I subsetted the 68kPBMCs file to just the first 100 rows and columns) running anndata[0:100,0:100]) in the notebook:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/users/aguang/scGAN/preprocessing/write_tfrecords.py", line 146, in read_and_serialize
    sc_data.apply_preprocessing()
  File "/users/aguang/scGAN/preprocessing/process_raw.py", line 362, in apply_preprocessing
    self.clustering()
  File "/users/aguang/scGAN/preprocessing/process_raw.py", line 142, in clustering
    sc.pp.recipe_zheng17(clustered)
  File "/usr/local/lib/python3.5/dist-packages/scanpy/preprocessing/recipes.py", line 107, in recipe_zheng17
    adata.X, flavor='cell_ranger', n_top_genes=n_top_genes, log=False)
  File "/usr/local/lib/python3.5/dist-packages/scanpy/preprocessing/simple.py", line 340, in filter_genes_dispersion
    np.percentile(df['mean'], np.arange(10, 105, 5)), np.inf])
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/reshape/tile.py", line 136, in cut
    dtype=dtype)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/reshape/tile.py", line 234, in _bins_to_cuts
    "the 'duplicates' kwarg".format(bins=bins))
ValueError: Bin edges must be unique: array([      -inf, 0.00666499, 0.00938967, 0.00938967, 0.01185446,
       0.01408451, 0.01725352, 0.0230047 , 0.02577465, 0.03677063,
       0.04813716, 0.05450704, 0.06125084, 0.0664118 , 0.069777  ,
       0.08798122, 0.15368545, 0.22179408, 0.27368627, 0.3314386 ,
              inf]).
You can drop duplicate edges by setting the 'duplicates' kwarg
"""

Answer 3 · 2022-03-09T07:07:27.000Z

Dear @aguang ,
I was discussing with colleagues who actually performed the experiment, but they didn't check the memory requirements.
For CPU requirements it mainly depends on the computational speed, but since most preprocessing operations are just single-threaded it does not make much difference to increase the number of CPU. For memory please try a bigger about of memory especially, but we unfortunately we cannot give a exact number to you.

For the second issue, during the preprocessing, the scanpy function recipe_zheng17 is called with default parameters, which involves the detection of 1000 highly variable genes and in your example the anndata object only contains 100 genes and thus fails.