dpeerlab/ENVI

MemoryError

Closed this issue · 7 comments

Hi, thank you for creating such a great tool! It works perfectly good with test data, but when I try to run it with my data, I got this memory allocating error. Do you maybe have any suggestions on how to run envi for bigger datasets?

Code: ENVI_Model = ENVI.ENVI(spatial_data = st_data, sc_data = sc_data)

Error: numpy.core._exceptions.MemoryError: Unable to allocate 392. GiB for an array with shape (217184, 492, 492) and data type float64

Sbatch job parameters:
#SBATCH --job-name=enVI
#SBATCH --output=logs/test-%j.out
#SBATCH --error=logs/test-%j.err
#SBATCH --time=05:00:00
#SBATCH --gres=gpu:1
#SBATCH --mem=180G
#SBATCH --partition=c18g
#SBATCH --cpus-per-task=30
#SBATCH --signal=2
#SBATCH --nodes=1
#SBATCH --export=ALL

UPDATE
I also tried to run it on more powerful cluster:

#SBATCH --job-name=enVI
#SBATCH --output=logs/subset-%j.out
#SBATCH --error=logs/subset-%j.err
#SBATCH --time=10:00:00
#SBATCH --mem=1000G
#SBATCH --cpus-per-task=80
#SBATCH --signal=2
#SBATCH --nodes=1
#SBATCH --export=ALL
#SBATCH --no-requeue

And I got OOM error:
tensorflow.python.framework.errors_impl.ResourceExhaustedError: {{function_node _wrapped__SelfAdjointEigV2_device/job:localhost/replica:0/task:0/device:CPU:0}} OOM when allocating tensor with shape[217184,491,491] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu [Op:SelfAdjointEigV2]

I also subsetted my datasets almost twice. Unfortunately I am afraid that further subseting will lead to loosing of biological meaning. If someone also has similar problem, I would be very grateful for any suggestions on how to solve it :)

Hi!

The OOM error comes from the computation of the COVET matrices based on 492 genes. We've now updated ENVI and unless specified, should only base COVET on the 64 highly variables genes.

Please try again with the new version and let us now if your still having errors.

Hi, thank you for your answer. I updated envi and tried again. But unfortunately, my problem remains the same:
at the Computing Niche Covariance Matrices step:
Traceback (most recent call last):
File ".../enVI/test.py", line 9, in
envi_model = scenvi.ENVI(spatial_data = st_data, sc_data = sc_data)
File ".../enviroments/envi/lib/python3.9/site-packages/scenvi/ENVI.py", line 122, in init
self.spatial_data.obsm['COVET'], self.spatial_data.obsm['COVET_SQRT'], self.CovGenes = compute_covet(self.spatial_data, self.k_nearest, num_cov_genes, cov_genes, spatial_key = spatial_key, batch_key = batch_key)
File ".../enviroments/envi/lib/python3.9/site-packages/scenvi/utils.py", line 233, in compute_covet
CovMats = CalcCovMats(spatial_data, k, genes = CovGenes, spatial_key = spatial_key, batch_key = batch_key)
File ".../enviroments/envi/lib/python3.9/site-packages/scenvi/utils.py", line 198, in CalcCovMats
CovMats = np.matmul(DistanceMatWeighted.transpose([0,2,1]), DistanceMatWeighted) / (kNN - 1)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 196. GiB for an array with shape (217184, 492, 492) and data type float32

Hi!

If you have already run highly variable genes (sc.pp.highly_variable_genes) ENVI by default uses those genes as the basis for COVET. Is it possible that you already have all genes in your st_data marked as HVGs? If so, can you first re-run sc.pp.highly_variable_genes with a higher threshold or less genes and try again?

Yes, I did run sc.pp.highly_variable_genes and I only selected top=1000 but the error is the same no matter if I use 1000, 3000 or 10 000 genes :( Should I subset further? Do you have an approximate value of how many genes can be imputed? And also will splitting reference scRNA data into smaller objects and then running in the loop one by one make sense biologically speaking?

Thank you in advance for your answer! :)

We mean sc.pp.highly_variable_genes on the spatial data (st_data) in your case, not the scRNA-seq. For the scRNA-seq even with 3,000 genes you should see no issue. If you ran sc.pp.highly_variable_genes on the st_data and selected the top 1,000 it would just select all the genes (since you only have 492 genes in the dataset). Before running ENVI, first try running:

sc.pp.highly_variable_genes(st_data, n_top_genes = 64, layer = 'log')

Also, make sure the data is not logged (in the .X), since ENVI expected unlogged counts.

I see, however, I think it wasn't the problem because I did not run hvg on my spatial dataset before. But after I converted both sc_data.X and st_data.X from sparse matrix to np array it seemed to work no matter if I ran highly_variable_genes function or not. With it now I can close the error, thank you for your help and the great update. I hope we will do a lot of nice analyses with envi :)