MouseLand/cellpose

Cellpose 3 runs out of GPU memory where Cellpose 2 didn't

Closed this issue · 9 comments

I've been using Cellpose 2.2.2 to segment rather large images with tens of thousands of cells. Now I upgraded to Cellpose 3.0.7 to try out the new denoising/deblurring features, however, Cellpose 3 does not seem to handle the large files in the same way.

I am using using a laptop with a Intel Core i7-13700H and Nvidia RTX 4050 6GB to segment, for example, an image that is about 9000x7000 pixels large and contains around 75,000 cells to be segmented.

  • In Cellpose 2, if I remember correctly, would trigger a message in the command line along the lines of "image too large, computing masks to flows on CPU" and switch to the slower CPU processing, but successfully complete it eventually.
  • In Cellpose 3, it simply fails with the error torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.71 GiB. GPU 0 has a total capacty of 6.00 GiB of which 1.16 GiB is free. Of the allocated memory 3.69 GiB is allocated by PyTorch, and 90.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    If I switch off GPU processing manually, it segments the image successfully using the CPU.

Is there any way to switch back to the old behaviour in Cellpose 3? For me, this new behaviour is more inconvenient - I would like to use the GPU, if possible, and usually I use Cellpose via CLI to batch-process lots of images. Without the automatic GPU/CPU switching, I'd always need to estimate or manually check if it is small enough to run on the GPU, and then change it on a per-run basis.

Thanks a lot!

Mine is doing the same thing but for much smaller images 2048x2048. It seems it does not clear memory between segmentations. I dont have a "real" fix but this is the notebook I'm running for now

import numpy as np
import gc
import time, os, sys
from urllib.parse import urlparse
import skimage.io
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300
from cellpose import utils
from cellpose import plot
from cellpose import models
from cellpose import models, io
from urllib.parse import urlparse
from cellpose import models, core
from cellpose.io import logger_setup
from cellpose import denoise, io
from scipy.io import savemat
logger_setup();
import re

diam = 50
minarea = 200
filetype = '.tif'
model = models.CellposeModel(gpu=True, model_type = 'CPx')
channels = [0,0] # IF YOU HAVE GRAYSCALE

folders = 'D:\CellImages'`

def list_files(dir):
r = []
#all_data = list()
for root, dirs, files in os.walk(dir):
for name in files:
if name.endswith(filetype):
r0=os.path.join(root,name)
if "20x" in r0 and "C3" in r0:
# print(r0)
r.append(r0)
# all_data.append(skimage.io.imread(r0, as_gray=True))
return r

namess = list_files(folders)

def partitioncellpose(namelist):
pace=5
part = int(np.ceil(len(namelist)/pace))
lastindx = 0
for index in range(part):
r = []
all_data = list()
if index==part:
names = namelist[lastindx:-1]
else:
names = namelist[lastindx:lastindx+pace]
lastindx = lastindx+5
for impath in names:
r.append(impath)
all_data.append(skimage.io.imread(impath, as_gray=True))
imgs = all_data
nimg = len(imgs)
#model = denoise.CellposeDenoiseModel(gpu=True, model_type="cyto3", restore_type="denoise_cyto3")
#masks, flows, styles = model.eval(imgs, diameter=diam, channels=channels, flow_threshold=0.8,cellprob_threshold=-6, do_3D=False, min_size=minarea, resample=True, progress=True)
#masks, flows, styles, imgs_dn = model.eval(imgs, diameter=diam, channels=channels, flow_threshold=0.8, cellprob_threshold=-6, do_3D=False, min_size=minarea, resample=True)
#io.masks_flows_to_seg(imgs_dn, masks, flows, names, diam, channels )
masks, flows, styles = model.eval(imgs, diameter=diam, channels=channels, flow_threshold=0.8, cellprob_threshold=-6, do_3D=False, min_size=minarea, resample=True)
io.masks_flows_to_seg(imgs, masks, flows, names, diam, channels )
#masks_flows_to_seg(images, masks, flows, file_names, diams, channels, imgs_restore, restore_type, ratio)

    for file in names:
        name = re.sub('.tif$','_seg.npy',file)
        dat = np.load(name, fix_imports=True,allow_pickle=True).item()
        name = re.sub('npy$', 'mat', name)
        savemat(name, dat)
    del styles
    print(names)
    del names
    del imgs
    gc.collect()
return 0

partitioncellpose(namess)

I left the commented stuff intentionally because if you use the denoising+cyto3, it returns the denoised image as well.

@yassinharim some of the denoising features in CP3 are achieved with a CNN. The weights for the CNN and the actual CPnet and the image data all have to be held in GPU memory while evaluating. So, I'm not surprised that large images exceed the memory capacity. To confirm, if you use the version 3 gui and then only run the cyto3 model do you get memory errors?

We will look into this issue

@yassinharim some of the denoising features in CP3 are achieved with a CNN. The weights for the CNN and the actual CPnet and the image data all have to be held in GPU memory while evaluating. So, I'm not surprised that large images exceed the memory capacity. To confirm, if you use the version 3 gui and then only run the cyto3 model do you get memory errors?

We will look into this issue

Thanks a lot for your reply @mrariden! Actually I did not test this with any version 3 features like denoising or the cyto3 model - I was just executing Cellpose via CLI to run a model that I trained based on the nuclei model. And the thing that I'm wondering about is that it used to work on version 2 because it would automatically switch to CPU processing - but in version 3, it simply fails and stops.

Since I'm quantifying nuclei, I don't think the cyto3 model would yield good results. Do you still want me to run cyto3 in the version 3 GUI just for troubleshooting issues, or was it just to clarify the scenario where the error occurred?

I tend to have GPU VRAM problems as well when processing on large images, but one issue is actually in the diameter calibration step, rather than the mask segmentation step. See below for my setup (using cellpose v3.05, with an image 10000px/21000px in size), which triggers the same calculation of flow threshold on the CPU.

image

However, this is NOT triggered during diameter calibration, and instead a memory error is thrown, and so far I only have theories why this is the case

ah good catch please turn off the diameter calculation then, in practice we recommend users don't use this option because they know the diameters of their samples.

I am going to close this issue for now but please let us know if you are still having problems

@carsen-stringer I am indeed, think @ian-coccimiglio is experiencing a different issue than me - the memory error I encountered occurs during actual segmentation, since I perform that step with a pre-set diameter. Precisely I'm using the following command:

cellpose --image_path <path> --pretrained_model <model> --chan <channel> --diameter <diameter> --flow_threshold <ft> --cellprob_threshold <cpt> --verbose --save_tif --in_folders --use_gpu

So of course I could remove --use_gpu but then my pipeline would be much slower altogether of course, and I don't know how to calculate the memory requirement by myself in order to insert some logic in the pipeline to decide whether to use GPU processing or not. In Cellpose 2, using this CLI command, it would just fall back to the CPU if the GPU memory would be exceeded. Instead, in Cellpose 3, it fails on this command with the error given in my first post.

hmm this code has not changed from cellpose3 vs cellpose2 (code), can you please specify the exact cellpose3 version you are using and the exact cellpose2 version you are using and verify this error occurs for the exact same images? thanks

hmm this code has not changed from cellpose3 vs cellpose2 (code), can you please specify the exact cellpose3 version you are using and the exact cellpose2 version you are using and verify this error occurs for the exact same images? thanks

@carsen-stringer I set up a new environment to test cellpose3 again (3.0.11) aside my normal installation of cellpose2 (2.3.2) with the exact same images and parameters and indeed, now it just worked with the exact same behaviour that I am used to from cellpose2, to fall back to CPU processing. So, solved from my side too, thank you!

One question that I found from this direct comparison: cellpose3 counted 57.293 cells whereas cellpose2 counted 57.072. , Again with the exact same image, parameters and running the same custom model (trained in cellpose2). It's a very little difference of course, but just to know - since I am not using the new features of cellpose3 like denoise, are there any other changes in cellpose3 (filtering, mask creation etc.) that could explain this difference?

@yassinharim going purely on the difference, some notes in this thread might be useful in understanding what could be going on: https://forum.image.sc/t/cellpose-gpu-returns-different-results-from-cpu/101199
Though it starts with CPU-GPU difference, some aspects there call out to the sensitivity of the result to floating point operations.

I suspect cellpose2 and 3 have different underlying library requirements (NumPy/PyTorch etc). These different versions could potentially have different optimizations resulting in different floating point math accumulations - which might round differently through the various threshold/comparison operations. This could happen even with exactly same code in the Cellpose library implementation.