juglab/EmbedSeg

RuntimeError: CUDA out of memory.

Saharkakavand opened this issue · 8 comments

I have 4 images, and batch size is only 1. but when I start the
begin_training(train_dataset_dict, val_dataset_dict, model_dict, loss_dict, configs), I have RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 31.75 GiB total capacity; 30.71 GiB already allocated; 62.50 MiB free; 12.93 MiB cached). Please let me know how can I solve it.
Thanks

Hello @Saharkakavand. Thank you for opening an issue and trying out EmbedSeg!
Could you share the size of the original images and the crop_size which you used in 01-data.ipynb?
A trivial reason for this out of memory error could just be that additional notebooks are open - if so, shutting them down should release the GPU memory. This is how the Running tab would look ideally:
image

@MLbyML , thank your for your replay,
original 3d image size: 713 x 806 x 714, crop: 256 x 256 x256
there is no notebook running, and before running the last cell for train the network there is no process on GPU based on nvidia-smi -l command

sorry, I I have 5 images with these shapes in data directory
img.shape[0]=[713, 748, 797, 791, 972]
img.shape[1]=[806, 787, 677, 798, 364]
img.shape[2] =[714, 772, 772, 783, 816]
crops are in this size:
crops.shape[0] = 256
crops.shape[1] = 256
crops.shape[2] = 256

Okay, so these look like confocal volume images since the size of the z dimension appears almost the same as x and y dimensions, is that correct?
This set of notebooks runs it for in-situ specimens imaged under confocal microscopy, just for reference.
I would have to dig a bit more into how GPU memory scales with crop_size and can get back to you. For now, I would recommend bringing down the crop_size by half - so something like 128 x128 x 128 (X x Y x Z) if the z voxel size is roughly the same as x and y voxel size. In case the z dimension is downsampled, you could also try 256 x 256 x 64 (X x Y x Z) and set the anisotropy_factor appropriately.
Since you may have to generate the crops again, you can increase the speed_up factor to 3 or higher to get these crops generated quicker. Let me know if you have questions. Thank you!

Hello @MLbyML , thank you for your reply.
I changed the crop size, it works but after 60 epochs still the train loss and val loss is 1.03.
I just tried to use prediction code for the two images I have as a test set but I have this error: RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR.

3-D testdataloader created! Accessing data from ../../../data/fiber/test/ Number of images intestdirectory is 2 Number of instances intestdirectory is 2 Number of center images intest` directory is 0


Creating branched erfnet 3d with [6, 1] classes

0%| | 0/2 [00:25<?, ?it/s]


RuntimeError Traceback (most recent call last)
in
----> 1 begin_evaluating(test_configs, verbose = True, avg_bg = avg_background_intensity/normalization_factor)

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/EmbedSeg/test.py in begin_evaluating(test_configs, verbose, mask_region, mask_intensity, avg_bg)
67 grid_x=test_configs['grid_x'], grid_y=test_configs['grid_y'], grid_z=test_configs['grid_z'],
68 pixel_x=test_configs['pixel_x'], pixel_y=test_configs['pixel_y'],pixel_z=test_configs['pixel_z'],
---> 69 one_hot=test_configs['dataset']['kwargs']['one_hot'], mask_region= mask_region, mask_intensity=mask_intensity, avg_bg = avg_bg)
70
71

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/EmbedSeg/test.py in test_3d(verbose, grid_x, grid_y, grid_z, pixel_x, pixel_y, pixel_z, one_hot, mask_region, mask_intensity, avg_bg)
255 output = torch.from_numpy(output_average).float().cuda()
256 else:
--> 257 output = model(im)
258
259 instance_map, predictions = cluster.cluster(output[0],

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
--> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
150 return self.module(*inputs[0], **kwargs[0])
151 replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
--> 152 outputs = self.parallel_apply(replicas, inputs, kwargs)
153 return self.gather(outputs, self.output_device)
154

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
160
161 def parallel_apply(self, replicas, inputs, kwargs):
--> 162 return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
163
164 def gather(self, outputs, output_device):

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices)
81 output = results[i]
82 if isinstance(output, Exception):
---> 83 raise output
84 outputs.append(output)
85 return outputs

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py in _worker(i, module, input, kwargs, device)
57 if not isinstance(input, (list, tuple)):
58 input = (input,)
---> 59 output = module(*input, **kwargs)
60 with lock:
61 results[i] = output

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
--> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/EmbedSeg/models/BranchedERFNet_3d.py in forward(self, input, only_encode)
36 output = self.encoder(input)
37
---> 38 return torch.cat([decoder.forward(output) for decoder in self.decoders], 1)
39
40

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/EmbedSeg/models/BranchedERFNet_3d.py in (.0)
36 output = self.encoder(input)
37
---> 38 return torch.cat([decoder.forward(output) for decoder in self.decoders], 1)
39
40

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/EmbedSeg/models/erfnet_3d.py in forward(self, input)
141
142 for layer in self.layers:
--> 143 output = layer(output)
144
145 output = self.output_conv(output)

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
--> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/EmbedSeg/models/erfnet_3d.py in forward(self, input)
49
50 def forward(self, input):
---> 51 output = self.conv3x1x1_1(input)
52 output = F.relu(output)
53 output = self.conv1x3x1_1(output)

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
491 result = self._slow_forward(*input, **kwargs)
492 else:
--> 493 result = self.forward(*input, **kwargs)
494 for hook in self._forward_hooks.values():
495 hook_result = hook(self, input, result)

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/modules/conv.py in forward(self, input)
474 self.dilation, self.groups)
475 return F.conv3d(input, self.weight, self.bias, self.stride,
--> 476 self.padding, self.dilation, self.groups)
477
478

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

`
Screenshot from 2021-07-06 11-28-47

Hello @Saharkakavand Thanks for giving this a go.
To me, the loss after 60 epochs looks reasonable- probably a better indicator of whether the training is stagnating is by looking at the loss.png saved at experiment/$data$-demo/
The error message you pointed out seems to be coming from running the inference on multiple GPUs parallely and might be a bug in the code that we need to fix - not sure at the moment.
Is GPU0 being used for training while you try to run the prediction notebook simultaneously? (maybe stopping the training notebook could help before running the prediction notebook? You can always resume training later from the last checkpoint by using the resume_path variable) If this doesn't help, I will dig deeper and let you know.

I stopped training and run the test on the same GPU. I attached the loss.png, it didn't change much as you can see
Screenshot from 2021-07-06 13-16-27

The iou profile appears strange to me- my understanding is that the iou shouldn't really go down unless the validation and train images are quite different in their appearance.
Would it be possible for me to look at the images and instance masks which you are training the network on? Maybe I can try running them on my setup here, if that helps? (If sharing original images is not possible, then would sharing downsampled versions of the images work?)