Training Models with Custom Labeled Data
NickDiNapoli opened this issue · 11 comments
Hi there,
I am trying to train a model via the GUI. I started with SegResNet because I figured it'd be the most simple to start. I have a 3D labeled dataset that I converted to your required input .tif format. The images and labels are all of shape (7, 1150, 1150). After setting training parameters (mostly the default for now) I receive this error upon starting the training.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
10:51:09 INFO config model : SegResNet
10:51:09 WARNING Labels are not semantic, but instance. Converting to semantic, this might cause errors.
10:51:10 WARNING Warning : a very large dimension for automatic padding has been computed.
Ensure your images are of an appropriate size and/or that you have enough memory.The padding value is currently 1024.
10:51:10 WARNING Warning : a very large dimension for automatic padding has been computed.
Ensure your images are of an appropriate size and/or that you have enough memory.The padding value is currently 2048.
10:51:10 WARNING Warning : a very large dimension for automatic padding has been computed.
Ensure your images are of an appropriate size and/or that you have enough memory.The padding value is currently 1024.
10:51:10 WARNING Warning : a very large dimension for automatic padding has been computed.
Ensure your images are of an appropriate size and/or that you have enough memory.The padding value is currently 2048.
10:51:10 WARNING Warning : a very large dimension for automatic padding has been computed.
Ensure your images are of an appropriate size and/or that you have enough memory.The padding value is currently 4096.
10:51:10 WARNING Warning : a very large dimension for automatic padding has been computed.
Ensure your images are of an appropriate size and/or that you have enough memory.The padding value is currently 8192.
Loading dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 85/85 [00:00<?, ?it/s]
Loading dataset: 0%| | 0/22 [00:03<?, ?it/s]
10:51:14 ERROR Error in training
.
.
.
RuntimeError: quantile() input tensor is too large
.
.
.
RuntimeError: applying transform <napari_cellseg3d.code_models.workers_utils.QuantileNormalizationd object at 0x000001BE118DB4C0>
10:51:14 INFO WORKER ERRORED at 10:51:14
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
It seems as if the preprocessing step is padding my images or labels for a reason I am unsure about. So my questions are:
- Is there a required input shape such that I can just do the padding myself?
- Is there a required array order that I am missing? i.e. (z, x, y) which I currently have or (z, x, y, channel), (channel, z, x, y), (x, y, z) etc.
Thank you for any insight.
Hello,
Thanks for trying the plugin ! There are couple things that might help based on the log :
- The input size is indeed too large; I'm not sure why padding goes to such high numbers, but either way 7x1150x1150 is most likely too big a size to fit into memory (or the quantile normalization, which is why it is erroring). I would recommend cropping volumes that are 7x64x64 optimally; you can use the Fragment utility of the plugin to quickly achieve this on a folder of images and labels.
- You do not need to worry about channels or the order of the axes, here the issue is purely with the size of the images.
- Something to note is that SegResNet and SwinUNetR both have an input size of 64x64x64 by default, however this is only the FoV of the model, not a strict requirement on input size, but please keep this in mind; if your objects are far larger than 64 pixels you may want to downsample your images, such that the model is able to capture your target objects fully in the FoV, this might help achieve better performance, depending on your data and labels.
I hope this helps; please let me know if you are able to solve this or if you need further help; if issues with padding reoccur I can try to reproduce the error and fix it.
Have a great day !
Best,
Cyril
Hi Cyril,
Thanks for your prompt reply and detailed response -- this was tremendously helpful. I am going to give those approaches a go. Out of curiosity I wanted to follow up quickly and ask, if I downsample by a factor of ~10 and train on 7x128x128 volumes, can I conduct inference on arbitrarily sized images? I assume the closer I can get to what the model was trained on, the better the performance will be. Our resolution is quite high (~0.108um/px) so downsampling to 64 would likely be too risky for capturing entire cells.
Update: I used the Fragment utility to create new folder with the fragmented images and a separate one with the corresponding labels. The output is a folder where each initial tile becomes their own folder and that folder contains a series of tif stacks (in my case 7x128x128x5. The problem is when I go to train a model using the fragmented data, the plugin does not allow me to choose the outermost folder as the images folder because the Fragment utility creates subfolders with images instead of tif stacks in the original folder. Is there a way to have Fragment not create subfolders, it seems like I will have to write a script to extract the fragmented images to the master folder?
Here is an example of the file org that the Fragment utility created:
-->tiff_stacks_fragmented (the empty folder I created as output for Fragment utility)
------>tile_0_fragmented_13_05_16 (folder)
---------->tile_0_fragmented_0 (tif stack)
---------->tile_0_fragmented_1 (tif stack)
---------->tile_0_fragmented_2 (tif stack)
.
.
.
------>tile_1_fragmented_13_05_17 (folder)
---------->tile_0_fragmented_0 (tif stack)
---------->tile_0_fragmented_1 (tif stack)
---------->tile_0_fragmented_2 (tif stack)
.
.
.
Hi Cyril,
Thanks for your prompt reply and detailed response -- this was tremendously helpful. I am going to give those approaches a go. Out of curiosity I wanted to follow up quickly and ask, if I downsample by a factor of ~10 and train on 7x128x128 volumes, can I conduct inference on arbitrarily sized images? I assume the closer I can get to what the model was trained on, the better the performance will be. Our resolution is quite high (~0.108um/px) so downsampling to 64 would likely be too risky for capturing entire cells.
Hello again } Glad to have helped; most likely training on 7x128x128 would be fine, yes.
What you say is perfectly correct in that inference will be a bit more flexible in terms of input size, but performance will depend on what the model has been trained on.
For downsampling, indeed be careful of keeping any structures of interest large enough to be captured by the model - but what that means is again very specific to your data.
Update: I used the Fragment utility to create new folder with the fragmented images and a separate one with the corresponding labels. The output is a folder where each initial tile becomes their own folder and that folder contains a series of tif stacks (in my case 7x128x128x5. The problem is when I go to train a model using the fragmented data, the plugin does not allow me to choose the outermost folder as the images folder because the Fragment utility creates subfolders with images instead of tif stacks in the original folder. Is there a way to have Fragment not create subfolders, it seems like I will have to write a script to extract the fragmented images to the master folder?
Here is an example of the file org that the Fragment utility created:
-->tiff_stacks_fragmented (the empty folder I created as output for Fragment utility) ------>tile_0_fragmented_13_05_16 (folder) ---------->tile_0_fragmented_0 (tif stack) ---------->tile_0_fragmented_1 (tif stack) ---------->tile_0_fragmented_2 (tif stack) . . . ------>tile_1_fragmented_13_05_17 (folder) ---------->tile_0_fragmented_0 (tif stack) ---------->tile_0_fragmented_1 (tif stack) ---------->tile_0_fragmented_2 (tif stack) . . .
Yes, sorry, I tried to keep each file grouped by source image to avoid confusion, but it's true that it's not directly compatible with training.
You could use something like this though (I'm afraid I cannot test this right now, but it should be close to what you need hopefully) :
from pathlib import Path
import shutil
source_folder = Path('path/to/source/folder')
folder_pattern = "_fragmented"
target_folder = source_folder / "all_files"
target_extension = {".tif", ".tiff"}
if not target_folder.exists():
target_folder.mkdir()
for folder in source_folder.iterdir():
if folder.is_dir() and folder_pattern in folder.name:
for file in folder.iterdir():
if file.suffix in target_extension:
shutil.copy(file, target_folder)
I hope this helps, please let me know if I can help further.
Best,
Cyril
Thank you for your detailed responses, Cyril; I appreciate it. the snippet you provided is pretty much identical to what I had to do to restructure the training data. Once I do so I still get errors upon starting the training. They aren't too informative so I was hoping you could give me the exact spec on the file structure required for training. Right now my file structure is as follows:
Images directory:
-->tiff_stacks_fragmented/all_files/ (folder)
---->tile_0_fragmented_0 (tif stack)
---->tile_0_fragmented_1 (tif stack)
.
.
.
Labels directory:
-->tiff_stack_labels_fragmented/all_files/ (folder)
---->tile_0_fragmented_0 (tif stack)
---->tile_0_fragmented_1 (tif stack)
.
.
.
Here is the log:
12:44:08 WARNING Image and label paths are not correctly set
12:44:08 WARNING Aborting, please set all required paths
Hello again, sorry for the late answer.
This is indeed strange, it seems the file discovery is not working, as in the plugin considers that the folders do not contain any images, hence the warning message.
There is no "strict" file structure, the plugin expects one "image" and one "label" folders, with the same number of .tif files in each and matching order when sorted alphabetically.
What confuses me further is that it seemed to work previously with your original files.
Do you have any other outputs from the terminal that are not shown in the log ? There may be more info there; if there isn't perhaps we can enable debug logging to get some more detailed info.
Cases where the plugin would fail to register image files that come to my mind would be :
- Something went wrong with the fragment utility : are you able to open files in napari ? Do they look to be the correct size ?
You can useviewer.layers.selection.active.data.shape
in the napari console to check the shape of the selected layer (click on it first) - Is the file type ".tif" ? (I think there might have been an issue with ".tiff" files not being recognized sometime ago, but if I'm not mistaken it should be fixed)
Please let me know if that helps, otherwise I can quickly work on enabling debug logging.
Sorry for the continued trouble,
Cyril
Hello @NickDiNapoli,
I will close this for now, please let me know if you need further help with your issue
Best,
Cyril
Hi @C-Achard,
I did have a mismatch in the number of files because some of the stacks caused an error during fragmentation (a separate issue but I am ignoring for now and skipping the inclusion of those tiles).
I then went back and triple checked the plugin requirements. One folder has image stacks as .tif files of shape (7, 128, 128, 5) and the other has the label stacks of shape (7, 128, 128). These folders now both have the same number of files now. and I can select them for model training. After starting the model training, the tiles are successfully loaded according to the progress bars and shortly after I receive another error. I am copying the logs below. based on your experience, is there still something I am missing or doing wrong in my data preprocessing?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
17:59:52 INFO Deterministic training is enabled
17:59:52 INFO Seed is 42
17:59:52 INFO Training for 50 epochs
17:59:52 INFO Loss function is : Dice
17:59:52 INFO Validation is performed every 2 epochs
17:59:52 INFO Batch size is 1
17:59:52 INFO Learning rate is 0.001
17:59:52 INFO Using whole images as dataset
17:59:52 INFO ----------
17:59:52 INFO Epoch 1/50
18:00:20 ERROR Error in training
Traceback (most recent call last):
File "c:\users\vizgen\venv\lib\site-packages\napari_cellseg3d\code_models\worker_training.py", line 1513, in train
outputs = model(inputs)
File "c:\users\vizgen\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "c:\users\vizgen\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "c:\users\vizgen\venv\lib\site-packages\napari_cellseg3d\code_models\models\model_SegResNet.py", line 29, in forward
res = SegResNetVAE.forward(self, x)
File "c:\users\vizgen\venv\lib\site-packages\monai\networks\nets\segresnet.py", line 335, in forward
vae_loss = self._get_vae_loss(net_input, vae_input)
File "c:\users\vizgen\venv\lib\site-packages\monai\networks\nets\segresnet.py", line 294, in _get_vae_loss
x_vae = x_vae.view(-1, self.vae_fc1.in_features)
File "c:\users\vizgen\venv\lib\site-packages\monai\data\meta_tensor.py", line 282, in torch_function
ret = super().torch_function(func, types, args, kwargs)
File "c:\users\vizgen\venv\lib\site-packages\torch_tensor.py", line 1443, in torch_function
ret = func(*args, **kwargs)
RuntimeError: shape '[-1, 0]' is invalid for input of size 65536
18:00:20 ERROR shape '[-1, 0]' is invalid for input of size 65536
Traceback (most recent call last):
File "c:\users\vizgen\venv\lib\site-packages\napari_cellseg3d\code_models\worker_training.py", line 1513, in train
outputs = model(inputs)
File "c:\users\vizgen\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "c:\users\vizgen\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "c:\users\vizgen\venv\lib\site-packages\napari_cellseg3d\code_models\models\model_SegResNet.py", line 29, in forward
res = SegResNetVAE.forward(self, x)
File "c:\users\vizgen\venv\lib\site-packages\monai\networks\nets\segresnet.py", line 335, in forward
vae_loss = self._get_vae_loss(net_input, vae_input)
File "c:\users\vizgen\venv\lib\site-packages\monai\networks\nets\segresnet.py", line 294, in _get_vae_loss
x_vae = x_vae.view(-1, self.vae_fc1.in_features)
File "c:\users\vizgen\venv\lib\site-packages\monai\data\meta_tensor.py", line 282, in torch_function
ret = super().torch_function(func, types, args, kwargs)
File "c:\users\vizgen\venv\lib\site-packages\torch_tensor.py", line 1443, in torch_function
ret = func(*args, **kwargs)
RuntimeError: shape '[-1, 0]' is invalid for input of size 65536
18:00:20 INFO WORKER ERRORED at 18:00:20
18:00:24 INFO ********************
18:00:24 INFO
Worker finished at 18:00:24
18:00:24 INFO Saving in C:\Users\Vizgen\Desktop\NickDiNapoli\SegResNet_Dice_50e_2024_07_12_17_55_57
18:00:24 INFO Saving last loss plot
18:00:24 INFO Saving log
18:00:24 INFO Done
18:00:24 INFO **********
18:00:24 WARNING Error while saving CSV report: 'Error when making csv. Check loss dict keys ?'
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Hello @NickDiNapoli ,
Sorry for the late answer. It looks like you're much further in training this time ! I think here the problem is the dimension of the image stacks - I think your error is due to the fact that you're sending in 5 channels at once.
I think keeping only the channel on which you want to run the segmentation on (I assume not all other channels may not be suitable/wanted for segmentation ?) would be best if possible.
Otherwise you could split the volumes into 5 3D volumes, one per channel.
I hope this helps !
Best,
Cyril
Thank you @C-Achard -- I think you hit the nail on the head for what is going on. I initially thought the models were flexible enough to handle multi-channel segmentation or an arbitrarily-sized input but I suppose that is not the case.