Inference with VNet not running - RuntimeError in aborted thread
thecobb opened this issue · 18 comments
I'm getting an error when trying to use VNet with a folder than has the tiff files corresponding to a whole brain scan of a cleared brain from a Lavision Ultramiscrope II stitched together using Imaris and resaved as an image sequence using FIJI to shorten the filename. I've also tried other models and have gotten different issues but am less concerned about resolving them since having VNet working would be sufficient.
The system I am using is a Windows 10 workstation with an AMD 5950x and Nvidia 3090TI with 128GB RAM and 2 TB SSD with 1 TB available. In the error below, it is using the CPU implementation but I have encountered the same error using an implementation where I had the GPU-enabled Pytorch installed.
The output of the terminal in text is pasted below. Below that are screenshots of the settings and terminal output itself.
(cellseg) C:\Users\free_\Documents\GitHub\CellSeg3d>napari
C:\Users\free_
C:/irfp_rename
Starting...
Using cpu device
Using torch :
1.11.0+cpu
Downloading the model from the M.W. Mathis Lab server http://deeplabcut.rowland.harvard.edu/cellseg3dmodels/VNet.tar.gz....
168828928B [00:15, 10931902.90B/s]
OS is Windows
Worker started at 14:27:39
Saving results to : C:/results
Worker is running...
TIFFReadDirectory: Warning, Unknown field with tag 50838 (0xc696) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 50839 (0xc697) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 50838 (0xc696) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 50839 (0xc697) encountered.
C:\Users\free_\Anaconda3\envs\cellseg\lib\site-packages\superqt\utils_qthreading.py:183: RuntimeWarning: RuntimeError in aborted thread: Can't downcast to a specialization of MetaDataObject!
Screenshots of the terminal output and Cellseg settings below. I screenshotted the terminal after a stop request because in the past the program has gotten stuck after the error related to the aborted thread.
After updating my pytorch implementation so that it has the GPU support enabled and running this on a folder of the same images but without having been renamed in FIJI (and so are .tif instead of .tiff), I get the following error instead TypeError: 'ValueError' object is not subscriptable
I'm pasting my terminal output below. I've also noted that it says it's padding 2D data even though this is a 3D scan.
(cellseg) C:\Users\free_>napari
C:\Users\free_
C:\Users\free_
C:/irfp
Starting...
Using cuda device
Using torch :
1.11.0+cu113
Downloading the model from the M.W. Mathis Lab server http://deeplabcut.rowland.harvard.edu/cellseg3dmodels/VNet.tar.gz....
168828928B [00:15, 11092961.61B/s]
OS is Windows
Worker started at 16:59:15
Saving results to : C:/results
Worker is running...
Dimension of data for padding : 2D
Checking dimensions...
Image shape is (7993, 7382)
Padding sizes are [8192, 8192]
Parameters summary :
Model is : VNet
Window inference is disabled
Dataset loaded to CPU
Loading dataset...
Done
Loading weights...
Done
Inference started on image n°1...
TypeError Traceback (most recent call last)
File ~\Anaconda3\envs\cellseg\lib\site-packages\napari_cellseg3d\plugin_model_inference.py:589, in Inferer.start..(data=ValueError('expected 5D input (got 4D input)'))
570 self.window_inference_size = int(
571 self.window_size_choice.currentText()
572 )
574 self.worker = InferenceWorker(
575 device=device,
576 model_dict=model_dict,
(...)
586 stats_csv=self.stats_to_csv,
587 )
--> 589 yield_connect_show_res = lambda data: self.on_yield(
self = <napari_cellseg3d.plugin_model_inference.Inferer object at 0x000002E204BBF940>
data = ValueError('expected 5D input (got 4D input)')
590 data,
591 widget=self,
592 )
594 self.worker.started.connect(self.on_start)
595 self.worker.log_signal.connect(self.log.print_and_log)
File ~\Anaconda3\envs\cellseg\lib\site-packages\napari_cellseg3d\plugin_model_inference.py:659, in Inferer.on_yield(data=ValueError('expected 5D input (got 4D input)'), widget=<napari_cellseg3d.plugin_model_inference.Inferer object>)
648 """
649 Displays the inference results in napari as long as data["image_id"] is lower than nbr_to_show,
650 and updates the status report docked widget (namely the progress bar)
(...)
654 widget (QWidget): widget for accessing attributes
655 """
656 # viewer, progress, show_res, show_res_number, zoon, show_original
657
658 # check that viewer checkbox is on and that max number of displays has not been reached.
--> 659 image_id = data["image_id"]
data = ValueError('expected 5D input (got 4D input)')
660 model_name = data["model_name"]
661 total = len(widget.images_filepaths)
TypeError: 'ValueError' object is not subscriptable
Worker finished at 16:59:38
Empyting cache...
Cache emptied
Screenshots of the Napari Cellseg settings and terminal are included below
Hello @thecobb, thank you for reporting those issues !
I'll look into the .tiff loading issues as you mentioned in #21, otherwise I'm intrigued as to why it considers your data to be 2D.
Just to confirm regarding the images being "stitched together using Imaris and resaved as an image sequence using FIJI to shorten the filename" : are you running inference on a folder of several 2D images, or a folder with a single/several 3D image(s) ?
Looking at the docs, I realise it might have been ambiguous, only inference on a folder of 3D images is supported currently; loading a stack of 2D images is not supported for inference and training, unlike for review or cropping. I've updated the docs to reflect this, sorry if it was unclear.
If you'd find it useful to have the possibility to run inference on a 2D stack, please let me know, I can add this to the feature request list and work on implementing it.
(on a side note, if you are using a single 3D image, running inference on such a large volume might be problematic without using window inference, it would require a lot of memory to run 3D UNets on such a large image. If you do have access to a single 3D file, maybe try using window inference, or cropping a smaller piece of the volume, such as a 128 or 256 wide cube, and then running inference on it ? On Windows you could also monitor your RAM/VRAM usage with the Performance tab of the Task Manager, to see if memory is an issue. Or any similar monitoring utility if you're on another OS.)
Hi C-Achard,
I was using a folder of 2d images that comprised a full z-stack. I have switched to loading a folder that has only one 3d multi-page tiff. By using the 3D tif I'm getting a new error (which I get regardless of model selected) and am including a screencast [https://capture.dropbox.com/HSDHuEE3IS65lk6q](Dropbox Capture Screencast) as well as screenshots below. I get this error regardless if I use Vnet or SegResNet. I've also used the window inference, I had previously been using it but purposely left it off in examples above since it wasn't getting to the point where either the RAM or GPU RAM were being loaded onto. It would be useful to be able to run inference on a 2D stack but I'm able to concatenate the images into a 3D tiff file so it's not an issue to modify my data to use with the inference function. Looking at my GPU and RAM usage, neither has increased since clicking start within the inference menu.
Now the error is as follows:
C:\Users\free_\Anaconda3\envs\cellseg\lib\site-packages\superqt\utils\_qthreading.py:183: RuntimeWarning: RuntimeError in aborted thread: Can't downcast to a specialization of MetaDataObject!
The terminal output is as follows:
(cellseg) C:\Users\free_>napari
C:\Users\free_
C:/irfp_multi
Starting...
Using cuda device
Using torch :
1.11.0+cu113
Downloading the model from the M.W. Mathis Lab server http://deeplabcut.rowland.harvard.edu/cellseg3dmodels/TRAILMAP.tar.gz....
60538880B [00:05, 10415713.72B/s]
OS is Windows
Worker started at 13:47:15
Saving results to : C:/results
Worker is running...
TIFFReadDirectory: Warning, Unknown field with tag 50838 (0xc696) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 50839 (0xc697) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 50838 (0xc696) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 50839 (0xc697) encountered.
C:\Users\free_\Anaconda3\envs\cellseg\lib\site-packages\superqt\utils_qthreading.py:183: RuntimeWarning: RuntimeError in aborted thread: Can't downcast to a specialization of MetaDataObject!
And screenshots that might be relevant including settings, terminal, and tif properties are included below:
Given the error (and the fact that memory usage is not increasing), this seems to be a reader error, i.e. the chosen reader of MONAI's LoadImaged transform is unable to open your file. The unknown field warnings also point towards this.
Do you know exactly which kind of .tiff file is being created by ImageJ, and are you able to open your file in the napari viewer ?
The issue might be that it's a multi-page tiff, I do not think I have tested the plugin with this format.
If you are able to open your 2D images folder in napari (Using "File > Open Folder"), you can then save the whole as a single 3D .tif file which should work with the plugin (just tested with a folder of 2D .png files)
Ok I'll try it using napari to save it.
Okay, great, let me know if this works
Okay I used Tifffile within Napari to save a subset of my scan as a .tif file (11GB) which worked to resolve the loading error, but now prior to inference starting (during the Monai transforms stage since when I tried using a 44GB .tif file it errored out during this stage saying it couldn't allocate 128GB of RAM which is the total amount on this workstation) performing inference on that 11GB file uses 90-100 GB of RAM which makes using the scan not possible. Additionally, I've tried inference using a window size of 32 and 16 with the VNET for the former and SegResNet with the latter, and both errored out at the inference stage saying it tried to allocate 32GB of VRAM when only 24GB was available - screenshots below). I'm going to try using a smaller window size again and if that doesn't work I will use the keep data on CPU option.
(Concise answer : Maybe try using the "keep data on CPU" option, and reduce the image size.)
Okay, glad that using napari resolved the issue with the reader.
It seems it is errorring at the inference stage now, since you're getting the "Inference started on image n°1" message.
If it were crashing at the MONAI transform stage it wouldn't go past the "Loading dataset" stage, which is where the transforms are used.
So it's most likely running out of memory when it's allocating space for the model itself, not the image.
An 11Gb image is still extremely large, from our earlier tests on a workstation with fairly similar specs (RTX A4000 + 128 Gb RAM iirc), you might be able to go up to a cube of ~256 - 512 pixels, but not much more.
3D UNets are very memory-intensive, especially VNet, so it's very easy to run out of memory; do you think you could try on a much smaller volume ? 8192x8192x128 is still an extremely large file for the models we provide, I'm afraid.
Sadly I do not have access to the workstation right now, so I can not give you a more precise estimate of the upper bound for the image size I'm able to load successfully, but if you'd like I can do so later, once I'm able to.
What bothers me more is window inference not balancing the memory load more evenly, as I thought it would.
As mentioned in this thread on MONAI's Github page, you might get better results if you use the "Keep on CPU" option, as this does exactly what is suggested in the thread (i.e. keep the model on the VRAM and the dataset on the RAM).
It would still use the VRAM for the model, so it should be faster than running without CUDA, it will simply use the RAM and not the VRAM for the dataset only.
Good news, it ran to completion using a window size of 16 with Vnet and the "keep data on the cpu" option selected as well as the perform thresholding option. The GPU memory usage was low throughout (less than 3 GB of the 21 GB available) but CUDA utilization was constant at 88% for the 4ish hours it says the worker ran to complete inference. I would use the trailmap_MS weights since my guess is that's less memory intensive if that is working but as mentioned in another issue there seems to be a mismatch between the weights loaded and the model. Is the window size selection not cubic? I can make the volume less than 100 planes in the z dimension but I had thought that it needed to be greater than 64 to have a window size selection of 64.
On a side note, were you able to check if the results seem satisfactory ? It'd be very useful to get some feedback on model performance, if you wouldn't mind. No need for thorough benchmarking, just a quick visual check to see if results makes sense.
Quick update : TRAILMAP_MS is now functional on the #19 PR, if you'd like to try this model right now you can grab the code by cloning the cy/log_weights_download branch and running pip install -e .
in the cloned repo folder.
I'll try the TRAILMAP_MS model shortly. I've encountered a different error when running the Instance Segmentation that seems to be driven by the "save stats to csv" option when selected. The error during this step is
TypeError: 'ValueError' object is not subscriptable
and I'm attaching a .txt file with the full terminal output as well as screenshots of terminal output and settings:
cellseg_segresnet_terminal_output.txt
In terms of the predicted output (using the SegResnet), it seems like there's both downsampling (by a factor of 2x) and a transpose that occurs that makes the overlay of the predicted output not match up with the original .tif file. The predicted output has half as many z-steps and needs to be resaved in Napari using the transpose function to have the right orientation. I'm including screenshots of the predictions overlayed with the original image to show both the alignment problem (which may be fixable by upsampling the prediction using interpolation so I will comment if that works) as well as screenshots showing the difference in number of z-planes between the original image and the prediction output. Overall I think the model is doing a decent job I just need to get the predictions properly aligned with the original image.
Predicted output overlaid on original image in Napari:
Predicted output showing 48 z-steps:
Original Image showing 101 z-steps
Lastly there appears to be a type of artifact near the top and bottom of the image seen in the screenshot below.
So one issue I'm having comparing the outputs with the originals to see if the predictions are satisfactory is that if I have the compare prediction with original in Napari option selected, it overloads the ram at the display in Napari step. I'll see if I can resolve this starting with seeing if I can get the padding operation from Monai to allow for a 1-1 overlay of the original image stack with the predicted image stack.
Update, while the issue with saving the stats to .csv still exists, the downsampling issue appears to be resolved when I use a stack of 64 images instead of 101, so I think you're right in that the .tif being processed needs to have a number of z-steps that is a power of 2.
Let's open a new issue then for this; since the original thread is getting long, and is resolved :) @thecobb feel free to open a new issue for the z-step issue.