Inference Error when denoising all volumes
anudeepk17 opened this issue · 7 comments
Hello,
Firstly thanks for this great paper and detailed git.
I am training the network for DCE-MRI after adding noise explicitly to the 4D data. My dataset is 320x320. I was successfully able to train it for stage 1 and stage 2.
In stage 3 I am facing the error in model.py , line 223
self.optG.load_state_dict(opt['optimizer'])
The error being: loaded state dict contains a parameter group that doesn't match the size of optimizer's group
I delved deeper and found "initial_lr" missing in the self.optG.state_dict()['param_groups'] and loaded dict opt['optimizer']['param_groups'] had it.
I though the issue is that a new optimizer is being initialized and a trained optimizer is being loaded so, I added a line after line 65 in model.py
65| self.optG = torch.optim.Adam(
optim_params, lr=opt['train']["optimizer"]["lr"])
Line added:
66 | self.scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(self.optG, opt['train']['n_iter'], eta_min=opt['train']["optimizer"]["lr"])
after this addition I saw both the self.optG and opt['optimizer'] have same size and parameter groups yet the error persists.
Am I missing something, or was my approach wrong.
The changes I did for my purpose :
I had to change the image_size to 320 in .json files and uncommented the resize line in transform in mri_dataset.py because I did not want to downsize my data and had to reduce batch size to 2 for my training purposes.
I thank you in advance for your time.
Saw the fix to that being resume_state set to null, now facing another issue:
/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
File "train_diff_model.py", line 76, in
diffusion.optimize_parameters()
File "/srv/home/kumar256/DDM2/DDM2/model/model.py", line 92, in optimize_parameters
total_loss.backward()
File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/tensor.py", line 489, in backward
self, gradient, retain_graph, create_graph, inputs=inputs
File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/autograd/init.py", line 190, in backward
grad_tensors = make_grads(tensors, grad_tensors, is_grads_batched=False)
File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/autograd/init.py", line 85, in _make_grads
raise RuntimeError("grad can be implicitly created only for scalar outputs")
RuntimeError: grad can be implicitly created only for scalar outputs
Issue Resolved, was the problem of me using 2 gpu ids.
I was able to train all stages, now I wanted to denoise the whole dataset [320x320x128x28]. I set the
dataset_opt['val_volume_idx']='all'
But in the json while training my validation mask was [10,28] due to which I believe I was getting a denoised data of size [320x320x128x18]
I wanted to denoise the whole data so I changed valid mask to [0,28] but I got this error:
2303 done 3584 to go!!
Traceback (most recent call last):
File "denoise.py", line 76, in
for step, val_data in enumerate(val_loader):
File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
return self._process_data(data)
File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
data.reraise()
File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/_utils.py", line 543, in reraise
raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/srv/home/kumar256/.conda/envs/ddm2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/srv/home/kumar256/DDM2/DDM2/data/mri_dataset.py", line 133, in getitem
ret['matched_state'] = torch.zeros(1,) + self.matched_state[volume_idx][slice_idx]
KeyError: 18
Am I missing any step for denoising, any changes I needed to make in .json?
I have kept the resume_state in path section as the path to the stage 3 model
resume_state in the noise_model as the stage1 model
and stage2file as the path to stage2file
Hi! Thanks for your interest in our work! The error simply indicates that the matched state for index 18 (19th slice) cannot be found in the stage2 processed file. This is an expected error since you trained on only 18 slices for all three stages (including stage2).
A quick fix could be to rerun stage2 with the correct validation mask [0, 28] instead of [10, 28], however, the denoising quality for the first 10 slices may not be guaranteed (since they were not trained in stage1 and 3). Another solution is to train everything from scratch again, with the correct validation mask of course.
So would just like to confirm
- That we cannot use the model we obtained in stage3 for denoising any other data than what we trained it on. For any new data we need to train all the three stages.
- Should I change resume_state in noise model section to stage3 model?
- For stage 3 we need to keep resume state as null in path section.
Thank you for your great work.
- Yes. Our algorithm is an optimization-based method, it cannot be (or can poorly) generalized to unseen data points.
- For stage3 training, you need to specify stage1 resume_state. For inference, you don't need to.
- For initiating stage3 training, you can keep resume_state as null (which means training stage3 model from scratch). For resume training or inference, you need to change resume_state to your checkpoint.
Thank you again for your time and help.