Stage3 Freeze
Opened this issue · 22 comments
Hi Tiange,
I'm having an issue while running at stage 3. The run are stuck at the following step for 2 days now. I updated to pytorch2.0. Did you encountered this issue?
23-04-04 10:35:55.230 - INFO: Note: NumExpr detected 40 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
23-04-04 10:35:55.231 - INFO: NumExpr defaulting to 8 threads.
23-04-04 10:36:04.536 - INFO: MRI dataset [hardi] is created.
23-04-04 10:36:05.309 - INFO: MRI dataset [hardi] is created.
23-04-04 10:36:05.309 - INFO: Initial Dataset Finished
23-04-04 10:36:05.309 - INFO: ('2.0.0', '11.8')
23-04-04 10:36:12.078 - INFO: Initialization method [orthogonal]
Hi, this seems like a very weird problem. It looks like the model fails to forward the input data completely, after initialization. I think it is probably due to the package version mismatch, which leads to some errors in building the model or forwarding data.
Does the Stage 1 model run normally under the same environment? If so, the location of the problem can be narrowed down to some particular operators/functions (since their behaviors or hardware dependencies may be different at different versions). A workaround may be creating a new environment with the exact provided environment file.
Please let me know if the problem still cannot be resolved.
Hi, No i didn't notice this in stage1. I'll create the new env file and let you know if problem persists. Thank again!
Hi again,
I'm not able to create the environment from the yml provided. Is it ok if you could please clean it meaning just keep the main packages such as pytorch, numpy etc? there are lots of dependencies which are by default downloaded.
here is the error report:
Collecting package metadata (repodata.json): done
Solving environment: failed
ResolvePackageNotFound:
- dipy==1.5.0
- scipy==1.8.0
- pydicom==2.3.0
- tqdm==4.63.1
- torch==1.8.0
- opencv-python==4.5.4.58
Hi, it will be risky for us just to delete the packages that 'seem not been used', since one package could be a dependency on another package, even though it is not directly used. Without crashing the entire environment chain, I think we should just leave them as they are right now.
But we did have a cleaning on the dependencies and updated the 'environment.yaml' file. And we have tested the new environment on two individual machines. Can you please try again and let me know if the problem still exists.
Thank you!
Ok, Thank you! I'll give it try
Hi,
The problem is solved I can run. but I've other issues now. it's about GPU compatibility. I saw you already have that issue reported. Perhaps I need to build Pytorch from the source. I'll try this. Thanks again for your input.
Hi again!
Sorry for so many questions. I've another question regarding the number of gradients (usually 4th dimesion of raw_input) to regress. it's controlled through "datasets:padding" entry in the config? if I set padding = 5, then output = torch.Size([32, 1, 260, 260]) input = torch.Size([32, 4, 260, 260]). for padding = 6 it is torch.Size([32, 1, 260, 260]) torch.Size([32, 6, 260, 260]). My question here is only an odd number of padding is authorized right?
I've 6 gradient directions, I can't do output = torch.Size([32, 1, 260, 260]) input = torch.Size([32, 5, 260, 260]). Please correct me if I'm wrong.
Thanks again! :)
Hi!
I've another question regarding the number of gradients (usually 4th dimension of raw_input) to regress. it's controlled through "datasets:padding" entry in the config?
The number of padding means how many gradients are used to pass to the network as inputs. In our paper, we mainly use padding=3, which means each slice is denoised based on the information provided by the surrounding 3 slices with 3 different gradients. So yes, the number of padding = the input channels. But the number of padding doesn't directly correspond to the total number of gradients in the dataset. (it should be less than the total number, normally).
My question here is only an odd number of padding is authorized right?
I think it's more reasonable to use odd number of padding. However, if using even number of padding yields the correct input shape, it is also fine :)
I've 6 gradient directions, I can't do output = torch.Size([32, 1, 260, 260]) input = torch.Size([32, 5, 260, 260]).
Actually, I think it is legit to use padding=5 for 6 gradient directions. Is there an error message? Does it work for padding=3?
Thank you for the response.
Yes, it works with padding=3, only the input size changes. now the output = torch.Size([32, 1, 260, 260]) input = torch.Size([32, 2, 260, 260]) since you using (padding//2) i can't obtain input = torch.Size([32, 5, 260, 260]) meaning an odd number in gradients direction as input. it doesn't consider the slices from all gradients. I'm not sure if this will change the results.
Maybe it'll be nice to put this info as a warning and additional in your config :)
Yeah! This is a good point! Can you please open a pull request regarding of the warning in the config files? I am afraid I could misunderstand you. Thank you!
I'm afraid to do the pull request. I changed your code a little. I was having trouble running dipy package on my side, it was causing a gpu compatibilities issue at my side. so I removed all the functions related to dipy & changed your code. Recently we updated the cluster with the latest drivers so I had to modify it.
Hi again! :)
Another question: why did you specifically choose transforms.Lambda(lambda t: (t * 2) - 1) this for data augmentation?
Thanks again for answering.
Hey! Given that the raw data is normalized into the scale of [0, 1], this augmentation rescales the data into [-1, 1]. This is the common strategy used in most of the other diffusion models. Since the noise we inject into the diffusion is actually sampled from a Gaussian distribution with 0 mean, so the scale [-1, 1] makes more sense than [0, 1].
Ok, make sense now. :)
Ok confession here => I'm reviewing your work thoroughly because it's very interesting to be included in my PhD :) that explains my interest in your work. So I suppose more questions will come in future. :)
Ok while reviewing your code for the dataset, I noticed that your validation set is included in the train set. In lines 31-35 in data/mri_dataset.py you use raw_data for getting items for your train sets and validation set. there is no check on val_index which means the validation set is part of the train set. To my knowledge, In current ML methods, val_set shouldn't be included in the train_set. Is something expected behaviour? Is it special to MRI scan-specific nature?
Yeah sure no problem :) Glad this work can be helpful for your Ph.D.!
For self-supervised learning tasks, there is no clear boundary between the training set and the validation set. Because there are no ground truths, it is not 'cheating' when we train on the validation/testing data directly. This is also a common protocol for all other self-supervised learning works including N2N, N2S, P2S etc. (not just for MRI).
The generalization of self-supervised algorithms to unseen validation/testing data is usually pretty difficult, and it is still an open research problem I presume.
Ok got it, thank you for your explanations and for being clear about your response :)
Hey, are you planning on releasing Patch- vs Image-Based results for Stage I? By patches I meant 2D patches instead of what was originally done in patch2self. Instead of slice-slice mapping if slice-slice 2D patches are mapped.
Hi, unfortunately, we didn't run any experiments with 2D patches, and we don't have plans to release any patch-based implementations. You can modify the data loader to make this happen, but I think this may require some additional work.
Ok, i see, Thanks again :) I'll try it.
Hi Tiange,
I've observed some behaviour where ddm2 is denoising but in this reconstruction is very smooth. I observe high errors on high-frequency details. I used setting n=1 with volume having only 48 slices. I believe it might be due to less data as compared to the model size. But I'm not sure, have you experienced this in your experiments? Is there any requirement on how many slices should a volume have?
Thanks :)
Hi, by n=1 do you mean you used only 1 prior image as input to train the network? In this way, the denoising results can be compromised. If it is possible, please use at least n=3!
Also, please double-check if the stage 1 model is appropriately trained (by checking the denoised results from stage 1). DDM2 can generate smooth results with little data, but it is at least better than stage 1 for sure.