classifier training error

Question

classifier training error

0Godness opened this issue 2 years ago · 10 comments

creating data loader...
dataset is chexpert
creating optimizer...
training classifier model...
step 0
classnames ['diseased', 'healthy', 'diseased', 'diseased', 'diseased', 'healthy', 'healthy', 'healthy']
len 8
len 8
len 8
lenloader 8
len 8
path patient01104_study1
path patient00485_study2
path patient00124_study2
IS CHEXPERT
labels tensor([1], device='cuda:0')
h1 torch.Size([1, 8192])
Traceback (most recent call last):
File "/home/Diffusion Models/diffusion-anomaly/scripts/classifier_train.py", line 191, in main
losses = forward_backward_log(data, step + resume_step)
File "/home/Diffusion Models/diffusion-anomaly/scripts/classifier_train.py", line 138, in forward_backward_log
logits = model(sub_batch, timesteps=sub_t)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/syg/Diffusion Models/diffusion-anomaly/./guided_diffusion/unet.py", line 955, in forward
return self.out(h)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x8192 and 256x2)

I just use your original code and 'data' to train the classifier, but i got these errors. Could you please help to solve the error? And the class name is 8 which is inconsistent to the paper setting 2. I dont know whether I run it using wrong setting. Hoping for your detailed answers! Thank you!

Answer 1 · 2022-09-23T14:31:15.000Z

Hi
How did you set the path to your training data? It should go to data/chexpert/training. Then, the length of the class names should be 6 instead of 8, because you have only 6 images in your training set in this mini-example. You still have only 2 classes, which is 'healthy' and 'diseased', so that should work fine.

Regarding your error, I made a change to the file guided_diffusion/unet.py. Let me know whether this fixed the issue.

Best,
Julia

Answer 2 · 2022-09-26T01:02:12.000Z

Thanks for your kind answers. Yes, it works! So the one .npy file represents one image? Could you provide the transfer code and I want to use custom dataset.
Thanks again!

Answer 3 · 2022-10-01T09:09:23.000Z

yes, one .npy file represents one image. Would maybe be a better idea to store them in another format.
What do you mean by transfer code?

Answer 4 · 2022-10-26T06:54:27.000Z

yes, one .npy file represents one image. Would maybe be a better idea to store them in another format. What do you mean by transfer code?

transfer cheXpert dataset to .npy?

And if possible, could you provide the code for processing the brasts dataset?
The format of the dataset I downloaded is not the same as the format required by the code

Answer 5 · 2023-06-08T06:33:39.000Z

Hi How did you set the path to your training data? It should go to data/chexpert/training. Then, the length of the class names should be 6 instead of 8, because you have only 6 images in your training set in this mini-example. You still have only 2 classes, which is 'healthy' and 'diseased', so that should work fine.

Regarding your error, I made a change to the file guided_diffusion/unet.py. Let me know whether this fixed the issue.

Best, Julia

Hello Professor, I also have such a question. When I am training the classifier module, the loss will not converge and fluctuate very strangely. Is this normal?

Answer 6 · 2023-06-08T06:51:20.000Z

Hi

Yes this is normal and due to the fact that the input images contain various levels of noise.

Answer 7 · 2023-06-08T06:54:51.000Z

Hi

Yes this is normal and due to the fact that the input images contain various levels of noise.

Thank you for your reply. May I ask you another question? How do I tell if the classifier model has been trained? Thank you, Professor.

Answer 8 · 2023-06-08T07:26:51.000Z

That is usually hard to tell. You can validate by computing the classification scores on non-noisy images.

Answer 9 · 2023-06-08T07:29:11.000Z

That is usually hard to tell. You can validate by computing the classification scores on non-noisy images.
Ok, thank you very much for your sharing and clarification

Answer 10 · 2023-06-08T15:15:14.000Z

That is usually hard to tell. You can validate by computing the classification scores on non-noisy images.

Could you kindly share train/test split (patient ids for each set) and hyperparmeters for training classifier for BRATS dataset (i.e learning_rate, batch_size, anneal_lr, weight_decay, dropout) ? I tried your training settings in README, but it does not produce good results (much worse than your pretrained classifier checkpoint you provided). I refered to classifer hyperparmeters used in openai/guided_diffusion repository, it is different from your settings cited in README. So I want to know your exact parameters to reproduce your results. Thank you very much.