Training STAC summary_loss = nan

Question

Training STAC summary_loss = nan

klock18 opened this issue 4 years ago · 3 comments

Hi!

I am now training your STAC model using a baseline trained using tf_efficientdet_d0.

But now I am getting summary_loss = nan for every training step and this:

arning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")

2020-10-27T00:13:26.180480
LR: 1e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05

Would appreciate any help!

Answer 1 · 2020-11-04T10:59:48.000Z

I haven't trained on d0. Could you provide more details on your pipeline (e.g. batch size, data pipeline, resolution)?

Answer 2 · 2020-11-04T21:28:31.000Z

Thanks for your reply!

I moved to d2 which my computer can handle but still getting the same issue.

I am using the original data to train your baseline which works fine for d2 (though now you mention it, using the 1024x1024 images could very well be the cause of my previous CUDA errors, did you resize the images?). I had reduced batch size to 2 so that I wouldn't get these CUDA errors. Then I am using your full sized data that you uploaded to Kaggle to train your STAC model which is giving the above error.

Also, I was looking through your code to try and replicate STAC in YOLOv5; was your approach just to use the augmentations stated in the paper and then weight the pseudo loss rather than using consistency loss?

Answer 3 · 2021-06-06T12:13:32.000Z

I'm not sure if d2 works, and reducing batch size could bring performance regression.

Regarding STAC, I did try consistency loss but it did not improve my model.