training loss not converge

Question

training loss not converge

ilyaskhanbey opened this issue 3 years ago · 5 comments

hello, i tried your training code with (adobe, distinction636 and realWorld) dataset. i keep only solid objects, and i removed all transparent object like glasses.

composition and laplacien loss seems to not converge after warmup step.(itr>5000)
9770/500000], REC: 0.0121, COMP: 0.2987, LAP: 0.1311, lr: 0.001000

I removed the test step while training, i think it's have not impact in the training. am i wrong?

I should wait for more iterations or i should fix some hyperparameters when training with more dataset than you used?

Answer 1 · 2021-06-27T04:01:44.000Z

Hi, could you give more info about your problem? E.g., how do you put all datasets together?

The test step should have no impact on training. Besides, it seems to me your loss looks normal?

Answer 2 · 2021-06-27T16:35:34.000Z

Hello i think i found what is the issue and why there is no ceonvergence of the loss.
I think it's because the composition loss when enabling real world augmentation (adding noise to input image)
Actually when you do training you do so something like that:

alpha_pred = mg_net(image_noise)
loss_lap = image_noise - (alpha_pred*fg +(1-alpha_pred)*bg)

I think it' should be

alpha_pred = mg_net(image_noise)
loss_lap = image_with_no_noise - (alpha_pred*fg +(1-alpha_pred)*bg)

in the first configuration the loss is more impacted by the noise due to the augmentation.

To test that i start training only on one image (we should converge in few iterations)
after the fix here is losses:

Image tensor shape: torch.Size([2, 3, 512, 512]). Trimap tensor shape: torch.Size([2, 3, 512, 512])
[66/500000], REC: 0.0203, COMP: 0.0745, LAP: 0.1510, lr: 0.000100

without thefix , the compo and lap loss will always grab between (0.15 to 0.7)

Image tensor shape: torch.Size([2, 3, 512, 512]). Trimap tensor shape: torch.Size([2, 3, 512, 512])
[69/500000], REC: 0.0241, COMP: 0.4077, LAP: 0.2623, lr: 0.000100

I think the composition loss when it's not converging it's impact the laplacien loss too.

Answer 3 · 2021-06-27T16:54:02.000Z

Thanks for the explanation! Composition loss will be affected if real-world noises are introduced, as in https://github.com/yucornetto/MGMatting/blob/main/code-base/config/MGMatting-RWP-100k.toml you can see we disable the composition loss when real-world-aug is enabled.

As for lap loss, I am somehow confused since fg/image should not be involved in lap loss computation, as in https://github.com/yucornetto/MGMatting/blob/main/code-base/trainer.py#L224

I am not sure about your point that comp loss not converging also affect lap loss. But my suggestion will be just setting comp_loss_weight to 0 when real-world-aug is enabled.

Answer 4 · 2021-06-27T17:15:04.000Z

Thank you for your answer, i didn't notice you are not using comp loss when enabling real-world-aug
I tested your pretrained rwp model and it's seems working much better for real data than the dim pretrained model.
I will try to retrain rwp model with enabling comp loss and the fix of comp loss i made, i will tell you if peformance are better.

Last question is it a good idea to have a batch size of 40 as you mentionned in your article? marco forte in his FBA article said batch size must be between 6 to 16 (in BN) for alpha prediction.

Thank you for your great work

Answer 5 · 2021-06-27T17:26:34.000Z

Thanks for offering to run the experiments!

For your question regarding to FBA matting paper, I think they use batch size = 1 + WS + GN? Not sure about the 6 to 16 parts. My personal experience is that if BN is used, a relatively large batch size usually can lead to a better performance. But I did not run experiments to verify it on the matting task.