Loss Not Decreasing After 10K Steps

Question

Loss Not Decreasing After 10K Steps

Opened this issue 3 months ago · 0 comments

After reading the ControlNet paper and documentation the architecture seemed suitable to generate ultrasound images for a given tumor mask.
As an starting point a dataset consisting of ~750 images with tumor segmentations as conditioning is used for training.

Training is performed with a batch size of 24 at 512x512 on a H100, taking ~2s per step.
An epoch is just 32 steps and overfitting is therefore expected after 10K steps, which equals to 600 epochs.

The Stable Diffusion backbone is frozen and only the newly added control_model is trained.
Input normalization is also validated, images are in range [-1,1] and the segmentation masks are normalized to [0,1]

Here is an example epoch 1 and 9, the train/loss_epoch is basically the same.

Epoch 1: 1/1000 [00:21<5:57:00, 21.44s/it, loss=0.166, v_num=1.41e+7, train/loss_simple_step=0.180, train/loss_step=0.180, lr_step=1e-5, global_step=1e+3, train/loss_simple_epoch=0.168, train/loss_epoch=0.168, lr_epoch=1e-5]

Epoch 10: 1/1000 [00:21<6:00:32, 21.65s/it, loss=0.176, v_num=1.41e+7, train/loss_simple_step=0.189, train/loss_step=0.189, lr_step=1e-5, global_step=1e+4, train/loss_simple_epoch=0.166, train/loss_epoch=0.166, lr_epoch=1e-5]

As a sanity check I gave a house as a prompt with a zero mask and a high quality image of a house was generated, verifying the ControlNet model correctly adopted the pretrained Stable Diffusion model.

What could be the reason for the loss to not become lower, even when forcefully trying to overfit on a small dataset?

Any starting points would be helpful!