duanyiqun/DiffusionDepth

About DDIM loss on custom datasets

Closed this issue · 9 comments

Hi author, thanks for sharing this amazing work!
I tried to train this network on a custom dataset (indoor scenarios) for depth completion task. However, I found DDIM loss of val and test stayed near 1.
I modified the nyu dataloader to prepare rgb, gt depth and depth mask (for region needs to be completed). During training, for L1 and L2 loss I only computed loss for the masked region.

Network structure is as below:
backbone_module: mmbev_resnet
backbone name: mmbev_res50
head_specify: DDIMDepthEstimate_Res
loss: 1.0L1+1.0L2+1.0*DDIM

May I ask what should DDIM loss curve look like for val and test?
training curve

Hi there, thank you so much for the interest.
The DDIM loss on val and test only near 1 shouldn't be the right case. I think a normal case would be a drop from 1 to around 0.2-0.6 (according to different dataset) and keeping there around.
If it only near 1 and not descending, there might be an issue of calculate the loss, problably need midifiy the loss a little bit for a new dataset.
In that case, do the val and test RMSE and MAE looks alright?

Best regards
Yiqun

Hi, thanks for your reply.
I am wondering if it is because I only compute L1 and L2 loss for masked region but compute DDIM loss for the whole image. I will try again to compute loss for the whole depth image.
Curves of other metrics are as below, orange (train), black (val) and blue (test):
loss

Hi there, thanks so much for your detailed information.
I think just calculate L1 and L2 on regions with GT value is right. If the GT is sparse I think calculate whole image GT will bring a lot of noise.
But one interesting thing is that seems val and train shares same trends but test is not the same. Are the val and train with closer distribution?

Yes, val and train set have similar scenes with closer distribution, but test set has novel scenes with different objects and backgrounds. May I ask does it matter to RGB encoder or Depth Diffusion part or both?

Besides, I trained on NYU dataset yesterday as this indoor dataset is closer to my custom dataset. The configs, training curves and results are as below. The results seems great even with resnet backbone, but I still found val and test DDIM losses didn't descend.
=== Arguments ===
affinity : TGASS | affinity_gamma : 0.5 | augment : True | backbone_module : mmbev_resnet |
backbone_name : mmbev_res50 | batch_size : 16 | betas : (0.9, 0.999) | conf_prop : True | data_name : NYU |
decay : 10,15,20 | dir_data : /home/zl/dataset/nyudepthv2 | epochs : 20 | epsilon : 1e-08 | force_maxdepth : False |
from_scratch : False | gamma : 1.0,0.2,0.04 | gpus : 0,1 | head_specify : DDIMDepthEstimate_Res | inference_steps : 20 |
legacy : False | loss : 1.0L1+1.0L2+1.0*DDIM | lr : 0.001 | max_depth : 10.0 | min_depth : 1e-06 |
model_name : Diffusion_DCbase_ | momentum : 0.9 | network : resnet34 | no_multiprocessing : False | num_gpus : 2 |
num_sample : 0 | num_summary : 4 | num_threads : 4 | num_train_timesteps : 1000 | opt_level : O0 |
optimizer : ADAM | patch_height : 240 | patch_width : 320 | port : 29500 | preserve_input : False |
pretrain : None | prop_kernel : 3 | prop_time : 18 | resume : False | save : trial |
save_dir : ../experiments/231219_201611_trial | save_full : False | save_image : False | save_raw_npdepth : False | save_result_only : False |
seed : 7240 | split_backbone_training : False | split_json : ../data_json/nyu.json | test_crop : False | test_only : False |
top_crop : 0 | warm_up : True | weight_decay : 0.0 | with_loss_chamfer : False |

Training curves: train (black), val (blue), test (pink)
nyu_curve

Visualization on training set
nvu_train

Visualization on test set
nyu_test

hi there, thanks very much for the info. I did observe overfitting problem presents more on indoor rather than outdoor sparse scenarios. The final version is optimized for sparse scenarios. Let me check the configs.

I could be wrong but, if extend to custom dataset probably it is helpful to change diffuse on refined depth into diffuse on gt depth.

For the rub encoder part. I don't have a clear feeling about it.

Thanks for your reply. I will change diffuse on gt depth and update you later.

Hi @Zray26 ! Did you resolve the promblem of ddim loss? I also tried to train this network on a custom dataset and found that the ddim_loss almost does not decrease(0.1 on train set and 1 on test set). Looking forward to your reply.
Hi author @duanyiqun! I have an another question about ddim_loss when reading your code. I found the ddim_loss is calculated in ddim_loss() funtion with loss = F.mse_loss(noise_pred, noise) and then is divide by batch size in train() function. I rememder F.mse_loss with reduction='mean', the function already accounts for the batch size, providing an average loss per element across the entire batch, so additional division by the batch size should not be necessary. I think division by batch size may reduce the loss weight of ddim loss.

Hi @lhiceu ! I did some experiments and found that DDIM loss did descend slowly, but pixel losses (in my case L1 and L2) descended much faster.
If checking how data flows during training process, you will find there are two branches of data flow:

  1. A depth latent feature (refined_depth_t) is generated from random noise through DDIM inference process, then it is passed through a cnn decoder to generate the predicted depth image and compute pixel loss.

  2. The generated depth latent feature (refined_depth_t) is also taken as clean image (x0) of a vanilla DDIM training process and compute mse DDIM loss.

In the first branch, denoising model is actually called num_inference_steps (20) times. If my understanding is correct, the gradient chain of this branch is cnn decoder + denoising model * 20. That is to say, pixel loss iteratively optimize the denosing model 20 times in each training step.

However, when you look at the second branch, the vanilla DDIM trys to predict the added noise to (refined_depth_t), therefore it only optimizes denosing network 1 time in each training step. That's why DDIM loss descend much slower.

Hi @lhiceu ! I did some experiments and found that DDIM loss did descend slowly, but pixel losses (in my case L1 and L2) descended much faster. If checking how data flows during training process, you will find there are two branches of data flow:

  1. A depth latent feature (refined_depth_t) is generated from random noise through DDIM inference process, then it is passed through a cnn decoder to generate the predicted depth image and compute pixel loss.
  2. The generated depth latent feature (refined_depth_t) is also taken as clean image (x0) of a vanilla DDIM training process and compute mse DDIM loss.

In the first branch, denoising model is actually called num_inference_steps (20) times. If my understanding is correct, the gradient chain of this branch is cnn decoder + denoising model * 20. That is to say, pixel loss iteratively optimize the denosing model 20 times in each training step.

However, when you look at the second branch, the vanilla DDIM trys to predict the added noise to (refined_depth_t), therefore it only optimizes denosing network 1 time in each training step. That's why DDIM loss descend much slower.

Thanks for your detailed explanation. I understand your point. I tried to change the ddim_loss() to ddim_loss_gt() to add noise on gt_map_t, but I did not find significant changes on loss. Moreover, the ddim loss value on the test set is close to 1 in both your experiment and mine, which is strange. I think the ddim loss difference between train set and test set is too large.