ChenFengYe/motion-latent-diffusion

Problem on training results

Ying156209 opened this issue · 9 comments

Hi, pretty nice work!
I follow the instruction to train the model myself, which is to train vae first and load the pretrained vae to train mld by two config files. But the generated motion turning around in place and trembling. The only thing I change is complete it here by saving self.state_dict

checkpoint["something_cool_i_want_to_save"] = my_cool_pickable_object
.
My training result as follow, the visualization is correct for sure since i use the same scripts to visualize your pretrained model which is pretty cool.

https://user-images.githubusercontent.com/34014097/211010042-4393b5ea-da1c-4761-8463-780d4c22c785.mov
https://user-images.githubusercontent.com/34014097/211011274-76dfe00e-6595-456e-bbcd-b8bc66fd6117.mov

My log:
vae_log.log
mld_step_log.log

Thanks!

Hi, thanks for your attention. Please feel free to ask. I will keep helping you to train correctly.

I check the first mov. The first thing on my mind is the mis-loading for VAE. Could you please check your vae loading part of mld training?

FIDs in both training (VAE/MLD) logs are too large, and their correct arrange is around 0.45~1.0.
Could you check your std and mean of the humanml dataset? and provide some details about its setup.

Another importance thing is --nodebug for training, please check the new command in README. This flag will only use a small part of dataset for debugging.

@Ying156209, @ChenFengYe Hi,I have encountered the same problem.Have you salved this problem?

@Ying156209, @ChenFengYe Hi,I have encountered the same problem.Have you salved this problem?

Hi, do you have add --nodebug flag for training? I think this might be the reason.

Hi, do you have add --nodebug flag for training? I think this might be the reason.

yes, I have add this flag by this flag group.add_argument("--nodebug",type=bool,default=True,help="debug or not") in config.py line 61,but this problem still happened.

Hi Chenfeng, thx for your reply! I add this flag and train. The loss decreased normally, I will check it after all training is done. There are some other little problems. The speed of training is around 2it/s, training vae step on A100(6 visible, only one used) with batch size 64, data loader 16 workers, get 235 epoch trained in 10 hours, which means vae may take a few days to train. Is this the normal training speed? And by default, no detailed log is printed (only loss) after adding --nodebug even i
change it to log_every_n_steps = cfg.LOGGER.LOG_EVERY_STEPS = 1. @GongMingCarmen Do you have those problems?

I summarize your problem here:

  1. GPUs. You can indicate the IDs to use all your GPUs.
    DEVICE: [0] # Index of gpus eg. [0] or [0,1,2,3]
  2. Epoch Nums. 1500~3000 epoch is enough for VAE or MLD. I suggest you use wandb(prefer) or tensorborad to check FID curve of your training.
  3. Training Speed. 2000 epoch could cost 1 day for a single GPU, and around 12 hours for 8 GPUs. Training speed also depends on VAL_EVERY_STEPS (Validation Frequency), DataIO Speed. Your training is a little slow.
  4. Data Log. Only loss print by default. After validation, more metrics of val will print. More details in wandb (prefer) or tensorborad.

More TIPs:

  1. Please use --nodebug for all your training.
  2. Please load your pre-train VAE correctly for the MLD diffusion training.
  3. FID of validation will drop to 0.5~1 after 1500 epochs for both VAE and MLD training. By default, validation is on test split...

Very detailed tips!
Could you provide your training log, I want to make sure my training process is going correctly.

Please feel free to ask us if you have any problem.
vae_log.log mld_train.log

@ChenFengYe Is this correct that I use the FID as metrics to save checkpint? train.py line 119-132, like this
ModelCheckpoint(
monitor="Metrics/FID", #"step",
mode= "min", #"max",
......
)