TencentARC/BrushNet

Doesn't converge when I train with my own data

Opened this issue · 10 comments

Loss has been shaking., Wish I could see a picture of the correct loss,

Me too. It confuses me a lot. Have you solved it? @zf-666 Or could the authors help us please? @juxuan27

How did you build your dataset?

hi @juxuan27 @yuanhangio, thanks for such great work!
Could you plz share some details on training Brushnet_sdxl? such as how many epochs, how long it takes, how many GPU used?
I just use one zip package in BrashData just want to realize the training detail. Butt I find the loss fluctuation even if 11000+ steps passed (one zip has 10000 images, with batch size 4, around 4epoch).

I guess the fluctuation loss is related to the random timestep during training. but how to identify when the model is converged if loss has little guidance?

image

image

similar issue seem in (#35), but didn't find some explanation on the loss

I fix the problem by use 'fp16' and 'fp16 vae' , but another problem arises, my dataset is on the dark side, but the resulting data, while fitting the distribution, is always on the light side

hi @juxuan27 @yuanhangio, thanks for such great work! Could you plz share some details on training Brushnet_sdxl? such as how many epochs, how long it takes, how many GPU used? I just use one zip package in BrashData just want to realize the training detail. Butt I find the loss fluctuation even if 11000+ steps passed (one zip has 10000 images, with batch size 4, around 4epoch).  I guess the fluctuation loss is related to the random timestep during training. but how to identify when the model is converged if loss has little guidance?  image  image  similar issue seem in (#35), but didn't find some explanation on the loss

Hi, which resolution of images did you used for training? Only 1024x1024 or random resolution? Appreciate for the reply!

hi @juxuan27 @yuanhangio, thanks for such great work! Could you plz share some details on training Brushnet_sdxl? such as how many epochs, how long it takes, how many GPU used? I just use one zip package in BrashData just want to realize the training detail. Butt I find the loss fluctuation even if 11000+ steps passed (one zip has 10000 images, with batch size 4, around 4epoch).  I guess the fluctuation loss is related to the random timestep during training. but how to identify when the model is converged if loss has little guidance?  image  image  similar issue seem in (#35), but didn't find some explanation on the loss

Hi, which resolution of images did you used for training? Only 1024x1024 or random resolution? Appreciate for the reply!

1024x1024 for SDXL

I fix the problem by use 'fp16' and 'fp16 vae' , but another problem arises, my dataset is on the dark side, but the resulting data, while fitting the distribution, is always on the light side

could you share your training hyper-parameters and loss figure?

@yuanhangio @juxuan27 there are only about 5-10 images in own data set, can brushnet converge?? how many images at leat should I prepare??

I use BrushData as my dataset, but in some data missing "width" in .tar file, so the training process is failed. Is anyone know how to fix this in train_brushnet.py to skip this sample and continue training?
Thanks so much!

I fix the problem by use 'fp16' and 'fp16 vae' , but another problem arises, my dataset is on the dark side, but the resulting data, while fitting the distribution, is always on the light side

could you share your training hyper-parameters and loss figure?

Hi, did you solve it? Could you please share some results images? Thanks!