deepglugs/deep_imagen

About muti-GPU on a big dataset.

Closed this issue · 0 comments

Hi, Thanks for your work.
I am confused about the muti-GPU training on a subset of laion, about 7M.
When I try to specify the gpu id like
'CUDA_VISIBLE_DEVICES=0,1,2,3,4 python3 imagen.py --train '
only the first gpu is used, and the training speed is very slow, one epoch will take 200 hours.
When I use 'accelerate launch imagen.py'
The data is processed, but it got stuck in the first epoch of training.
In both cases, GPU-Util is 0%. But when I try it in a small dataset, the training and GPU-Util is normal.
It seems that the problem is class DataGenerator of data_generator.py.
Have you ever had similar problems, or can you give some suggestions?
Thanks again.