VRAM usage in multi GPU training
1378dm opened this issue · 11 comments
I use --batch_size 2
and --config e
to successfully train 1024 * 1536 model on a single RTX 2080ti. But when I add --num_gpus 8,
it fails.
Does multi GPU training need more VRAM? Or tell me the best settings for 11gb VRAM.
i usually train 1024*1024 config F with batch 2 on 11gb. same batch for higher res with smaller config sounds right.
i also used this code recently on a cluster with 8 * 32gb ram gpus, with batch 8 per gpu (64 total), without any problem.
can you provide exact and complete error log for that failure?
Here's my input:
train.bat final_1 --kimg 5000 --batch_size 2 --config e --num_gpus 8
Here's console's output:
custom init resolution [6, 4]
Batch size 2
Local submit :: train\000-final_1-1024x1536-e
dnnlib: Running training.training_loop.training_loop() on localhost...
Dataset shape = [4, 1536, 1024]
Label size = 0
model base resolution 1024
Constructing networks...
Building TensorFlow graph...
Training for 5000 kimg (5000 left)
Traceback (most recent call last):
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
return fn(*args)
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by {{node TrainG/Broadcast/NcclAllReduce}}with these attrs: [reduction="sum", shared_name="c0", T=DT_FLOAT, num_devices=8]
Registered devices: [CPU, GPU]
Registered kernels:
[[TrainG/Broadcast/NcclAllReduce]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/train.py", line 166, in
main()
File "src/train.py", line 162, in main
run(**vars(args))
File "src/train.py", line 131, in run
dnnlib.submit_run(**kwargs)
File "C:\Users\vipuser\Desktop\SG2\stylegan2-master\src\dnnlib\submission\submit.py", line 343, in submit_run
return farm.submit(submit_config, host_run_dir)
File "C:\Users\vipuser\Desktop\SG2\stylegan2-master\src\dnnlib\submission\internal\local.py", line 23, in submit
return run_wrapper(submit_config)
File "C:\Users\vipuser\Desktop\SG2\stylegan2-master\src\dnnlib\submission\submit.py", line 280, in run_wrapper
run_func_obj(**submit_config.run_func_kwargs)
File "C:\Users\vipuser\Desktop\SG2\stylegan2-master\src\training\training_loop.py", line 283, in training_loop
tflib.run([G_train_op, data_fetch_op], feed_dict)
File "C:\Users\vipuser\Desktop\SG2\stylegan2-master\src\dnnlib\tflib\tfutil.py", line 31, in run
return tf.get_default_session().run(*args, **kwargs)
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
run_metadata_ptr)
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
run_metadata)
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node TrainG/Broadcast/NcclAllReduce (defined at C:\Users\vipuser\Desktop\SG2\stylegan2-master\src\dnnlib\tflib\optimizer.py:199) with these attrs: [reduction="sum", shared_name="c0", T=DT_FLOAT, num_devices=8]
Registered devices: [CPU, GPU]
Registered kernels:
[[TrainG/Broadcast/NcclAllReduce]]
Errors may have originated from an input operation.
Input Source operations connected to node TrainG/Broadcast/NcclAllReduce:
TrainG/Clean0/mul (defined at C:\Users\vipuser\Desktop\SG2\stylegan2-master\src\dnnlib\tflib\optimizer.py:191)
Also, CUDA version is 10.0, Python version is 3.7.9, Already installed cuDNN 7.6.1 and Visual Studio 2017.
i don't see any memory-related (OOM) errors here. instead, there are errors with OpKernel and Op 'NcclAllReduce', which relate to multi-GPU operations as such. quick googling gave me these, for instance:
https://www.tensorflow.org/guide/gpu#using_multiple_gpus
tensorflow/tensorflow#33656
but it's about TF 2.x, while original SG2 requires TF 1.14, which is not updated anymore (do you use that version btw?)
Yes, I installed tensorflow-gpu==1.14
using pip
.
What could be the cause of this problem?
I have compiled the nccl dll for windows, where should I put it to make the error go away?
i asked about 1.14, because i remember there were performance issues on 1.15 with multipgu. tf 1.14 should be ok.
unfortunately, i can't help more with multigpu issues on your system. that sg2 part is kept exactly as in the original nvidia repo, so you'd better check the issues on the original nvidia repo (with the authors of the code). or try to google around, as i believe this is simply not stylegan2-related.
I seem to have to change my server system to Ubuntu to get it to work properly.
By the way, does the different config
have an effect on the quality of the model?
yes, config-f has more trainable weights than config-e, therefore more powerful (and hungry for resources). that means, you'll have can get more correct and detailed imagery from config-f.
So config-f takes up twice as much VRAM as config-e? I tested this model with config-f and batch_size-4 on a borrowed v100-32gb and it used up almost all of the VRAM.