How to train insgen with only 1 GPU
Johnson-yue opened this issue · 5 comments
thanks for your sharing, but I only have 1 GPU , it can not be trained
I see the reason why need multi-GPU is for ‘effect of disabling shuffle BN to MoCo’
but I can not understand why must shuffle batch data among all gpus not only in GPU?
would you provide a way to shuffle batch date on 1 GPU, it can be not ‘effect’ ?
I ran this on 2 gpus much slower than the baseline stylegan2, taking nearly twice as long.Then,I followed the solution in the issue1 and ran it on Colab. Again, it took twice as long.
If I apply this change I can run on a single GPU:
--- a/train.py
+++ b/train.py
@@ -413,7 +413,7 @@ def subprocess_fn(rank, args, temp_dir):
dnnlib.util.Logger(file_name=os.path.join(args.run_dir, 'log.txt'), file_mode='a', should_flush=True)
# Init torch.distributed.
- if args.num_gpus > 1:
+ if args.num_gpus > 0:
The key to the above patch is that even with 1 GPU the following code needs to run to init process groups via the torch.distributed.init_process_group
function
Lines 406 to 430 in 52bda7c
I changed the >
condition for brevity.
If I apply this change I can run on a single GPU:
--- a/train.py +++ b/train.py @@ -413,7 +413,7 @@ def subprocess_fn(rank, args, temp_dir): dnnlib.util.Logger(file_name=os.path.join(args.run_dir, 'log.txt'), file_mode='a', should_flush=True) # Init torch.distributed. - if args.num_gpus > 1: + if args.num_gpus > 0:The key to the above patch is that even with 1 GPU the following code needs to run to init process groups via the
torch.distributed.init_process_group
functionLines 406 to 430 in 52bda7c
I changed the
>
condition for brevity.
This method does not work on my mathine. Many different strange bugs prompted. I don't know why.
@jkla139 I actually used the copy from https://github.com/Zhendong-Wang/Diffusion-GAN . I cloned from the main branch of this repo and got:
:$ python train.py --gpus=1 ...
...
File "/home/giles/projects/insgen/training/training_loop.py", line 407, in training_loop
module = module.module
File "/home/giles/projects/stylegan2-ada-pytorch/.env/stylegan2-ada-pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 778, in __getattr__
raise ModuleAttributeError("'{}' object has no attribute '{}'".format(
torch.nn.modules.module.ModuleAttributeError: 'CLHead' object has no attribute 'module'
I tracked this down to the following:
--- a/training/training_loop.py
+++ b/training/training_loop.py
@@ -403,8 +403,6 @@ def training_loop(
snapshot_data = dict(training_set_kwargs=dict(training_set_kwargs))
for name, module in [('G', G), ('D', D), ('G_ema', G_ema), ('augment_pipe', augment_pipe), ('D_ema', D_ema), ('DHead', DHead), ('GHead', GHead)]:
if module is not None:
- if name in ['DHead', 'GHead']:
- module = module.module
if num_gpus > 1:
misc.check_ddp_consistency(module, ignore_regex=r'.*\.w_avg')
module = copy.deepcopy(module).eval().requires_grad_(False).cpu()
diffusion-gan doesn't seem to have this, so I just removed those lines. It seems to be working.
@jkla139 I actually used the copy from https://github.com/Zhendong-Wang/Diffusion-GAN . I cloned from the main branch of this repo and got:
:$ python train.py --gpus=1 ... ... File "/home/giles/projects/insgen/training/training_loop.py", line 407, in training_loop module = module.module File "/home/giles/projects/stylegan2-ada-pytorch/.env/stylegan2-ada-pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 778, in __getattr__ raise ModuleAttributeError("'{}' object has no attribute '{}'".format( torch.nn.modules.module.ModuleAttributeError: 'CLHead' object has no attribute 'module'
I tracked this down to the following:
--- a/training/training_loop.py +++ b/training/training_loop.py @@ -403,8 +403,6 @@ def training_loop( snapshot_data = dict(training_set_kwargs=dict(training_set_kwargs)) for name, module in [('G', G), ('D', D), ('G_ema', G_ema), ('augment_pipe', augment_pipe), ('D_ema', D_ema), ('DHead', DHead), ('GHead', GHead)]: if module is not None: - if name in ['DHead', 'GHead']: - module = module.module if num_gpus > 1: misc.check_ddp_consistency(module, ignore_regex=r'.*\.w_avg') module = copy.deepcopy(module).eval().requires_grad_(False).cpu()diffusion-gan doesn't seem to have this, so I just removed those lines. It seems to be working.
Yes, these two lines need to be deleted, now it's work.