multiple GPU training issue

Question

multiple GPU training issue

Opened this issue 2 years ago · 1 comments

Hi, thanks for open sourcing the code. I have tried using multiple GPU training below


python train.py \
	--batchSize 8 \
	--nThreads 8 \
	--name "$exp_name" \
	--load_pretrained_g_ema "$pretrain_weight" \
	--train_image_dir "$dataset_root"/"img_512" \
	--train_image_list "$dataset_root"/"train_img_list.txt" \
	--train_image_postfix ".png" \
	--val_image_dir "$dataset_root""/img_512" \
	--val_image_list "$dataset_root"/"val_mask_list.txt" \
	--val_mask_dir "$dataset_root"/"mask_512" \
	--val_image_postfix ".png" \
	--load_size 512 \
	--crop_size 512 \
	--z_dim 512 \
	--validation_freq 10000 \
	--niter 50 \
	--dataset_mode trainimage \
	--trainer stylegan2 \
	--dataset_mode_train trainimage \
	--dataset_mode_val valimage \
	--model comod \
	--netG comodgan \
	--netD comodgan \
	--no_l1_loss \
	--no_vgg_loss \
	--preprocess_mode scale_shortside_and_crop \
	--save_epoch_freq 10 \
	--gpu_id 0,1,2,3
	$EXTRA

and received the error: (This problem didn't have in the single gpu training)

(epoch: 1, iters: 9904, time: 0.171) GAN: 1.7399 path: 0.0003 D_real: 0.4633 D_Fake: 0.6500 r1: 0.2954
(epoch: 1, iters: 10000, time: 0.215) GAN: 1.4925 path: 0.0003 D_real: 0.3935 D_Fake: 0.9652 r1: 0.2954
saving the latest model (epoch 1, total_steps 10000)
Saved current iteration count at ./checkpoints/comod-ffhq-512-4gpus/iter.txt.
doing validation
warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
File "train.py", line 138, in
generated,_ = model(data_ii, mode='inference')
File "/binaries/anaconda3/envs/torch_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/binaries/anaconda3/envs/torch_py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/binaries/anaconda3/envs/torch_py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/binaries/anaconda3/envs/torch_py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/binaries/anaconda3/envs/torch_py36/lib/python3.6/site-packages/torch/_utils.py", line 434, in reraise
raise exception
TypeError: Caught TypeError in replica 3 on device 3.
Original Traceback (most recent call last):
File "/binaries/anaconda3/envs/torch_py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/binaries/anaconda3/envs/torch_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'data'

Do you know what could be the problem?

Answer 1 · 2022-05-07T16:35:57.000Z

I had no problem last time I tried training on multiple gpu. I have no access to multiple gpu currently. I'll look into this issue later