training problem

Question

training problem

asmile201711360030 opened this issue 2 years ago · 8 comments

asmile201711360030 commented 2 years ago

hi,When I was training with a dataset, I encountered the following question. If I could get your answer, thank you very much！

:~/StyleLight-main$ CUDA_VISIBLE_DEVICES=3,4 python train.py --outdir=./training-runs-256x512 --data=/home/kf/LavalIndoorDataset/IndoorHDRDataset2018-debug2-256x512-data-splits2/train/IndoorHDRDataset2018 --gpus=2
batch size: 32
batch gpu: 16

Training options:
{
"num_gpus": 2,
"image_snapshot_ticks": 100,
"network_snapshot_ticks": 100,
"metrics": [
"fid50k_full"
],
"random_seed": 0,
"training_set_kwargs": {
"class_name": "training.dataset.ImageFolderDataset",
"path": "/home/kf/LavalIndoorDataset/IndoorHDRDataset2018-debug2-256x512-data-splits2/train/IndoorHDRDataset2018",
"use_labels": false,
"max_size": 255,
"xflip": false,
"resolution": 256
},
"data_loader_kwargs": {
"pin_memory": true,
"num_workers": 3,
"prefetch_factor": 2
},
"G_kwargs": {
"class_name": "training.networks.Generator",
"z_dim": 512,
"w_dim": 512,
"mapping_kwargs": {
"num_layers": 2
},
"synthesis_kwargs": {
"channel_base": 16384,
"channel_max": 512,
"channels_dict": {
"4": 512,
"8": 512,
"16": 512,
"32": 512,
"64": 512,
"128": 256,
"256": 128,
"512": 64
},
"num_fp16_res": 4,
"conv_clamp": 256
}
},
"D_kwargs": {
"class_name": "training.networks.Discriminator",
"block_kwargs": {},
"mapping_kwargs": {},
"epilogue_kwargs": {
"mbstd_group_size": 4
},
"channel_base": 16384,
"channel_max": 512,
"channels_dict": {
"4": 512,
"8": 512,
"16": 512,
"32": 512,
"64": 512,
"128": 256,
"256": 128,
"512": 64
},
"num_fp16_res": 4,
"conv_clamp": 256
},
"task_name": "StyleLight-training",
"G_opt_kwargs": {
"class_name": "torch.optim.Adam",
"lr": 0.0025,
"betas": [
0,
0.99
],
"eps": 1e-08
},
"D_opt_kwargs": {
"class_name": "torch.optim.Adam",
"lr": 0.0025,
"betas": [
0,
0.99
],
"eps": 1e-08
},
"loss_kwargs": {
"class_name": "training.loss.StyleGAN2Loss",
"r1_gamma": 0.4096
},
"total_kimg": 25000,
"batch_size": 32,
"batch_gpu": 16,
"ema_kimg": 10.0,
"ema_rampup": 0.05,
"ada_target": 0.6,
"augment_kwargs": {
"class_name": "training.augment.AugmentPipe",
"xflip": 1,
"rotate90": 1,
"xint": 1,
"scale": 1,
"rotate": 1,
"aniso": 1,
"xfrac": 1,
"brightness": 1,
"contrast": 1,
"lumaflip": 1,
"hue": 1,
"saturation": 1
},
"run_dir": "./training-runs-256x512/00002-IndoorHDRDataset2018-auto2"
}

Output directory: ./training-runs-256x512/00002-IndoorHDRDataset2018-auto2
Training data: /home/kf/LavalIndoorDataset/IndoorHDRDataset2018-debug2-256x512-data-splits2/train/IndoorHDRDataset2018
Training duration: 25000 kimg
Number of GPUs: 2
Number of images: 255
Image resolution: 256
Conditional model: False
Dataset x-flips: False

Creating output directory...
Launching processes...
Loading training set...
training_set_kwargs: {'class_name': 'training.dataset.ImageFolderDataset', 'path': '/home/kf/LavalIndoorDataset/IndoorHDRDataset2018-debug2-256x512-data-splits2/train/IndoorHDRDataset2018', 'use_labels': False, 'max_size': 255, 'xflip': False, 'resolution': 256}

Num images: 255
Image shape: [6, 256, 512]
Label shape: [0]

Constructing networks...
Setting up augmentation...
Distributing across 2 GPUs...
Setting up training phases...
Exporting sample images...

...........................

Traceback (most recent call last):
File "train.py", line 581, in
main() # pylint: disable=no-value-for-parameter
File "/home/kf/miniconda3/envs/stylegan2/lib/python3.7/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/home/kf/miniconda3/envs/stylegan2/lib/python3.7/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/kf/miniconda3/envs/stylegan2/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/kf/miniconda3/envs/stylegan2/lib/python3.7/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/kf/miniconda3/envs/stylegan2/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "train.py", line 576, in main
torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
File "/home/kf/miniconda3/envs/stylegan2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/kf/miniconda3/envs/stylegan2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/kf/miniconda3/envs/stylegan2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/kf/miniconda3/envs/stylegan2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/kf/StyleLight-main/train.py", line 427, in subprocess_fn
training_loop.training_loop(rank=rank, **args)
File "/home/kf/StyleLight-main/training/training_loop.py", line 352, in training_loop
loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain)
File "/home/kf/StyleLight-main/training/loss.py", line 99, in accumulate_gradients
gen_logits = self.run_D(gen_img_ldr, gen_c, sync=False,isRealImage=False) ######## add isRealImage=False
File "/home/kf/StyleLight-main/training/loss.py", line 69, in run_D
logits = self.D(img, c)
File "/home/kf/miniconda3/envs/stylegan2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/kf/miniconda3/envs/stylegan2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/kf/miniconda3/envs/stylegan2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/kf/StyleLight-main/training/networks.py", line 755, in forward
x, img = block(x, img, **block_kwargs)
File "/home/kf/miniconda3/envs/stylegan2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(input, **kwargs)
File "/home/kf/StyleLight-main/training/networks.py", line 580, in forward
misc.assert_shape(x, [None, self.in_channels, self.resolution, 2self.resolution])
File "/home/kf/StyleLight-main/torch_utils/misc.py", line 97, in assert_shape
raise AssertionError(f'Wrong size for dimension {idx}: got {size}, expected {ref_size}')
AssertionError: Wrong size for dimension 2: got 64, expected 128

Answer 1 · 2023-04-22T11:49:59.000Z

Hi, what is the resolution of the dataset?

Answer 2 · 2023-04-22T12:42:23.000Z

Thanks,I use data_ prepare_ Laval.py processed IndoorLavalDataset2018 and obtained the image resolution of 256 * 512 using the parameters in your code.I only use 256 images.May I ask how you handled and trained IndoorLavalDataset？

Answer 3 · 2023-04-24T02:22:19.000Z

Thanks,I use data_ prepare_ Laval.py processed IndoorLavalDataset2018 and obtained the image resolution of 256 * 512 using the parameters in your code.I only use 256 images.May I ask how you handled and trained IndoorLavalDataset？

I prepared the data using python data_prepare_laval.py, please find the readme.md for the detail.

Answer 4 · 2023-04-24T02:24:20.000Z

Thanks,I use data_ prepare_ Laval.py processed IndoorLavalDataset2018 and obtained the image resolution of 256 * 512 using the parameters in your code.I only use 256 images.May I ask how you handled and trained IndoorLavalDataset？

I think the problem is that the image size does not match the input of networks. Could you please print the x.shape in line 96 of "/home/kf/StyleLight-main/torch_utils/misc.py"

Answer 5 · 2023-04-24T02:38:00.000Z

I initially used a 256 * 512 image, but after conducting an interview with 128 * 256, I found that 128 * 256 images can be used for training.I don't know why ？Have you ever trained with 256 * 512 ？I am a newie, thank you very much for your patient answer.

Answer 6 · 2023-04-24T02:54:46.000Z

I have tried both 256 * 512 and 128 * 256 resoltuion. Both are available. But you need to slightly modify the input size of networks to match image size in trainin_loop.py, Line 211.

where
common_kwargs_G = dict(c_dim=training_set.label_dim, img_resolution=128, img_channels=3, rank=rank) is for 128x256 images

common_kwargs_G = dict(c_dim=training_set.label_dim, img_resolution=256, img_channels=3, rank=rank) is for 256x512 images

Could you please try to modify this line to match the image resolution?

Answer 7 · 2023-04-24T03:06:43.000Z

Thanks a lot. I eventually understand where the problem lies.I'll take a good look at the source code again.

Answer 8 · 2023-04-24T03:55:00.000Z

Thanks. I will close this issue. If you have any other problem, you can open a new issue.