jolibrain/joliGEN

Exception in dataloader.py

hsleiman1 opened this issue · 3 comments

Hello, I am trying to train clear2snowy. I have the following exception

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "train.py", line 125, in train_gpu

The training command is as follows:

python train.py --dataroot datasets/clear2snowy/ --checkpoints_dir checkpoints/clear2snowy --name clear2snowy --output_display_env clear2snowy --output_display_freq 20 --output_print_freq 20 --train_G_lr 0.0002 --train_D_lr 0.0001 --data_crop_size 512 --data_load_size 512 --data_dataset_mode unaligned_labeled_mask_online --model_type cut --train_batch_size 2 --train_iter_size 4 --model_input_nc 3 --model_output_nc 3 --f_s_net segformer --f_s_config_segformer models/configs/segformer/segformer_config_b0.py --train_mask_f_s_B --f_s_semantic_nclasses 11 --G_config_segformer models/configs/segformer/segformer_config_b0.json --data_online_creation_crop_size_A 512 --data_online_creation_crop_delta_A 64 --data_online_creation_mask_delta_A 64 --data_online_creation_crop_size_B 512 --data_online_creation_crop_delta_B 64 --dataaug_D_noise 0.01 --data_online_creation_mask_delta_B 64 --alg_cut_nce_idt --train_sem_use_label_B --D_netDs projected_d basic vision_aided --D_proj_interp 512 --D_proj_network_type vitsmall --train_G_ema --G_padding_type reflect --train_optim adam --dataaug_no_rotate --train_sem_idt --model_multimodal --train_mm_nz 16 --G_netE resnet_512 --f_s_class_weights 1 10 10 1 5 5 10 10 30 50 50 --output_display_aim_server 127.0.0.1 --output_display_visdom_port 8501 --gpu_id 0,1 --G_netG unet_256

beniz commented

Hi @hsleiman1 yes it is known "error" that occurs at startup. We investigated it multiple times and it seems to be due to a filesystem temporary file lock.

However, it does not prevent training, as the dataloader has multiple workers. So in practice and until we find how to work around it, you can consider it as a warning that has no incidence on the training run.

If it does prevent the training, please let us know.

beniz commented

--G_netG unet_256

As a side note, our multimodal training runs on this dataset did use a segformer_attn_conv here instead.

Thank you @beniz. Actually I added --G_netG unet_256 following the warning: base_options.py:951: UserWarning: ResNet encoder/decoder architectures do not mix well with multimodal training, use segformer or unet instead, so I used unet. I've modified with --G_netG segformer_attn_conv and the training has started. Thanks!