
Training on Cityscape

Yan1026 opened this issue · 4 comments

Sorry to bother you.
I train with bash ./scripts/train_city.sh -l 372 -g 4 -b 50,but get error:

availble_gpus= [0, 1, 2, 3]
  0%|                                                                                                           | 0/93 [00:00<?, ?it/s]
  0%|                                                                                                           | 0/93 [00:05<?, ?it/s]
wandb: Waiting for W&B process to finish... (failed 1).
wandb: - 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb: \ 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb: | 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /homedjy/PS-MT/wandb/offline-run-20220908_195002-2iykpb0m
wandb: Find logs at: ./wandb/offline-run-20220908_195002-2iykpb0m/logs
Traceback (most recent call last):
  File "CityCode/main.py", line 199, in <module>
    main(-1, 1, config, args)
  File "CityCode/main.py", line 116, in main
  File "/home/PS-MT/CityCode/Base/base_trainer.py", line 145, in train
    _ = self._warm_up(epoch, id=1)
  File "/homedjy/PS-MT/CityCode/train.py", line 173, in _warm_up
    curr_iter=batch_idx, epoch=epoch-1, id=id, warm_up=True)
  File "/home/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    "them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

I try to fix it but no effect.I want use GPU5,6,7,8,because GPU0123 is occupied.But when print availble_gpus,it's still [0, 1, 2, 3].
I can train model on VOC in same case.
Do you have any ideas?

I fix it.Compared the code of VOC-main and Cityscape-main,I found that Cityscape-main lack a line of code about DDP.
args.ddp = True if args.gpus > 1 else False
After adding the code, the model can be trained.Maybe your code is a test version or I made a mistake.

Glad to hear you solve it.
In our experiments, I added the flag "--ddp" manually, and I missed this line when I re-organize the code.

Thanks a lot for reporting it.

Hi @yyliu01 ,I train with bash ./scripts/train_city.sh -l 372 -g 4 -b 50,but get error:

Saving a checkpoint: saved/final_test/372_mIoU_0.6137_model_e10.pth ... 
EVAL ID (Model 1) (10) | PixelAcc: 0.9311, Mean IoU: 0.6137 |
Traceback (most recent call last):
  File "CityCode/main.py", line 203, in <module>
    mp.spawn(main, nprocs=config['n_gpu'], args=(config['n_gpu'], config, args))
  File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/ PS-MT/CityCode/main.py", line 120, in main
  File "/home/ PS-MT/CityCode/Base/base_trainer.py", line 171, in train
  File "/home/ PS-MT/CityCode/Base/base_trainer.py", line 191, in _save_checkpoint
    upload_checkpoint(local_path=self.checkpoint_dir, prefix=pvc_dir, checkpoint_filepath=ckpt_name)
NameError: name 'upload_checkpoint' is not defined

About CityCode/Base/base_trainer.py, line 187---194 ,I found the following code annotated in VOC, but not in Cityscape.

Do you have any ideas?Maybe it is a test version of the code?

         pvc_dir = os.path.join("yy", "exercise_1", self.args.architecture,
                                "resnet{}_ckpt".format(str(self.args.backbone)), "city_cvpr_final",
         upload_checkpoint(local_path=self.checkpoint_dir, prefix=pvc_dir, checkpoint_filepath=ckpt_name)
         self.logger.info("Uploading current ckpt: mIoU_{}_model.pth to {}".format(str(state['monitor_best']),

Hi @Yan1026 , please comment that line as it is for the google cloud uploading, and shouldn't be used for your training.

I apologize for the inconvenience, please reopen the issue if you have any questions.