Training on Cityscape
Yan1026 opened this issue · 4 comments
Sorry to bother you.
I train with bash ./scripts/train_city.sh -l 372 -g 4 -b 50
,but get error:
availble_gpus= [0, 1, 2, 3]
0%| | 0/93 [00:00<?, ?it/s]
0%| | 0/93 [00:05<?, ?it/s]
wandb: Waiting for W&B process to finish... (failed 1).
wandb: - 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb: \ 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb: | 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /homedjy/PS-MT/wandb/offline-run-20220908_195002-2iykpb0m
wandb: Find logs at: ./wandb/offline-run-20220908_195002-2iykpb0m/logs
Traceback (most recent call last):
File "CityCode/main.py", line 199, in <module>
main(-1, 1, config, args)
File "CityCode/main.py", line 116, in main
trainer.train()
File "/home/PS-MT/CityCode/Base/base_trainer.py", line 145, in train
_ = self._warm_up(epoch, id=1)
File "/homedjy/PS-MT/CityCode/train.py", line 173, in _warm_up
curr_iter=batch_idx, epoch=epoch-1, id=id, warm_up=True)
File "/home/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
"them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
I try to fix it but no effect.I want use GPU5,6,7,8,because GPU0123 is occupied.But when print availble_gpus,it's still [0, 1, 2, 3].
I can train model on VOC in same case.
Do you have any ideas?
I fix it.Compared the code of VOC-main and Cityscape-main,I found that Cityscape-main lack a line of code about DDP.
args.ddp = True if args.gpus > 1 else False
After adding the code, the model can be trained.Maybe your code is a test version or I made a mistake.
Glad to hear you solve it.
In our experiments, I added the flag "--ddp" manually, and I missed this line when I re-organize the code.
Thanks a lot for reporting it.
Hi @yyliu01 ,I train with bash ./scripts/train_city.sh -l 372 -g 4 -b 50
,but get error:
Saving a checkpoint: saved/final_test/372_mIoU_0.6137_model_e10.pth ...
EVAL ID (Model 1) (10) | PixelAcc: 0.9311, Mean IoU: 0.6137 |
Traceback (most recent call last):
File "CityCode/main.py", line 203, in <module>
mp.spawn(main, nprocs=config['n_gpu'], args=(config['n_gpu'], config, args))
File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/imu_zhengyuan/.conda/envs/ps-mt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/ PS-MT/CityCode/main.py", line 120, in main
trainer.train()
File "/home/ PS-MT/CityCode/Base/base_trainer.py", line 171, in train
self._save_checkpoint(epoch)
File "/home/ PS-MT/CityCode/Base/base_trainer.py", line 191, in _save_checkpoint
upload_checkpoint(local_path=self.checkpoint_dir, prefix=pvc_dir, checkpoint_filepath=ckpt_name)
NameError: name 'upload_checkpoint' is not defined
About CityCode/Base/base_trainer.py, line 187---194 ,I found the following code annotated in VOC, but not in Cityscape.
Do you have any ideas?Maybe it is a test version of the code?
pvc_dir = os.path.join("yy", "exercise_1", self.args.architecture,
"resnet{}_ckpt".format(str(self.args.backbone)), "city_cvpr_final",
str(self.args.labeled_examples))
upload_checkpoint(local_path=self.checkpoint_dir, prefix=pvc_dir, checkpoint_filepath=ckpt_name)
self.logger.info("Uploading current ckpt: mIoU_{}_model.pth to {}".format(str(state['monitor_best']),
```