alibaba/EasyCV

Died with <Signals.SIGKILL: 9>. When first epoch ends, the program is killed

Cyanyanyan opened this issue · 1 comments

Using train on PAI and smart cache, config is like metric_learning/imagenet_resnet50_1000kid_jpg.py, then the program is always killed when fist epoch ends.
The error info is like below:
Traceback (most recent call last):
File "/home/pai/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/pai/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/pai/lib/python3.6/site-packages/easypai/torch/launch.py", line 428, in
main()
File "/home/pai/lib/python3.6/site-packages/easypai/torch/launch.py", line 414, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/pai/lib/python3.6/site-packages/easypai/torch/launch.py", line 389, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/apsara/TempRoot/xxxxx/workspace/python_bin', '-u', 'tools/train.py', '--local_rank=7', 'configs/metric_learning/xxxxx.py', '--work_dir', 'oss://xxxxx/', '--load_from', '/data/oss_bucket_0/xxxxx/r50_imagenet_epoch_100.pth', '--launcher', 'pytorch', '--fp16']' died with <Signals.SIGKILL: 9>.

it seems some resource have been used over quota, so the job is been killed, check your memory and cpu usage