Issues (?) when submitting jobs
wjlei1990 opened this issue · 4 comments
Hi I constantly encountered this issue when using entk on summit.
This issue doesn't not always happen but sometime it just pops out...I am not sure what is the reason...Could you help us to figure them out?
EnTK session: re.session.login3.lei.018844.0009
Creating AppManagerSetting up RabbitMQ system ok
ok
Validating and assigning resource manager ok
Setting up RabbitMQ system n/a
new session: [re.session.login3.lei.018844.0009] \
database : [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr] ok
create pilot manager ok
submit 1 pilot(s)Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 179, in submit_resource_request
self._pilot = self._pmgr.submit_pilots(pdesc)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot_manager.py", line 603, in submit_pilots
pilot = Pilot(pmgr=self, descr=pd)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot.py", line 109, in __init__
self._resource_sandbox = self._session._get_resource_sandbox(pilot)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/session.py", line 739, in _get_resource_sandbox
shell = self.get_js_shell(resource, schema)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/session.py", line 785, in get_js_shell
shell = rsup.PTYShell(js_url, self)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell.py", line 247, in __init__
interactive=self.interactive)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 206, in initialize
self._initialize_pty(info['pty'], info)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 427, in _initialize_pty
raise ptye.translate_exception (e) from e
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 300, in _initialize_pty
raise rse.NoSuccess("Could not detect shell prompt (timeout)")
radical.saga.exceptions.NoSuccess: Could not detect shell prompt (timeout) (/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py +300 (_initialize_pty) : raise rse.NoSuccess("Could not detect shell prompt (timeout)"))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 438, in run
self._rmgr.submit_resource_request()
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 216, in submit_resource_request
raise EnTKError(ex) from ex
radical.entk.exceptions.EnTKError: Could not detect shell prompt (timeout) (/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py +300 (_initialize_pty) : raise rse.NoSuccess("Could not detect shell prompt (timeout)"))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_entk.py", line 102, in main
appman.run()
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 468, in run
self.terminate()
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 507, in terminate
write_session_description(self)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 148, in write_session_description
tree[amgr._uid]['children'].append(wfp._uid)
AttributeError: 'NoneType' object has no attribute '_uid'
Another error run:
EnTK session: re.session.login5.lei.018844.0010
Creating AppManagerSetting up RabbitMQ system ok
ok
Validating and assigning resource manager ok
Setting up RabbitMQ system n/a
new session: [re.session.login5.lei.018844.0010] \
database : [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr] ok
create pilot manager ok
submit 1 pilot(s)
pilot.0000 ornl.summit 215040 cores 7680 gpus ok
closing session re.session.login5.lei.018844.0010 \
close pilot manager \
wait for 1 pilot(s)
0 ok
ok
session lifetime: 59.4s ok
wait for 1 pilot(s)
0 timeout
Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 438, in run
self._rmgr.submit_resource_request()
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 199, in submit_resource_request
self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot.py", line 558, in wait
time.sleep(0.1)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_entk.py", line 103, in main
appman.run()
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 462, in run
self.terminate()
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 507, in terminate
write_session_description(self)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 148, in write_session_description
tree[amgr._uid]['children'].append(wfp._uid)
AttributeError: 'NoneType' object has no attribute '_uid'
radical-stack
python : /ccs/home/lei/.conda/envs/summit-entk/bin/python3
pythonpath : /sw/summit/xalt/1.2.1/site:/sw/summit/xalt/1.2.1/libexec
version : 3.7.6
virtualenv : summit-entk
radical.analytics : 1.6.7
radical.entk : 1.6.7
radical.gtod : 1.6.7
radical.pilot : 1.6.7
radical.saga : 1.6.10
radical.utils : 1.6.7
This issues seems to be related to my .bashrc
file.
I had a very lengthy bashrc that takes a while to load on Summit. I think today Summit is kind of slow so it tooke longer than before to load the bashrc file.
After cleaning the bashrc file a bit, I can get most of the jobs submitted.
Thanks for the help.
Let me know if you have updates in the future.
@wjlei1990 - you could try to set this env variable on the client side, before running your script:
export RADICAL_SAGA_PTY_SSH_TIMEOUT=60
the default timeout is 10 seconds, which may be indeed too short if your bash startup takes too long.