
Issues (?) when submitting jobs

wjlei1990 opened this issue · 4 comments

Hi I constantly encountered this issue when using entk on summit.

This issue doesn't not always happen but sometime it just pops out...I am not sure what is the reason...Could you help us to figure them out?

EnTK session: re.session.login3.lei.018844.0009
Creating AppManagerSetting up RabbitMQ system                                 ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.login3.lei.018844.0009]                               \
database   : [mongodb://hpcw-pr:****@]            ok
create pilot manager                                                          ok
submit 1 pilot(s)Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 179, in submit_resource_request
    self._pilot = self._pmgr.submit_pilots(pdesc)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot_manager.py", line 603, in submit_pilots
    pilot = Pilot(pmgr=self, descr=pd)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot.py", line 109, in __init__
    self._resource_sandbox = self._session._get_resource_sandbox(pilot)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/session.py", line 739, in _get_resource_sandbox
    shell = self.get_js_shell(resource, schema)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/session.py", line 785, in get_js_shell
    shell = rsup.PTYShell(js_url, self)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell.py", line 247, in __init__
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 206, in initialize
    self._initialize_pty(info['pty'], info)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 427, in _initialize_pty
    raise ptye.translate_exception (e) from e
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 300, in _initialize_pty
    raise rse.NoSuccess("Could not detect shell prompt (timeout)")
radical.saga.exceptions.NoSuccess: Could not detect shell prompt (timeout) (/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py +300 (_initialize_pty)  :  raise rse.NoSuccess("Could not detect shell prompt (timeout)"))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 438, in run
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 216, in submit_resource_request
    raise EnTKError(ex) from ex
radical.entk.exceptions.EnTKError: Could not detect shell prompt (timeout) (/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py +300 (_initialize_pty)  :  raise rse.NoSuccess("Could not detect shell prompt (timeout)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_entk.py", line 102, in main
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 468, in run
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 507, in terminate
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 148, in write_session_description
AttributeError: 'NoneType' object has no attribute '_uid'

Another error run:

EnTK session: re.session.login5.lei.018844.0010
Creating AppManagerSetting up RabbitMQ system                                 ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.login5.lei.018844.0010]                               \
database   : [mongodb://hpcw-pr:****@]            ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   ornl.summit          215040 cores    7680 gpus           ok
closing session re.session.login5.lei.018844.0010                              \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
session lifetime: 59.4s                                                       ok
wait for 1 pilot(s)
              0                                                          timeout
Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 438, in run
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 199, in submit_resource_request
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot.py", line 558, in wait

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_entk.py", line 103, in main
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 462, in run
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 507, in terminate
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 148, in write_session_description
AttributeError: 'NoneType' object has no attribute '_uid'

  python               : /ccs/home/lei/.conda/envs/summit-entk/bin/python3
  pythonpath           : /sw/summit/xalt/1.2.1/site:/sw/summit/xalt/1.2.1/libexec
  version              : 3.7.6
  virtualenv           : summit-entk

  radical.analytics    : 1.6.7
  radical.entk         : 1.6.7
  radical.gtod         : 1.6.7
  radical.pilot        : 1.6.7
  radical.saga         : 1.6.10
  radical.utils        : 1.6.7

This issues seems to be related to my .bashrc file.

I had a very lengthy bashrc that takes a while to load on Summit. I think today Summit is kind of slow so it tooke longer than before to load the bashrc file.

After cleaning the bashrc file a bit, I can get most of the jobs submitted.

Thanks for the help.

Let me know if you have updates in the future.

@wjlei1990 - you could try to set this env variable on the client side, before running your script:


the default timeout is 10 seconds, which may be indeed too short if your bash startup takes too long.