ENTK seems to break after Summit OS Update
wjlei1990 opened this issue · 11 comments
Hi,
After Summit system OS update this week, I always got this error after submitting jobs:
EnTK session: re.session.login3.lei.018866.0003
Creating AppManagerSetting up RabbitMQ system ok
ok
Validating and assigning resource manager ok
Setting up RabbitMQ system n/a
new session: [re.session.login3.lei.018866.0003] \
database : [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr] ok
create pilot manager ok
submit 1 pilot(s)
pilot.0000 ornl.summit 336 cores 12 gpus ok
closing session re.session.login3.lei.018866.0003 \
close pilot manager \
wait for 1 pilot(s)
0 ok
ok
session lifetime: 176.7s ok
wait for 1 pilot(s)
0 timeout
Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 438, in run
self._rmgr.submit_resource_request()
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 199, in submit_resource_request
self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot.py", line 558, in wait
time.sleep(0.1)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "entk.hrlee.py", line 184, in main
appman.run()
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 462, in run
self.terminate()
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 507, in terminate
write_session_description(self)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 148, in write_session_description
tree[amgr._uid]['children'].append(wfp._uid)
AttributeError: 'NoneType' object has no attribute '_uid'
I re-installed erlang (Erlang/OTP 25 [DEVELOPMENT] [erts-12.0.3])
and rabbitmq_server-3.9.4
. Due to the system update, the old version installed on summit just broke and not working any more.
Any thoughts on the failure? Is it my issue or entk
issue?
(summit-entk) lei@login3 /gpfs/alpine/world-shared/geo111/lei/entk.small $
radical-stack
python : /ccs/home/lei/.conda/envs/summit-entk/bin/python3
pythonpath : /sw/summit/xalt/1.2.1/site:/sw/summit/xalt/1.2.1/libexec
version : 3.7.6
virtualenv : summit-entk
radical.analytics : 1.6.7
radical.entk : 1.6.7
radical.gtod : 1.6.7
radical.pilot : 1.6.7
radical.saga : 1.6.10
radical.utils : 1.6.7
New Error occured!
EnTK session: re.session.login3.lei.018871.0000
Creating AppManagerSetting up RabbitMQ system ok
ok
Validating and assigning resource manager ok
Setting up RabbitMQ system n/a
new session: [re.session.login3.lei.018871.0000] \
database : [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr] err
Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/session.py", line 183, in _initialize_primary
cfg=self._cfg, log=self._log)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/db/database.py", line 49, in __init__
self._mongo, self._db, _, _, _ = ru.mongodb_connect(str(dburl))
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/utils/misc.py", line 135, in mongodb_connect
db.authenticate(user, pwd)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/pymongo/database.py", line 1471, in authenticate
connect=True)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/pymongo/mongo_client.py", line 750, in _cache_credentials
writable_preferred_server_selector)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/pymongo/topology.py", line 235, in select_server
address))
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/pymongo/topology.py", line 193, in select_servers
selector, server_timeout, address)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/pymongo/topology.py", line 209, in _select_servers_loop
self._error_message(selector))
pymongo.errors.ServerSelectionTimeoutError: 129.114.17.185:27017: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 157, in submit_resource_request
self._session = rp.Session(uid=self._sid)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/session.py", line 153, in __init__
self._initialize_primary(dburl)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/session.py", line 198, in _initialize_primary
dburl_no_passwd) from e
RuntimeError: session create failed [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 438, in run
self._rmgr.submit_resource_request()
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 216, in submit_resource_request
raise EnTKError(ex) from ex
radical.entk.exceptions.EnTKError: session create failed [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr]
During handling of the above exception, another exception occurred:
Does it mean the mongodb breaks?
database : [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr] err
Hi @wjlei1990 , unfortunately 129.114.17.185 is momentarily offline. I am working on bringing it back up ASAP. Meanwhile, could you try to use our services deployed at ORNL? @lee212 could you give Wenjie the details of the endpoints and how to use them?
@wjlei1990 , I sent the information over Slack, let me know if you didn't receive the notification or had an issue with it.
Thanks for the mongodb update. This part now works.
However, ETNK still has some issues running the job on summit:
EnTK session: re.session.login5.lei.018874.0003
Creating AppManagerSetting up RabbitMQ system ok
ok
Validating and assigning resource manager ok
Setting up RabbitMQ system n/a
new session: [re.session.login5.lei.018874.0003] \
database : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test] ok
create pilot manager ok
submit 1 pilot(s)
pilot.0000 ornl.summit 336 cores 12 gpus ok
closing session re.session.login5.lei.018874.0003 \
close pilot manager \
wait for 1 pilot(s)
0 ok
ok
session lifetime: 156.5s ok
wait for 1 pilot(s)
0 timeout
Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 438, in run
self._rmgr.submit_resource_request()
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 199, in submit_resource_request
self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot.py", line 558, in wait
time.sleep(0.1)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "entk.hrlee.py", line 187, in main
appman.run()
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 462, in run
self.terminate()
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 507, in terminate
write_session_description(self)
File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 148, in write_session_description
tree[amgr._uid]['children'].append(wfp._uid)
AttributeError: 'NoneType' object has no attribute '_uid'
This is what I got from ENTK sandbox:
ls re.session.login5.lei.018874.0003/pilot.0000/
agent.0.cfg bootstrap_0.err bootstrap_0.out bootstrap_0.sh deactivate env.orig
The job seems to be running in the queue on Summit. However, entk
seems to fail at launch tasks.
Update: release has been made, pypi installation will reflect this update. you can get rp:
pip install --upgrade radical.pilot
This issue has been addressed and merged to the devel branch, we will have this fix in the next release. In the meantime, would you be able to install radical.pilot from the github repo?
The installation instruction of getting the devel branch is removing your current installation first and installing it from the git repo like:
pip uninstall radical.pilot
pip install git+https://github.com/radical-cybertools/radical.pilot@devel
The specific PR is merged to devel: radical-cybertools/radical.pilot#2439
After testing it on Summit, now the task directory is generated from pilot.0000
. However, tasks are still not running properly. I n the task.0000.err
file, I found this:
Failed to bind process to ERF smt array, err: Invalid argument
Failed to bind process to ERF smt array, err: Invalid argument
Failed to bind process to ERF smt array, err: Invalid argument
Failed to bind process to ERF smt array, err: Invalid argument
Failed to bind process to ERF smt array, err: Invalid argument
This is my script for creating task:
def create_task(task_dir):
t1 = Task()
t1.pre_exec = [
'cd {}'.format(task_dir),
'unset CUDA_VISIBLE_DEVICES',
"export OMP_NUM_THREADS=1"
]
t1.executable = './bin/xspecfem3D'
t1.cpu_reqs = {
'cpu_processes': mpi_per_task,
'cpu_process_type': 'MPI',
'cpu_threads': 4,
'cpu_thread_type': 'OpenMP'}
t1.gpu_reqs = {
'gpu_processes': 1,
'gpu_process_type': None,
'gpu_threads': 1,
'gpu_thread_type': 'CUDA'}
return t1
Anything wrong with the task config?
Hi @wjlei1990 - we got informed by another user that the ERF format is broken on summit at the moment. Tickets have been opened toward IBM, but so far we have no ETA for a fix I am afraid.
@andre-merzky thanks for the update!
My entk script is located here:
/gpfs/alpine/world-shared/geo111/lei/entk.small/entk.hrlee.py
@wjlei1990, please try to use/replicate this script, you would be able to avoid ERF error,
/gpfs/alpine/world-shared/geo111/hrlee/entk.hrlee.py
It uses jsrun
arguments to specify resource sets as a temporal fix. We will eventually try pmix/prrte later though.
Thanks~
Issue resolved! Ticket can be closed in today's meeting.