EnTK does not terminate after user programs are complete
Weiming-Hu opened this issue · 2 comments
I have encountered this problem once. I'm using EnTK on NCAR Cheyenne. My radical stack info is as follows:
(venv) wuh20@cheyenne4:~> radical-stack
python : 3.7.5
pythonpath :
virtualenv : /glade/u/home/wuh20/venv
radical.entk : 1.5.5
radical.gtod : 1.5.0
radical.pilot : 1.5.7-v1.5.7-70-g03956fe3b@fix-mpiexec_mpt
radical.saga : 1.5.7
radical.utils : 1.5.7
My workflow consists of two stages. I requested 6480 CPUs for 480 minutes (8 hours). However, when the second stage is complete (roughly after 2 hours), EnTK did not terminate as expected, instead it
(venv) wuh20@cheyenne4:~/github/pv-workflow/02_simulate/use-entk> python python_me.py
EnTK session: re.session.cheyenne4.wuh20.018589.0005
Creating AppManagerSetting up RabbitMQ system ok
ok
Validating and assigning resource manager ok
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-01.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-02.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-03.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-04.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-05.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-06.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-07.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-08.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-09.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-10.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-11.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-12.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-13.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-14.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-15.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-16.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-17.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-18.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-19.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-20.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-21.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-22.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-23.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-24.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-25.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-26.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-27.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-28.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-29.nc exists. Skip generating this file.
/glade/scratch/wuh20/anen_solar_annual/analogs_domain-USA_chunk-30.nc exists. Skip generating this file.
Nothing to be done for the analog generation stage!
Setting up RabbitMQ system n/a
new session: [re.session.cheyenne4.wuh20.018589.0005] \
database : [mongodb://hpcw-psu:****@129.114.17.185:27017/hpcw-psu] ok
create pilot manager ok
submit 1 pilot(s)
pilot.0000 ncar.cheyenne_mpt 6480 cores 0 gpus ok
All components created
create unit managerUpdate: pipeline.0000 state: SCHEDULING
Update: pipeline.0000.stage-power state: SCHEDULING
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-06.cfg state: SCHEDULING
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-27.cfg state: SCHEDULING
[...]
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-16.cfg state: SCHEDULED
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-11.cfg state: SCHEDULED
Update: pipeline.0000.stage-power state: SCHEDULED
/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/topology.py:162: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: https://pymongo.readthedocs.io/en/stable/faq.html#is-pymongo-fork-safe "MongoClient opened before fork. Create MongoClient only " ok
submit: ########################################################################
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-06.cfg state: SUBMITTING
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-27.cfg state: SUBMITTING
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-01.cfg state: SUBMITTING
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-22.cfg state: SUBMITTING
Update: pipeline.0000.stage-power.task-power_domain-USA_chunk-12.cfg state: SUBMITTING
[ctrl + C hit here]
close unit manager ok
Exception in thread Thread-12:
Traceback (most recent call last):
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/execman/rp/task_manager.py", line 318, in _process_ta
sks
mq_connection.close()
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 730, in close
self._flush_output(self._closed_result.is_ready)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 443, in _flush_output
raise exceptions.ConnectionClosed()
pika.exceptions.ConnectionClosed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/glade/u/apps/ch/opt/python/3.7.5/gnu/8.3.0/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/glade/u/apps/ch/opt/python/3.7.5/gnu/8.3.0/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/execman/rp/task_manager.py", line 326, in _process_ta
sks
raise EnTKError(e) from e
radical.entk.exceptions.EnTKError
Process task-manager:
Traceback (most recent call last):
File "/glade/u/apps/ch/opt/python/3.7.5/gnu/8.3.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/glade/u/apps/ch/opt/python/3.7.5/gnu/8.3.0/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/execman/rp/task_manager.py", line 177, in _tmgr
heartbeat_response(mq_channel)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/execman/rp/task_manager.py", line 113, in heartbeat_r
esponse
mq_channel.basic_get(queue=self._hb_request_q)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 2077, in basic_get
self._basic_getempty_result.is_ready)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 1292, in _flush_output
*waiters)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 458, in _flush_output
self._impl.ioloop.poll()
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pika/adapters/select_connection.py", line 495, in poll
self._poller.poll()
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pika/adapters/select_connection.py", line 1102, in poll
events = self._poll.poll(self._get_max_wait())
KeyboardInterrupt
wait for 1 pilot(s)
0 ok
closing session re.session.cheyenne4.wuh20.018589.0005 \
close pilot manager \
wait for 1 pilot(s)
0 ok
ok
+ re.session.cheyenne4.wuh20.018589.0005 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 26373.7s ok
All components terminated
EnTK was stuck at the submitting
stage but the tasks were complete with all the output ready. At a certain point, when I went back on Cheyenne to check the progress, I found that my data were already generated but EnTK was still running. When I ssh
into one of the compute nodes, I notice several rp.*
processes were sleeping. Please see the following output from top
on one of the compute nodes when EnTK was stuck.
top - 21:45:05 up 41 days, 7:57, 1 user, load average: 0.07, 0.07, 0.04
Tasks: 738 total, 1 running, 737 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 65507008 total, 15499832 used, 50007176 free, 0 buffers
KiB Swap: 0 total, 0 used, 0 free. 3747756 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3162 wuh20 20 0 598648 19904 7476 S 0.664 0.030 0:32.48 rp.agent_stagin
3096 wuh20 20 0 1297296 39660 10192 S 0.332 0.061 2:21.38 rp.agent.0
3154 wuh20 20 0 598648 19812 7604 S 0.332 0.030 0:30.41 rp.agent_schedu
3182 wuh20 20 0 598648 19380 7192 S 0.332 0.030 0:33.35 rp.funcs_res_qu
3309 wuh20 20 0 1224428 36188 7676 S 0.332 0.055 0:34.67 rp.agent_schedu
3357 wuh20 20 0 1086900 36468 8280 S 0.332 0.056 0:38.32 rp.update.0000
45669 wuh20 20 0 16564 2400 1312 R 0.332 0.004 0:00.07 top
2447 wuh20 20 0 13128 2000 1648 S 0.000 0.003 0:00.00 bash
2453 wuh20 20 0 10708 472 356 S 0.000 0.001 0:00.02 pbs_demux
2457 wuh20 20 0 13260 1868 1512 S 0.000 0.003 0:00.00 bash
2459 wuh20 20 0 13928 2848 1740 S 0.000 0.004 0:02.54 bash
2552 wuh20 20 0 38724 5792 5000 S 0.000 0.009 0:01.37 ssh
3150 wuh20 20 0 603512 25532 7492 S 0.000 0.039 0:33.67 rp.agent_execut
3158 wuh20 20 0 598648 19656 7272 S 0.000 0.030 0:33.18 rp.agent_schedu
3166 wuh20 20 0 598904 19868 7368 S 0.000 0.030 0:32.62 rp.agent_stagin
3170 wuh20 20 0 598648 19656 7472 S 0.000 0.030 0:29.89 rp.agent_unsche
3174 wuh20 20 0 598648 20156 7668 S 0.000 0.031 0:30.26 rp.control_pubs
3178 wuh20 20 0 598648 19740 7524 S 0.000 0.030 0:33.52 rp.funcs_req_qu
3186 wuh20 20 0 598648 19824 7604 S 0.000 0.030 0:29.88 rp.state_pubsub
3293 wuh20 20 0 1163204 41516 7644 S 0.000 0.063 0:45.16 rp.agent_execut
3325 wuh20 20 0 1020596 35220 7732 S 0.000 0.054 0:32.64 rp.agent_stagin
3341 wuh20 20 0 951212 35368 7840 S 0.000 0.054 0:32.65 rp.agent_stagin
3431 wuh20 20 0 806604 36244 7504 S 0.000 0.055 0:20.43 rp.agent_schedu
45598 wuh20 20 0 75032 3196 2404 S 0.000 0.005 0:00.00 sshd
45599 wuh20 20 0 17204 5640 2092 S 0.000 0.009 0:00.04 bash
45693 wuh20 20 0 5764 1336 1244 S 0.000 0.002 0:00.00 sleep
I then terminated EnTK manually.
There are no error messages in the client sandbox when I run grep -R Error *
; there are no error messages either in the server sandbox when I run grep -R Error *.log
.
I'm not sure whether this issue is related to another connection issue I opened at EnTK. This is my second time finishing an entire workflow. For the first time, I didn't have this problem though.
Thank you all for your help.
@Weiming-Hu , I wanted to tell you that I will be following this up along with the open ticket: radical-cybertools/radical.entk#509, as this needs to be addressed in the use of pika. We anticipate this will be resolved no later than the January release.
Sounds good to me. I will be actively monitoring this thread. Thanks.