circus-tent/circus

quit --waiting command hang up

unkcpz opened this issue · 5 comments

I think there is a critical issue after upgrade to tornado>5 .

circusctl --endpoint=xxx quit --waiting 

will successfully stop the process but hang up in terminal.
After trying to debug this, I find the quit action stuck in

events = dict(self.poller.poll(self.timeout))
lead to timeout.

To my surprise, quit without --waitting is all fine.

This issue can simply reproduced by running circusd examples/example2.ini and then stop arbiter with:

circusctl --timeout 10000 --endpoint tcp://127.0.0.1:5555 quit --waiting

I find for command quit the property waiting is not actually used when executing quit procedure.

def execute(self, arbiter, props):
return arbiter.stop()

Can it be simply removed?

What makes me curious is what's the different between waiting and not waiting. When does synchronous CircusClient send the command over socket and who will receive and process the message?

Pinging @biozz @k4nar @sphuber for comments.

biozz commented

I spent several hours trying to debug this and unfortunately I have no clues why it hangs. It is obvious that there are some events that are not sent and poller is not receiving them, but I don't know which ones...

As for the removal of the waiting property - it seems that you are right, there is no point of waiting when using synchronous client.

What I am worried about is that even without waiting I get random zmq errors when circusd stops. Those appeared after upgrade to Tornado 5, ex.:

2020-07-25 23:27:09 asyncio[7773] [ERROR] Exception in callback BaseAsyncIOLoop._handle_events(14, 1)
handle: <Handle BaseAsyncIOLoop._handle_events(14, 1)>
Traceback (most recent call last):
  File ".../asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File ".../site-packages/tornado/platform/asyncio.py", line 139, in _handle_events
    handler_func(fileobj, events)
  File ".../site-packages/zmq/eventloop/zmqstream.py", line 452, in _handle_events
    zmq_events = self.socket.EVENTS
  File ".../site-packages/zmq/sugar/attrsettr.py", line 48, in __getattr__
    return self._get_attr_opt(upper_key, opt)
  File ".../site-packages/zmq/sugar/attrsettr.py", line 52, in _get_attr_opt
    return self.get(opt)
  File "zmq/backend/cython/socket.pyx", line 472, in zmq.backend.cython.socket.Socket.get
  File "zmq/backend/cython/socket.pyx", line 135, in zmq.backend.cython.socket._check_closed
zmq.error.ZMQError: Socket operation on non-socket
2020-07-25 23:27:09 circus[7773] [INFO] Stats streamer stopped

Thanks very much for the effort! @biozz
I spent several days in this issue 😿

What I am worried about is that even without waiting I get random zmq errors when circusd stops. Those appeared after upgrade to Tornado 5, ex

This happened to me also in tornado<5, so I guess this is not a critical issue. @sphuber Can you double check it in the circus.log of your aiida environment?
EDIT: this may relevant to this Fix ZeroMQ assertion error in the client.

As for the removal of the waiting property - it seems that you are right, there is no point of waiting when using synchronous client.

If is the case, do you agree to remove it to get around of the issue? I can make a PR for this.

biozz commented

If is the case, do you agree to remove it to get around of the issue? I can make a PR for this.

I agree, it is the best option right now. In my opinion it doesn't affect most of the "simple" usage scenarios, when you just want to quit regardless of waiting.

This happened to me also in tornado<5, so I guess this is not a critical issue. @sphuber Can you double check it in the circus.log of your aiida environment?

Sorry for the late reply. Can confirm that I regularly used to get these ZMQErrors even before the move to tornado>=5, so I don't think that is what causes it.