Yelp/zygote

AFUnixSender crash (EEXIST)

Closed this issue · 1 comments

During a random test:

evan@zeno ~/code/zygote (master) $ PYTHONPATH=.:$PYTHONPATH testify tests.test  ZygoteTests.test_hup
[5823] 2011-07-04 21:50:39,137 :: INFO    :: zygote.master - main started
[5823] 2011-07-04 21:50:39,139 :: INFO    :: zygote.master - started zygote 5828 pointed at base '/data/code/zygote/example'
[5823] 2011-07-04 21:50:39,140 :: INFO    :: zygote.master - Attempted to transition 0 workers from 0 zygotes
[5828] 2011-07-04 21:50:39,141 :: INFO    :: zygote.worker.zygote_process - new zygote started
[5829] 2011-07-04 21:50:39,242 :: DEBUG   :: zygote.worker.worker_process - new worker started
[5830] 2011-07-04 21:50:39,244 :: DEBUG   :: zygote.worker.worker_process - new worker started
[5823] 2011-07-04 21:50:39,245 :: DEBUG   :: zygote.master - received message of type MessageWorkerStart from pid 5829
[5831] 2011-07-04 21:50:39,245 :: DEBUG   :: zygote.worker.worker_process - new worker started
[5823] 2011-07-04 21:50:39,246 :: DEBUG   :: zygote.master - received message of type MessageWorkerStart from pid 5830
[5832] 2011-07-04 21:50:39,247 :: DEBUG   :: zygote.worker.worker_process - new worker started
[5823] 2011-07-04 21:50:39,248 :: DEBUG   :: zygote.master - received message of type MessageWorkerStart from pid 5831
[5823] 2011-07-04 21:50:39,249 :: DEBUG   :: zygote.master - received message of type MessageWorkerStart from pid 5832
[5823] 2011-07-04 21:50:40,018 :: INFO    :: zygote.master - started zygote 5833 pointed at base '/data/code/zygote/example'
[5823] 2011-07-04 21:50:40,019 :: INFO    :: zygote.master - Attempted to transition 4 workers from 1 zygotes
[5832] 2011-07-04 21:50:40,019 :: INFO    :: zygote.worker.worker_process - received SIGTERM, exiting
[5829] 2011-07-04 21:50:40,019 :: INFO    :: zygote.worker.worker_process - received SIGTERM, exiting
[5830] 2011-07-04 21:50:40,020 :: INFO    :: zygote.worker.worker_process - received SIGTERM, exiting
[5833] 2011-07-04 21:50:40,020 :: INFO    :: zygote.worker.zygote_process - new zygote started
[5831] 2011-07-04 21:50:40,021 :: INFO    :: zygote.worker.worker_process - received SIGTERM, exiting
[5828] 2011-07-04 21:50:40,052 :: INFO    :: zygote.worker.zygote_process - reaped worker 5829, status 0
[5828] 2011-07-04 21:50:40,052 :: INFO    :: zygote.worker.zygote_process - reaped worker 5831, status 0
[5828] 2011-07-04 21:50:40,054 :: INFO    :: zygote.worker.zygote_process - reaped worker 5830, status 0
Traceback (most recent call last):
  File "zygote/main.py", line 51, in <module>
    main()
  File "zygote/main.py", line 48, in main
    zygote.master.main(opts, args)
  File "/data/code/zygote/zygote/master.py", line 283, in main
    master.start()
  File "/data/code/zygote/zygote/master.py", line 243, in start
    z = self.create_zygote()
  File "/data/code/zygote/zygote/master.py", line 234, in create_zygote
    z.loop()
  File "/data/code/zygote/zygote/worker.py", line 120, in loop
    self.io_loop.start()
  File "/usr/lib/python2.7/site-packages/tornado/ioloop.py", line 261, in start
    event_pairs = self._impl.poll(poll_timeout)
  File "/data/code/zygote/zygote/worker.py", line 117, in reap_child
    notify(self.notify_socket, MessageWorkerExit, '%d %d' % (pid, status_code))
  File "/data/code/zygote/zygote/worker.py", line 35, in notify
    sock.send(msg_cls.emit(str(body)))
  File "/data/code/zygote/zygote/util.py", line 166, in send
    self._sendall()
  File "/data/code/zygote/zygote/util.py", line 161, in _sendall
    self.io_loop.add_handler(self.socket.fileno(), sender, self.io_loop.WRITE)
  File "/usr/lib/python2.7/site-packages/tornado/ioloop.py", line 164, in add_handler
    self._impl.register(fd, events | self.ERROR)
IOError: [Errno 17] File exists

This doesn't happen reliably, and I couldn't reproduce it, but there's a bug lurking somewhere in this code.

Since I couldn't reproduce this before, I'm not 100% sure that this actually fixes the issue. It seems plausible, though, that this would lead to the socket being added to the IOLoop twice (once on the async connect, then again on the send).

Anyway, hopefully I won't have to re-open this (but it's definitely possible).