AFUnixSender crash (EEXIST)
Closed this issue · 1 comments
eklitzke commented
During a random test:
evan@zeno ~/code/zygote (master) $ PYTHONPATH=.:$PYTHONPATH testify tests.test ZygoteTests.test_hup
[5823] 2011-07-04 21:50:39,137 :: INFO :: zygote.master - main started
[5823] 2011-07-04 21:50:39,139 :: INFO :: zygote.master - started zygote 5828 pointed at base '/data/code/zygote/example'
[5823] 2011-07-04 21:50:39,140 :: INFO :: zygote.master - Attempted to transition 0 workers from 0 zygotes
[5828] 2011-07-04 21:50:39,141 :: INFO :: zygote.worker.zygote_process - new zygote started
[5829] 2011-07-04 21:50:39,242 :: DEBUG :: zygote.worker.worker_process - new worker started
[5830] 2011-07-04 21:50:39,244 :: DEBUG :: zygote.worker.worker_process - new worker started
[5823] 2011-07-04 21:50:39,245 :: DEBUG :: zygote.master - received message of type MessageWorkerStart from pid 5829
[5831] 2011-07-04 21:50:39,245 :: DEBUG :: zygote.worker.worker_process - new worker started
[5823] 2011-07-04 21:50:39,246 :: DEBUG :: zygote.master - received message of type MessageWorkerStart from pid 5830
[5832] 2011-07-04 21:50:39,247 :: DEBUG :: zygote.worker.worker_process - new worker started
[5823] 2011-07-04 21:50:39,248 :: DEBUG :: zygote.master - received message of type MessageWorkerStart from pid 5831
[5823] 2011-07-04 21:50:39,249 :: DEBUG :: zygote.master - received message of type MessageWorkerStart from pid 5832
[5823] 2011-07-04 21:50:40,018 :: INFO :: zygote.master - started zygote 5833 pointed at base '/data/code/zygote/example'
[5823] 2011-07-04 21:50:40,019 :: INFO :: zygote.master - Attempted to transition 4 workers from 1 zygotes
[5832] 2011-07-04 21:50:40,019 :: INFO :: zygote.worker.worker_process - received SIGTERM, exiting
[5829] 2011-07-04 21:50:40,019 :: INFO :: zygote.worker.worker_process - received SIGTERM, exiting
[5830] 2011-07-04 21:50:40,020 :: INFO :: zygote.worker.worker_process - received SIGTERM, exiting
[5833] 2011-07-04 21:50:40,020 :: INFO :: zygote.worker.zygote_process - new zygote started
[5831] 2011-07-04 21:50:40,021 :: INFO :: zygote.worker.worker_process - received SIGTERM, exiting
[5828] 2011-07-04 21:50:40,052 :: INFO :: zygote.worker.zygote_process - reaped worker 5829, status 0
[5828] 2011-07-04 21:50:40,052 :: INFO :: zygote.worker.zygote_process - reaped worker 5831, status 0
[5828] 2011-07-04 21:50:40,054 :: INFO :: zygote.worker.zygote_process - reaped worker 5830, status 0
Traceback (most recent call last):
File "zygote/main.py", line 51, in <module>
main()
File "zygote/main.py", line 48, in main
zygote.master.main(opts, args)
File "/data/code/zygote/zygote/master.py", line 283, in main
master.start()
File "/data/code/zygote/zygote/master.py", line 243, in start
z = self.create_zygote()
File "/data/code/zygote/zygote/master.py", line 234, in create_zygote
z.loop()
File "/data/code/zygote/zygote/worker.py", line 120, in loop
self.io_loop.start()
File "/usr/lib/python2.7/site-packages/tornado/ioloop.py", line 261, in start
event_pairs = self._impl.poll(poll_timeout)
File "/data/code/zygote/zygote/worker.py", line 117, in reap_child
notify(self.notify_socket, MessageWorkerExit, '%d %d' % (pid, status_code))
File "/data/code/zygote/zygote/worker.py", line 35, in notify
sock.send(msg_cls.emit(str(body)))
File "/data/code/zygote/zygote/util.py", line 166, in send
self._sendall()
File "/data/code/zygote/zygote/util.py", line 161, in _sendall
self.io_loop.add_handler(self.socket.fileno(), sender, self.io_loop.WRITE)
File "/usr/lib/python2.7/site-packages/tornado/ioloop.py", line 164, in add_handler
self._impl.register(fd, events | self.ERROR)
IOError: [Errno 17] File exists
This doesn't happen reliably, and I couldn't reproduce it, but there's a bug lurking somewhere in this code.
eklitzke commented
Since I couldn't reproduce this before, I'm not 100% sure that this actually fixes the issue. It seems plausible, though, that this would lead to the socket being added to the IOLoop twice (once on the async connect, then again on the send).
Anyway, hopefully I won't have to re-open this (but it's definitely possible).