mongodb-labs/drivers-atlas-testing

Document the use of CTRL_BREAK_EVENT on Windows instead of SIGINT to interrupt workload executors

prashantmital opened this issue · 2 comments

There are many limitations with using signal.CTRL_C_EVENT to interrupt a subprocess on Windows. Consider, for example, the following scripts:

  1. pyscript.py (analogous to 'the framework', i.e. astrolabe):
import subprocess
import os
import signal
import sys
import time


cmd = subprocess.Popen([sys.executable, "bgproc.py"],
        creationflags=subprocess.CREATE_NEW_PROCESS_GROUP,
        stdout=subprocess.PIPE, stderr=subprocess.PIPE)
time.sleep(2)
os.kill(cmd.pid, signal.CTRL_C_EVENT)
stdout, stderr = cmd.communicate(timeout=10)

print("stdout: {}".format(stdout))
print("stderr: {}".format(stderr))
print("exit code: {}".format(cmd.returncode))
  1. bgproc.py (analogous to a driver workload executor script):
import signal

print("hello world")

try:
    while True:
        pass
except KeyboardInterrupt:
    print("caught ctrl-c!")
    exit(0)

Running python.ext pyscript.py, we'd expect to see bgproc.py's execution interrupted by the CTRL_C_EVENT signal, which is 'handled' in the except KeyboardInterrupt block. However, we actually find that interruption of this script is not interrupted at all by the signal causing the call to communicate to timeout:

$ C:/python/Python37/python.exe pyscript.py
Traceback (most recent call last):
  File "pyscript.py", line 13, in <module>
    stdout, stderr = cmd.communicate(timeout=10)
  File "C:\python\Python37\lib\subprocess.py", line 964, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "C:\python\Python37\lib\subprocess.py", line 1298, in _communicate
    raise TimeoutExpired(self.args, orig_timeout)
subprocess.TimeoutExpired: Command '['C:\\python\\Python37\\python.exe', 'bgproc.py']' timed out after 10 seconds

After observing this peculiar behavior, I investigated further and found that on Windows there are many deficiencies with the IPC APIs. The situation is further complicated by deficient/incorrect Python documentation (specifically, the correct usage of CTRL_C_EVENT, CTRL_BREAK_EVENT, CREATE_NEW_PROCESS_GROUP, os.kill on Windows). Some resources with pertinent information/discussions are:

In light of this, we need a new way to stop the Workload Executor on windows.

After some more digging, it seems that using the CTRL_BREAK_EVENT signal is the right way to kill process groups on Windows. The following combination of scripts works:

  1. pyscript.py analogous to 'the framework', i.e. astrolabe):
import subprocess
import os
import signal
import sys
import time


cmd = subprocess.Popen([sys.executable, "bgproc.py"],
        creationflags=subprocess.CREATE_NEW_PROCESS_GROUP,
        stdout=subprocess.PIPE, stderr=subprocess.PIPE)
time.sleep(2)
os.kill(cmd.pid, signal.CTRL_BREAK_EVENT)
stdout, stderr = cmd.communicate(timeout=10)

print("stdout: {}".format(stdout))
print("stderr: {}".format(stderr))
print("exit code: {}".format(cmd.returncode))
  1. bgproc.py (analogous to a driver workload executor script):
import signal

print("hello world")

def cleanup(signum, frame):
    print("caught ctrl-break!")
    exit(0)

signal.signal(signal.SIGBREAK, cleanup)

while True:
    pass

exit(0)

This works as expected:

$ C:/python/Python37/python.exe pyscript.py
stdout: b'hello world\r\ncaught ctrl-break!\r\n'
stderr: b''
exit code: 0

With #32 and #30 done, this mostly boils down to document this change in the expected behavior of workload executors written for windows. Adding the documentation tag and updating the issue description accordingly.