ChildActorExited - Manually killing child PID does not result in ChildActorExited Message to Parent
Opened this issue · 9 comments
Code
Below is example code which has reproduced this issue for me.
Project structure
.
├── app
│ ├── __init__.py
│ ├── parent.py
│ └── child.py
├── start.py
└── stop.py
start.py
from thespian.actors import ActorSystem
from app import parent
if __name__ == '__main__':
asys = ActorSystem('multiprocTCPBase')
actor = asys.createActor(parent.Parent)
asys.tell(actor, parent.Initialize())
stop.py
from thespian.actors import ActorSystem
if __name__ == '__main__':
ActorSystem('multiprocTCPBase').shutdown()
parent.py
from thespian.actors import ActorTypeDispatcher
import os
class Initialize:
def __init__(self):
pass
class Parent(ActorTypeDispatcher):
def __init__(self, *args, **kwargs):
super(Parent, self).__init__(*args, **kwargs)
self._child = None
def receiveMsg_Initialize(self, msg, sender):
print(f'{self.myAddress} (P|:{os.getpid()}) Parent received Initialize command from {sender}')
self._child = self.createActor('app.child.Child')
self.send(self._child, Initialize())
def receiveMsg_ChildActorExited(self, msg, sender):
print(f'In Parent: ChildActorExited')
child.py
from thespian.actors import ActorTypeDispatcher
import os
class Child(ActorTypeDispatcher):
def __init__(self, *args, **kwargs):
super(Child, self).__init__(*args, **kwargs)
def receiveMsg_Initialize(self, msg, sender):
print(f'{self.myAddress} (P|:{os.getpid()}) Child received Initialize command from {sender}')
Procedure
- Navigate to the root of my project.
~$ cd ~/Documents/thespian/test
- Start the actor system using
start.py
~/Documents/thespian/test$ python start.py
ActorAddr-(T|:1111) (P|:5555) Parent received Initialize command from ActorAddr-(T|:0000)
ActorAddr-(T|:2222) (P|:6666) Child received Initialize command from ActorAddr-(T|:1111)
- Kill the child actor process by its PID.
~/Documents/thespian/test$ kill -9 6666
- After tests, shut down actor system.
~/Documents/thespian/test$ python stop.py
Problem
From the procedure above, upon killing the child process in step 3, it is to my understanding that the parent actor should immediately receive a ChildActorExited
message and my example program should print out ChildActorExited
to the terminal; this, however, does not happen. Instead the remaining parent actor will stay alive and report nothing.
I have tested this same functionality on MacOS and Windows with the same results. I also tried using multiprocUDPBase
but again got the same results.
Another thing to note is after killing the child process and running stop.py
, the actor system takes a bit longer than usual to shutdown however it does not print any additional information.
Environment
- These tests were performed in fresh virtual environments where the only pip package installed was
thespian
. - I have tested this on MacOS Monterey (12.1) as well as Windows 10 (20H2).
- On both operating systems, I have tested using various python versions (3.7.3, 3.8.6, 3.9.9).
- I have also tested on a couple versions of Thespian (3.9.11, 3.10.6).
If you need any additional information, please let me know.
Thank you for the clear report with samples to check this.
Unfortunately, I cannot reproduce the problem: the ChildActorExited
is delivered immediately for me when running the tests you describe.
Python versions tested: 3.7.12, 3.8.12, 3.9.6
The main difference is that I'm running a Linux system (I don't presently have access to either a Windows or a MacOS system).
07:06:42 issue85$ export PS0="running command at start time \t\n"
running command at start time 07:06:46
07:06:46 issue85$ python start.py
running command at start time 07:06:50
ActorAddr-(T|:38691) (P|:1135063) Parent received Initialize command from ActorAddr-(T|:43313)
ActorAddr-(T|:42113) (P|:1135064) Child received Initialize command from ActorAddr-(T|:38691)
07:06:50 issue85$ kill 1135064
running command at start time 07:06:58
07:06:58 issue85$ In Parent: ChildActorExited
In Parent: ChildActorExited
07:06:59 issue85$ python stop.py
running command at start time 07:07:09
07:07:09 issue85$
07:07:14 issue85$
07:07:15 issue85$ python start.py
running command at start time 07:07:21
ActorAddr-(T|:34143) (P|:1135204) Parent received Initialize command from ActorAddr-(T|:36281)
ActorAddr-(T|:32769) (P|:1135205) Child received Initialize command from ActorAddr-(T|:34143)
07:07:21 issue85$ kill -9 1135205
running command at start time 07:07:25
07:07:25 issue85$ In Parent: ChildActorExited
07:07:27 issue85$ python stop.py
running command at start time 07:07:30
07:07:30 issue85$
Can you please try two things:
- Repeat your tests but without the
-9
argument tokill
- Run the following and let me know the results with (a)
python subproc.py
and no additional actions, (b)python subproc.py
thenkill
of the reported child process, and (c)python subproc.py
+kill -9
of the reported child process. It should report "Parent got signal 17" immediately on thekill
commands.
subproc.py:
from multiprocessing import *
import signal
import time
import os
def sleeper(t):
print('Process %s sleeping for %d seconds' % (os.getpid(), t))
time.sleep(t)
print('Process %s awakened and exiting' % os.getpid())
def gotsig(signum, frame):
print('Parent got signal %s' % signum)
if __name__ == "__main__":
wait_time = 10
signal.signal(signal.SIGCHLD, gotsig)
p = Process(target=sleeper, args=(wait_time,))
p.start()
time.sleep(wait_time + 2)
print('Exiting')
Repeat your tests but without the -9 argument to kill
When I do this, the process never gets killed and the actor stays alive. Both OSs exhibit the same behavior where I am prevented from killing ActorSystem child actor processes without forcefully killing them.
Note: In MacOS the signal number for SIGCHLD
is 20 instead of 17.
Run the following and let me know the results with
python subproc.py
and no additional actions
Results:
~/test$ python subproc.py
Process 42158 sleeping for 10 seconds
Process 42158 awakened and exiting
Parent got signal 20
Exiting
~/test$
Run the following and let me know the results with
python subproc.py
then kill of the reported child process
Results:
~/test$ python subproc.py
Process 53352 sleeping for 10 seconds
~/test$ kill 53352
Parent got signal 20
Exiting
~/test$
I received the exit signal immediately after killing the process.
Run the following and let me know the results with
python subproc.py
+kill -9
of the reported child process
Results:
~/test$ python subproc.py
Process 56348 sleeping for 10 seconds
~/test$ kill -9 56348
Parent got signal 20
Exiting
~/test$
I received the exit signal immediately after killing the process.
I was unable to run the subproc.py
tests on Windows since windows does not have any SIGCHLD
signal or anything that I could use as a replacement.
I should have a Linux system up and running in the next couple days so I can get back to you with my test results on that, however it will be on an ARM Cortex processor so hopefully that doesn't throw a wrench into the mix. At least for now it seems like it should work and that this is more of a Mac and Windows specific issue.
This is certainly an odd issue... It almost feels like there may be a permissions issue or something. Are there any environment variables or permissions that I should verify on my end?
Additional environment info:
On the mac, I am using PyEnv to manage my multiple python versions.
On both mac and windows I am using python virtual environments created with python -m venv
but I doubt that is the source of the issue.
Interesting... and a bit unexpected:
- The
SIGCHLD
value is usually17
, not20
, although that signal is not one of the explicit values listed in the POSIX standard, so it's possible it's implementation dependent. I get<Signals.SIGCHLD 17>
when I run$ python -c 'import signal; print(repr(signal.SIGCHLD))
. On my system,20
isSIGTSTP
which is a terminal stop signal (usually associated withCtrl-Z
by default, although that's remappable elsewhere). What do you get for this python command? - Running kill without the
-9
should deliverSIGTERM
(usually value15
). What happens if you explicitly runkill -15 PID
? - Can you run
$ type kill
command? I would expect it to report "kill is a shell builtin". Also, what shell are you running?
There seem to be some sort of shenanigans going on with signals that's causing the problem. Normally, thespian will watch for a child process exit by registering for a number of signals (see https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L678-L694 and https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L29-L40). Your results indicate that signals are being delivered promptly, but not necessarily the signal that was expected. It may be possible to resolve this by simply adding the right additional signal to https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L36, but I'd like to try to understand what's happening a little better first.
This situation should not be affected by the hardware architecture or the python virtual environment.
- I checked the reference man pages for OSX which provided the following:
No Name Default Action Description
1 SIGHUP terminate process terminal line hangup
2 SIGINT terminate process interrupt program
3 SIGQUIT create core image quit program
4 SIGILL create core image illegal instruction
5 SIGTRAP create core image trace trap
6 SIGABRT create core image abort program (formerly SIGIOT)
7 SIGEMT create core image emulate instruction executed
8 SIGFPE create core image floating-point exception
9 SIGKILL terminate process kill program
10 SIGBUS create core image bus error
11 SIGSEGV create core image segmentation violation
12 SIGSYS create core image non-existent system call invoked
13 SIGPIPE terminate process write on a pipe with no reader
14 SIGALRM terminate process real-time timer expired
15 SIGTERM terminate process software termination signal
16 SIGURG discard signal urgent condition present on
socket
17 SIGSTOP stop process stop (cannot be caught or
ignored)
18 SIGTSTP stop process stop signal generated from
keyboard
19 SIGCONT discard signal continue after stop
20 SIGCHLD discard signal child status has changed
This is showing SICHLD on signal number 20. Here is the output of the requested command from your first point:
~/test$ python -c 'import signal; print(repr(signal.SIGCHLD))'
<Signals.SIGCHLD: 20>
-
Running with
kill -15 PID
gives the same results askill PID
since the default signal for the kill command in OSX is 15 (SIGTERM
). On that note, in OSX, thekill -9 PID
command will deliver aSIGKILL
to the process. -
type kill
returnskill is a shell builtin
. I'm using ZSH on my Mac but I'm just using standard Powershell on my Windows computer.
I'm looking through the code you linked and can't see why the exit signal isn't being caught since you're referencing it by name and not number.
I modified your test code in an attempt to view what signals are being received by the killed process.
from multiprocessing import *
import signal
import time
import os
def sleeper(t):
print('Process %s sleeping for %d seconds' % (os.getpid(), t))
time.sleep(t)
print('Process %s awakened and exiting' % os.getpid())
def gotsig(signum, frame):
print('Parent got signal %s' % signum)
if __name__ == '__main__':
for i in ['SIGTERM', 'SIGKILL', 'SIGQUIT', 'SIGABRT', 'SIGCHLD']:
try:
signum = getattr(signal, i)
signal.signal(signum, gotsig)
print(f'{i}: {signum}')
except (OSError, RuntimeError) as m: # os error
print(f'Skipping {i}: {m}')
wait_time = 10
p = Process(target=sleeper, args=(wait_time,))
p.start()
time.sleep(wait_time+2)
print('Exiting')
By running this code, I get the following output.
SIGTERM: 15
SIGKILL: 9
Skipping SIGKILL: [Errno 22] Invalid argument
SIGQUIT: 3
SIGABRT: 6
SIGCHLD: 20
Process 96152 sleeping for 10 seconds
Parent got signal 20
Exiting
Regardless of which signal I am killing the process with (SIGKILL: 9, SIGTERM: 15, SIGQUIT: 3, etc.) I get the same result signal in the output Parent got signal 20
. I also did a bit of digging and found that for both Linux and Mac, their man pages specify the following:
The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored.
This explains the invalid argument error for SIGKILL in my outputs.
Please let me know if there is any other info that may help.
Looks like only some of the signal values are defined by POSIX and the others can be remapped by different OS vendors (see https://en.wikipedia.org/wiki/Signal_(IPC) and search for "Portable number"), which is what accounts for the different numbers.
Thanks for trying the explicit -15
on the kill: I just wanted to check in case kill
was mapped to a different default signal.
SIGKILL
is a pretty emphatic termination signal and usually just used as a last resort, since the default SIGTERM
is usually nicer in that it allows the application to perform any cleanup processing it might have setup via an associated signal handler. Regardless, however the child exits, the parent should always get the SIGCHLD
, which is what the latest script seems to verify.
I appreciate your patience in working through this with me. When you have a chance, the next set of tests is:
- Update your local thespian's
thespian/system/transport/TCPTransport.py
with https://gist.github.com/kquick/371519aa6cb0bea16961cc100eb7cf3f THESPLOG_THRESHOLD=DEBUG THESPLOG_MAX_FILESIZE=1024000 THESPLOG_FILE=$(pwd)/run1.log python start.py
and then complete your original test process.- Send both
run1.log
to me (via gist or attachment here, after you've perused and tweaked them to remove any sensitive data you don't want exposed).
That will give me some internal Thespian log information that should hopefully reveal what's different about running in your environment.
I've attached a few different logs here. All were generated by running the test code originally linked at the beginning of this issue.
kill_minus_15_SIGTERM.log
When running to generate kill_minus_15_SIGTERM.log
, I attempted to kill the child process using kill -15 PID
, which doesn't actually kill the process at all. After running the kill command, I ran stop.py
to gracefully shut down the actorsystem.
kill_minus_9_SIGKILL.log
When running to generate kill_minus_9_SIGKILL.log
, I used kill -9 PID
to kill the child process. The child process was killed successfully but as expected didn't report back to the parent with a ChildActorExited message. I then gracefully shut down the actorsystem using stop.py
.
I'm noticing in the kill_minus_15_SIGTERM.log
file, on line 38, we get
2022-02-11 07:34:12.256399 p27057 Warn Actor app.child.Child @ ActorAddr-(T|:61345) got shutdown signal: 15
whereas no such line exists in the kill_minus_9_SIGKILL.log
file.
I also tested using kill -3
and kill -6
, both of which exhibited the exact same behavior as kill -15
. They didn't result in the process being killed and the actor remained alive. They both did however show up in the log file saying got shutdown signal: 3
and got shutdown signal: 6
kill_minus_17_SIGSTOP.log
I then tested using kill -17
which is the SIGSTOP
signal on MacOSX, which just like kill -9
(SIGKILL
), "cannot be caught or ignored" according to the MacOSX and Linux man pages. This command behaved slightly differently from kill -9
. After stopping (not killing since the process never actually died) the child actor process with SIGSTOP
, then attempting to run stop.py
to gracefully shut down the actorsystem, the shutdown process took longer than expected and resulted in the following error at the end of the log:
2022-02-11 07:55:48.015610 p46550 ERR No response to Admin shutdown request; Actor system not completely shutdown
I think this is due to SIGSTOP
essentially freezing a process which thespian recognizes as still being an existing process. So when thespian tries to kill the frozen process it gets stuck trying to kill the process but can't actually kill it and eventually gives up.
Quick update. On my Debian system I receive the ChildActorExited message as anticipated.
Thanks for the info and update. The log files are not showing some events I'm expecting to see. I'm looking into getting access to a MacOS environment to see if I can do some local testing/reproduction.
If there's something in particular you're looking for, I could do a bit of digging to get some more detailed info. Thanks again for your help. :)