Add ability to manually restart the main child
baryluk opened this issue ยท 15 comments
Backstory: I am using Apache Airflow 2.1.2, on kubernetes (AWS EKS). As I am migrating from airflow 1.10.15, there is many issues, that are hard to debug locally, as there are many interactions with extra resources (secrets, exact environment variables used, k8s, production database, etc). I fixed many of these issues by trial and error, on test instances, but there is one that is really hard to debug. Related to s3 remote logging in airflow. From all I can see (local testing, emitted logs, triple checking configs, and logging to pod and manually doing import boto3, creating sessions with same keys, and executing same commands as airflow code), it should work, but it does not. So I want to instrument / modify the actual process to debug it deeper.
I wish I was able to log into the pod, modify some python files (possibly prepared externally then copied using wget + tar xf
to get them, or edited locally using something like nano
, mcedit
or vi
, or transfered using kubectl cp ...
), to add various debug features, and ask dumb-init
explicitly to restart the child, without terminating the pod. I cannot just restart the pod, because if it terminats, k8s will restart it from scratch, with any custom debug modifications gone.
This is the process structure in the pod:
$ eks exec -it --namespace airflow-uat airflow-uat-web-cf645648d-w5pwx -- /bin/bash
airflow@airflow-uat-web-cf645648d-w5pwx:/opt/airflow$ ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
airflow 134 0.0 0.0 5752 3720 pts/0 Ss 11:11 0:00 /bin/bash
airflow 143 0.0 0.0 9392 3144 pts/0 R+ 11:45 0:00 \_ ps auxf
airflow 1 0.0 0.0 2144 756 ? Ss 10:31 0:00 /usr/bin/dumb-init -- bash -c exec airflow webserver
airflow 7 0.3 1.0 430136 162484 ? Ss 10:31 0:14 /usr/local/bin/python /home/airflow/.local/bin/airflow webserver
airflow 33 0.0 0.3 108132 57996 ? S 10:31 0:01 \_ gunicorn: master [airflow-webserver]
airflow 44 0.1 0.9 432288 158300 ? S 10:31 0:08 \_ [ready] gunicorn: worker [airflow-webserver]
airflow 45 0.1 1.0 436620 162216 ? S 10:31 0:07 \_ [ready] gunicorn: worker [airflow-webserver]
airflow 46 0.1 1.0 436932 162428 ? S 10:31 0:07 \_ [ready] gunicorn: worker [airflow-webserver]
airflow 47 0.1 0.9 432632 158676 ? S 10:31 0:08 \_ [ready] gunicorn: worker [airflow-webserver]
I want to restart process 7, so it can re-read the modified python files.
The alternatives are not so pleasant:
- Create a init container, to patch files. Hard to setup, especially in adhoc debug sessions
- Using gdb, pydb - needs a priori preparation and modifying images, as well increasing image sizes. Also using gdb is not as great, when using gunicorn sub-processes, and when you want to use existing Python logging system for debugging.
- Create a custom image with patched files - hard and convoluted to do, as it requires significant changes in the airflow image build process, and sometimes is not even possible (i.e. we use many upstream images directly), and also carries a risk of polluting or breaking other pods / namespaces that use same image.
- Starting a parallel process - not easily possible, because webserver listens on specific port (8080), and the new webserver, will not start. If I start it on a different port, then I cannot connect to it from outside, without modifying pod spec, or doing manual port-forward. But even then, it is risky for some services (like airflow scheduler, but even to webserver) as there is a global state on file system (some log files, pid lock files) that could interact badly when you have to instances running on same pod.
Also note that default uid is airflow (50000), which makes even doing apt install
very tricky on the running pod.
So I think ability to modify ad-hoc the files on the pod, and asking dumb-init
to restart the main process, is most straightforward, requires no extra preparation steps, and is safest.
If after the modifications, and explicitly manual restart request, the main process exits or crashes, dump-init
should behave same as before, i.e. exit and terminate the pod.
I think the feature is desirable, and keeps dumb-init
still relatively dumb.
PS. It is somehow similar to #156 but different in purpose.
a reminder that the project is called dumb init -- intentionally not supporting things like reloaders, restarters, etc. but just the bare minimum features of an init system.
if you want a reloader, I'd recommend writing a small thing which responds to SIGUSR1 and runs os.exec*
(this thing can easily run as a direct child of dumb-init and a direct parent of your application
I wish that was easy, but it is not.
- There are many docker images using dumb-init, that I did not build, but I use them
- There are many docker images using dumb-init that I did build, but there is dozens of them. It will be tricky to update them all, with some using different build systems.
- Even the ones that I can update, sometimes some helm charts, that bypass the entrypoint, and execute
dumb-init -- bash exec service
directly.
I even tried doing this for airflow, with airflow-git-tag-2.1.2/scripts/in_container/prod/entrypoint_prod.sh:360
modified to have this at the end instead of normal exec airflow "$@"
:
# ...
AIRFLOW_PID=
RELOADING=0
doreload() {
RELOADING=1
kill "${AIRFLOW_PID}"
}
while :; do
RELOADING=0
"airflow" "${@}" &
AIRFLOW_PID=$!
trap doreload SIGUSR1
wait ${AIRFLOW_PID}
STATUS=$?
if [ "${RELOADING}" != "1" ]; then
exit "${STATUS}"
break
fi
done
and it does not work, because the helm chart I am using bypasses the airflow provided entrypoint ( https://github.com/airflow-helm/charts/blob/main/charts/airflow/templates/webserver/webserver-deployment.yaml#L106 ), and as far as I know, there is no mechanism for me to change it.
shouldn't be too hard to change the helm chart or write your own airflow
binary prefixed on PATH
?
shouldn't be too hard to change the helm chart or write your own
airflow
binary prefixed onPATH
?
On one-by-one basis yes (and it is not trivial). But that does not help much with other cases and other images.
Also the PATH
is "/home/airflow/.local/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", and airflow is in the first directory. So I also need to modify the PATH
somehow, either in the helm release use (no idea if this will even work properly), or create a custom image, that moves things around. Fixable, but tedious.
In this particular case I figured how to do it:
In Dockerfile at the end:
USER root
ENV PATH="/reloader:${PATH}"
# Install reloader.sh as "airflow" as the first element in the PATH.
# This way when helm chart does 'exec bash -c airflow',
# it will be picked up first.
# The reloader later will run airflow via absolute path.
COPY reloader.sh /reloader/airflow
USER airflow
and reloader.sh like this
#!/bin/bash
REAL_AIRFLOW="/home/airflow/.local/bin/airflow"
doreload15() {
RELOADING=1
kill "${AIRFLOW_PID}"
}
doreload9() {
RELOADING=1
kill -9 "${AIRFLOW_PID}"
}
while :; do
RELOADING=0
"${REAL_AIRFLOW}" "${@}" &
AIRFLOW_PID=$!
trap doreload15 SIGUSR1 # Send SIGTERM
trap doreload9 SIGUSR2 # Send SIGKILL
wait "${AIRFLOW_PID}"
STATUS=$?
if [ "${RELOADING}" != "1" ]; then
exit "${STATUS}"
break
fi
sleep 5 # Give some time for the child worker process to terminate too.
done
It works. I still think it is a PITA, and could be solved more universally in dumb-init
.
I am not even sure if running kill inside a signal handler is allowed, or maybe bash does something special, and just captures the signals, and back in the main interpreter loop it the handler function is run safely.
could be solved more universally in dumb-init.
it could, dumb-init could also be a process watcher or a service runner or systemd, but it's intentionally not
could be solved more universally in dumb-init.
it could, dumb-init could also be a process watcher or a service runner or systemd, but it's intentionally not
Because these things are better solved in other places. That is not the case for my use case.
it absolutely is in your case, you're looking for functionalities that a process supervisor would implement -- so use a process supervisor!
it absolutely is in your case, you're looking for functionalities that a process supervisor would implement -- so use a process supervisor!
How I use a process supervisor in ~30 different container images that we use, majority of which I have zero extra control over? I honestly do not see how this can be done, especially when running on kubernetes. I am open for suggestions.
tbh I'm concerned for your organization:
- you claim to have no control over the code you're running
- you want hot reloading of hand-edited code in production
I'd start by trying to fix those instead of demanding features on tools which are designed to not have them
So how would you do it then? I still do not see how it is possible.
dumb-init
=> (supervisor of your choice) => (entrypoint of your image)
but more realistically, I would not be debugging in production -- I'd use a development environment for that and run tools in the foreground, etc.
dumb-init => (supervisor of your choice) => (entrypoint of your image)
Not possible with any of docker images on hub.docker.com , or hundrets of other images that you build using open source Dockerfiles
but more realistically, I would not be debugging in production -- I'd use a development environment for that and run tools in the foreground, etc.
sometimes that is simply not possible. there are sometimes issues in production that cannot be reproduced in local development environment, or even mirrored environment. Sometimes (once every few years from my experience) there is a serious production issue, that cannot be figured out by other means quickly, and needs to be debugged in production. This is reality.
Thanks for the issue but I have to agree this is out-of-scope for dumb-init. If you need any kind of process supervision functionality, it would be better to use a different tool.
@chriskuehl I understand, but I should clarify. I do not need or want any kind of process supervision functionality tho. I do not want for any automatic restarting or monitoring of the main process. I do not find the requested functionality to be really "process supervision".
I guess, I would find some workaround in the future. Thanks.