containers/podman

[RFE] Integrate sd_notify with podman health checks

Closed this issue · 65 comments

/kind feature

Description

Recently I have played around multiple times with starting podman containers using systemd. If a service consists of multiple containers, e.g. a web application and a database server, some applications require the start of the service to be orchestrated, i.e. the web application needs to be started after the database container is started and ready.

Dependencies can be modeled in systemd, but I didn't find a possibility to model dependencies based on the readiness of a service running within a container.

It occurred to me that maybe it would be possible to do such an integration by using the sd_notify mechanism in combination with podman health checks, i.e. mark a systemd service of type "notify" as started only when the health check reports "healthy".

Any thoughts on that?

Apologies for the not replying earlier to the issue. We have focused on getting upcoming release of Podman v2.0.0 into shape.

First, I love your idea as it matches our vision of extending Podman with systemd. What's important to mention is that Podman will mount the sd-notify socket into each container if it's set. The idea behind that behaviour is to delegate the messaging to the container itself. This implies that the "healthchecks" logic had to be implemented inside the container.

Let's assume we have a database container: when running inside a Type=notify systemd unit, the container needs to send the ready message. The logic might be implemented in a bash script running in the container, waiting for the database to be ready and eventually sending the messages over dbus. This will mark the service as started/running and systemd will continue starting dependent services.

@rhatdan, that could be an interesting blog post.

I think there are two uses for notify. One, as @vrothberg pointed out, is managed by the container to signal when it is ready.

Another one that could be handled with healtchecks, is systemd watchdog to signal systemd that the service/container is still working and doesn't need to be restarted. This second use case should not be delegated to the container.

Thanks for your feedback @vrothberg and @giuseppe

I am aware that the sd_notify socket es exposed to the container, but IMHO it is not easy to make use of it if the containerized application (e.g. a spring-boot application) is not sd_notify aware. I was thinking about the bash script solution you mentioned, but I couldn't get my head around on how to make it work if I am not in control of the container image build process.

At the end of the day, I would love to have something like Kubernetes readiness / liveliness probes which can be tied to systemd which controls the states of the containers. The health check mechanism is a great start and seems to work fine, but it would be great if systemd knew about the state of these health checks.

Another one that could be handled with healtchecks, is systemd watchdog to signal systemd that the service/container is still working and doesn't need to be restarted. This second use case should not be delegated to the container.

I think this had to be a separate unit as watchdogs are sd-notify based. Using that in the same unit would cause the container to mount the socket (and the runtime to wait for a ready message).

At the end of the day, I would love to have something like Kubernetes readiness / liveliness probes which can be tied to systemd which controls the states of the containers. The health check mechanism is a great start and seems to work fine, but it would be great if systemd knew about the state of these health checks.

@jritter, I think we can achieve that with a second unit running podman healthcheck on the container. Would a second unit work for you?

I think this had to be a separate unit as watchdogs are sd-notify based. Using that in the same unit would cause the container to mount the socket (and the runtime to wait for a ready message).

wouldn't that be a problem only if the runtime send a healthcheck before the container payload sends the READY=1 (and assuming systemd doesn't handle this case)? I don't think there is a problem if both use the same socket.

Friendly ping.

Hi, sorry for the delay. I was thinking about a separate unit B monitoring the podman container service unit A, and then model dependencies based on the systemd status of unit B.

It just doesn't feel like a nice solution at the end of the day (lots of error-prone manual wiring and confusing dependencies), in my opinion, it should be possible to wire the service up in a way that systemd knows what's going on under the hood.

And to me the mechanism now this is achieved really doesn't matter, I just thought sd_notify might be an option. I had a chat with @mheon regarding this and based on his reaction I thought this might be the way to go.

A friendly reminder that this issue had no activity for 30 days.

@jritter @mheon @vrothberg Any further thoughts on this?

I think that with the recent changes from @goochjj we are a huge leap closer. Need more time (which I currently don't have) to think it through though.

@vrothberg, I assume you are referring to this PR, right? #6693

That certainly looks interesting, I'll have a closer look at this. Based on this #6693 (comment) example I would run the readiness check after running podman run, and then send the systemd-notify --ready.

What I don't like about this approach is the fact that containers have to be started in detached mode, which makes log handling a bit trickier. When running in foreground mode, logs written to stdout are fed straight into the systemd journal, which makes it super easy to query. Any idea on that? Of course podman logs would still work, but I would love to be able to treat services running in containers just as any other systemd service.

Couldn't the systemd-notify --ready be called as part of a ExecStartPost? Like that the container can still be started in foreground mode.

@duritong, according to the manpage systemd.service(5) ExecStartPost only runs after "READY=1" is sent for Type=notify, and that's exactly what systemd-notify --ready does.

By definition @jritter can't use ExecStartPost because that wouldn't check service health, it just verifies podman returned.

With the sdnotify options now, you can use sdnotify conmon and it'll send READY=1 when podman exits. No opportunity for a health check.

With sdnotify container, YOU send READY=1, so it's up to you to send it when appropriate, and signal other states as appropriate.

Either way, podman isn't "in charge" of sd-notify... I imagine health checking is an OCI runtime option, so it might be something that could be added in the OCI runtime, but right now it just passes READY through.

You could do your own health check at the systemd level. I.e. Script it, use a script w/ a coprocess for the health checking, or exec podman -d and have your "primary" process be one that polls health and returns ready and such.

You could do your own health check at the container level - just make sure the process in your container that does healthchecks chats with systemd. i.e.

  1. start your primary process (i.e. nginx) NOT sd-notify enabled
  2. Start your health checker process
  3. Health checker sends READY=1 the first time the check succeeds.
  4. Health checker sends STOPPING, or RELOADING, or WATCHDOG=trigger when health check fails, depending on what you want systemd to do. Or just have successful health-checks send WATCHDOG=1 - that'll probably do it.

I'm not sure if NOTIFY_SOCKET is available to health check processes when they're spawned - but I'm stabbing in the dark and assuming the health checks are spawned by the OCI runtime, not by podman. Making that smarter would be harder.

That's why really I'd like podman to take sd-notify out of the hands of the runtime entirely... because the BIGGEST problem with what you suggest is that podman will block until READY=1 is received. -d isn't going to fix that. And by block, I mean everything. podman ps, podman inspect, podman exec, podman anything will block until READY=1 is received - because the OCI runtime blocks until the container is "READY". Since you want it tied to health checking, you're assuming there's an init process with a non-trivial startup time, which means you WILL have this issue. See #6688

Ultimately, IMHO this isn't behavior podman is responsible for, and personally it's behavior I wouldn't WANT the OCI runtime or anything else doing. If you want to have a healthcheck that can speak sd-notify, then go right ahead. But I think the use case is too complicated and specialized for automagic behavior.

Can you define exactly how "health check response codes" would translate into sd-notify messages?

@vrothberg, I assume you are referring to this PR, right? #6693

That certainly looks interesting, I'll have a closer look at this. Based on this #6693 (comment) example I would run the readiness check after running podman run, and then send the systemd-notify --ready.

What I don't like about this approach is the fact that containers have to be started in detached mode, which makes log handling a bit trickier. When running in foreground mode, logs written to stdout are fed straight into the systemd journal, which makes it super easy to query. Any idea on that? Of course podman logs would still work, but I would love to be able to treat services running in containers just as any other systemd service.

I don't think that is true anymore.

Before it was true, because the service type needed it, and we needed Systemd to look for the pid file.

Now as long as Type=notify works with long running processes as well as daemons, it should make no difference whether you use -d or not. Otherwise your complaint is that "Systemd Type=notify services must fork" which isn't a podman problem.

You'll probably end up with an extra pid - i.e. podman sticking around in addition to conmon

Thanks for your feedback @vrothberg and @giuseppe

I am aware that the sd_notify socket es exposed to the container, but IMHO it is not easy to make use of it if the containerized application (e.g. a spring-boot application) is not sd_notify aware. I was thinking about the bash script solution you mentioned, but I couldn't get my head around on how to make it work if I am not in control of the container image build process.

  1. Create myentrypoint.sh in a place
  2. podman run --entrypoint /usr/local/bin/myentrypoint.sh -v $PWD/myentrypoint.sh:/usr/local/bin/myentrypoint.sh
  3. Include old container entrypoint in myentrypoint.sh, or move it to the cmd

Myentrypoint forks off your spring boot process and execs the health check.

By definition @jritter can't use ExecStartPost because that wouldn't check service health, it just verifies podman returned.

With the sdnotify options now, you can use sdnotify conmon and it'll send READY=1 when podman exits. No opportunity for a health check.

With sdnotify container, YOU send READY=1, so it's up to you to send it when appropriate, and signal other states as appropriate.

Either way, podman isn't "in charge" of sd-notify... I imagine health checking is an OCI runtime option, so it might be something that could be added in the OCI runtime, but right now it just passes READY through.

You could do your own health check at the systemd level. I.e. Script it, use a script w/ a coprocess for the health checking, or exec podman -d and have your "primary" process be one that polls health and returns ready and such.

You could do your own health check at the container level - just make sure the process in your container that does health checks chats with systemd. i.e.

1. start your primary process (i.e. nginx) NOT sd-notify enabled

2. Start your health checker process

3. Health checker sends READY=1 the first time the check succeeds.

4. Health checker sends STOPPING, or RELOADING, or WATCHDOG=trigger when health check fails, depending on what you want systemd to do.  Or just have successful health-checks send WATCHDOG=1 - that'll probably do it.

I'm not sure if NOTIFY_SOCKET is available to health check processes when they're spawned - but I'm stabbing in the dark and assuming the health checks are spawned by the OCI runtime, not by podman. Making that smarter would be harder.

Hi @goochjj , thanks for chiming in on this. I think you are raising good points. The base of my idea was to take two existing concepts (systemd-notify and podman health checks) and make them able to talk to each other. My ultimate goal however is to be able to model dependencies in systemd based on the service state, for instance when a service A running in a container takes 3 minutes to start, service B which depends on service A should not start until service A is fully up. In an ideal world, all my services within the containers all understand the concept of SD_NOTIFY out of the box, but unfortunately, this is not the case in real life. So I am looking for a solution to run some sort of readiness and liveness checks, and tell systemd about the results, similar to how Kubernetes does its readiness and liveness checks.

That's why really I'd like podman to take sd-notify out of the hands of the runtime entirely... because the BIGGEST problem with what you suggest is that podman will block until READY=1 is received. -d isn't going to fix that. And by block, I mean everything. podman ps, podman inspect, podman exec, podman anything will block until READY=1 is received - because the OCI runtime blocks until the container is "READY". Since you want it tied to health checking, you're assuming there's an init process with a non-trivial startup time, which means you WILL have this issue. See #6688

Ultimately, IMHO this isn't behavior podman is responsible for, and personally it's behavior I wouldn't WANT the OCI runtime or anything else doing. If you want to have a healthcheck that can speak sd-notify, then go right ahead. But I think the use case is too complicated and specialized for automagic behavior.

Maybe you are right, and health checking is not the responsibility of podman in this screnario. In the Kubernetes world, health checking is done by the kubelet, which is part of the orchestrator and not the container runtime. I guess in the situation we are discussing, systemd takes the role of the orchestrator. Any thoughts on this @vrothberg @giuseppe @rhatdan @goochjj ? Should this maybe be a discussion on the systemd mailing list?

Can you define exactly how "health check response codes" would translate into sd-notify messages?

According to man podman-healthcheck-run(1), there are currently 3 return codes defined:

  • 0 = healthcheck command succeeded
  • 1 = healthcheck command failed
  • 125 = an error has occurred

So my approach would be to send READY=1 and WATCHDOG=1 if the health check returns 0, READY=0 otherwise.

Another approach that we might look into: podman health checks are triggered periodically by a systemd timer, which run in a separate cgroup. Maybe those could send a systemd-notify notification, this would require NotifyAccess to be set to "all" I guess.

@vrothberg, I assume you are referring to this PR, right? #6693
That certainly looks interesting, I'll have a closer look at this. Based on this #6693 (comment) example I would run the readiness check after running podman run, and then send the systemd-notify --ready.
What I don't like about this approach is the fact that containers have to be started in detached mode, which makes log handling a bit trickier. When running in foreground mode, logs written to stdout are fed straight into the systemd journal, which makes it super easy to query. Any idea on that? Of course podman logs would still work, but I would love to be able to treat services running in containers just as any other systemd service.

I don't think that is true anymore.

Before it was true, because the service type needed it, and we needed Systemd to look for the pid file.

Now as long as Type=notify works with long running processes as well as daemons, it should make no difference whether you use -d or not. Otherwise your complaint is that "Systemd Type=notify services must fork" which isn't a podman problem.

You'll probably end up with an extra pid - i.e. podman sticking around in addition to conmon

I'm not sure if I am following here. You are saying that logs also appear in the systemd journal when podman is forking? Here an example where that was not the case, but maybe I am missing an option:


# Inspired by 
# - https://www.redhat.com/sysadmin/podman-shareable-systemd-services
# - https://fedoramagazine.org/systemd-unit-dependencies-and-order/

[Unit]
Description=MariaDB service

[Service]
Restart=on-failure
ExecStartPre=/usr/bin/rm -f %t/%n-pid %t/%n-cid
ExecStartPre=/bin/sh -c '/usr/bin/podman rm -f dbservice-mariadb || exit 0'
ExecStart=/usr/bin/podman run \
    --name dbservice-mariadb \
    --hostname dbservice-mariadb \
    --conmon-pidfile %t/%n-pid \
    --cidfile %t/%n-cid \
    -d \
    -e MYSQL_USER=user -e MYSQL_PASSWORD=pass -e MYSQL_DATABASE=db \
    -p 3306:3306 -p 8080:80 \
    -v /var/dbservice/db:/var/lib/mysql/data:Z \
    --health-cmd='/usr/bin/mysql -h localhost -u root -e "show databases;" || exit 1' \
    --health-interval=30s \
    --health-retries=3 \
    --health-start-period=0s \
    --health-timeout=30s \
    registry.redhat.io/rhel8/mariadb-103
ExecStop=/usr/bin/sh -c "/usr/bin/podman rm -f `cat %t/%n-cid`"
KillMode=none
Type=forking
PIDFile=%t/%n-pid

[Install]
WantedBy=multi-user.target

Thanks for your feedback @vrothberg and @giuseppe
I am aware that the sd_notify socket es exposed to the container, but IMHO it is not easy to make use of it if the containerized application (e.g. a spring-boot application) is not sd_notify aware. I was thinking about the bash script solution you mentioned, but I couldn't get my head around on how to make it work if I am not in control of the container image build process.

1. Create myentrypoint.sh in a place

2. podman run --entrypoint /usr/local/bin/myentrypoint.sh -v $PWD/myentrypoint.sh:/usr/local/bin/myentrypoint.sh

3. Include old container entrypoint in myentrypoint.sh, or move it to the cmd

Myentrypoint forks off your spring boot process and execs the health check.

I agree, this works in a perfect world where I am in charge of the container image build process as well as sysadmin. In my current situation, I have to operate under the assumption that I cannot modify the container images (e.g. third party software shipped in a container image). Of course, I could just slap a different configuration on top of the image, but this would cause raised eyebrows from a support perspective.

Thanks for your feedback @vrothberg and @giuseppe
I am aware that the sd_notify socket es exposed to the container, but IMHO it is not easy to make use of it if the containerized application (e.g. a spring-boot application) is not sd_notify aware. I was thinking about the bash script solution you mentioned, but I couldn't get my head around on how to make it work if I am not in control of the container image build process.

1. Create myentrypoint.sh in a place

2. podman run --entrypoint /usr/local/bin/myentrypoint.sh -v $PWD/myentrypoint.sh:/usr/local/bin/myentrypoint.sh

3. Include old container entrypoint in myentrypoint.sh, or move it to the cmd

Myentrypoint forks off your spring boot process and execs the health check.

I agree, this works in a perfect world where I am in charge of the container image build process as well as sysadmin. In my current situation, I have to operate under the assumption that I cannot modify the container images (e.g. third party software shipped in a container image). Of course, I could just slap a different configuration on top of the image, but this would cause raised eyebrows from a support perspective.

Nothing I suggested actually changed the image, you're just bindmounting a new entrypoint in place.

If that will raise eyebrows, then the people in charge of building the image should integrate operations concerns into the build. That's why it's DevOps, after all. :-D

Can you define exactly how "health check response codes" would translate into sd-notify messages?

According to man podman-healthcheck-run(1), there are currently 3 return codes defined:

  • 0 = healthcheck command succeeded
  • 1 = healthcheck command failed
  • 125 = an error has occurred

So my approach would be to send READY=1 and WATCHDOG=1 if the health check returns 0, READY=0 otherwise.

READY=0 isn't a thing.

https://www.freedesktop.org/software/systemd/man/sd_notify.html

Tells the service manager that service startup is finished, or the service finished loading its configuration. This is only used by systemd if the service definition file has Type=notify set. Since there is little value in signaling non-readiness, the only value services should send is "READY=1" (i.e. "READY=0" is not defined).

You could DELAY sending READY=1, but you can't send READY=0.

I'm against having the OCI runtime integrate health checks into SDNotify, mainly because the OCI runtime's behavior right now is to block until READY=1 is sent, which makes many other things worse. (i.e. podman locks, inability to exec into the container) Delaying the INITIAL READY=1 will deal with startup, but consider the case where the health check never succeeds. It'll lock podman forever. After that, you could have the health check send WATCHDOG=1 and let systemd deal with timeouts, or do WATCHDOG=trigger if... 1 health check fails? multiple health checks fail?

IMHO if your intention is to better integrate your containers with systemd... you...should.. do that. Bindmount the notify socket through and implement your own service startup notifications/health checks, as sdnotify was purpose built to allow services to report their own status, I think you should implement it that way.

Another approach that we might look into: podman health checks are triggered periodically by a systemd timer, which run in a separate cgroup. Maybe those could send a systemd-notify notification, this would require NotifyAccess to be set to "all" I guess.

Since you have a business need to not modify the build, that's another option - perhaps using ExecStartPre. Do be aware that systemd identifies which service it's speaking to by cgroup, so it WOULD have to be part of the same cgroup the unit uses, otherwise the messages you send to the notify socket would not be mapped back to the right unit. (sdnotify protocol uses a single unix socket, and has no protocol to identify which unit is speaking, it resolves the unit from the cgroup of the sending PID)

I'm not sure if I am following here. You are saying that logs also appear in the systemd journal when podman is forking? Here an example where that was not the case, but maybe I am missing an option:

IFTFY - try below.


# Inspired by 
# - https://www.redhat.com/sysadmin/podman-shareable-systemd-services
# - https://fedoramagazine.org/systemd-unit-dependencies-and-order/

[Unit]
Description=MariaDB service

[Service]
Restart=on-failure
ExecStartPre=-/usr/bin/podman rm -f dbservice-mariadb
ExecStart=/usr/bin/podman run \
    --name dbservice-mariadb \
    --hostname dbservice-mariadb \
    --sdnotify conmon \
    -e MYSQL_USER=user -e MYSQL_PASSWORD=pass -e MYSQL_DATABASE=db \
    -p 3306:3306 -p 8080:80 \
    -v /var/dbservice/db:/var/lib/mysql/data:Z \
    --health-cmd='/usr/bin/mysql -h localhost -u root -e "show databases;" || exit 1' \
    --health-interval=30s \
    --health-retries=3 \
    --health-start-period=0s \
    --health-timeout=30s \
    registry.redhat.io/rhel8/mariadb-103
ExecStop=/usr/bin/podman stop -i -t 20 dbservice-mariadb
ExecStopPost=-/usr/bin/podman rm -f dbservice-mariadb
KillMode=none
Type=notify
NotifyAccess=all

[Install]
WantedBy=multi-user.target

You'll find that podman doesn't block. sdnotify receives communications from conmon, so the maidpid is passed up without needing a pidfile. Notify socket is not passed into the container.

Note that doesn't prevent you from bindmounting it into the container and continuing to speak through it, as long as the cgroup resolves properly - I'm a fan of --cgroups split

If you don't want podman to send the READY=1, then you're really looking at --sdnotify container - because you still need the MAINPID broadcast from podman, even if it's not going to send the READY=1.

Not needed:

ExecStartPre=/usr/bin/rm -f %t/%n-pid %t/%n-cid
    --conmon-pidfile %t/%n-pid \
    --cidfile %t/%n-cid \
    -d \
Type=forking
PIDFile=%t/%n-pid

Another option, add ExecStartPre= lines to your dependent services.
i.e.

[Service]
Restart=on-failure
#I've found I need this otherwise I end up with failed services that don't autorestart causing issues with dep services
ExecStartPre=/usr/bin/systemctl is-active dbservice-mariadb
#Will fail if container isn't created
ExecStartPre=/usr/bin/podman inspect --type container dbservice-mariadb
# This is your health check
ExecStartPre=/usr/bin/podman exec dbservice-mariadb sh -c '/usr/bin/mysql -h localhost -u root -e "show databases;" || exit 1'
ExecStartPre=-/usr/bin/podman rm -f dependent-service
ExecStart=/usr/bin/podman run --name dependent-service etc...

If any of the pre's without a - fail, then the service startup fails - no need for sdnotify or anything. It does break encapsulation a bit, having dep services needing to know about parent containers. I'd get around that by creating a convention...i.e. /healthcheck.sh is always my healthchecker, that way I don't have to know how to call mysql for a mariadb container. It'll make both parent and dependent service units cleaner.

Further notes:

I prefer --log-driver journald with -d rather than having the podman process attached to systemd piping logs. With the log-driver, conmon handles it, and you can still filter by using CONTAINER_NAME=dbservice-mariadb, and/or the unit if you're using --cgroups split or --cgroups no-conmon. When running without -d, you have a pretty useless process sticking around that just taps into stdout and stderr.

I guess you could just use podman healthcheck run dbservice-mariadb instead of what I did - but you get the point.

The other thing that supports less magic here is the healthcheck just sets the container to unhealthy. For instance, even Docker wouldn't kill and restart the container, nor does podman. It's left up to the orchestrator to determine the remedial behavior. Podman and OCI runtimes are lower level than container orchestration. So if you're using systemd as your orchestrator... yeah, not sure how podman can help.

Docker sends an event when the healthcheck fails, (actually, when a container transitions from healthy to unhealthy), so perhaps what you really need to do is tap into podman's events stream, and trigger systemd actions to remediate.

Upon looking at the code, the health checks appear to leverage systemd to create a timer to run podman healthcheck run.

I'd consider it in-scope to providing "healthy" and "unhealthy" commands that the podman healthcheck run command would run on state transition, to work within the existing systemd timer. (Perhaps there needs to be a podman healthcheck periodic or 'scheduled' mode which triggers these hooks, and perhaps there's two versions, for triggering action outside the container i.e. with systemctl, or inside the container i.e. service mysqld restart) This could be actionable as a feature request.

It still means scripting your "ops". Also, given the commands available to you you could just schedule a systemd timer to do that for you, if that's all the healthchecks are.

Additional notes, I did a contrived check here and I didn't see healthy or unhealthy events show up in the podman events stream... So either i'm doing something wrong or podman doesn't do that... Similarly I don't see "healthy" and "unhealthy" in the podman ps output. These would be feature discrepancies between docker and podman that are probably undesirable... That may be actionable as a bug report.

I don't think SDNotify is going to be the answer. (Which is probably why it isn't widely deployed)

@jritter Thoughts?

I'm not sure if I am following here. You are saying that logs also appear in the systemd journal when podman is forking? Here an example where that was not the case, but maybe I am missing an option:

IFTFY - try below.


# Inspired by 
# - https://www.redhat.com/sysadmin/podman-shareable-systemd-services
# - https://fedoramagazine.org/systemd-unit-dependencies-and-order/

[Unit]
Description=MariaDB service

[Service]
Restart=on-failure
ExecStartPre=-/usr/bin/podman rm -f dbservice-mariadb
ExecStart=/usr/bin/podman run \
    --name dbservice-mariadb \
    --hostname dbservice-mariadb \
    --sdnotify conmon \
    -e MYSQL_USER=user -e MYSQL_PASSWORD=pass -e MYSQL_DATABASE=db \
    -p 3306:3306 -p 8080:80 \
    -v /var/dbservice/db:/var/lib/mysql/data:Z \
    --health-cmd='/usr/bin/mysql -h localhost -u root -e "show databases;" || exit 1' \
    --health-interval=30s \
    --health-retries=3 \
    --health-start-period=0s \
    --health-timeout=30s \
    registry.redhat.io/rhel8/mariadb-103
ExecStop=/usr/bin/podman stop -i -t 20 dbservice-mariadb
ExecStopPost=-/usr/bin/podman rm -f dbservice-mariadb
KillMode=none
Type=notify
NotifyAccess=all

[Install]
WantedBy=multi-user.target

You'll find that podman doesn't block. sdnotify receives communications from conmon, so the maidpid is passed up without needing a pidfile. Notify socket is not passed into the container.

Note that doesn't prevent you from bindmounting it into the container and continuing to speak through it, as long as the cgroup resolves properly - I'm a fan of --cgroups split

If you don't want podman to send the READY=1, then you're really looking at --sdnotify container - because you still need the MAINPID broadcast from podman, even if it's not going to send the READY=1.

Not needed:

ExecStartPre=/usr/bin/rm -f %t/%n-pid %t/%n-cid
    --conmon-pidfile %t/%n-pid \
    --cidfile %t/%n-cid \
    -d \
Type=forking
PIDFile=%t/%n-pid

Thanks, this looks promising, I'll give this a shot once this new functionality has been released, 2.0.4 doesn't contain it yet as far as I can tell.

Another option, add ExecStartPre= lines to your dependent services.
i.e.

[Service]
Restart=on-failure
#I've found I need this otherwise I end up with failed services that don't autorestart causing issues with dep services
ExecStartPre=/usr/bin/systemctl is-active dbservice-mariadb
#Will fail if container isn't created
ExecStartPre=/usr/bin/podman inspect --type container dbservice-mariadb
# This is your health check
ExecStartPre=/usr/bin/podman exec dbservice-mariadb sh -c '/usr/bin/mysql -h localhost -u root -e "show databases;" || exit 1'
ExecStartPre=-/usr/bin/podman rm -f dependent-service
ExecStart=/usr/bin/podman run --name dependent-service etc...

If any of the pre's without a - fail, then the service startup fails - no need for sdnotify or anything. It does break encapsulation a bit, having dep services needing to know about parent containers. I'd get around that by creating a convention...i.e. /healthcheck.sh is always my healthchecker, that way I don't have to know how to call mysql for a mariadb container. It'll make both parent and dependent service units cleaner.

Well, running checks on the dependent service also occurred to me, and that's more or less how I do it at the moment. The goal behind this RFE is to be able to model these dependencies without having knowledge of services that another service depends on, and model everything using systemd dependencies.

Upon looking at the code, the health checks appear to leverage systemd to create a timer to run podman healthcheck run.

I'd consider it in-scope to providing "healthy" and "unhealthy" commands that the podman healthcheck run command would run on state transition, to work within the existing systemd timer. (Perhaps there needs to be a podman healthcheck periodic or 'scheduled' mode which triggers these hooks, and perhaps there's two versions, for triggering action outside the container i.e. with systemctl, or inside the container i.e. service mysqld restart) This could be actionable as a feature request.

It still means scripting your "ops". Also, given the commands available to you you could just schedule a systemd timer to do that for you, if that's all the healthchecks are.

Tying the states of the health check (see podman inspect <container-name>) with the systemd orchestration is what my original thought was. The 'ops' part in case of a unhealthy container could be done with the systemd watchdog functionality (i.e. the periodically triggered healthcheck would send a sdnotify keepalive if the health check finds the container to be healthy). Systemd allows to configure a restart policy in case of a keepalive timeout (see man 5 systemd.service).

Additional notes, I did a contrived check here and I didn't see healthy or unhealthy events show up in the podman events stream... So either i'm doing something wrong or podman doesn't do that... Similarly I don't see "healthy" and "unhealthy" in the podman ps output. These would be feature discrepancies between docker and podman that are probably undesirable... That may be actionable as a bug report.

That is also what I noticed. Health checks are available and the state is recorded, but at the moment it is hard to integrate state changes into other mechanisms.

I don't think SDNotify is going to be the answer. (Which is probably why it isn't widely deployed)

Maybe you are right. I just thought that we have two components (systemd and podman health checks) which are ready to use, but at the moment it is a bit tricky to make them talk with each other.

A friendly reminder that this issue had no activity for 30 days.

Sadly this issue never makes any progress.

Hi everybody, this is a small attempt to rekindle this discussion. I finally found some time to play around with the suggestion of @goochjj , which makes use of the new feature he implemented (--sdnotify conmon). This is indeed an improvement, as the service remains in state "activating" during the pull of the container. It transitions to state active after the container has sprung to life.

In my case this is still too early though, a dependent service should only start after the application itself is ready to serve, which might take slightly more time.

I don't know if you are interested, but I have implemented a small playground with a containerized example application which artificially delays the readyness of the application. The repository also contains two systemd unit files which starts the containers according to the suggestion of @goochjj . It also integrates the podman healthcheck functionality.

My goal would be to delay the start of service-two until service-one is ready to serve, meaning the health endpoint reports "UP".

https://github.com/jritter/start-delay-playground

More details can be found in the README.md of the repository above.

Oh and there is another discussion going on in the systemd repository:
systemd/systemd#9075

If you're going to do that, why not just make your service sd-notify capable?

Integrate this
https://github.com/faljse/SDNotify

Before your waitforexit, advertise ready with SDNotify.sendNotify();

And spawn with --sdnotify=container

I created this as an example application to do some testing. Of course, in a perfect world I could influence the application and this would be by the best solution, but I'm operating under the assumption that I have finished container images that are not sd-notify capable, and I need to make these container images operation ready. This is the scenario that I encountered a couple times out there at customers, and I'm trying to find a good solution for situations like this.

A friendly reminder that this issue had no activity for 30 days.

I hope I won't be hijacking this discussion, but in relation to this approach:

If you're going to do that, why not just make your service sd-notify capable?

Integrate this
https://github.com/faljse/SDNotify

Before your waitforexit, advertise ready with SDNotify.sendNotify();

And spawn with --sdnotify=container

I have been trying to do this - although with https://github.com/coreos/go-systemd, because my container is written in golang - and failing to make it work. I've tried many variations of the podman invocation, but when my container sends READY=1, either systemd doesn't see it at all, or sees a PID that it can't correlate to the service. I wonder if there is a trick that I'm missing?

This discussion seemed to me to be a possible cause - i.e. it doesn't work because podman/conmon move the container process into a different cgroup than the one that systemd expects. Am I correct that that's a fundamental problem, or is there an easy way to work around it?

I'm also not seeing MAINPID set in the container's environment, and in any case I'm not sure it helps for the container to specify MAINPID in its notification because - IIUC - the notification is swallowed and then re-emitted, with only READY=1, by the runc runtime.

So many layers here. If there is a known working way to do this, I'd appreciate seeing an example or documentation for it. This is all part of an OpenShift 4.7.9 rig, so the versions are those that OpenShift 4.7.9 provides. (I'll follow up with those a bit later once the rig is up again.)

You're probably right about the cgroups... but sdnotify=container will work around that if/when
#8508 is merged - because the process that sends the sd-notify message will be conmon itself, which IS in the right group. Otherwise it's the job of the OCI runtime, which is probably in the container group, not the supervisor group. You can see this with systemd-cgls or systemctl status unitname

In the meantime you should add --cgroups split, which should organize the pod's cgroups as a sub-cgroup of the service's cgroup. You should add Delegate=yes to your Unit section then as well.

Anywhere you see group instead of CGROUP it's because Grammarly keeps "fixing" it.

Also FYI, when 8508 is merged Conmon will hijack the MAINPID and return itself. I don't remember if that happens with the OCI runtime or not.

Many thanks @goochjj !

A friendly reminder that this issue had no activity for 30 days.

A friendly reminder that this issue had no activity for 30 days.

It occurred to me that maybe it would be possible to do such an integration by using the sd_notify mechanism in combination with podman health checks, i.e. mark a systemd service of type "notify" as started only when the health check reports "healthy".

The sdnotify integration has been completed by commit c22f3e8. Systemd units with type=notify (now default in podman-generate-systemd) and containers created with --sdnotify=container will only be marked as started after the container sent a ready message.

I think this matches the desired behavior (without health checks). Please reopen if I am mistaken.

I cannot speak to the intentions of OP, but what I expected here was: if I'm running a systemd unit with Type=notify, and a container with --sdnotify=conmon and there's a health-check implemented, then conmon will wait for the health-check to pass before sending READY=1 (rather than send when the container initializes, as it currently does).

The above changes do indeed fix issues for applications with support for sd_notify (including MariaDB, thankfully), but the approach above would help bridge support where none exists in the application itself. It is, of course, possible to script this in the container entry-point, but getting this right isn't always as simple.

Thank you for clarifying, @deuill!

Changing the semantics of --sdnotify=conmon may break existing users who rely on the current behavior, but I think we can add a new policy --sdnotify=healthcheck to achieve what you desire. @rhatdan FYI

SGTM

A friendly reminder that this issue had no activity for 30 days.

@vrothberg any movement on this one?

No. Please let me know if this has priority and I will tackle it. Otherwise, my pipeline is very long.

@flouthoc Something you are interested in looking at?

Thanks for your feedback @vrothberg and @giuseppe

I am aware that the sd_notify socket es exposed to the container, but IMHO it is not easy to make use of it if the containerized application (e.g. a spring-boot application) is not sd_notify aware. I was thinking about the bash script solution you mentioned, but I couldn't get my head around on how to make it work if I am not in control of the container image build process.

At the end of the day, I would love to have something like Kubernetes readiness / liveliness probes which can be tied to systemd which controls the states of the containers. The health check mechanism is a great start and seems to work fine, but it would be great if systemd knew about the state of these health checks.

@rhatdan On revisiting thread and in the light of this comment. I think we already implemented liveness probe (#10956) and the whole idea of multi-container approach goes along with the design of Pods. Therefore I think we should invest time in hardening liveness and readiness probes for pods which is exactly for these use-cases.

@jritter You could create a pod with two containers webserver and database.

  1. Webserver has a liveness probe which hits /health or runs a bash command which should check for uptime of database container if its down OTOH webserver container will be automatically restarted till your db is up. Afaik this is a standard architecture for a generic microservice on kubernetes.
  • plus: You have leverage to additions like init containers which can do migrations and other pre-flight checks for database.

@jritter I feel using multi-container pods is more organic solution to this problem rather than plumbing sd_notify and if you see any issues with liveness probe i feel we should spent more time in fixing that.


TLDR:
I am not sure but I think Pods are exactly made for these use-cases and multi-containers groups. I think users could be encouraged to use pods and if something is broken in Pod liveness probe then we should spend time in fixing that.

But i might be missing if there are any other special use-cases where this RFE is needed.

mheon commented

I don't see how this is any different from "improving liveness probes". This is improving healthchecks to integrate much more closely with systemd units (and systemd's much better dependency management. I still see a lot of value here.

I don't see how this is any different from "improving liveness probes". This is improving healthchecks to integrate much more closely with systemd units (and systemd's much better dependency management. I still see a lot of value here.

@mheon makes sense my intention was to only point towards the problem statement described in the original issue where application itself is not aware of the systemd example here a webserver api and a database.

But sure I agree RFE seems useful where system is already being orchestrated using systemd.

Thanks @flouthoc for pointing out the Pod lveness / readiness probe feature, I'll definitely have a look at this! I agree though with @mheon , in some cases not everything is "containerized" and, and the startup order of components should be orchestrated. I had such a usecase where periodic retries are just not an option (customer facing information kiosk), and I had to do some tedious plumbing to make it work properly. While doing some research I stumbled upon podman health checks and sd_notify, which both seem to do one part of the job, but the integration between the two did not feel great. Hence this RFE.

Just to understand guys. I have the following systemd unit service

[Unit]
Description=Podman container-f68127644cb1df4e120815b760709be1bf19465dfca1033fc35a2fd8f0a6e129.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
ExecStartPre=/bin/rm -f %t/%n.ctr-id
ExecStart=/usr/bin/podman run --cidfile=%t/%n.ctr-id --cgroups=no-conmon --rm --sdnotify=conmon --replace --healthcheck-start-period 2m --healthcheck-retries 3 --healthcheck-command "CMD-SHELL curl https://google.com || exit 1" --cap-add=NET_ADMIN -d --network nordvpn --name nordvpn --dns=1.1.1.1 --hostname nordvpn --env-file /mnt/data/nordvpn/env ghcr.io/bubuntux/nordvpn
ExecStop=/usr/bin/podman stop --ignore --cidfile=%t/%n.ctr-id
ExecStopPost=/usr/bin/podman rm -f --ignore --cidfile=%t/%n.ctr-id
Type=notify
NotifyAccess=all

[Install]
WantedBy=multi-user.target default.target

When unhealthy it does not restart automatically. so it remains unhealthy. is it the exact issue here?

A friendly reminder that this issue had no activity for 30 days.

A friendly reminder that this issue had no activity for 30 days.

A friendly reminder that this issue had no activity for 30 days.

A friendly reminder that this issue had no activity for 30 days.

Just to understand guys. I have the following systemd unit service

[Unit]
Description=Podman container-f68127644cb1df4e120815b760709be1bf19465dfca1033fc35a2fd8f0a6e129.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
ExecStartPre=/bin/rm -f %t/%n.ctr-id
ExecStart=/usr/bin/podman run --cidfile=%t/%n.ctr-id --cgroups=no-conmon --rm --sdnotify=conmon --replace --healthcheck-start-period 2m --healthcheck-retries 3 --healthcheck-command "CMD-SHELL curl https://google.com || exit 1" --cap-add=NET_ADMIN -d --network nordvpn --name nordvpn --dns=1.1.1.1 --hostname nordvpn --env-file /mnt/data/nordvpn/env ghcr.io/bubuntux/nordvpn
ExecStop=/usr/bin/podman stop --ignore --cidfile=%t/%n.ctr-id
ExecStopPost=/usr/bin/podman rm -f --ignore --cidfile=%t/%n.ctr-id
Type=notify
NotifyAccess=all

[Install]
WantedBy=multi-user.target default.target

When unhealthy it does not restart automatically. so it remains unhealthy. is it the exact issue here?

Because you don't understand what health checks are, or what they're meant to do. :-D

This is old but still relevant: https://developers.redhat.com/blog/2019/04/18/monitoring-container-vitality-and-availability-with-podman

You have actually proven that podman AND the health check are working, because the container status moved to unhealthy. That's all the healthcheck commands on podman run do - you define a command to be run, and podman creates a timer in SystemD to run your healthcheck, and update the container status from healthy to unhealthy.

If the question is "how do I run related services after I'm sure my parent container is healthy", IMHO, add an ExecStartPre command that runs podman healthcheck run database from the webserver service file. This ensures the healthcheck command defined on the container (by the database service) runs successfully at least once before starting the dependent service. (subject to your restartsec value, it will retry, and therefore not complete startup before the first healthcheck passes)

If the question is "if the container is unhealthy and didn't get restarted".... Correct. Podman takes no action - but for that matter, what would you expect that action to be?

  1. Kill -TERM to the container runtime?
  2. Run podman stop %N?
  3. Issue a systemctl restart on the systemd unit? (Which podman doesn't even know)
  4. Issue some other signal to conmon/container runtime? (and if so which one)
  5. Some other graceful shutdown process

As long as podman run or podman healthcheck doesn't have options to define the above behavior, there can be no reasonable expectation that systemd and podman alone are going to restart your container.

You could create your own systemd unit, with your own timer, which can do this. Take the transient unit in the link above as an example and expand from there - except, define a reasonable OnFailure= action to happen in the transient unit service. (And actually define it as a non-transient unit)

If there were ANY RFE to come out of this, I'd expect it to be something along the lines of "these config options allow me to add an OnFailure to the healthcheck systemd unit" (i.e. --healthcheck-onfailure=podman-recover@%N) or "we allow a template or more options available to expand the healthcheck systemd unit". Or "Bake some other response into podman healthcheck run. or "allow the user to specify the name of the transient unit service+timer name so we can use it with dropins or other shenanigans", or "provide more user examples/documentation of how this is supposed to work, and example systemd units to handle these use cases")

I think as far as the tool, all the pieces are there to do whatever needs to be done... Expecting podman to just DO it seems beyond what I would expect of a container product that doesn't have any running daemons and makes no claims of being a job orchestrator :-D. Maybe some additional logic can be wrapped around containers in a pod - but unless it's configurable, I wouldn't even want that.

sdnotify=healthcheck isn't really feasible because conmon is the one doing the SD_NOTIFY, and the healthcheck is happening in the transitory unit. This doesn't sound like a responsibility conmon should have (executing additional healthchecks and acting on the results) especially since a transitory unit is already doing it.

So really, you'd want database.service to start database-container.service, database-ready.service... Have database-ready service require+after database-container, have database.service require both (or make database-container and database-ready PartOf database) database-ready calls podman healthcheck run, set the restartSec such that it's just for intervals between healthchecks at service start.

It would be a lot simpler (given the above suggestion) to just add the healthcheck to the dependent service in the first place to handle startup sequencing.

And finally, I feel this issue should be closed.

  1. It's not named properly (sd_notify is not the way to do this) which has been beaten to death
  2. If people want health checks to do things - that should be a separate issue and maybe choose from the ones I laid out above. Or maybe create a new issue to discuss a generic approach to using Systemd with health checks - this thread is too long.

Also, there appears to be active development on a health-on-failure action, so these ideas likely belong elsewhere (and this should be closed and/or linked to those issues)

This just merged, so is this enough to satisfy issue.

Have a look at the following PR: #15687

This is how we envision healthchecks to be used in conjunction with systemd.

I think we have many pieces in place with the on-failure-actions. There's also a blog on the topic: https://www.redhat.com/sysadmin/podman-edge-healthcheck

One thing we can think of is adding a --sdnotify=healthy that will send the READY message once the health status is "healthy". However, this can already be done inside the container itself. I see health checks more on the availability side of things rather than initialization.

@rhatdan @giuseppe WDYT?

I agree.

I just had another look. We need to get #13627 in before tackling this issue here.

The idea when --sdnotify=healthy is specified is to wait for the container to turn healthy and then send the READY message.

I see health checks more on the availability side of things rather than initialization

There are two types of health checks: healthcheck-* and health-startup-*. I think that this case refers to the latter.