GoogleCloudPlatform/berglas

Signal sent after the process has finished

paalka opened this issue · 6 comments

First of all, thanks for the great work on improving and providing support for Berglas! :)

The issue I have is that Berglas seems to send signals after the process has finished running.
To reproduce this issue, the following script should hopefully be sufficient:

#!/bin/bash
for i in {1..25} 
do
        berglas exec /usr/bin/true
done

When running the script above (on an Ubuntu 20.04 machine), I get the following output:

failed to signal command os: process already finished
failed to signal command: os: process already finished
failed to signal command: os: process already finished
failed to signal command: os: process already finished
failed to signal command: os: process already finished

These errors only sometimes occur, so you might have to run it multiple times for the commands to fail.
As you suggested in #36 it seems to occur less frequently with commands that run longer (ex. using sleep 10 seems to on average yield less errors).

Just in case it could be useful – it seems like the error is exported in the os golang module.
I can try to create a PR with a suggested fix if you don't have time to look into it.

Hey @paalka this is race condition because we need to start the signal handler before starting the process, or else we risk losing a signal. However, doing so introduces the edge case where a process abruptly stops. In this case, the signal handler receives the exit and tries to forward to the child, but the child has already died.

In practice, most workloads do not encounter the race condition. Are you actually experiencing this in a production workload, or did you encounter the race theoretically?

Hey @paalka this is race condition because we need to start the signal handler before starting the process, or else we risk losing a signal. However, doing so introduces the edge case where a process abruptly stops. In this case, the signal handler receives the exit and tries to forward to the child, but the child has already died.

Aha! That makes sense! Thanks for the explanation :)

In practice, most workloads do not encounter the race condition. Are you actually experiencing this in a production workload, or did you encounter the race theoretically?

We encounter this relatively frequently when running Argo Cron Workflows on Kubernetes. The workflows primarily consists of simple python scripts that are run every 5-10 minutes or so. Berglas is in this case injected using the mutating webhook. I can try to provide an example of something similar to what we use it for, if that is useful.

In our case, this issue primarily just causes some noise in our monitors and logs.
I suppose we could have some issues with our setup, if this is something that doesn't really occur frequently elsewhere.

@paalka I think the best I could do in Berglas is suppress this particular error from appearing in logs. Would that help?

@sethvargo Hmm, I think that would help! Thanks a lot for the help btw :)

This issue is stale because it has been open for 14 days with no
activity. It will automatically close after 7 more days of inactivity.

This issue has been automatically locked since there has not been any
recent activity after it was closed. Please open a new issue for
related bugs.