jenkinsci/docker-agent

Jenkins Agent should wait for pipeline to complete on SIGTERM

ferhatguneri opened this issue · 11 comments

Jenkins and plugins versions report

Environment
Jenkins: 2.414.1
OS: Linux - 4.18.0-477.15.1.el8_8.x86_64
Java: 11.0.20 - Eclipse Adoptium (OpenJDK 64-Bit Server VM)

What Operating System are you using (both controller, and any agents involved in the problem)?

Linux

Reproduction steps

Run kubernetes hosted agents.
Drain Kubernetes Node or upgrade your kubernets cluster.
The agent pods will immediately be deleted.

Expected Results

When Jenkins Agent receives SIGTERM (e.g., when upgrade kubernetes cluster or drain node), it doesn't wait for job to be completed. Kubernetes does wait for statefulset, deployment,and all oher resources to be deleted safely if there is anything defined except pods. It just delete the pod wihout checking whether pipeline job is still running or not.

Actual Results

While draining node or upgrading kubernetes cluster all the running agents are getting deleted without waiting jobs to be completed.

Anything else?

No response

jglick commented

Did you mean to file this against kubernetes-plugin, or do you see something amiss in this image itself?

I think plugin itself should change. Raw pods will be deleted without any confirmation during any cluster upgrade. So Maybe instead of pod you can create a job and that job could manage the pod. until job finish and complete pod will stay within grace period. Or maybe in the java code you can handle signals (TERM SIGTERM) and make it wait until pipeline is done.

I hope it's clear.

jglick commented

So then this can be closed and an issue filed with the plugin. (I think the issue there could be closed as well, with some documentation, but that is not a discussion to have here.)

As I mentioned There are 2 way of solving this issue I guess.

  1. Once kubernetes send the signal to delete the pod (which means jebkins agent) this signal can be handled and wait until actual pipeline completed.
  • In practice, this means your application needs to handle the SIGTERM message and begin shutting down when it receives it. This means saving all data that needs to be saved, closing down network connections, finishing any work that is left, and other similar tasks. More info is here
  1. You can add preStop hook to the agent pod definition to wait for the pipeline to be completed.
  • If your application doesn’t gracefully shut down when receiving a SIGTERM you can use this hook to trigger a graceful shutdown. Most programs gracefully shut down when receiving a SIGTERM, but if you are using third-party code or are managing a system you don’t have control over, the preStop hook is a great way to trigger a graceful shutdown without modifying the application.
  1. When you run pipeline in Jenkins, it creates a pod in kubernetes which is the most basic resource in k8s and it can be deleted when it receive sigterm.I think istead of running pipeline in the pod you can create kubernetes job and attach pod to this job. But this is my thought. I am not sure if this will %100 cover this particular case.

So you decide if this is plugin matter or agent matter. But it seems to me this is the right place to raise this issue.

jglick commented

i[n]stead of running pipeline in the pod you can create kubernetes job and attach pod to this job

This is essentially what happens if you use the recommended idiom: https://github.com/jenkinsci/kubernetes-plugin#retrying-after-infrastructure-outages

Anyway any changes to the behavior of agent pods would very likely involve changes to K8s YAML, which is controlled by kubernetes-plugin, rather than the agent container image used for various purposes which is just a Dockerfile. (Simply trapping SIGTERM in the agent and waiting for a build to complete would not work, since you would excess the pod’s grace period.)

Hi @jglick I was doing more tests to make sure if this is plugin or agent related. Here is the result.

When I run a pipeline it spins up a pod which has 2 containers. (Agent and other one which is doing the task)

While both containers are running I tried to delete the pod. The jenkins agent immediately disconnects from jenkins but the other container is still running. and both containers are waiting for terminatin grace period..

Stream closed EOF for jenkins-deployment-k8s-agents/dev-jenkins-agent-8-1cm3b-84sxp-rx5bw (jnlp)

So the question is why when other container process is still running jenkins agent disconnects from jenkins and wait to be terminated. I can stay connected to jenkins while waiting to be terminated at the end of grace period.

I hope that clarify my concern about the agent.

Hi @jglick I was doing more tests to make sure if this is plugin or agent related. Here is the result.

When I run a pipeline it spins up a pod which has 2 containers. (Agent and other one which is doing the task)

While both containers are running I tried to delete the pod. The jenkins agent immediately disconnects from jenkins but the other container is still running. and both containers are waiting for terminatin grace period..

Stream closed EOF for jenkins-deployment-k8s-agents/dev-jenkins-agent-8-1cm3b-84sxp-rx5bw (jnlp)

So the question is why when other container process is still running jenkins agent disconnects from jenkins and wait to be terminated. I can stay connected to jenkins while waiting to be terminated at the end of grace period.

I hope that clarify my concern about the agent.

When you request the termination of the pod, it sends a SIGTERM signal to all containers. If the grace period is reached, then processes are terminated abruptly (SIGKILL).

When the agent container receives a SIGTERM signal, its top-level process ("PID 1") start propagating the signal to its child to stop them properly (hoping it is before the grace period) and then stops itself properly.

So this is what happen: you ask for termination, then it terminates gracefully. It's not clear why this normal behavior should be changed?

The grace period is not a Jenkins concept and should NOT be something to tune honestly: it is a Linux standard concept and a process when receiving a SIGTERM is always expected to stop properly as soon as possible.

It looks like, in your case, that the 2nd container is not handling properly the signal (or it takes time), but the pod termination is already requested: all containers will be stopped, that is the Kubernetes contract. It's not clear why you don't want this behavior?

Yes I agree with you but there are cases where you have critical jobs running in your second container. Even kubernetes sends SIGTERM the job should keep running until it complete and then terminate. That's the reason I'm saying that jenkins agent also should wait with the second container until job is done.

The test case is let's say in a shared cluster when some developers running some DB migration which shouldn't be interrupted meanwhile devops engineer is upgrading kubernetes cluster. In that case node will be drained but it must wait for the pod to be completed. which makes more sense. When pods are in terminating state the db migration is already continuing because we are handling sigterm and asking kubernetes to wait until greace period is done. But agent is already disconnected effectively after receiving SIGTERM even if there is a long grace period time.

Yes I agree with you but there are cases where you have critical jobs running in your second container. Even kubernetes sends SIGTERM the job should keep running until it complete and then terminate. That's the reason I'm saying that jenkins agent also should wait with the second container until job is done.

The test case is let's say in a shared cluster when some developers running some DB migration which shouldn't be interrupted meanwhile devops engineer is upgrading kubernetes cluster. In that case node will be drained but it must wait for the pod to be completed. which makes more sense. When pods are in terminating state the db migration is already continuing because we are handling sigterm and asking kubernetes to wait until greace period is done. But agent is already disconnected effectively after receiving SIGTERM even if there is a long grace period time.

Your description confirms that it has nothing to do with the container image here, but it is a pure Kubernetes discussion: only pods have the concepts of multiple containers. This confirms the issue is closed with no actions.

I suggest you start a discussion on community.jenkins.io to get more attention and tips & tricks (you can mention me, same handle as GitHub, as I don't mind sharing pointers with you).

jglick commented

some developers running some DB migration which shouldn't be interrupted meanwhile devops engineer is upgrading kubernetes cluster

If you are running a critical, non-idempotent task, you should mark the pod as not eligible for deletion. This would be something to customize in the K8s pod template. Trying to stay alive after receiving a termination signal would not work anyway, if you run for another 30+s. Not an issue with the agent image.

Here I've created a topic in community. I would appreciate if you take a look at the example which will give you more information.

https://community.jenkins.io/t/jenkins-agent-should-wait-for-pipeline-to-complete-on-sigterm/10434