kubeflow/spark-operator

[FEATURE] prevent driver pod from being deleted before its status is processed by the operator

hguo25 opened this issue ยท 2 comments

Community Note

  • Please vote on this issue by adding a ๐Ÿ‘ reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

What is the outcome that you are trying to reach?

Other services may delete a finished pod, for example as part of garbage collection. It can happen immediately after the job finishes before the operator has a chance to process the driver update and transition the application status, which results in the final status not being correctly recorded. We hope to prevent such rushed deletion.

Describe the solution you would like

attach a finalizer to the driver pods created by the operator and only remove it when the operator transitions the application to terminal state

Describe alternatives you have considered

Additional context

before the operator has a chance to process the driver update and transition the application status, which results in the final status not being correctly recorded

Hi @hguo25, it sounds an interesting issues, can you give more detail about the situation you met?

We noticed that garbage collection could remove the driver pod almost immediately after it finishes.
The sparkPodEventHandler may have enqueued the application for processing upon status update event but it's not dequeued to be processed yet.
When a worker thread finally dequeues the app, because its driver pod has been removed, the state machine will transition this app to failed due to driver pod not being found.
To fix this, we can add a finalizer to the driver pod at the creation and it will only be removed by the operator when the application is transitioned to a terminal state. This can prevent garbage collection deleting the pod before it's processed by the operator