spotify/flink-on-k8s-operator

Job cancel is buggy

live-wire opened this issue · 0 comments

Job-cancel command does not work as expected in some cases.
Some cases observed:

  • If an unavailable flink image is used, job submitter stays in a pending state, it stays stuck in pending state. This will lead to the submitter pod stuck in an unschedulable state. It leads to this dangerous situation that the cluster can not scale down because of one unschedulable pod.
  • Job finished (succeeded), but cluster isn't torn down. JobManager and TaskManagers stay running.
  • With pod disruption budget, these unhealthy should-be-killed-clusters can block some nodes from upgrading etc. too.

In case of job-cancel, we should just kill all pods within a reasonable amount of time.