how to restore cortex operator normally when too many jobs are requested
nellaG opened this issue · 1 comments
hello.
I'm currently using cortex 0.40.0.
I seldom request thousands of jobs to certain cortex api by mistake.
When I do like that, I can't use cortex cli well (the response time is so long, or just hanging) and I guess that cortex operator is overloaded because of me.
(the status of operator-controller-manager
pod is continuously goes to OOMKilled -> CrashLoopBackOff)
To resolve this issue, I attempted these so far but It didn't work well.
- delete thousands of AWS sqs queue
- delete all of enqueuer job and worker job created by mistake
- delete certain cortex api and re-deploy it
After all I just down the cluster and up (+ re-deploy all of api) to make cortex work well.
If this is happened, what should I do to restore cortex without down and up cluster?
I glad to your support. Thank you so much.
the operator-controller-manager
is responsible for the cleanup of all the resources, so if it starts failing, it requires a lot of intervention.
The first thing I would try is If the operator-controller-manager
is getting OOMKilled, is to increase its memory limits.
If that doesn't work, there are ways to "fix" that weird state, but still require a lot of manual intervention, or eventually an automated script.
When you create a BatchAPI job this happens:
- A
BatchJob
kubernetes resource is created - The
operator-controller-manager
creates / updates / deletes the required resources referring to thatBatchJob
resource.
In order to fix that weird state you have to:
- Delete the created
BatchJob
resources from the cluster usingkubectl delete
with the--force
flag - Delete all the created SQS queues manually or with a script
- Delete S3 resources that might have been created for that
BatchJob
resource