Insufficient regional quota to satisfy request and katib job is blocked.
Jeffwan opened this issue ยท 24 comments
Prow status: https://prow.k8s.io/?repo=kubeflow%2Fkatib
ERROR: (gcloud.beta.container.clusters.create) ResponseError: code=403, message=Insufficient regional quota to satisfy request: resource "CPUS": request requires
'48.0' and is short '40.0'. project has a quota of '500.0' with '8.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage
=USED&project=kubeflow-ci.
Reported by @andreyvelich
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
area/katib | 0.77 |
kind/bug | 0.89 |
Please mark this comment with ๐ or ๐ to give our bot feedback!
Links: app homepage, dashboard and code for this bot.
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
area/engprod | 0.54 |
Please mark this comment with ๐ or ๐ to give our bot feedback!
Links: app homepage, dashboard and code for this bot.
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
area/engprod | 0.54 |
Please mark this comment with ๐ or ๐ to give our bot feedback!
Links: app homepage, dashboard and code for this bot.
Seems CPU quota is almost used out and it's match with error message.
gcloud compute regions describe us-east1 --project=kubeflow-ci
creationTimestamp: '1969-12-31T16:00:00.000-08:00'
description: us-east1
id: '1230'
kind: compute#region
name: us-east1
quotas:
- limit: 500.0
metric: CPUS
usage: 484.0
- limit: 200000.0
metric: DISKS_TOTAL_GB
usage: 13886.0
....
@Jeffwan Thank you for creating the issue.
Can we check which pods are currently running at cluster and use CPUs?
I kick off an one off job to clean up deployment to release some resources.
I will try to re-run tests.
There're few clusters under this project and I am trying to clean up kubeflow-ci cluster first. All workflows are running there. it doesn't seem we have a groupby <cluster, cpu> utils. Once the clean up is done, I can check the total CPU usage at the project level again.
kubeflow-periodic-0-3-branch-tf-serving-3367-e0d1 Active 486d
kubeflow-periodic-0-5-branch-tf-serving-3856-4443 Active 335d
kubeflow-periodic-master-deployapp-878-a4a0 Active 536d
kubeflow-periodic-master-tf-serving-353-b738 Active 623d
kubeflow-periodic-master-tf-serving-721-fd6b Active 562d
kubeflow-periodic-master-tf-serving-913-2737 Active 530d
kubeflow-periodic-release-branch-tf-serving-227-9734 Active 617d
kubeflow-presubmit-deployapp-1817-e39b1d3-3922-6f80 Active 672d
kubeflow-presubmit-deployapp-1817-f1c14ea-3928-4442 Active 672d
kubeflow-presubmit-tf-serving-2338-9316696-5673-b788 Active 554d
kubeflow-presubmit-tf-serving-2449-e9ea4dd-5627-7ec9 Active 555d
kubeflow-presubmit-tf-serving-2474-1720719-5700-fc62 Active 553d
kubeflow-presubmit-tf-serving-2784-3488c99-6610-fdef Active 516d
kubeflow-presubmit-tf-serving-2991-7732038-6736-f518 Active 497d
kubeflow-presubmit-tf-serving-3464-2de5dd8-6288-31d2 Active 433d
kubeflow-presubmit-tf-serving-3464-7c4ef28-7168-3901 Active 433d
kubeflow-presubmit-tf-serving-3464-9165fce-1152-5c01 Active 432d
kubeflow-presubmit-tf-serving-3464-9165fce-2688-141d Active 433d
kubeflow-presubmit-tf-serving-3464-9165fce-8256-a10f Active 432d
I think these are some leaked resources, I will delete them as well
k get all -n kubeflow-presubmit-tf-serving-2338-9316696-5673-b788 โ kubeflow-testing 10:33:28
NAME READY STATUS RESTARTS AGE
pod/mnist-cpu-bc4ddfd96-ssmtv 1/1 Running 33 512d
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/mnist-cpu ClusterIP 10.39.255.61 <none> 9000/TCP,8500/TCP 554d
service/mnist-gpu ClusterIP 10.39.244.227 <none> 9000/TCP,8500/TCP 554d
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/mnist-cpu 1/1 1 1 554d
NAME DESIRED CURRENT READY AGE
replicaset.apps/mnist-cpu-bc4ddfd96 1 1 1 554d
Do we have any ideas how this pod was deployed?
It was running for 512d which is strange.
Not sure. it's long time ago. :D. kubeflow-testing
doesn't have resource leaking. In total, it uses less than 100 CPUs.(only 7 nodes, I sum all the requests), could be other clusters, Need some time to figure it out
kubeflow-testing 2018-03-29T17:46:26+00:00 us-east1-d RUNNING
kf-vmaster-n00 2019-04-02T12:15:07+00:00 us-east1-b RUNNING
kf-ci-v1 2020-02-03T23:14:27+00:00 us-east1-d RUNNING
fairing-ci 2020-03-09T17:08:54+00:00 us-central1-a RUNNING
ztor-presubmit-v1-1150-21e7089-1683-8f6b 2020-04-08T12:34:44+00:00 us-east1-d RUNNING
kf-ci-management 2020-04-28T21:22:24+00:00 us-central1 RUNNING
ztor-presubmit-v1-1175-2f86c79-0370-4672 2020-06-26T03:42:13+00:00 us-east1-d RUNNING
zmit-e2e-v1alpha3-1235-c772f95-9616-2f11 2020-06-28T02:52:57+00:00 us-east1-d RUNNING
ztor-presubmit-v1-1171-232e11d-6400-180a 2020-07-24T14:17:23+00:00 us-east1-d RUNNING
ztor-presubmit-v1-1171-232e11d-2272-dcce 2020-07-25T01:20:32+00:00 us-east1-d RUNNING
ztor-presubmit-v1-1171-232e11d-1840-a12b 2020-07-25T11:21:13+00:00 us-east1-d RUNNING
ztor-presubmit-v1-1171-232e11d-7808-9113 2020-07-26T15:35:59+00:00 us-east1-d RUNNING
ztor-presubmit-v1-1171-232e11d-2752-cd98 2020-07-27T06:11:29+00:00 us-east1-d RUNNING
ztor-presubmit-v1-1171-232e11d-8096-3703 2020-07-29T08:55:21+00:00 us-east1-d RUNNING
ztor-presubmit-v1-1171-232e11d-7216-b6d9 2020-08-05T09:20:25+00:00 us-east1-d RUNNING
ztor-presubmit-v1-1175-56c27aa-5600-5bc3 2020-08-11T03:50:57+00:00 us-east1-d RUNNING
zbmit-e2e-v1beta1-1303-30e3e23-2896-853c 2020-08-19T20:29:44+00:00 us-east1-d RUNNING
zbmit-e2e-v1beta1-1305-9179667-8816-238b 2020-08-19T20:51:50+00:00 us-east1-d RUNNING
It looks like some pretty large clusters are still being created inside project: kubeflow-ci for individual presubmits.
e.g. zbmit-e2e-v1beta1-1305-9179667-8816-238b was created at 08-19. I'm not sure which presubmit this is coming from (maybe Katib).
Some background:
Originally all of our test infrastructure ran in project kubeflow-ci. This included ephmeral infrastructure such as GKE clusters spun up for the lifetime of the tests.
To enable better management of ephemeral infrastructure we started moving ephmeral clusters into separate projects e.g. kubeflow-ci-deployments. The thinking was this would make it easier to deal with resource leaks because everything in the ephemeral project could just be deleted.
Not all of the tests have been migrated to this model.
To remediate this I'm going to disable the ability of tests to create infrastructure in project kubeflow-ci
. This will break any tests that are still doing that and acting as a forcing function for them to fix things.
I'm removing a bunch of permissions from kubeflow-testing@kubeflow-ci.iam.gserviceaccount.com
GSA. Noteably
- Kubernetes Engine Admin
I'm attaching the full policy before the modifications
kubeflow-ci.policy.iam.txt
I deleted all the ephemeral clusters. This should free up significant CPU.
Tests that were using kubeflow-ci for ephemeral infrastructure will need to migrate that to creating ephemeral infra in different projects. My initial guess is that this primarily impacts Katib (@andreyvelich @johnugeorge ).
Each WG should probably use its own GCP project for this to allow better isolation and quota management.
WG"s can projects using GitOPS by creating the project here:
https://github.com/kubeflow/community-infra/tree/master/prod
As part of #737 it would be nice to document this so that other WGs could follow a similar approach.
Related to: #650 - Organize projects into folder.
@kubeflow/kfserving-owners @andreyvelich @gaocegege @johnugeorge @Bobgy @rmgogogo @terrytangyuan see previous comment as its possible tests for your WG were impacted.
It looks like some pretty large clusters are still being created inside project: kubeflow-ci for individual presubmits.
e.g. zbmit-e2e-v1beta1-1305-9179667-8816-238b was created at 08-19. I'm not sure which presubmit this is coming from (maybe Katib).
In Katib test infra we create individual cluster for each presubmit after that we clean-up this cluster under kubeflow-ci project
For example, one of our workflow: http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-katib-presubmit-e2e-v1beta1-1299-b2d713c-3236-d712?tab=workflow.
Some background:
Originally all of our test infrastructure ran in project kubeflow-ci. This included ephmeral infrastructure such as GKE clusters spun up for the lifetime of the tests.To enable better management of ephemeral infrastructure we started moving ephmeral clusters into separate projects e.g. kubeflow-ci-deployments. The thinking was this would make it easier to deal with resource leaks because everything in the ephemeral project could just be deleted.
Not all of the tests have been migrated to this model.
I am fine with migrating Katib test infra for independent GCP project. What do you think @gaocegege @johnugeorge ?
To remediate this I'm going to disable the ability of tests to create infrastructure in project
kubeflow-ci
. This will break any tests that are still doing that and acting as a forcing function for them to fix things.I'm removing a bunch of permissions from
kubeflow-testing@kubeflow-ci.iam.gserviceaccount.com
GSA. Noteably
Does it affect TF or PyTorch operators test infra @johnugeorge @terrytangyuan @Jeffwan ?
Yes. it should be the same case with operators also
@jlewi this is affecting KFServing CI and currently blocking multiple PRs, I have created the PR kubeflow/community-infra#10 to setup a GCP project for KFServing, please help review, thanks!
/priority p0
I am fine with migrating Katib test infra for independent GCP project. What do you think @gaocegege @johnugeorge ?
SGTM.
I filed a PR for training projects. kubeflow/community-infra#13
As I mentioned in yesterdays community meeting another remediation would be to revert the IAM policy changes in
#749 (comment)
To grant an extension to the existing projects e.g. katib and kfserving so that they can continue to create clusters in kubeflow-ci until they have successfully setup and migrated to wg specific projects.
ci-team@kubeflow.org should have sufficient privileges to do this. So anyone in
https://github.com/kubeflow/internal-acls/blob/master/ci-team.members.txt
Could do this.
It looks like the members of ci-team@ is outdated; a bunch of those folks are likely no longer active in the project. It might make sense to replace those folks with members from the respective wgs that are depending on kubeflow-ci so that they can help administer and contribute.
cd @kubeflow/wg-automl-leads @kubeflow/wg-serving-leads @kubeflow/wg-training-leads
@jlewi you have a view on who all with adderesses @google.com are active here?
https://github.com/kubeflow/internal-acls/blob/master/ci-team.members.txt
Additionally I don't recognize who else is active from these names
kam.d.kasravi@intel.com [Don't think he is active anymore]
hnalla@redhat.com [Who?]
scottleehello@gmail.com [Who?]
@animeshsingh you should be able to use history to see who committed the changes
@scottilee = scottleehello@
@harshad16 = harshad@
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.