kubeflow/testing

Insufficient regional quota to satisfy request and katib job is blocked.

Jeffwan opened this issue ยท 24 comments

Prow status: https://prow.k8s.io/?repo=kubeflow%2Fkatib

ERROR: (gcloud.beta.container.clusters.create) ResponseError: code=403, message=Insufficient regional quota to satisfy request: resource "CPUS": request requires
'48.0' and is short '40.0'. project has a quota of '500.0' with '8.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage
=USED&project=kubeflow-ci.

Reported by @andreyvelich

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 0.77
kind/bug 0.89

Please mark this comment with ๐Ÿ‘ or ๐Ÿ‘Ž to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

Issue-Label Bot is automatically applying the labels:

Label Probability
area/engprod 0.54

Please mark this comment with ๐Ÿ‘ or ๐Ÿ‘Ž to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

Issue-Label Bot is automatically applying the labels:

Label Probability
area/engprod 0.54

Please mark this comment with ๐Ÿ‘ or ๐Ÿ‘Ž to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

Seems CPU quota is almost used out and it's match with error message.

gcloud compute regions describe us-east1 --project=kubeflow-ci         
creationTimestamp: '1969-12-31T16:00:00.000-08:00'
description: us-east1
id: '1230'
kind: compute#region
name: us-east1
quotas:
- limit: 500.0
  metric: CPUS
  usage: 484.0
- limit: 200000.0
  metric: DISKS_TOTAL_GB
  usage: 13886.0
....

@Jeffwan Thank you for creating the issue.
Can we check which pods are currently running at cluster and use CPUs?

I kick off an one off job to clean up deployment to release some resources.

I will try to re-run tests.

@andreyvelich

There're few clusters under this project and I am trying to clean up kubeflow-ci cluster first. All workflows are running there. it doesn't seem we have a groupby <cluster, cpu> utils. Once the clean up is done, I can check the total CPU usage at the project level again.

kubeflow-periodic-0-3-branch-tf-serving-3367-e0d1      Active   486d
kubeflow-periodic-0-5-branch-tf-serving-3856-4443      Active   335d
kubeflow-periodic-master-deployapp-878-a4a0            Active   536d
kubeflow-periodic-master-tf-serving-353-b738           Active   623d
kubeflow-periodic-master-tf-serving-721-fd6b           Active   562d
kubeflow-periodic-master-tf-serving-913-2737           Active   530d
kubeflow-periodic-release-branch-tf-serving-227-9734   Active   617d
kubeflow-presubmit-deployapp-1817-e39b1d3-3922-6f80    Active   672d
kubeflow-presubmit-deployapp-1817-f1c14ea-3928-4442    Active   672d
kubeflow-presubmit-tf-serving-2338-9316696-5673-b788   Active   554d
kubeflow-presubmit-tf-serving-2449-e9ea4dd-5627-7ec9   Active   555d
kubeflow-presubmit-tf-serving-2474-1720719-5700-fc62   Active   553d
kubeflow-presubmit-tf-serving-2784-3488c99-6610-fdef   Active   516d
kubeflow-presubmit-tf-serving-2991-7732038-6736-f518   Active   497d
kubeflow-presubmit-tf-serving-3464-2de5dd8-6288-31d2   Active   433d
kubeflow-presubmit-tf-serving-3464-7c4ef28-7168-3901   Active   433d
kubeflow-presubmit-tf-serving-3464-9165fce-1152-5c01   Active   432d
kubeflow-presubmit-tf-serving-3464-9165fce-2688-141d   Active   433d
kubeflow-presubmit-tf-serving-3464-9165fce-8256-a10f   Active   432d

I think these are some leaked resources, I will delete them as well

k get all -n kubeflow-presubmit-tf-serving-2338-9316696-5673-b788                                                  โ—‹ kubeflow-testing 10:33:28
NAME                            READY   STATUS    RESTARTS   AGE
pod/mnist-cpu-bc4ddfd96-ssmtv   1/1     Running   33         512d

NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
service/mnist-cpu   ClusterIP   10.39.255.61    <none>        9000/TCP,8500/TCP   554d
service/mnist-gpu   ClusterIP   10.39.244.227   <none>        9000/TCP,8500/TCP   554d

NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/mnist-cpu   1/1     1            1           554d

NAME                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/mnist-cpu-bc4ddfd96   1         1         1       554d

Do we have any ideas how this pod was deployed?
It was running for 512d which is strange.

Not sure. it's long time ago. :D. kubeflow-testing doesn't have resource leaking. In total, it uses less than 100 CPUs.(only 7 nodes, I sum all the requests), could be other clusters, Need some time to figure it out

kubeflow-testing                          2018-03-29T17:46:26+00:00  us-east1-d     RUNNING
kf-vmaster-n00                            2019-04-02T12:15:07+00:00  us-east1-b     RUNNING
kf-ci-v1                                  2020-02-03T23:14:27+00:00  us-east1-d     RUNNING
fairing-ci                                2020-03-09T17:08:54+00:00  us-central1-a  RUNNING
ztor-presubmit-v1-1150-21e7089-1683-8f6b  2020-04-08T12:34:44+00:00  us-east1-d     RUNNING
kf-ci-management                          2020-04-28T21:22:24+00:00  us-central1    RUNNING
ztor-presubmit-v1-1175-2f86c79-0370-4672  2020-06-26T03:42:13+00:00  us-east1-d     RUNNING
zmit-e2e-v1alpha3-1235-c772f95-9616-2f11  2020-06-28T02:52:57+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-6400-180a  2020-07-24T14:17:23+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-2272-dcce  2020-07-25T01:20:32+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-1840-a12b  2020-07-25T11:21:13+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-7808-9113  2020-07-26T15:35:59+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-2752-cd98  2020-07-27T06:11:29+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-8096-3703  2020-07-29T08:55:21+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-7216-b6d9  2020-08-05T09:20:25+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1175-56c27aa-5600-5bc3  2020-08-11T03:50:57+00:00  us-east1-d     RUNNING
zbmit-e2e-v1beta1-1303-30e3e23-2896-853c  2020-08-19T20:29:44+00:00  us-east1-d     RUNNING
zbmit-e2e-v1beta1-1305-9179667-8816-238b  2020-08-19T20:51:50+00:00  us-east1-d     RUNNING

Em. Seems kfc-ci-v1, fairing-ci, kubeflow-testing usage is reasonable. @jinchihe @jlewi Any other clues?

jlewi commented

It looks like some pretty large clusters are still being created inside project: kubeflow-ci for individual presubmits.

e.g. zbmit-e2e-v1beta1-1305-9179667-8816-238b was created at 08-19. I'm not sure which presubmit this is coming from (maybe Katib).

Some background:
Originally all of our test infrastructure ran in project kubeflow-ci. This included ephmeral infrastructure such as GKE clusters spun up for the lifetime of the tests.

To enable better management of ephemeral infrastructure we started moving ephmeral clusters into separate projects e.g. kubeflow-ci-deployments. The thinking was this would make it easier to deal with resource leaks because everything in the ephemeral project could just be deleted.

Not all of the tests have been migrated to this model.

To remediate this I'm going to disable the ability of tests to create infrastructure in project kubeflow-ci. This will break any tests that are still doing that and acting as a forcing function for them to fix things.

I'm removing a bunch of permissions from kubeflow-testing@kubeflow-ci.iam.gserviceaccount.com GSA. Noteably

  • Kubernetes Engine Admin

I'm attaching the full policy before the modifications
kubeflow-ci.policy.iam.txt

I deleted all the ephemeral clusters. This should free up significant CPU.

Tests that were using kubeflow-ci for ephemeral infrastructure will need to migrate that to creating ephemeral infra in different projects. My initial guess is that this primarily impacts Katib (@andreyvelich @johnugeorge ).

Each WG should probably use its own GCP project for this to allow better isolation and quota management.

WG"s can projects using GitOPS by creating the project here:
https://github.com/kubeflow/community-infra/tree/master/prod

As part of #737 it would be nice to document this so that other WGs could follow a similar approach.

Related to: #650 - Organize projects into folder.

jlewi commented

@kubeflow/kfserving-owners @andreyvelich @gaocegege @johnugeorge @Bobgy @rmgogogo @terrytangyuan see previous comment as its possible tests for your WG were impacted.

It looks like some pretty large clusters are still being created inside project: kubeflow-ci for individual presubmits.

e.g. zbmit-e2e-v1beta1-1305-9179667-8816-238b was created at 08-19. I'm not sure which presubmit this is coming from (maybe Katib).

In Katib test infra we create individual cluster for each presubmit after that we clean-up this cluster under kubeflow-ci project
For example, one of our workflow: http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-katib-presubmit-e2e-v1beta1-1299-b2d713c-3236-d712?tab=workflow.

Some background:
Originally all of our test infrastructure ran in project kubeflow-ci. This included ephmeral infrastructure such as GKE clusters spun up for the lifetime of the tests.

To enable better management of ephemeral infrastructure we started moving ephmeral clusters into separate projects e.g. kubeflow-ci-deployments. The thinking was this would make it easier to deal with resource leaks because everything in the ephemeral project could just be deleted.

Not all of the tests have been migrated to this model.

I am fine with migrating Katib test infra for independent GCP project. What do you think @gaocegege @johnugeorge ?

To remediate this I'm going to disable the ability of tests to create infrastructure in project kubeflow-ci. This will break any tests that are still doing that and acting as a forcing function for them to fix things.

I'm removing a bunch of permissions from kubeflow-testing@kubeflow-ci.iam.gserviceaccount.com GSA. Noteably

Does it affect TF or PyTorch operators test infra @johnugeorge @terrytangyuan @Jeffwan ?

Yes. it should be the same case with operators also

@jlewi this is affecting KFServing CI and currently blocking multiple PRs, I have created the PR kubeflow/community-infra#10 to setup a GCP project for KFServing, please help review, thanks!

/priority p0

I am fine with migrating Katib test infra for independent GCP project. What do you think @gaocegege @johnugeorge ?

SGTM.

I filed a PR for training projects. kubeflow/community-infra#13

jlewi commented

As I mentioned in yesterdays community meeting another remediation would be to revert the IAM policy changes in
#749 (comment)

To grant an extension to the existing projects e.g. katib and kfserving so that they can continue to create clusters in kubeflow-ci until they have successfully setup and migrated to wg specific projects.

ci-team@kubeflow.org should have sufficient privileges to do this. So anyone in
https://github.com/kubeflow/internal-acls/blob/master/ci-team.members.txt
Could do this.

It looks like the members of ci-team@ is outdated; a bunch of those folks are likely no longer active in the project. It might make sense to replace those folks with members from the respective wgs that are depending on kubeflow-ci so that they can help administer and contribute.

cd @kubeflow/wg-automl-leads @kubeflow/wg-serving-leads @kubeflow/wg-training-leads

@jlewi you have a view on who all with adderesses @google.com are active here?
https://github.com/kubeflow/internal-acls/blob/master/ci-team.members.txt

Additionally I don't recognize who else is active from these names

kam.d.kasravi@intel.com [Don't think he is active anymore]
hnalla@redhat.com [Who?]
scottleehello@gmail.com [Who?]

jlewi commented

@animeshsingh you should be able to use history to see who committed the changes
@scottilee = scottleehello@
@harshad16 = harshad@

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.