google/caliban

Cannot create cluster

eschnett opened this issue · 5 comments

I want to create a GKE cluster following your instructions. I have set up the cloud tools, authentication, etc. I receive this error message:

$ caliban cluster create --cluster_name einsteintoolkit-cluster --zone us-east1-b
I0801 23:07:37.349657 4416302528 cli.py:185] creating cluster einsteintoolkit-cluster in project fifth-curve-272318 in us-east1-b...
I0801 23:07:37.349900 4416302528 cli.py:186] please be patient, this may take several minutes
I0801 23:07:37.349989 4416302528 cli.py:188] visit https://console.cloud.google.com/kubernetes/clusters/details/us-east1-b/einsteintoolkit-cluster?project=fifth-curve-272318 to monitor cluster creation progress
E0801 23:07:37.582320 4416302528 util.py:68] exception in call <function Cluster.create at 0x7ffc08ba4a70>:
<HttpError 400 when requesting https://container.googleapis.com/v1beta1/projects/fifth-curve-272318/zones/us-east1-b/clusters?alt=json returned "Resource_limit.maximum must be greater than 0.">
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/caliban-0.3.0+8.gaf9dd99-py3.7.egg/caliban/platform/gke/util.py", line 65, in wrapper
    response = fn(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/caliban-0.3.0+8.gaf9dd99-py3.7.egg/caliban/platform/gke/cluster.py", line 1178, in create
    rsp = request.execute()
  File "/opt/anaconda3/lib/python3.7/site-packages/google_api_python_client-1.10.0-py3.7.egg/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/google_api_python_client-1.10.0-py3.7.egg/googleapiclient/http.py", line 907, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://container.googleapis.com/v1beta1/projects/fifth-curve-272318/zones/us-east1-b/clusters?alt=json returned "Resource_limit.maximum must be greater than 0.">

When I use --dry_run, I see these details:

$ caliban cluster create --cluster_name einsteintoolkit-cluster --zone us-east1-b --dry_run
I0801 23:08:27.893903 4585823680 cli.py:175] request:
{'cluster': {'autoscaling': {'autoprovisioningNodePoolDefaults': {'oauthScopes': ['https://www.googleapis.com/auth/compute',
                                                                                  'https://www.googleapis.com/auth/cloud-platform']},
                             'enableNodeAutoprovisioning': 'true',
                             'resourceLimits': [{'maximum': '24',
                                                 'resourceType': 'cpu'},
                                                {'maximum': '1536',
                                                 'resourceType': 'memory'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-k80'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-p100'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-v100'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-p4'},
                                                {'maximum': '1',
                                                 'resourceType': 'nvidia-tesla-t4'},
                                                {'maximum': '0',
                                                 'resourceType': 'nvidia-tesla-a100'}]},
             'enable_tpu': 'true',
             'ipAllocationPolicy': {'useIpAliases': 'true'},
             'locations': ['us-east1-b', 'us-east1-c', 'us-east1-d'],
             'name': 'einsteintoolkit-cluster',
             'nodePools': [{'config': {'oauthScopes': ['https://www.googleapis.com/auth/devstorage.read_only',
                                                       'https://www.googleapis.com/auth/logging.write',
                                                       'https://www.googleapis.com/auth/monitoring',
                                                       'https://www.googleapis.com/auth/service.management.readonly',
                                                       'https://www.googleapis.com/auth/servicecontrol',
                                                       'https://www.googleapis.com/auth/trace.append']},
                            'initialNodeCount': '3',
                            'name': 'default-pool'}],
             'releaseChannel': {'channel': 'REGULAR'},
             'zone': 'us-east1-b'},
 'parent': 'projects/fifth-curve-272318/locations/us-east1-b'}

There is indeed a resource request with a maximum of 0.

I am using the current master branch.

Thanks for the report, @eschnett ! @ajslone, our resident GKE expert, should be able to help out here.

This change avoids the error:

diff --git a/caliban/platform/gke/util.py b/caliban/platform/gke/util.py
index 9a754ea..0d70c45 100644
--- a/caliban/platform/gke/util.py
+++ b/caliban/platform/gke/util.py
@@ -395,9 +395,10 @@ def resource_limits_from_quotas(
     gd = gpu_match.groupdict()
     gpu_type = gd['gpu']

-    limits.append({
-        'resourceType': 'nvidia-tesla-{}'.format(gpu_type.lower()),
-        'maximum': str(limit)
+    if limit > 0:
+      limits.append({
+          'resourceType': 'nvidia-tesla-{}'.format(gpu_type.lower()),
+          'maximum': str(limit)
     })

   return limits

Erik, thanks for the bug report, and for the fix. I'll merge that in shortly.

Erik, I have merged this change, please let me know if this does not resolve your issue. Thanks again for the bug report and for using Caliban + GKE.

This change resolves the issue.