Cannot create certain GCP GPU instances
awendel-presien opened this issue ยท 11 comments
Hi everyone,
I'm getting the following error when trying to create A100 or L4 based instances on GCP using cml runner launch
(a2-highgpu
and g2-standard
types respectively):
***"level":"error","message":"terraform error: Error: Failed creating the machine: googleapi: Error 400: Instances with guest accelerators do not support live migration., badRequest"***
I have no problem creating V100 and T4 instances (both n1
types).
I have found this discussion, which suggests the maintenance policy needs to be set to TERMINATE
. Am I on the right track, and if yes, is there a way to do that using cml runner launch
?
Regards,
Alex.
Hello, @awendel-presien! It looks like the instance maintenance behavior is already being set to TERMINATE
when creating GPU instances. ๐ค
Hello, @awendel-presien! It looks like the instance maintenance behavior is already being set to
TERMINATE
when creating GPU instances. ๐ค
Hi @0x2b3bfa0, thanks for having a look at this! Do you have any other ideas as to why it might return that error for the newer a2-highgpu
and g2-standard
instance types?
@0x2b3bfa0, unfortunately this error persists for us. As a test, I tried creating a g2-standard-4
instance using Terraform and the Iterative Terraform provider (so not using CML), and that worked without issue.
So the problem only occurs when trying to start g2
or a2
instances using cml runner launch
.
Any ideas?
We managed to get this working by including the GPU type and number in the --cloud-type
option, e.g. g2-standard-96+nvidia-l4*8
instead of g2-standard-96
.
I think this is something that should at least be documented, because it is technically superfluous; i.e. g2-standard-96
instances only come with 8x Nvidia L4 GPUs . It's the same with a2
instances; for example a2-highgpu-8g
only comes with 8x A100 GPUs.
It's also not necessary to specify the number and type of GPUs when using the Terraform Provider Iterative directly - it works just fine when only providing the machine type. And cml runner launch
does not require this for AWS instances; for example you can launch a g4dn.metal
instance without specifying the type and number of GPUs.
Hi @awendel-presien,
Did you manage to run any a2-highgpu
using cml runner launch
? I am getting the same error when I set --cloud-type=a2-highgpu-1g
.
@hopeai can you try as a2-highgpu+nvidia-a100*1
or a2-highgpu+nvidia-tesla-a100*1
we'll see if we can address this in the near future. In the past you had to select GPUs and the gcp types didn't have preselected gpus options, like for example with the aws image types.
Thanks @dacbd, I was able to solve this problem by setting --cloud-type=a2-highgpu-1g+nvidia-tesla-a100*1
. BTW, how do you deal with resource availability problem. Is there a plan to address this in the near future.
error: terraform error: Error: Failed creating the machine: Operation error: compute.OperationErrorErrors{Code:"ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS", ErrorDetails:[]*compute.OperationErrorErrorsErrorDetails{(*compute.OperationErrorErrorsErrorDetails)(0xc000336870), (*compute.OperationErrorErrorsErrorDetails)(0xc000336960), (*compute.OperationErrorErrorsErrorDetails)(0xc000336cd0)}, Location:"", Message:"The zone 'projects/MY_PROJECT/zones/us-central1-a' does not have enough resources available to fulfill the request. '(resource type:compute)'.", ForceSendFields:[]string(nil), NullFields:[]string(nil)}
My recommendation if its something that you encounter often would be to try some kind of simple bash loop, something like this:
zones=("us-central1-a", "us-central1-b", "us-central1-c")
for zone in "{zones[@]}"; do
cml runner launch ... \
--region="$zone" \
...
if [ $? -eq 0 ]; then
echo "deploy runner in $zone"
break
else
echo "Runner in $zone failed, trying next zone"
fi
done
(I haven't explicitly tested the above)
@hopeai we aren't doing much active development on CML for the moment, but if you want to add this feature yourself, I'd be happy to prioritize testing any pull requests you make, and releasing any new additions.
My recommendation if its something that you encounter often would be to try some kind of simple bash loop, something like this:
zones=("us-central1-a", "us-central1-b", "us-central1-c") for zone in "{zones[@]}"; do cml runner launch ... \ --region="$zone" \ ... if [ $? -eq 0 ]; then echo "deploy runner in $zone" break else echo "Runner in $zone failed, trying next zone" fi done(I haven't explicitly tested the above)
@hopeai we aren't doing much active development on CML for the moment, but if you want to add this feature yourself, I'd be happy to prioritize testing any pull requests you make, and releasing any new additions.
Thanks for the recommendation @dacbd. At the moment I'm using a similar bash loop, but I'd like to know if this is something that will be addressed in cml runner launch
it could be a --cloud-region-list
option or --cloud-region
can accept more than one region to try.
check quotas of your gcp account and try to provision resources accordingly via cml runner