iterative/cml

Cannot create certain GCP GPU instances

awendel-presien opened this issue ยท 11 comments

Hi everyone,

I'm getting the following error when trying to create A100 or L4 based instances on GCP using cml runner launch (a2-highgpu and g2-standard types respectively):

***"level":"error","message":"terraform error: Error: Failed creating the machine: googleapi: Error 400: Instances with guest accelerators do not support live migration., badRequest"***

I have no problem creating V100 and T4 instances (both n1 types).

I have found this discussion, which suggests the maintenance policy needs to be set to TERMINATE. Am I on the right track, and if yes, is there a way to do that using cml runner launch?

Regards,
Alex.

Hello, @awendel-presien! It looks like the instance maintenance behavior is already being set to TERMINATE when creating GPU instances. ๐Ÿค”

Hello, @awendel-presien! It looks like the instance maintenance behavior is already being set to TERMINATE when creating GPU instances. ๐Ÿค”

Hi @0x2b3bfa0, thanks for having a look at this! Do you have any other ideas as to why it might return that error for the newer a2-highgpu and g2-standard instance types?

@0x2b3bfa0, unfortunately this error persists for us. As a test, I tried creating a g2-standard-4 instance using Terraform and the Iterative Terraform provider (so not using CML), and that worked without issue.

So the problem only occurs when trying to start g2 or a2 instances using cml runner launch.

Any ideas?

We managed to get this working by including the GPU type and number in the --cloud-type option, e.g. g2-standard-96+nvidia-l4*8 instead of g2-standard-96.

I think this is something that should at least be documented, because it is technically superfluous; i.e. g2-standard-96 instances only come with 8x Nvidia L4 GPUs . It's the same with a2 instances; for example a2-highgpu-8g only comes with 8x A100 GPUs.

It's also not necessary to specify the number and type of GPUs when using the Terraform Provider Iterative directly - it works just fine when only providing the machine type. And cml runner launch does not require this for AWS instances; for example you can launch a g4dn.metal instance without specifying the type and number of GPUs.

hopeai commented

Hi @awendel-presien,

Did you manage to run any a2-highgpu using cml runner launch ? I am getting the same error when I set --cloud-type=a2-highgpu-1g .

dacbd commented

@hopeai can you try as a2-highgpu+nvidia-a100*1 or a2-highgpu+nvidia-tesla-a100*1 we'll see if we can address this in the near future. In the past you had to select GPUs and the gcp types didn't have preselected gpus options, like for example with the aws image types.

hopeai commented

Thanks @dacbd, I was able to solve this problem by setting --cloud-type=a2-highgpu-1g+nvidia-tesla-a100*1 . BTW, how do you deal with resource availability problem. Is there a plan to address this in the near future.

error: terraform error: Error: Failed creating the machine: Operation error: compute.OperationErrorErrors{Code:"ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS", ErrorDetails:[]*compute.OperationErrorErrorsErrorDetails{(*compute.OperationErrorErrorsErrorDetails)(0xc000336870), (*compute.OperationErrorErrorsErrorDetails)(0xc000336960), (*compute.OperationErrorErrorsErrorDetails)(0xc000336cd0)}, Location:"", Message:"The zone 'projects/MY_PROJECT/zones/us-central1-a' does not have enough resources available to fulfill the request. '(resource type:compute)'.", ForceSendFields:[]string(nil), NullFields:[]string(nil)}

dacbd commented

My recommendation if its something that you encounter often would be to try some kind of simple bash loop, something like this:

zones=("us-central1-a", "us-central1-b", "us-central1-c")
for zone in "{zones[@]}"; do
    cml runner launch ... \
        --region="$zone" \
        ...
    if [ $? -eq 0 ]; then
          echo "deploy runner in $zone"
          break
    else
          echo "Runner in $zone failed, trying next zone"
    fi
done

(I haven't explicitly tested the above)


@hopeai we aren't doing much active development on CML for the moment, but if you want to add this feature yourself, I'd be happy to prioritize testing any pull requests you make, and releasing any new additions.

hopeai commented

My recommendation if its something that you encounter often would be to try some kind of simple bash loop, something like this:

zones=("us-central1-a", "us-central1-b", "us-central1-c")
for zone in "{zones[@]}"; do
    cml runner launch ... \
        --region="$zone" \
        ...
    if [ $? -eq 0 ]; then
          echo "deploy runner in $zone"
          break
    else
          echo "Runner in $zone failed, trying next zone"
    fi
done

(I haven't explicitly tested the above)

@hopeai we aren't doing much active development on CML for the moment, but if you want to add this feature yourself, I'd be happy to prioritize testing any pull requests you make, and releasing any new additions.

Thanks for the recommendation @dacbd. At the moment I'm using a similar bash loop, but I'd like to know if this is something that will be addressed in cml runner launch it could be a --cloud-region-list option or --cloud-region can accept more than one region to try.

check quotas of your gcp account and try to provision resources accordingly via cml runner