iterative/cml

How to set instance recreation times count (exceeded maximum number of attempts error on start up)?

Closed this issue · 1 comments

Summary / Background

I want to run 8xA100 on EC2 - yet as it is in high demand, I would not expect it to get available in 3 retries, nor in 100 - I want it to retry each second it can for X hours until ready (like a bot).

error example:

{"level":"info","message":"iterative_cml_runner.runner: Creating..."}
{"level":"info","message":"iterative_cml_runner.runner: Creation errored after 10s"}
{"level":"error","message":"terraform error: Error: Failed creating the machine: Not able to decode: operation error EC2: RunInstances, exceeded maximum number of attempts, 3, https response error StatusCode: 500, RequestID: 78dbfe11, api error InsufficientInstanceCapacity: We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c."}
{"level":"info","message":"::error::Terraform exited with code 1."}

Scope

So I want to have some option to set X retry attempts or infinite retry when I try to get an instance started. Is there any hidden option for it or at least to set retry count to 99999999?

We don't provide any inbuilt mechanism to do that, but you can always retry at the shell level.

for attempt in {1..100}; do
  cml runner ...
done