cloudfoundry/bosh-cli

Feature Request: bosh ability to recreate multi specificied indexes for an instance group in one run

iamejboy opened this issue · 7 comments

As an operator, we would like to have this bosh ability to recreate an instance group with specifying indexes to recreate in one command, or range of indexes

bosh recreate -d cf-deployment cell/[1,339,123,24,250]
and
bosh recreate -d cf-deployment cell/[300-350]

from this discussion -> https://cloudfoundry.slack.com/archives/C02HPPYQ2/p1561624294169800

The current API allows recreating 1 or ALL. Sometimes we don't need all and recreating one by one has the following disadvantages:

  • Preparation for each recreation step takes 5-7 minutes
  • Max-in-flight is not respected (we cannot run more than one recreate tasks in parallel)

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/166970717

The labels on this github issue will be updated when the story is started.

Hi @iamejboy,

We recently cut v270.4.0 which contains endpoints for new start/stop/restart/recreate behavior. These new endpoints act only on the instance in question, and don't plan the entire deployment as a prerequisite. One of the benefits of these endpoints is that it goes through a different planning flow that should be much quicker.

You can access these endpoints by running the following:

bosh curl -X POST /deployments/<deployment-name>/instance_groups/<instance-group-name>/<index-or-id>/actions/recreate`

Could you try that out and let us know how that improves your workflow.

Best,

@jfmyers9

Hi @jfmyers9 ,

Thanks for reverting back with this feature request. I tried the new bosh version in bosh-lite to verify the endpoints mentioned. The request of recreate via bosh curl with specified endpoint to recreate as mentioned was quicker to operate obviously. Yes, it does not goes through usual prepation unlike bosh recreate but, based on my observation, its only nice when trying to recreate one instance. If I want to recreate many instance based on instance group index which is the main purpose of this request, let say like below simple iteration:

for a in 0 2 4; do bosh curl -X POST /deployments/zookeeper/instance_groups/zookeeper/${a}/actions/recreate; done;
Using environment '192.168.50.6' as client 'admin'

{"id":37,"state":"processing","description":"recreate instance","timestamp":null,"started_at":null,"result":null,"user":"admin","deployment":"zookeeper","context_id":""}
Succeeded
Using environment '192.168.50.6' as client 'admin'

{"id":38,"state":"processing","description":"recreate instance","timestamp":null,"started_at":null,"result":null,"user":"admin","deployment":"zookeeper","context_id":""}
Succeeded
Using environment '192.168.50.6' as client 'admin'

{"id":39,"state":"queued","description":"recreate instance","timestamp":null,"started_at":null,"result":null,"user":"admin","deployment":"zookeeper","context_id":""}
Succeeded

ID  State       Started At                    Finished At                   User   Deployment  Description        Result  
39  error       Fri Aug 16 04:01:44 UTC 2019  Fri Aug 16 04:01:54 UTC 2019  admin  zookeeper   recreate instance  Failed to acquire lock for lock:deployment:zookeeper uid: 0725f592-f44e-4ac7-a903-95691c2fc240. Locking task id is 37,...  
38  error       Fri Aug 16 04:01:43 UTC 2019  Fri Aug 16 04:01:54 UTC 2019  admin  zookeeper   recreate instance  Failed to acquire lock for lock:deployment:zookeeper uid: cd05f3bf-b340-4258-b2a1-1b848eddee25. Locking task id is 37,...  
37  processing  Fri Aug 16 04:01:43 UTC 2019  -                             admin  zookeeper   recreate instance  -

vs

 for a in 0 2 4; do bosh -d zookeeper -n recreate "zookeeper/${a}"; done;
Using environment '192.168.50.6' as client 'admin'

Using deployment 'zookeeper'

Task 40

Task 40 | 04:04:10 | Preparing deployment: Preparing deployment (00:00:01)
Task 40 | 04:04:11 | Preparing deployment: Rendering templates (00:00:00)
Task 40 | 04:04:11 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 40 | 04:04:12 | Updating instance zookeeper: zookeeper/cfb2dcd6-76e1-41c0-9d06-87c3246b3d7e (0) (canary) (00:00:34)

Task 40 Started  Fri Aug 16 04:04:10 UTC 2019
Task 40 Finished Fri Aug 16 04:04:46 UTC 2019
Task 40 Duration 00:00:36
Task 40 done

Succeeded
Using environment '192.168.50.6' as client 'admin'

Using deployment 'zookeeper'

Task 41

Task 41 | 04:04:48 | Preparing deployment: Preparing deployment (00:00:00)
Task 41 | 04:04:48 | Preparing deployment: Rendering templates (00:00:01)
---------------CUT--------------------------------

The latter was still better, reliable even its slow. But I believe the feature request will help operator like us whose running hundreds of instance of a particular instance group like cell, ease the pain of recreating several instance of its kind. For some incident we are needing to recreate certain number of cell vms due to some IaaS issue. Having to do it in one pass and the liberty to specify indexes or indexes range without having to undergo every preparation flow would be awesome.

Regards
@iamejboy

Hi @iamejboy,

In the use case you described above, you say you need to recreate certain VMs due to IAAS issues. Are these VMs becoming unresponsive? If so, have you considered using the bosh cck command to repair unresponsive VMs?

How are you detecting the VMs that you would like to recreate? Is there a common defining characteristic i.e. az, instance state, etc?

At this point we don't have an endpoint on the director side to expose mass recreation of instances, so any features that we added in the CLI would be a convenience over the bash script that you have implemented above.

Best,

@jfmyers9 && @xtreme-conor-nosal

Hi @jfmyers9 @xtreme-conor-nosal ,

In the use case you described above, you say you need to recreate certain VMs due to IAAS issues. Are these VMs becoming unresponsive? If so, have you considered using the bosh cck command to repair unresponsive VMs?

Its case to case basis and depends on what we have conclude base root cause analysis. We sometimes use bosh cck (also taking some time) , oftentimes recreate.

How are you detecting the VMs that you would like to recreate? Is there a common defining characteristic i.e. az, instance state, etc?

Operation requirement like Update/ Upgrade/ unresponsive etc. And Yes, what are we looking here is convenience especially us operator with a very large deployment.

I hope this feature might be considered.

Regards,

Using a blocking form of the new endpoints for a in 0 2 4; do bosh -d zookeeper -n recreate "zookeeper/${a}" --no-converge; done; does alleviate the pain spent in the preparing stage. This is similar to the curl command but called through the cli. It blocks and thus avoids the deployment lock issues.

Making changes to allow multiple parallel upgrades and have them honor max-in-flight is a bigger change. Are you using the resurrector? I would expect that to catch many of the unresponsive issues before operator intervention is even required. I'm also not sure what upgrade/update requirements a recreate would fix as that usually falls into the purview of a bosh deploy action.

This issue was closed because it has been labeled Stale for 7 days without subsequent activity. Feel free to re-open this issue at any time by commenting below.