cloudfoundry/bosh

Dry-run flag is causing the BOSH director DB out of sync with infrastructure

madamkiwi opened this issue · 2 comments

Describe the bug
Running bosh deploy --dry-run on a stemcell change caused IPs to be created in the database, and there isn't a way to remove those IPs

To Reproduce
Steps to reproduce the behavior:

  1. Deploy a bosh deployment
  2. Ssh onto BOSH director and run console to show the total number of IPs
bosh/0:~$ sudo su -
bosh/0:~# /var/vcap/jobs/director/bin/console
irb(main):003:0> puts Bosh::Director::Models::IpAddress.all.filter {|x| x.vm_id == nil}.count

  1. Upload new stemcell
  2. Redeploy with --run-run flag to see that new stemcell is being picked up
  3. Ssh onto BOSH director and run console to show the total number of IPs
bosh/0:~$ sudo su -
bosh/0:~# /var/vcap/jobs/director/bin/console
irb(main):003:0> puts Bosh::Director::Models::IpAddress.all.filter {|x| x.vm_id == nil}.count

Expected behavior
The number of IPs on step 5 is greater than step 3

Logs
Eventually we hit an error that we are running out of IPs

Error: Failed to reserve IP for 'diego-cell/<guid>' for manual network 'pws-diego-public': no more available

Versions (please complete the following information):

  • Infrastructure: AWS
  • BOSH version: 271.1.0
  • BOSH CLI version: 6.3.1
  • Stemcell version: ubuntu-xenial/621.81

Deployment info:
https://github.com/pivotal-cloudops/pws-cf/blob/master/manifest.yml

Additional context
It seems like the dry-run flag does a persistent write to the BOSH director db, causing it to be out of sync with the current state of the infrastructure. We have tried recreating VMs, redeploying the same manifest, but still cannot claim back the lost IP. Interestingly, we noticed that after a failure to deploy, somehow the total number of IPs decreased. It seems to us that BOSH may clean up IPs after a deploy failure.

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/174556922

The labels on this github issue will be updated when the story is started.

This is an unfortunate known issue, --dry-run has side effects that need a deploy to resolve. Our current expectations around use is that it's used to validate what will happen and then a changed manifest is deployed.

This issues is somewhat similar #2226 . At the moment we're not focusing on making improvements to the --dry-run behavior as it would be a very significant refactor to our codebase and the benefit is not that clear.

At the very least we should be adding a warning to the bosh.io docs and possibly after the dry deploy command is run.