Dry-run flag is causing the BOSH director DB out of sync with infrastructure
madamkiwi opened this issue · 2 comments
Describe the bug
Running bosh deploy --dry-run on a stemcell change caused IPs to be created in the database, and there isn't a way to remove those IPs
To Reproduce
Steps to reproduce the behavior:
- Deploy a bosh deployment
- Ssh onto BOSH director and run console to show the total number of IPs
bosh/0:~$ sudo su -
bosh/0:~# /var/vcap/jobs/director/bin/console
irb(main):003:0> puts Bosh::Director::Models::IpAddress.all.filter {|x| x.vm_id == nil}.count
- Upload new stemcell
- Redeploy with --run-run flag to see that new stemcell is being picked up
- Ssh onto BOSH director and run console to show the total number of IPs
bosh/0:~$ sudo su -
bosh/0:~# /var/vcap/jobs/director/bin/console
irb(main):003:0> puts Bosh::Director::Models::IpAddress.all.filter {|x| x.vm_id == nil}.count
Expected behavior
The number of IPs on step 5 is greater than step 3
Logs
Eventually we hit an error that we are running out of IPs
Error: Failed to reserve IP for 'diego-cell/<guid>' for manual network 'pws-diego-public': no more available
Versions (please complete the following information):
- Infrastructure: AWS
- BOSH version: 271.1.0
- BOSH CLI version: 6.3.1
- Stemcell version: ubuntu-xenial/621.81
Deployment info:
https://github.com/pivotal-cloudops/pws-cf/blob/master/manifest.yml
Additional context
It seems like the dry-run flag does a persistent write to the BOSH director db, causing it to be out of sync with the current state of the infrastructure. We have tried recreating VMs, redeploying the same manifest, but still cannot claim back the lost IP. Interestingly, we noticed that after a failure to deploy, somehow the total number of IPs decreased. It seems to us that BOSH may clean up IPs after a deploy failure.
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/174556922
The labels on this github issue will be updated when the story is started.
This is an unfortunate known issue, --dry-run has side effects that need a deploy to resolve. Our current expectations around use is that it's used to validate what will happen and then a changed manifest is deployed.
This issues is somewhat similar #2226 . At the moment we're not focusing on making improvements to the --dry-run
behavior as it would be a very significant refactor to our codebase and the benefit is not that clear.
At the very least we should be adding a warning to the bosh.io docs and possibly after the dry deploy command is run.