bosh-nats-sync failing as long no uaa is available
max-soe opened this issue · 2 comments
Describe the bug
With the new nats version in bosh 274 we have an issue to deploy bosh. Sometimes the deploy fails with:
Task 12439 | 13:26:49 | L starting jobs: bosh/ad445f92-5fab-4070-bdd4-1071258ba02d (0) (canary)Updating deployment: Expected task '12439' to succeed but state is 'error' (00:08:12)L Error: 'bosh/ad445f92-5fab-4070-bdd4-1071258ba02d (0)' is not running after update. Review logs for failed jobs: health_monitor Task 12439 | 13:32:51 | Error: 'bosh/ad445f92-5fab-4070-bdd4-1071258ba02d (0)' is not running after update. Review logs for failed jobs: health_monitor
We found that the bosh-nats-sync job can not authenticate as long as the codeployed uaa is not running:
[2022-10-13T14:12:56.206749 #647762] INFO : Nats Sync starting... [2022-10-13T14:13:06.290402 #647762] INFO : Executing NATS Users Synchronization [2022-10-13T14:13:06.522845 #647762] ERROR : Failed to obtain token from UAA: #<CF::UAA::BadTarget: error: Failed to open TCP connection to 192.168.1.11:8443 (Connection refused - connect(2) for 192.168.1.11:8443)> [2022-10-13T14:13:06.602752 #647762] FATAL : 401 Unauthorized
So the health-monitor can not use the nats. After the uaa started 5min later everything works fine.
Expected behavior
The bosh-nats-sync jobs wait until uaa is started. All jobs that depends on nats like the health_monitor wait until bosh-nats-sync is started.
Versions:
- Infrastructure: AWS
- BOSH version 274.4
- Stemcell version [e.g. ubuntu-jammy/1.18]
HI @max-soe , thanks for the details of the problem. Currently the bosh-nats-sync job needs to access the BOSH API to get all the running VMs to write the NATS authentication file, so the VMs are authorized to access NATS. If the BOSH API is not up for whatever reason (UAA being down for example) then it does not write the authentication file, which is a problem for the BOSH Monitor and the BOSH Director, because as you mention, they can't use NATS.
We plan to change this behavior. If the BOSH API is not up, we will write a basic authentication file for NATS that will give access to the BOSH Monitor and Director. This will make them able to send messages to NATS and probably will fix the issue you are experiencing.
This changes are in a pull request, we will merge it once it is approved.
Closing this is fixed in v275.1.0