cloudfoundry/bosh

The bosh-agent client certificate expired and the vm unresponsive

rluan opened this issue · 1 comments

rluan commented

Hi,
My bosh director is "version : 264.3.0"

I know the bosh director would issue the nats client cert for downstream vms, which with my director version is issued a 2 years long terms cert for bosh-agent client.
-> pwd
/var/vcap/bosh
-> cat settings.json | jq .env.bosh.mbus.cert.certificate |sed s'/"//'g | sed -E s'/\n/\n/g'|openssl x509 -noout -subject -issuer -dates
subject= /C=USA/O=Cloud Foundry/CN=d1e7fc20-ff50-4a43-baa3-fd94daac4a8b.agent.bosh-internal
issuer= /CN=default.nats-ca.bosh-internal
notBefore=Mar 23 20:56:11 2021 GMT
notAfter=Mar 23 20:56:11 2023 GMT

The issue which I have encountered recently is, after we exceeded the validation terms of the bosh client cert, the vm would become "unresponsive agent" status and can not be operated with like ssh vm.
Like below outputs:
Deployment 'pp_bosh'
Instance Process State AZ IPs
pp_bosh/b7fc4c14-84e2-4ac5-8b97-9226fed836ab unresponsive agent - 10.97.224.22

And the target vm client cert is expired :
Issuer: CN=default.nats-ca.bosh-internal
Validity
Not Before: Feb 13 10:25:53 2019 GMT
Not After : Feb 12 10:25:53 2021 GMT
Subject: C=USA, O=Cloud Foundry, CN=5100880f-38f4-4f8a-8940-9be10fffdc26.agent.bosh-internal

Why I can encounter it, because I have disabled the bosh resurrector, so the unresponsive agent can not be cured with recreate.
My operation is just able to recreate it with --fix options, to make this vm recovered.

I thought it would be automatically updated to avoid this kind of issue , form a bosh director dynamic monitor/operate perspective, but it does not.
So which means we need to
#1 manually recreate the vms under a bosh director periodically within a valid client cert term ?
#2 or the issue vm would be recreated automatically by the bosh director (resurrection = on), once the director finds the vm is unresponsive with a cert issue ?
With scenario #1 , I think we can have a regular MA window to handle recreate vms to renew the cert terms, but with scenario #2 I'm concerning there may involve some outage to our deployments , as we always create a bunch of vms at the same time when we execute the bosh deployment, so the expiration time for those vms would be nearly to each other; or we may have a singleton deployment, like NFS server.

P.S
I know there's the "percent_threshold " option for director configurations which can defend large scale outage happened with same deployment, but it can not help all kinds of issue cases, and which would requires manually operations here.

Do we have some preventive methods to automatically update the client cert ahead of the expirations? Or can we have some options to sign a long term client certs?

P.S
I though of this kind issue maybe few years ago(at that time we have the issue to rotate nats server certificates), but today I encountered the client cert issue finally ....

Currently this certificates have to be rotated manually, this link shows how to do that https://bosh.io/docs/nats-ca-rotation/. We have upcoming work to make it so this certificates can be rotated without recreating all the VMs (we haven't fully scoped out the work yet and don't have an estimate for when that will be done)