Connector fails to Sign the CSR when no cert in cache
Szymongib opened this issue · 3 comments
Description
Sometime Compass Connector fails to sign the CSR when the certificate is not found in the cache. This may cause critical issues on Runtime or Application side as one-time token is used and a new token needs to be issued.
Example log from Compass Runtime Agent that encountered the issue:
Error while establishing a connection with Compass: Failed to sign CSR: Failed to generate certificate: graphql: Error while signing Certificate Signing Request: Certificate data not found in the cache
Expected result
The certificate read directly from Secret if not found in the cache.
Actual result
SignCSR fails when the certificate not found in the Secret.
Steps to reproduce
The issue seems to be random, may be connected to the Connector restarts.
Troubleshooting
@gvachkov I think your team can try to fix it :)
Hello,
Me and a colleague of mine have been working on this issue and we have come to the following findings:
Firstly, we went through the flow regarding the communication between the in-memory cache and the Secrets. We noticed that the state of the Secrets is synced with the cache every minute meaning that the cache is supposed to have the same content as the Secrets. This brought the question what might have caused the situation where the certificate from the Secret is missing in the cache.
So we searched the logs of the connector pod on our dev cluster and on a local Compass installation as well. We found out that in each setup there is a consistent occurring behaviour - during the initial start up of the connector pod (fresh installation or a restart) there’s always the same error regarding fetching the Secrets content into the cache which happens on api-server level:
level=error msg="Failed to load secret compass-system/connector-service-app-ca to cache: failed to get compass-system/connector-service-app-ca secret, Get "https://<ip>/api/v1/namespaces/compass-system/secrets/connector-service-app-ca": dial tcp <ip>: connect: connection refused"
However, when the first minute passes (or the first loading/fetching iteration so to speak) then all present certificates from the Secrets are successfully loaded into the cache and from that point there is no further occurrence of the issue. Everything starts working as expected.
To us, it seemed that some “initial preparation time” is needed (for whatever reason) so that the loading process can begin successfully without an initial failure. So, two approaches came to our mind:
-
start the go routine responsible for the fetching process after a successful Get from the cache is executed. (We will call the cache each second for example until it has the certificate available) or
-
go a step back and include secretsRepository alongside the Cache in the service layer. That way if the call to the cache does not find the certificate, we will fallback to searching it directly in the Secret through the
secretsRepository
.
To sum up, during this gap - from starting the connector and immediately failing to connect to the api-server (for fetching the certs from the Secret) to the point where it has fetched them successfully, whatever action we take we will never be able to obtain the certificate simply because there is a connectivity issue on api-server level. From what we described above, it seems like the issue is caused by something outside of the connector and if the connection to the api-server fails, even if we try to get the Secrets when the certificates are not present in the cache, the request might fail again.
We might be missing something as well, so any suggestions by your side or knowledge/experience regarding the kubernetes level error will be more than welcome.
Thanks for doing the investigation @nyordanoff.
I suspected that the issue might be exactly as you described as we often observe problems with connection to the API Server when the pod starts, which is most likely caused by Istio Envoy sidecar not being initialized yet, hence we often use functions like that in our tests.
In this case, I guess we could either wait for it before starting the API (health checks would indicate that the pod is not ready yet, so the old instance would not be removed) or as you mentioned, try to fetch the certificate directly from Secret if it is not found in the cache.