New version of connaisseur 3.4.0 not working in calico cluster

Question

New version of connaisseur 3.4.0 not working in calico cluster

harangar opened this issue 8 months ago · 4 comments

Describe the bug
Since the connaisseur doesn't have hostNetwork parameter added, we modified the helm chart and added hostNetwork in deployment.yaml.
This modified helm chart was working with calico clusters till connaisseur version v3.3.3.
However, with 3.4.0 we see the following error when connaisseur pod is trying to connect with redis.
In connaisseur pod, we see the following error 

{
  "level": "error",
  "msg": "redis ping failed: dial tcp: lookup connaisseur-redis-service on 10.1.56.2:53: no such host",
  "time": "2024-04-16T00:41:42Z"
}

And in connaisseur-redis pod,

Error accepting a client connection: error:0A00010B:SSL routines::wrong version number (addr=192.xx.xx.xx:60752 laddr=192.xx.xx.xx:6379

We tried enabling hostNetwork in redis pod as well, but the issue was not resolved and we still faced the same error.

Following are the details of the pod running in connaisseur namespace:

Optional: Versions (please complete the following information as relevant):

Kubernetes Cluster: EKS 1.28
Connaisseur: 3.4.0

Any help here would be greatly appreciated. Thank you

Answer 1 · 2024-04-19T08:50:01.000Z

HOWDY @harangar.

So your hostNetwork change is because of this, right?:

Calico networking cannot currently be installed on the EKS control plane nodes. As a result the control plane nodes will not be able to initiate network connections to Calico pods. (This is a general limitation of EKS's custom networking support, not specific to Calico.) As a workaround, trusted pods that require control plane nodes to connect to them, such as those implementing admission controller webhooks, can include hostNetwork:true in their pod spec. See the Kubernetes API pod spec definition for more information on this setting.
quote from calico docs

With the Connaisseur pods in the hostNetwork, they now use the hosts DNS resolver to resolve URLs instead of the one from your cluster ... On startup Connaisseur tries to reach the redis service as a check that everything works, but since the redis service isn't registered in the hosts DNS, Connaisseur startup will fail.

Now the solution seems to be similar to this. The Connaisseur pods need to use the clusters DNS again so that the lookup to the redis service works again.

For now I'd say, try to modify the helm chart so that the Connaisseur pods use the right DNS resolver and see if that works. In the meantime I'll try to figure out, how best make this use-case configurable in the chart, so you don't have to make chnages manually in the future.

Cheers.

Answer 2 · 2024-05-08T15:16:14.000Z

Hi @phbelitz ,

Thank you for your response.
We tried adding dnsConfig in Connaisseur deployment.yaml.

We still saw this error on Redis pod -

Error accepting a client connection: error:0A00010B:SSL routines::wrong version number (addr=192.xx.xx.xx:60752 laddr=192.xx.xx.xx:6379

and after checking, we found that Redis pod is trying to connect with Datadog that was running on our cluster.  After removing Datadog, we don’t see that error on Redis pod.

However, connaisseur functionality is not working as expected in calico cluster even after making these changes - it is not blocking the unsigned container images.

Answer 3 · 2024-05-24T15:10:05.000Z

@harangar Hm. I don't how to fix this problem yet... However we released a new version of Connaisseur where the redis cache is no longer required on startup. That means you can run Connaisseur again, alas you won't have the caching capabilities and thus worse performance. In this case you can disable the caching entirely by setting the cache expirySeconds to zero (described here).

Answer 4 · 2024-06-14T19:57:47.000Z

@phbelitz ,

Thank you , setting expirySeconds to 0 worked for calico as well as vpc-cni clusters, and it din't deploy redis pod.

However, we are facing this issue while trying to update other deployments after connaisseur is enabled on the EKS cluster. This specifically occurs when resourceValidationMode is set to podsOnly

Error: cannot patch "datadog" with kind DaemonSet: Internal error occurred: failed calling webhook "connaisseur-svc.connaisseur.svc": received invalid webhook response: webhook returned response.patchType but not response.patch && cannot patch "datadog-cluster-agent" with kind Deployment: Internal error occurred: failed calling webhook "connaisseur-svc.connaisseur.svc": received invalid webhook response: webhook returned response.patchType but not response.patch

We are able to update the deployments when the resourceValidationMode is set to all

But we want to validate signatures only for the pods in the EKS cluster, hence we set the
resourceValidationMode: "podsOnly"

Could you please help us here ?