submariner-io/lighthouse

service discovery not working without gateways

MartijnStraatman opened this issue · 9 comments

What happened:

I deployed solely the lighthouse agents (no gateways) for service discovery by :

subctl deploy-broker --components service-discovery

I followed the user-guide and deployed the nginx-test pods.
https://submariner.io/operations/usage/

The following connectivity test results in an unknown host exception when the deployment on the local cluster is scaled to zero:

curl nginx.nginx-test.svc.clusterset.local:8080.

What you expected to happen:

Since the local cluster is not responding, I expect a response from the remote cluster.

Enable debug logging on the lighthouse coredns pods for more specific logging.

[DEBUG] plugin/lighthouse: Request received for "nginx.nginx-test.svc.clusterset.local."
[DEBUG] plugin/lighthouse: Couldn't find a connected cluster or valid IPs for "nginx.nginx-test.svc.clusterset.local."

How to reproduce it (as minimally and precisely as possible):
Just perform the userguide and scale down the deployment on of of the clusters to zero. Perform the curl command on the cluster with the scaled down deployment. Only install lighthouse agents and no gateways!

To get it working, apply this workaround on all clusters:

kubectl delete crd gateways.submariner.io

We have code to check if gateways resource is available or not. If not, we assume clusters to be always connected. This is the code:

https://github.com/submariner-io/lighthouse/blob/devel/coredns/gateway/controller.go#L85

Likely, we're installing gateways CRD even on service-discovery only deployments. @skitt DId we change anything in operator/subctl regarding how CRDs are installed?

CRD gateways.submariner.io will be created while submariner-operator initialized and Reconcile. So before delete CRD gateways.submariner.io (and enable debug log of lighthouse-coredns), submariner-operator must be stopped.

kubectl -n submariner-operator scale --replicas=0 deploy submariner-operator
kubectl delete crd gateways.submariner.io
kubectl -n submariner-operator rollout restart deploy submariner-lighthouse-coredns

Talking about this with @tpantelis, it seems like this will be a bit tricky. But it seems useful for folks who only want SD.

We have to install all the submariner CRDs on operator startup - at that point we don't know what will be installed. We should check for existence of the Submariner resource instead of the gateways CRD. @vthapar I can look into it if you want.

Hi,

The suggested workaround (delete the gateway crd) is not stable. It seems overtime the lookups are failing again. After restarting the coredns and lighthouse agent pods lookups succeed again. Any suggestions how to get things working stable with a workaround?

We have to install all the submariner CRDs on operator startup - at that point we don't know what will be installed. We should check for existence of the Submariner resource instead of the gateways CRD. @vthapar I can look into it if you want.

In order for this to work, the CoreDNS plugin cluster role needs permission to list the Submariner resource. This seems OK to me - @vthapar @skitt WDYT?

The suggested workaround (delete the gateway crd) is not stable. It seems overtime the lookups are failing again. After restarting the coredns and lighthouse agent pods lookups succeed again. Any suggestions how to get things working stable with a workaround?

The lighthouse CoreDNS plugin only checks for the presence of the gateways CRD on startup so that sounds like a separate issue not related to the gateways CRD. I'd suggest opening a new issue with relevant information (eg lighthouse CoreDNS pod logs).

skitt commented

In order for this to work, the CoreDNS plugin cluster role needs permission to list the Submariner resource. This seems OK to me - @vthapar @skitt WDYT?

That seems OK to me too!