kubernetes-retired/service-catalog

Logging should be clearer when service catalog fails to connect to etcd

Closed this issue · 10 comments

I encountered a scenario where the SC failed to connect to etcd using TLS because the etcd certificate name did not match the connection hostname.

There was no obvious logging at loglevel 10 and the SC /healthz and /healthz/etcd endpoints did not indicate an issue. The SC itself just seemed to be hanged and API requests returned errors like Error from server (Forbidden): clusterservicebrokers.servicecatalog.k8s.io "foo" is forbidden: not yet ready to handle request.

It would be good to add loggging -

  • to show etcd connection failures and detail
  • on /healthz
  • to be clearer about which admission controller is failing, when one is failing

@pmorie @jboyd01

We're using the upstream checker, and that has no logging.

Did you see any of the log messages in https://github.com/kubernetes-incubator/service-catalog/blob/v0.1.12/cmd/apiserver/app/server/run_server.go#L135-L149 or did it just hang completely with no details?

I didn't, no. I wonder if the etcd logging is just all disabled. Perhaps enabling it when the -v loglevel is above a certain level, or something like that, might help?

I may be hitting this issue myself, still debugging. However I do see that the test that is under CheckEtcdServers() is pretty lame - it just does a dial on the address:port and if the dial is accepted the connection is closed and thumbs up. Certainly could be stronger.

/assign @jboyd01

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.