MS3Inc/tavros

Heartbeat monitoring broken for camel web service 0.2.7+

Opened this issue ยท 8 comments

Reference to related issue in charts repo.

If users upgrade to 0.2.7 of the Tavros helm chart, then heartbeat monitoring will be broken. The way the virtual port was re-exposed caused duplicate listener issues creating a lot of issues with uptime of pods so it needs to be removed. However, since heartbeat is inside of elastic-system namespace, it is outside of the prod and sandbox meshes. Heartbeat needs to be able to access the actuator port of the service (example: http://api-test-camel-web-service.dev.svc.cluster.local:8080/actuator/health/liveness) within both the prod and sandbox meshes.

Some ideas from the folks at Kong/Kuma:

  • gateway for each mesh
  • deploy heartbeat inside of each mesh and then somehow pull in the results back into the elastic-system namespace (maybe as ExternalService?)

Acceptance tests:

Given an api and curl shell deployed in each namespace (dev,test,prod):

Namespace Test Expected result Actual result PASS/FAIL
PROD shell curl 'http://api-test-camel-web-service.prod.svc.cluster.local:8080/actuator/health/liveness' {"status":"UP"}
PROD shell curl 'http://api-test-camel-web-service.dev.svc.cluster.local:8080/actuator/health/liveness' curl 'http://api-test-camel-web-service.test.svc.cluster.local:8080/actuator/health/liveness' Empty reply from server
DEV shell curl 'http://api-test-camel-web-service.prod.svc.cluster.local:8080/actuator/health/liveness' Empty reply from server
DEV shell curl 'http://api-test-camel-web-service.dev.svc.cluster.local:8080/actuator/health/liveness' curl 'http://api-test-camel-web-service.test.svc.cluster.local:8080/actuator/health/liveness' {"status":"UP"}
TEST shell curl 'http://api-test-camel-web-service.prod.svc.cluster.local:8080/actuator/health/liveness' Empty reply from server
TEST shell curl 'http://api-test-camel-web-service.dev.svc.cluster.local:8080/actuator/health/liveness' curl 'http://api-test-camel-web-service.test.svc.cluster.local:8080/actuator/health/liveness' {"status":"UP"}

&

Test Expected result Actual result PASS/FAIL
Log into Kibana -> Observability -> Uptime Uptime monitoring works for prod pods
Log into Kibana -> Observability -> Uptime Uptime monitoring works for dev pods
Log into Kibana -> Observability -> Uptime Uptime monitoring works for test pods
jam01 commented

Hey @rlratcliffe I believe the kong DPs are already configured to be gateways. Perhaps there's a good way to create a route/service that proxies the request to the probes (wonder if the host can be dynamic). Then possibly make that service only internal to the cluster, maybe IP whitelisting...?

The other solution may also be possible, though I don't exactly remember how/where the heartbeat component runs

hey @jam01 decided to go with a totally different solution for now, as it seemed easier/safer to configure this way, which is to create 3 different heartbeat instances:

  • dev instance with a sidecar with the sandbox mesh
  • test instance with a sidecar with the sandbox mesh
  • prod instance with a sidecar with the prod mesh

this way heartbeat stays inside of the same namespace and each instance looks only at the specific namespace of the pods so there's no conflicts for the instances. they don't seem to take up too many resources, although I only have 1 API in my test cluster. done a lot of tests in my personal cluster and it seems ok. the person I talked to in the kuma slack thought it was an ok approach. I'll create a PR at some point, although #102 would need to be merged first.

I will make both 0.2.7 and 0.2.8 chart releases pre-releases with notes related to this issue in the meantime.

jam01 commented

Don't quite remember the distinction between dev and test... But yeah if the resources taken by sidecars is not significant then no worries, though it may be significant if there's thousands of pods.

Though if the deployments work as side cars, that means they're deployed in namespaces different than elastic... Which means that daemonsets in sandbox and production namespaces could also work somehow.

Either way, you obviously already have a functional solution :)

thanks for chiming in :)

might not be an important distinction but, my understanding is the heartbeat instances are still in elastic-system. it's similar, I think, to how in the kong namespace there isn't a mesh defined, but prod and sandbox releases each have sidecars and so each release can communicate with the necessary prod/dev/test namespaces. just by defining the mesh per sidecar.

jam01 commented

decided to keep with stated plan for now.

PR ready: #104