Add `/heartbeat` style HTTP endpoint for load-balancer health checks
Closed this issue · 10 comments
Many public cloud load balancers expect services to provide an HTTP(s) accessible endpoint to verify its healthy operation. This will allow the load balancer to add or remove servers it is aware of from the load-balancer rotation automatically.
As roughenough is a UDP only server it is unable to respond to these health checks and therefore a bad server cannot be removed from rotation.
It would be very helpful if roughenough could (optionally?) also expose an endpoint on HTTP port 80 (e.g. /heartbeat
) which would return a 200 response if the server is operating normally. This endpoint might even expose a tiny JSON
response with additional info such as the output of a current roughtime timestamp in JSON form. e.g.
{"midpoint": 123, "radius": 1000000000}
Thanks.
I have been toying with this idea for a while now. It's genuinely useful. Thanks for brining it up.
Two approaches come to mind:
- Use a Rust HTTP server crate and run the healthcheck outside the main event loop, or
- Have a listening TCP socket in the
mio
event loop; parse HTTP requests and respond in the event loop
I'm leaning towards approach 2:
- I don't want to add a project dependency for an entire HTTP server (and its downstream dependencies) just to answer 200 to
/heartbeat
- Roughenough should only answer 200 to
/heartbeat
if the event loop is actually operating, and by extension, more likely than not able to respond to roughtime requests - The health-check endpoint is for internal use only and will not be exposed to the internet, have to serve more than one request every few seconds, or face DoS attacks.
Thoughts? I think this satisfies what you described.
Few more thoughts:
-
To keep attack surface minimal, I would prefer to not include an HTTP parser at all and instead have the HTTP health check port unconditionally squirt back an HTTP 200 response to all that connect.
-
Health check functionality will be an optional build-time feature not compiled by default. Keeping with the idea that the health check endpoint is only to be exposed to non-internet, non-hostile networks, the health check feature will be opt-in, requiring build and runtime steps to enable. The intent is to reduce the chance of unintentionally deploying roughenough with the health check exposed to the internet.
I think that returning an HTTP 200 for all TCP requests on that port would do the trick for our situation. And I think that it does make sense to handle that in the main event loop so the response is in some way tied to the healthy operation of the server.
https://cloud.google.com/load-balancing/docs/health-checks#legacy-health-checks
Having this functionality as a runtime option definitely makes sense to me. I feel less strongly about it being a build time option (not compiled in by default). I don't see much harm in including it in a standard build as long as it requires positive action to enable it in the config. But for our use case we could work with either approach (as we build a docker image of roughenough from source to host from).
I look forward to trying it out!
v1.0.7
? ;-)
@grempe look at what just landed, your health check has arrived. ;)
Some polish still required, but give it a go. I'm planning on releasing 1.1 this weekend if time permits.
Great @int08h !
I'll try to look today, but if not, I'll pick up 1.1 and let you know if I have any issues with the LB integration.
Cheers.
So far, so good. I see:
❯ curl -i http://127.0.0.1:8000
HTTP/1.1 200 OK
Content-Length: 0
Connection: close
I don't really have a way to test the negative case I think (other than shutting down roughenough). When the 1.1 release version comes out I can test it with the real load balancer health checks.
Thanks for implementing.
👍
I did note that the logging output is different for health check logs vs previously existing logs (_
vs ::
).
chainpoint-roughtime | 2018-10-27 01:34:09 INFO [roughenough_server] Max response batch size : 64
chainpoint-roughtime | 2018-10-27 01:34:09 INFO [roughenough_server] Status updates every : 600 seconds
chainpoint-roughtime | 2018-10-27 01:34:09 INFO [roughenough_server] Server listening on : 0.0.0.0:2002
chainpoint-roughtime | 2018-10-27 01:34:09 INFO [roughenough_server] TCP health check : 0.0.0.0:8000
chainpoint-roughtime | 2018-10-27 01:34:13 INFO [roughenough::server] health check from 172.21.0.1:59608
chainpoint-roughtime | 2018-10-27 01:34:13 INFO [roughenough::server] health check from 172.21.0.1:59610
chainpoint-roughtime | 2018-10-27 01:34:22 INFO [roughenough::server] health check from 172.21.0.1:59612
chainpoint-roughtime | 2018-10-27 01:34:47 INFO [roughenough::server] health check from 172.21.0.1:59614
chainpoint-roughtime | 2018-10-27 01:34:48 INFO [roughenough::server] health check from 172.21.0.1:59616
chainpoint-roughtime | 2018-10-27 01:34:50 INFO [roughenough::server] health check from 172.21.0.1:59618
chainpoint-roughtime | 2018-10-27 01:34:52 INFO [roughenough::server] health check from 172.21.0.1:59620
chainpoint-roughtime | 2018-10-27 01:34:53 INFO [roughenough::server] health check from 172.21.0.1:59622
chainpoint-roughtime | 2018-10-27 01:34:55 INFO [roughenough::server] health check from 172.21.0.1:59624
chainpoint-roughtime | 2018-10-27 01:35:02 INFO [roughenough::server] health check from 172.21.0.1:59626
Committed in 1.1.0
I did note that the logging output is different for health check logs vs previously existing logs (
_
vs::
).
I noticed that too. Artifact of the logging library being used. It violates my aesthetic sensibilities too, but I haven't had time to look more deeply into it.