timdaman/check_docker

Thorough check of services

owk opened this issue · 9 comments

owk commented

Hi @timdaman

I was wondering if it would be possible to expand your service checks in check_swarm? The script only checks if the service is available in the swarm, but it doesn't take into account if any containers are actually scheduled for the service? I hastily glanced over the API reference, and I think it would be possible to use the tasks resource for this. Maybe add a 'tasks'-argument where you can filter on the service or the container name?

Thanks for the great scripts!

owk commented

I added some functions I'm gonna use for our monitoring setup. I shared them in the gist below:

https://gist.github.com/owk/bab9d45cc60eba6295fe4a3cb8176c6f

It's not the cleanest code, but it works :)

Thanks, that looks interesting. One thing I am noticing is the logic for selecting the status. I don't work with swarm much so I may be missing something. The way I see it, if the service is not running we are in a critical state. The way I see it, OK means things are working, WARNING means things are working but with less performance margin than desired, and CRITICAL means things are not working. If a service is 'preparing' would it not be non-functional? Normally services start quickly but if took a very long time it would clearly be critical until it was running.

owk commented

Good point, maybe I ought to set 'preparing' as a CRITICAL status. I expect one would only see this status if the check interval is quite low, or if there is indeed a problem. Thanks for that!

If I understand the issue right, there's also the possibility to use --running check from check_swarm.py, but only after significant change of logic, to allow old instances of the same service in not-running state - see operasoftware@2b1fc15 for example implementation.

FWIW, after I'm done cleaning up these changes, I'm going to open another issue to review all the changes I've made, because they can be useful in general for Swarm setups.

from what i can see in the script, you're only determining service health by the HTTP return code, which in fact only signals swarm health (and even that is a rather limited swarm health check). Consider this example with a failed service:

$ docker service ls | grep testservice

wapxwn0h5i2l        testservice                         replicated          0/1                 testimage:latest

(the container failed to start due to an unsatisfied node constraint, but that could be anything).

You're doing a simple HTTP API check, similar to this:

$ curl -v --unix-socket /var/run/docker.sock http://localhost/services/testservice
> GET /services/testservice HTTP/1.1
< HTTP/1.1 200 OK
(normal service details follow)

200 here doesn't mean the service is green - it is in fact completely down - 200 only means that swarm was able to serve your request.

The service check is somewhat misleading the way it is now; putting that in a production environment as the primary health check is obviously the technician's fault and not yours, but still this will inevitably happen (as it did on the system i'm investigating right now). Perhaps it would be best to remove the service check option until it's fixed to help people avoid getting hurt?

@nksupport So sorry to hear this slowed your investigation down, that is exactly not what i had hoped it would do. Indeed I see the issue you mentioned, not good.

This was written for a previous job so I don't actually use this code much anymore but given the seriousness of the issue I will see if I can get a test environment up to test fixes.

Thank you such a detailed explanation of the problem, it really helps.

@nksupport , I have a pushed an updated swarm check. It is not fully tested but I figured I should get it to you quickly given the impact.
https://github.com/timdaman/check_docker/tree/swarm_fix

Below is a direct link to the updated swarm check, you can run it like this by hand, python3 check_swarm.py <Your options>

https://github.com/timdaman/check_docker/blob/edcacedef5ae8e6354962d151dce1cbe50483240/check_docker/check_swarm.py

@nksupport, update, that check had another bug which I found when I added unit tests. Here is the updated version which I intend to ship in the next few days. If you have a moment would love to know if it looks good to you.

https://github.com/timdaman/check_docker/raw/swarm_fix/check_docker/check_swarm.py

I believe version 2.2.1 should resolve this issue. Let me know if you see otherwise.