linsomniac/nanomon

Issue with recovering service coinciding with failing service not being reported.

ChrisHeerschap opened this issue · 3 comments

This is a bit of an edge case, but I've found an issue where a service that recovers at the same time that another service fails results in neither a recovery alert or a new failure alert, since this doesn't change the total number of failures.

I confirmed this via testing, using the following config file:

#statusfile('/var/lib/nanomon.status')
statusfile('/home/cmh/git/nanomon/status')
mailto('nanomon@example.com')
mailfrom('nanomon@example.com')

command('/home/cmh/git/nanomon/testexit /home/cmh/git/nanomon/exit1',
    success = 0)
command('/home/cmh/git/nanomon/testexit /home/cmh/git/nanomon/exit2',
    success = 0)
command('/home/cmh/git/nanomon/tester /home/cmh/git/nanomon/exit3',
    success = 0)

The "testexit" script simply exits with the status found in the file listed in arg1. To start, all three files (exit1, exit2, exit3) have "0", so several runs of nanomon result in no errors, but a good status file. Changing one of the files to "1" and re-running causes nanomon to note the failure but not alert. Several more runs cause the number of failures to get to the alerting threshold, at which point it mails out. More runs after that don't produce output since the alert has been sent.

At this point, changing the file with "1" to "0" (causing the service to "recover") and changing another file from "0" to "1" and re-running nanomon results in no recovery email nor any notice of the new failure.

I know this is an edge case and would probably require a pretty significant rewrite to address, but it could be an issue and I thought you might like to know about it. I logged my testing in a text file (with diffs of the status.dat after each run) if it would help.

I've committed a new test that I think reproduces this. Can you eyeball it and see if it looks like what you are talking about?

I think that test correctly identifies the problem you are reporting. I've disabled that test, as (like you surmise) the design of nanomon identifies failures in the SYSTEM not in the SERVICES. Probably the easiest way to make this change would be to have a status file for every service rather than just a single file, probably by taking the command and creating a signature, and using that signature in the file name. Then the failure checking would be moved from the system to the service.

However, this is not a change I'm prepared to make. I have re-factored a lot of the code to make it easier to make such a change though.

Haven't had a chance to look at the change yet, but based on what you're saying, that was exactly what I was guessing, and I guessed the same type of approach to address it. It's an edge case, so I'm not tremendously worried about it. Going too deep down that rabbit hole would get away from the "nano" part of nanomon. :D