hvisage/metalog

If defined script in metalog.conf is missing, metalog can make the system unusable

Closed this issue · 9 comments

Hi!
Let's asume, you define a section in metalog.conf, which references an external script.
It's defined as last section for a catch-all section.

System:
        break           = 1
        command         = "/sbin/consolelog.sh"
        facility        = "*"
        logdir          = "/var/log/metalog"

If that script is missing, for example wrong path, metalog can go very bad and make the whole system unbootable . Unbootable means, that the system hangs on various daemons to start during boot. It affects every service, which needs to log into the system logger.

The only difference I see, that "ps aux" reports a defunct on the new metalog version on those systems, where it hangs. So it makes sense, that somehow every service hangs forever, when it wants to log into the defunct logger.

System working:
root      8029  0.0  0.0   3020  1224 ?        Ss   12:21   0:00 metalog [MASTER]
root      8030  0.0  0.0   3020   120 ?        S    12:21   0:00 metalog [KERNEL]
System failing:
root      4403  0.0  0.0   3020  1356 ?        Ss   12:20   0:00 metalog [MASTER]
root      4404  0.0  0.0   3020   120 ?        S    12:20   0:00 metalog [KERNEL]
root      4558  0.0  0.0      0     0 ?        Z    12:20   0:00 [metalog] <defunct>

After running strace (Log: https://pastebin.com/3dkWNT04), you can see, that metalog locks complety and tries all the time to run the script, which fails, as it does not exist.

If it does not lock up the system boot, it seems still to have problems, as it fails to log most of the stuff.

Feb 15 03:19:30 [metalog] Child [1091] exited with return code 127.
Feb 15 03:19:30 [metalog] Child [1092] exited with return code 127.
Feb 15 03:20:50 [su] pam_unix(su:session): session closed for user man
Feb 15 03:20:50 [metalog] Child [13244] exited with return code 127.
Feb 15 10:15:35 [uptimed] moving up to position 40: 0 days, 10:56:01
Feb 15 10:15:35 [metalog] Child [6662] exited with return code 127.
Feb 15 12:00:01 [start-stop-daemon] Will stop /usr/bin/lcd4linux
Feb 15 12:00:01 [start-stop-daemon] Will stop PID 3593
Feb 15 12:00:01 [metalog] Child [32369] exited with return code 127.
Feb 15 12:00:01 [metalog] Child [32370] exited with return code 127.
Feb 15 12:00:01 [start-stop-daemon] Sending signal 15 to PID 3593
Feb 15 12:00:01 [metalog] Child [32371] exited with return code 127.
Feb 15 12:00:01 [start-stop-daemon] Will stop /usr/bin/lcd4linux
Feb 15 12:00:01 [metalog] Child [32458] exited with return code 127.

I don't think, that metalog should lock up, if target script is not found.

Could you please pastebin your complete metalog.conf?

The part of the metalog.conf you pasted above, says: "Log everything that gets reported and start the script /sbin/consolelog.sh"
Metalog will do exactly that. Because of the failing start of the script, metalog will produce a log entry "Child XYZ exited ..." which will lead to another attempt to start the script - which will fail again :-)

You could avoid this circle by adding a metalog section for the logging of metalog itself, that doesn't try to start this non existing script:
Metalog:
program = "metalog"
logdir = "/var/log/metalog"
break = 1

Metalog could even log into the same directory as your System section does.

Could you please pastebin your complete metalog.conf?

# Metalog

# Einstellungen
maxfiles                = 7
maxsize                 = 1048576
maxtime                 = 86400

# System
System:
        break           = 1
        command         = "/sbin/consolelog.sh"
        facility        = "*"
        logdir          = "/var/log/metalog"

The part of the metalog.conf you pasted above, says: "Log everything that gets reported and start the script /sbin/consolelog.sh"
Metalog will do exactly that. Because of the failing start of the script, metalog will produce a log entry "Child XYZ exited ..." which will lead to another attempt to start the script - which will fail again :-)

Yes, this is intended, to log everything into file and output in console. It was just an mistake, that I did places the script into a wrong name. But I don't think, that if that script is missing, that every daemon, which tries to log into system log, just freezes and does nothing, since it's somehow blocked due metalog. Isn't that basically a bug? Only a kill helps to stop that daemon, which tries to log.

You have to kill the processes that produce the logs, so that the system is usable again?
I could not reproduce your problem, even when I installed a "command" with a non existing script. If metalog really is responsible for your non responsive system, then my guess is it's about the load, the recursive logging causes.
Does my suggestion about the metaglog section prevent the problem? If this is not the case, we have to dig further....

You have to kill the processes that produce the logs, so that the system is usable again?

No. I have to kill metalog.

Does my suggestion about the metaglog section prevent the problem?

Haven't tried. Will do and report back.

To completely avoid this my only idea would be, to limit the allowed rate of spawning a configured command per default. That is not a perfect solution, as a sensible rate can be quite different depending on the command you want to spawn.

Have you had the chance to avoid this problem?
Do you have another idea, how this could be solved?

I thought about checking for the existance of the program given by "command", e.g. with stat(). That would only make sense when the config gets parsed. If only checked immediatelly before executing the command it would not help very much, because, the resulting system load would be about the same.
In case the programm doesn't exist at metalog start time, there are two options:

  • simply ignore this part of the config. This could be weired in debugging, if the programm comes to existance at a later time.
  • exit metalog completely. This will deny execution of scripts, that get created at a later time.

What poison should it be? Or should we just leave it as it is right now?

Thanks for your thoughts.

simply ignore this part of the config. This could be weired in debugging, if the programm comes to existance at a later time.

Personally, I would like to have this option. Maybe an additional one-time log entry at start would be nice, so you know, that this section is being skipped.

exit metalog completely. This will deny execution of scripts, that get created at a later time.

But If you prefer that option, I would be also fine with that.

What poison should it be? Or should we just leave it as it is right now?

I think, we need some solution, I would not leave it as it is.

Please review the merge request!

I guess we also should think about introducing SIGHUP, which should reread the config file. Currently metalog must be restarted after the missing command (script) has been created. A SIGHUP would do the same trick. It would also be the first step to multiple config files, as suggested / wished by vaygr (include support).

Eventually hvisage could also comment? It really feals weird to merge own requests on a project, where I just where given the rights to do so a few years ago :-)

BTW: How would you think about a new release (doing that feels even more weird...)?

Please review the merge request!

Looks good on my first sight. Please note, I am not an expert in programing ;-)

I guess we also should think about introducing SIGHUP, which should reread the config file. Currently metalog must be restarted after the missing command (script) has been created. A SIGHUP would do the same trick. It would also be the first step to multiple config files, as suggested / wished by vaygr (include support).

I like the idea! Include support would be awesome!

BTW: How would you think about a new release (doing that feels even more weird...)?

Why not? Still using after more than 10 years metalog :-)