scalyr/scalyr-agent-2

Monitors DEBUG level logging breaks / stops after reloading agent config in the main loop

Kami opened this issue · 0 comments

Kami commented

While looking into to the memory leak issue today I observed weird behavior / bug with logging from the monitors.

To be able to track down issues with tcollector, I added some log statements to the main loops and other methods which stopped showing up in the agent debug log after the agent main loop iteration (aka after we check the config file on disk and verify if there are any changes).

I was able to track down the issue to the __verify_config method (https://github.com/scalyr/scalyr-agent-2/blob/master/scalyr_agent/agent_main.py#L355), more specifically a place where we create new manager classes.

I haven't track down the exact root cause yet, but it's likely related to some place where we reload some module and potentially re-create the log handlers or loggers which become regular loggers instead of AgentLogger class instances (possibly related to this comment - https://github.com/scalyr/scalyr-agent-2/blob/master/scalyr_agent/agent_main.py#L61).

At first I thought it's related to self.__update_debug_log_level method, but this doesn't appear to be the case.


Here is an example change which replicates this issue - 305790b

If you run the agent with that change, you will see "[tcollector.py:853] inside tcollector main loop" log messages until the config is reloaded aka "[agent_main.py:1084] Config was not different than previous" message is logged.

As soon as that message is logged (and MonitorsManger class is re-created) DEBUG logging from all the monitors will stop showing up in the agent log files (monitors will continue to run though).

Example output:

==> /home/kami/scalyr-agent-dev/log/agent_debug.log <==
2020-02-26 12:40:06.103Z DEBUG [monitor:linux_system_metrics()] [tcollector.py:325] collector loop run
2020-02-26 12:40:06.193Z DEBUG [monitor:linux_system_metrics()] [tcollector.py:848] inside tcollector main loop
2020-02-26 12:40:07.104Z DEBUG [monitor:linux_system_metrics()] [tcollector.py:325] collector loop run
2020-02-26 12:40:08.105Z DEBUG [monitor:linux_system_metrics()] [tcollector.py:325] collector loop run
2020-02-26 12:40:08.195Z DEBUG [monitor:linux_system_metrics()] [tcollector.py:848] inside tcollector main loop
2020-02-26 12:40:09.080Z DEBUG [core] [agent_main.py:1008] Checking for any changes to config file
2020-02-26 12:40:09.085Z DEBUG [core] [agent_main.py:1084] Config was not different than previous
....
# No more monitor DEBUG log lines after that

You can confirm that the issue is indeed related to re-creating of MonitorsManager class by using this change - d45c566.

With this change, you will continue to see "[tcollector.py:853] inside tcollector main loop" messages in the log file even after reloading the config.

Example output:

2020-02-26 12:41:13.846Z DEBUG [monitor:linux_system_metrics()] [tcollector.py:848] inside tcollector main loop
2020-02-26 12:41:14.737Z DEBUG [core] [agent_main.py:1008] Checking for any changes to config file

==> /home/kami/scalyr-agent-dev/log/agent.log <==
2020-02-26 12:41:14.747Z INFO [core] [agent_main.py:373] verify_config - creation of monitors manager disabled

==> /home/kami/scalyr-agent-dev/log/agent_debug.log <==
2020-02-26 12:41:14.749Z DEBUG [monitor:linux_system_metrics()] [tcollector.py:325] collector loop run
2020-02-26 12:41:15.750Z DEBUG [monitor:linux_system_metrics()] [tcollector.py:325] collector loop run
# Debug level log messages from monitoring appear normally after reloading the config

I had a brief look at the commit history and changes to that file and it appears this bug hasn't been introduced recently and has been there for a while already.