microsoft/windows-container-tools

File watcher hangs

sumantfordev opened this issue · 21 comments

I have a windows docker image and started using this to stream logs from my application log foler to std out. It all work fines but sometimes it stops pushing logs to std out.
At first i thought it might be no new logs but when i open the files , new lines are continuously getting added by the actual application but Logmonitor doesnot puts those new lines anymore. Everytime i have to restart the container. My configuration json file is pulling logs from Eventvwr and log folder. The eventvwr keeps working though.

What could be the possible reasons? Can there be a possibility of file locking which is not allowing the logmonitor to read new lines? Need some urgent guidance on this.

We have the same issue. Did you get any solution? or a workaround?

No solution yet. Are you facing this in docker or kubernets environment?
For us in local we use plain docker deployment we have not faced the issue, The issue is in production where we use kubernets.

In all the environments. When we are not using Kubernetes in my local machine for example LogMonitor stops sending the log content to the standard output, therefore, we don't see anything in the docker logs. We changed the log4net config to use small files etc but still. We are not able to fully reproduce but it happens after running for a couple of days.

Are you using any docker logging driver https://docs.docker.com/config/containers/logging/configure/ ?
For us what we are seeing is windows container stucks writing log to the file ( using the json logging file driver). And that's when logs stops coming in the std out. So its not actually teh file watcher that is hanging but the stdout is stuck.
I can somewhat relate this issue to this windows container issue- microsoft/Windows-Containers#106 .

I think we have an issue with LogMonitor reading the log4net log file.

In our case, we are using a log4net.Appender.RollingFileAppender. Sometimes when the file gets the maximum size and it's replaced LogMonitor stops reading the file and logs are not shipped to the STDOUT.

I know LogMonitor still working because I can see entries from the EventLog. Sometimes randomly LogMinitor is able to read the file and send the logs to the STDOUT.

    <appender name="JsonFileAppender" type="log4net.Appender.RollingFileAppender">
	    <file type="log4net.Util.PatternString" value="%property{ApplicationPath}\logs\application-logs.json"/>
	    <appendToFile value="true" />
	    <rollingStyle value="Size" />
	    <datePattern value="yyyy-MM-dd.lo\g" />
	    <maxSizeRollBackups value="20" />
	    <maximumFileSize value="40KB" />

	    <!-- logs to the same file always and rolls the others -->
	    <staticLogFileName value="true" />
	    <layout type="log4net.Layout.SerializedLayout, log4net.Ext.Json">
                         ..
	    </layout>
    </appender>
{
  "LogConfig": {
    "sources": [      
      {
        "type": "File",
        "directory": "******\\logs",
        "filter": "*.json",
        "includeSubdirectories": false
      },
      {
        "type": "EventLog",
        "startAtOldestRecord": true,
        "eventFormatMultiLine": false,
        "channels": [
          {
            "name": "system",
            "level": "Information"
          },
          {
            "name": "application",
            "level": "Error"
          },
          {
            "name": "application",
            "level": "Warning"
          }

        ]
      },
      {
        "type": "ETW",
        "providers": [
          {
            "providerName": "IIS: WWW Server",
            "ProviderGuid": "3A2A4E84-4C21-4981-AE10-3FDA0D9B0F83",
            "level": "Information"
          },
          {
            "providerName": "Microsoft-Windows-IIS-Logging",
            "ProviderGuid ": "7E8AD27F-B271-4EA2-A783-A47BDE29143B",
            "level": "Information",
            "keywords": "0xFF"
          }
        ]
      }
    ]
  }
}

@profnandaa review #26 as well

Is the file watcher not safe to use until this issue is fixed? @iankingori

Hey @knunery, we are actively working to resolve this as it also impacts log rotation. As for whether it is safe, I haven't seen an instance where the logs fail to write to file due to log monitor, we only fail to read and write to stdout so the logs can still be accessed on file.

We are also experiencing this. We've logs set up to rotate hourly and so far there doesn't seem to be a pattern to when or why LogMonitor stops sending new logs to stdout.
If there are any settings we can use to get debug output from LogMonitor or anything I can do to assist troubleshooting, let me know. I don't enough know C++ to be of programming help :).

Hey @seamustheseagull, we don't currently have a debug output option but that would be great to add. A couple of questions to help us repro the issue:

  • Are you using version 1.1 or 1.2 of log monitor?
  • What OS Version are you running on?
  • What are you using for your log rotation (e.g log4net, log4rs....) and what is your log rotation config beyond the hourly cycles?

Hi @iankingori

  • We're running v1.1 but in the process of moving to 1.2
  • Host & Containers are Windows Server 20H2 (I know)
  • We're using NLog. Rotates hourly, and retains the last six log files. On rotation it's renamed from file.log.json to file.{#}.archive.json

One of the curious things I've noticed in the LogMonitor output is every so often there'll be a message like this:

[2022-09-23T13:00:10.000Z][LOGMONITOR] ERROR: Error in log file monitor. Failed to open file \\?\c:\AppLogs\<appname>\<appname>.75.archive.json. Error: 32

Error 32 is a file access error, and given it happens on the hour, then this error likely occurs when the log rotation deletes the file. So that's to be expected.

What it's unexpected for me is that the file filter in LogMonitorConfig.json is "*.log.json".

So LogMonitor shouldn't be monitoring "*.archive.json". But I don't know the nuances of file handles - perhaps when the file is renamed on rotation, LogMonitor's handle to it is retained? Maybe that's relevant to the issue, maybe not.

Anything other questions, fire away.

The problem in our case seems to be caused by massive number of writes to log files. From what we understood from the source code, LogMonitor captures all events related to file system in a given folder to specific files (if the filter is defined)
Generating thousands writes a minute to a single file seems to lead to some sort of overflow in the internal queue of LogMonitor causing it to stop reading the file at all.
We have worked it around by tweaking the configuration of log4net to generate a new file every minute. We also had to introduce our own appender due to a bug in log4net. So our solution looks like this:

namespace IntApp.Wilco.Common.Logs
{
    using System;
    using System.IO;
    using System.Linq;

    using log4net.Appender;
    using log4net.Util;

    public class CloudFileAppender : RollingFileAppender
    {
        /// <summary>
        /// The adjust file before append.
        /// </summary>
        protected override void AdjustFileBeforeAppend()
        {
            base.AdjustFileBeforeAppend();

            var logsDirectory = Directory.GetParent(this.File).FullName;
            var logFiles = Directory.GetFiles(logsDirectory, "*.log");

            if (logFiles.Length > this.MaxSizeRollBackups)
            {
                logFiles
                    .OrderBy(System.IO.File.GetLastWriteTime)
                    .Take(logFiles.Length - this.MaxSizeRollBackups)
                    .ToList()
                    .ForEach(DeleteFile);
                LogLog.Debug(typeof(CloudFileAppender), $"Deleting {logFiles.Length} old log files");
            }
        }
    }
}
  <appender name="CloudTimeBasedRollingFileAppender" type="Intapp.Wilco.Common.Logs.CloudFileAppender, Intapp.Wilco.Common">
    <file value="logs\" />
    <appendToFile value="true" />
    <maxSizeRollBackups value="180" />
    <datePattern value=".MM.dd.yyyy.HH.mm'.log'" />
    <preserveLogFileNameExtension value="true" />
    <rollingStyle value="Date" />
    <staticLogFileName value="false" />
    <lockingModel type="log4net.Appender.FileAppender+ExclusiveLock"/>
    <layout type="IntApp.Wilco.Common.Logs.JsonLayout, IntApp.Wilco.Common">
    </layout>
  </appender>

After delivering all of this to Prod, we discovered that now generating tons of log records leads to terrible performance of our app. We have not figure out whether it was caused by application writes to the file system or LogMonitor furiously reading from generated files but after we rolled back to our old approach with a filebeat sidecar, things came to normal.
So I hope besides fixing the problem with File Watcher hanging from time to time, performance will also be checked

Thanks for the detailed info @DmitrySenin. We've had trouble getting this to repro on our end before. I'll try use your setup and see whether I can repro. Please share a sample of your FileMonitor configuration and base image.

@iankingori , here it is

{
    "LogConfig": {
      "logFormat": "custom",
      "sources": [
        {
          "type": "EventLog",
          "eventFormatMultiLine": false,
          "startAtOldestRecord": true,
          "channels": [
            {
              "name": "application",
              "level": "Information",
              "higher": true
            },
            {
              "name": "system",
              "level": "Information",
              "higher": true
            }
          ],
          "customLogFormat": "{'timestamp':'%TimeStamp%', 'message':'%Message%', 'eventLog':{'eventId':'%EventId%'}, 'level':'%Severity%'}|json"
        },
        {
          "type": "File",
          "eventFormatMultiLine": false,
          "directory": "c:\\logs",
          "filter": "*.log",
          "customLogFormat": "%Message%"
        },
        {
          "type": "ETW",
          "eventFormatMultiLine": false,
          "providers": [
            {
              "providerName": "IIS: WWW Server",
              "providerGuid": "3A2A4E84-4C21-4981-AE10-3FDA0D9B0F83",
              "level": "Information",
              "higher": true
            },
            {
              "providerName": "Microsoft-Windows-IIS-Logging",
              "providerGuid": "7E8AD27F-B271-4EA2-A783-A47BDE29143B",
              "level": "Information",
              "higher": true
            }
          ],
          "customLogFormat": "{'timestamp':'%TimeStamp%', 'message':'%EventData%', 'etw':{'eventId':'%EventId%', 'providerId':'%ProviderId%', 'providerName':'%ProviderName%', 'executionProcessId':'%ExecutionProcessId%', 'executionThreadId':'%ExecutionThreadId%', 'keyword':'%Keyword%'}, 'level':'%Severity%'}|json"
        }
      ]
    }
  }

From what I recall, 1k log records a second was enough to replicate the problem

Which base container image are you using and is this happening on docker locally or just on Kubernetes in production?

@iankingori ,
I just realised that for my experiments I was using a different config

{
  "LogConfig": {
    "logFormat": "custom",
    "sources": [
      {
        "type": "File",
        "directory": "C:\\projects\\Fiddle\\Fiddle\\bin\\Debug\\net7.0\\Logs",
        "filter": "*.log",
        "includeSubdirectories": false,
        "includeFileNames": false,
        "customLogFormat": "%Message%"
      }
    ]
  }
}

I reproduced the problem locally without docker or anything. Just LogMonitor with the config above started from a terminal, and a side app generating tons of logs to a single file.
We didn't try that with Docker locally and observed the problem on Kubernetes in Production. Our based image is mcr.microsoft.com/dotnet/framework/aspnet:4.7.2-20210209-windowsservercore-ltsc2019

@iankingori , I found the project which I used to repro the problem locally. Here it is Fiddle.zip. It's a C# app generating 1k of beefy log records a second via log4net to a file which rolls over hourly. You will also find LogMonitor config inside but you will need to correct the value of directory
I run LogMonitor first then the app. After some time, LogMonitor stops outputting anything to the console but the app keeps writing logs to the file

bchew commented

We also have encountered the same issue similar to what @DmitrySenin has reported (one of our app processes generated >1k log records and the LogMonitor process would stop outputting anything to stdout after that).

Tried it with Docker locally and the same issue was present. Have tried a few workarounds as mentioned for log4net - log rotation being set to small files, disabling log rotation altogether and having it output to a single log file and it does seem that the large volume of records would be able to reproduce this issue.

Our base image is mcr.microsoft.com/dotnet/framework/aspnet:4.8-20231010-windowsservercore-ltsc2022.

Just checking if any work is going on for this problem?

Hey @DmitrySenin. we were able to successfully repro using the steps you shared and are working on fixing the issue.

Hi, any updates on this issue?