openshift/origin-aggregated-logging

Logging - posibility of losing logs

Closed this issue · 7 comments

Let's say I have container which logs massively. Supported configuration from RedHat uses JSON files on /var/log/containers. But it will eventually eat all filesystem because those logs are deleted after pod deletion. One way to combat this situation is to use max-size.

Let's imagine this scenario (for demonstration, log entry will have 1 MB and max-size is 50MB):

  1. container's log on node has 49.5 MB, fluend position is at EOF
  2. container logs 1 MB
  3. current log on node has 50.5 MB, fluend reads and tries to forward to ES but some problem happens (network failure, ES down, whatever) -> so no data has been sent
  4. container logs 1 MB
  5. docker daemon checks for max-size of logging file before writing => so it truncates file to 0 (https://github.com/moby/moby/blob/77faf158f5265711dbcbff0ffb855eed2e3b6ccd/daemon/logger/loggerutils/logfile.go#L174)

Same idea applies for dead containers, k8s GC could have deleted dead containers before sending data to ES (maximum-dead-containers-per-container, default value is 1).

Is there any way to truncate/rotate/delete logs from nodes based on acknowledgment from fluentd that those data has been successfully sent or any idea how to get it working 100% and not to lose a single log line?

richm commented

Is there any way to truncate/rotate/delete logs from nodes based on acknowledgment from fluentd that those data has been successfully sent or any idea how to get it working 100% and not to lose a single log line?

I don't know, but I would like to know. Have you tried asking the upstream fluentd community?

Note that OpenShift 4.x uses CRI-O instead of docker - CRI-O has max-size and rotation parameters - not sure how to configure them.

Note that logging 4.2 will support rsyslog in addition to fluentd.

richm commented

@portante I think this is related to what you have been investigating.

@alanconway is there work here to be done on the collector side to resolve this or is this purely related to the runtime work you started?

@camabeh

Is there any way to truncate/rotate/delete logs from nodes based on acknowledgment from fluentd that those data has been successfully sent or any idea how to get it working 100% and not to lose a single log line?

We strive to collect all logs from the system but we make no guarantees

This problem would likely be solved by a solution like the one proposed for conmon [1].

[1] containers/conmon#84

Closing issue to be resolved by impl of containers/conmon#84