BUG: 0.20.2 chart is stuck at some point and cannot recover
Closed this issue · 2 comments
Hi,
I upgraded to 0.20.2 a week ago and since then logs didn't come through. When I checked the logzio agent logs I saw this warning coming up all the time:
2023-06-01 22:34:28 +0000 [error]: #0 [out_logzio] Error while sending POST to [https://listener.logz.io:8071?token=REDACTED](https://listener.logz.io:8071/?token=REDACTED): {"malformedLines":0,"oversizedLines":1,"successfulLines":233}
2023-06-01 22:34:28 +0000 [debug]: #0 [out_logzio] taking back chunk for errors. chunk="..."
2023-06-01 22:34:28 +0000 [warn]: #0 [out_logzio] failed to flush the buffer. retry_times=178 next_retry_time=2023-06-01 22:35:01 +0000 chunk="5fc71397291af524ebf9f3ce631679d6" error_class=RuntimeError error="Logzio listener returned (400) for [https://listener.logz.io:8071?token=REDACTED](https://listener.logz.io:8071/?token=REDACTED): {\"malformedLines\":0,\"oversizedLines\":1,\"successfulLines\":233}"
2023-06-01 22:34:28 +0000 [warn]: #0 suppressed same stacktrace
This happened in 3 different clusters. I wasn't sure what the cause was, so I used the daemonset.logzioLogLevel=debug
and saw that the logs that were being sent were one week old (from May 24-25th compare to June 1st). One log was indeed problematic, as it was too long however, this was a log from a pod that didn't exist anymore and it got to logzio with a logzio-invalid-log
so even if it was problematic, it should not have created this issue.
Downgrading to version 0.20.1 seems to help, and I suspect that this is related to the Use fluentd's retry instead of retry in code (raise exception on non-2xx response).
from the change log
The outcome of this issue is that we either lost the logs for this week or that we will be sending them in the upcoming hours and most likely exceed our daily capacity.
Shaked