logzio/logzio-helm

BUG: 0.20.2 chart is stuck at some point and cannot recover

Closed this issue · 2 comments

Shaked commented

Hi,

I upgraded to 0.20.2 a week ago and since then logs didn't come through. When I checked the logzio agent logs I saw this warning coming up all the time:

2023-06-01 22:34:28 +0000 [error]: #0 [out_logzio] Error while sending POST to [https://listener.logz.io:8071?token=REDACTED](https://listener.logz.io:8071/?token=REDACTED): {"malformedLines":0,"oversizedLines":1,"successfulLines":233}
2023-06-01 22:34:28 +0000 [debug]: #0 [out_logzio] taking back chunk for errors. chunk="..."
2023-06-01 22:34:28 +0000 [warn]: #0 [out_logzio] failed to flush the buffer. retry_times=178 next_retry_time=2023-06-01 22:35:01 +0000 chunk="5fc71397291af524ebf9f3ce631679d6" error_class=RuntimeError error="Logzio listener returned (400) for [https://listener.logz.io:8071?token=REDACTED](https://listener.logz.io:8071/?token=REDACTED): {\"malformedLines\":0,\"oversizedLines\":1,\"successfulLines\":233}"
2023-06-01 22:34:28 +0000 [warn]: #0 suppressed same stacktrace

This happened in 3 different clusters. I wasn't sure what the cause was, so I used the daemonset.logzioLogLevel=debug and saw that the logs that were being sent were one week old (from May 24-25th compare to June 1st). One log was indeed problematic, as it was too long however, this was a log from a pod that didn't exist anymore and it got to logzio with a logzio-invalid-log so even if it was problematic, it should not have created this issue.

Downgrading to version 0.20.1 seems to help, and I suspect that this is related to the Use fluentd's retry instead of retry in code (raise exception on non-2xx response). from the change log

The outcome of this issue is that we either lost the logs for this week or that we will be sending them in the upcoming hours and most likely exceed our daily capacity.

Shaked

Hi @Shaked,
Thank you for reporting this, we'll work on a fix.

Hi @Shaked , version 1.1.0 of the Chart is out and should handle it. Can you please confirm that this solves your issue?