linkedin/brooklin

Producer is closed forcefully loop

RagingPuppies opened this issue · 0 comments

Producer is closed forcefully loop

After restarting a broker / broker failures (anythins that trigger a leader election) seems like some brooklin TransportProviders cant self heal and get stuck in a loop.

brooklin is set with "pausePartitionOnError": "true",
A flag indicating whether to auto-pause a topic partition if dispatching its data for delivery to the destination system fails.

when brooklin producer AKA TransportProvider receives an error, it will pause following the configuration "pauseErrorPartitionDurationMs": "180000" (3 minutes).

looking at brookling logs i could find the following errors at the corresponding time of the issue:
"Flush interrupted."
"This server is not the leader for that topic-partition."
"Partition rewind failed due to"
means that at this moment, our brooklin producer is trying to work against a non-leader partition.
roughly 5 minutes later, i've witnessed the following error messages:
"Expiring 227 record(s) for <topic_name>-12: 302797 ms has passed since last append"
after comparing this with the brooklin configuration i've spotted "request.timeout.ms": "300000" which is 5 minutes.

for the next 20 minutes we received NotLeaderForPartitionException, which means we did not produced data and seems like we did not consumed.
later on theres only one exception, "Producer is closed forcefully."
reading a bit online someone said it may be that the produce can't keep with the consume,
"producersPerTask" and "numProducersPerConnector" in our configuration should do the job.
i was looking on the consumer group info and seems like it stoped consuming as well.

At the same time, we have another Datastream that replicates to the SAME cluster and topics sharing the same configurations, the failing cluster have 8 more in maxTasks,
The source of the failing Datastream is kafka remote cluster while the working one is a kafka local cluster, and the local does not fail at all, not even a single exception.

[local]Cluster A ---> Brooklin ---> Cluster C
[remote]Cluster B -----^
on remote cluster some (2~3) TransportProviders are failing.

brooklin configurations:
https://pastebin.com/raw/kHACqwcA

Your environment

  • Ubuntu 18.04
  • Brooklin 1.1.0
  • Java 1.8.0_152
  • Kafka 2.5.0
  • ZK 3.4.5

Ideas?