Yolean/kubernetes-kafka

Java based producer fails to re-connect after upgrade

solsson opened this issue · 6 comments

Tested #228 and #227. Kafkacat recovered nicely from downtime, bu the "produce-consume" test from 4.3.0 did badly. Producer logs:

$ kubectl -n test-kafka logs -f produce-consume-vwqtq -c producer
++ date --iso-8601=ns -u
+ echo '--- start produce-consume-vwqtq 2018-12-02T13:43:26,379429655+00:00 ---'
+ ./bin/kafka-topics.sh --zookeeper zookeeper.kafka:2181 --describe --topic test-produce-consume
Topic:test-produce-consume	PartitionCount:3	ReplicationFactor:2	Configs:
	Topic: test-produce-consume	Partition: 0	Leader: 0	Replicas: 1,0	Isr: 0,1
	Topic: test-produce-consume	Partition: 1	Leader: 0	Replicas: 0,1	Isr: 0,1
	Topic: test-produce-consume	Partition: 2	Leader: 0	Replicas: 1,0	Isr: 0,1
+ ./bin/kafka-console-producer.sh --broker-list bootstrap.kafka:9092 --topic test-produce-consume --producer-property acks=-1
+ tail -f /shared/produce.tmp
[2018-12-02 13:44:56,627] WARN [Producer clientId=console-producer] Got error produce response with correlation id 6 on topic-partition test-produce-consume-0, retrying (2 attempts left). Error: NOT_LEADER_FOR_PARTITION (org.apache.kafka.clients.producer.internals.Sender)
[2018-12-02 13:44:56,628] WARN [Producer clientId=console-producer] Received invalid metadata error in produce request on partition test-produce-consume-0 due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2018-12-02 13:44:56,729] WARN [Producer clientId=console-producer] Got error produce response with correlation id 8 on topic-partition test-produce-consume-0, retrying (1 attempts left). Error: NOT_LEADER_FOR_PARTITION (org.apache.kafka.clients.producer.internals.Sender)
[2018-12-02 13:44:56,730] WARN [Producer clientId=console-producer] Received invalid metadata error in produce request on partition test-produce-consume-0 due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)
[2018-12-02 13:50:26,623] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:50:26,735] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:50:26,913] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:50:27,380] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:50:28,123] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:50:39,230] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:50:43,625] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:50:43,625] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:50:46,950] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:50:49,725] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:50:50,023] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:50:53,095] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:50:55,868] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:50:56,166] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:50:59,239] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:02,310] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:05,384] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:08,455] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:08,523] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:11,526] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:14,598] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:17,670] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:20,742] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:20,828] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:23,814] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:26,886] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:26,923] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:29,958] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:33,031] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:33,124] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:36,103] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:39,175] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:39,224] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:42,246] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:45,318] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:45,340] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:48,390] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:51,462] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:51,484] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:54,534] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:54,623] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:51:57,606] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:00,679] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:00,723] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:03,750] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:06,822] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:06,924] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:09,894] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:12,967] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:13,126] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:16,038] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:19,111] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:22,182] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:22,224] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:25,254] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:25,535] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-0:120001 ms has passed since batch creation
[2018-12-02 13:52:25,625] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-0:120001 ms has passed since batch creation
[2018-12-02 13:52:25,625] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-0:120001 ms has passed since batch creation
[2018-12-02 13:52:25,625] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-0:120001 ms has passed since batch creation
[2018-12-02 13:52:25,626] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-0:120001 ms has passed since batch creation
[2018-12-02 13:52:28,326] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:31,398] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:34,470] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:34,524] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:35,437] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-2:120001 ms has passed since batch creation
[2018-12-02 13:52:35,438] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-2:120001 ms has passed since batch creation
[2018-12-02 13:52:35,438] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-2:120001 ms has passed since batch creation
[2018-12-02 13:52:35,438] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-2:120001 ms has passed since batch creation
[2018-12-02 13:52:35,438] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-2:120001 ms has passed since batch creation
[2018-12-02 13:52:37,542] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:40,614] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:43,723] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:43,725] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.28:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 13:52:45,434] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-1:120001 ms has passed since batch creation
[2018-12-02 13:52:45,434] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-1:120001 ms has passed since batch creation
[2018-12-02 13:52:45,434] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-1:120001 ms has passed since batch creation
[2018-12-02 13:52:45,435] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-1:120001 ms has passed since batch creation
[2018-12-02 13:52:45,435] ERROR Error when sending message to topic test-produce-consume with key: null, value: 62 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for test-produce-consume-1:120001 ms has passed since batch creation
[2018-12-02 13:52:46,758] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.22:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

It seems to never recover.

These podIPs do not exist anymore, after the replacement of the statefulset.

Edit: I realized later that I didn't check kubectl -n kafka get endpoints broker -o yaml here. Maybe the IPs were endpoints.

I scaled up the test and the new pod's producer has no issues. The old one hasn't recovered. This is the Kafka 2.1.0 image.

Unfortunately the failing pod got OOMKilled after about an hour so I will never know if it would have recovered. The resterted container connected successfully after a few seconds.

After another 5 minutes the producer container logged the following:

[2018-12-02 15:03:41,827] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.1.25:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-02 15:03:41,924] WARN [Producer clientId=console-producer] 2 partitions have leader brokers without a matching listener, including [test-produce-consume-0, test-produce-consume-2] (org.apache.kafka.clients.NetworkClient)
[2018-12-02 15:08:56,526] WARN [Producer clientId=console-producer] Got error produce response with correlation id 236 on topic-partition test-produce-consume-0, retrying (2 attempts left). Error: NOT_LEADER_FOR_PARTITION (org.apache.kafka.clients.producer.internals.Sender)
[2018-12-02 15:08:56,528] WARN [Producer clientId=console-producer] Received invalid metadata error in produce request on partition test-produce-consume-0 due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now (org.apache.kafka.clients.producer.internals.Sender)

That might be explained by my current 500Mi memory limit on Kafka. I see two restarts on kafka-1.

I might have some kind of name resolution problem in my test cluster. For example zookeeper pods repeatedly log java.net.UnknownHostException: zoo-1.zoo for all of the zoo pods. Pods are Ready and endpoints exist.

Edit: Nope, that was #231. I can't see a relation between that issue to the kafka broker problem here, but maybe there is one.

Got this error again now after testing another upgrade. Maybe it's reproducible by simply deleting and recreating the kafka statefulset.

[2018-12-03 07:13:20,827] WARN [Producer clientId=console-producer] Connection to node 2 (kafka-2.broker.kafka.svc.cluster.local/10.36.0.59:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-03 07:13:21,474] WARN [Producer clientId=console-producer] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.3.34:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-03 07:13:21,659] WARN [Producer clientId=console-producer] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.56:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
$ kubectl -n kafka get endpoints broker
NAME     ENDPOINTS                                         AGE
broker   10.36.0.73:9092,10.36.0.74:9092,10.36.3.43:9092   49m
$ kubectl -n kafka get pod -l app=kafka -o yaml | grep IP
    hostIP: 10.0.0.2
    podIP: 10.36.0.73
    hostIP: 10.0.0.5
    podIP: 10.36.3.43
    hostIP: 10.0.0.2
    podIP: 10.36.0.74
$ kubectl -n kafka describe service bootstrap | grep Endpoints
Endpoints:         10.36.0.73:9092,10.36.0.74:9092,10.36.3.43:9092

Edit: A comment has been deleted right above this one. That's what the ping was referring to. It was interesting, suggesting that maybe brokers report their IP address to Zk and not a DNS name. I assumed that it was the client that had resolved the IP addresses and didn't refresh them when brokers restarted.

@qbast That sounds like a plausible explanation for the error, but I don't know where that POD_IP is set. Where did you find the export? Or is it that PLAINTEXT://:9092 somehow defaults to the "host" IP?

If we end up changing listerners or advertised.listeners we should also take #192 (comment) into account.