Issue with running kafka (without zookeeper)
shavo007 opened this issue · 11 comments
Description
Error connecting to broker when running kraft
cp-all-in-one/cp-all-in-one-kraft
https://github.com/confluentinc/cp-all-in-one/tree/6.2.0-post/cp-all-in-one-kraft
Troubleshooting
When i run the sample producer i get an exception:
2021-09-22 12:39:35 WARN NetworkClient:1060 - [Producer clientId=producer-1] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected
I checked the logs and the broker seems up but can't connect to it
The other examples work fine with zookeeper but not this one.
Environment
- GitHub branch:
6.2.0-post
- Operating System: mac os
- Version of Docker: Version: 20.10.8
- Version of Docker Compose: docker-compose version 1.29.2
Same here! I'm trying to connect with kafka-topics:
$ kafka-topics --bootstrap-server localhost:9092 --list
Error while executing topic command : Timed out waiting for a node assignment. Call: listTopics
[2021-10-18 15:38:56,506] ERROR org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: listTopics
Same Issue here tested with the v7.0.0 container
I am able to run kafka-topics
via docker exec
but my kafka client on the host is not able to connect and produces the same error.
Interestingly the console consumer (kafka-console-consumer.sh) running on my host machine is able to connect, no idea why. Maybe the error is not logged.
the consoleConsumer provided with kafka itself, sorry for the typo, I will edit my comment
Does this fix help?
#84
@mbreevoort Not for me. Made no difference, unfortunately.
Facing the same issue that I first discovered using librdkafka, but the same happens with a Java producer, too.
What's happening is the API version (v3) request gets "cut" before actually receiving a response.
From the broker container: netstat -ano -p
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
tcp 0 0 0.0.0.0:9101 0.0.0.0:* LISTEN - off (0.00/0/0)
tcp 0 0 0.0.0.0:41777 0.0.0.0:* LISTEN - off (0.00/0/0)
tcp 0 0 127.0.0.11:46301 0.0.0.0:* LISTEN - off (0.00/0/0)
tcp 0 0 127.0.0.1:9092 0.0.0.0:* LISTEN - off (0.00/0/0)
tcp 0 0 192.168.48.2:29092 0.0.0.0:* LISTEN - off (0.00/0/0)
tcp 0 0 192.168.48.2:29093 0.0.0.0:* LISTEN - off (0.00/0/0)
tcp 0 0 192.168.48.2:29092 192.168.48.3:45354 ESTABLISHED - keepalive (6592.04/0/0)
tcp 0 0 192.168.48.2:29092 192.168.48.3:45348 ESTABLISHED - keepalive (6591.66/0/0)
tcp 0 0 192.168.48.2:29092 192.168.48.3:45360 ESTABLISHED - keepalive (6592.52/0/0)
tcp 0 0 192.168.48.2:29093 192.168.48.2:37916 TIME_WAIT - timewait (52.04/0/0)
tcp 0 0 192.168.48.2:29093 192.168.48.2:37902 ESTABLISHED - keepalive (6587.84/0/0)
tcp 0 0 192.168.48.2:29092 192.168.48.3:45350 ESTABLISHED - keepalive (6591.73/0/0)
tcp 0 0 192.168.48.2:37902 192.168.48.2:29093 ESTABLISHED - keepalive (6587.83/0/0)
udp 0 0 127.0.0.11:32893 0.0.0.0:* - off (0.00/0/0)
Whereas if running the standard image with zookeeper I'm getting this result:
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
tcp 0 0 0.0.0.0:9092 0.0.0.0:* LISTEN - off (0.00/0/0)
tcp 0 0 0.0.0.0:29092 0.0.0.0:* LISTEN - off (0.00/0/0)
tcp 0 0 0.0.0.0:34405 0.0.0.0:* LISTEN - off (0.00/0/0)
tcp 0 0 0.0.0.0:9101 0.0.0.0:* LISTEN - off (0.00/0/0)
tcp 0 0 127.0.0.11:37105 0.0.0.0:* LISTEN - off (0.00/0/0)
tcp 0 0 0.0.0.0:8090 0.0.0.0:* LISTEN - off (0.00/0/0)
tcp 0 0 192.168.80.3:41014 52.85.187.45:443 TIME_WAIT - timewait (27.25/0/0)
tcp 0 0 192.168.80.3:50126 192.168.80.3:29092 ESTABLISHED - keepalive (7159.45/0/0)
tcp 0 0 192.168.80.3:50106 192.168.80.3:29092 TIME_WAIT - timewait (18.20/0/0)
tcp 0 0 192.168.80.3:50138 192.168.80.3:29092 TIME_WAIT - timewait (19.78/0/0)
tcp 0 0 192.168.80.3:50096 192.168.80.3:29092 ESTABLISHED - keepalive (7155.10/0/0)
tcp 0 0 192.168.80.3:50160 192.168.80.3:29092 TIME_WAIT - timewait (20.39/0/0)
tcp 0 0 192.168.80.3:29092 192.168.80.4:35756 ESTABLISHED - keepalive (7159.82/0/0)
tcp 0 0 192.168.80.3:50116 192.168.80.3:29092 TIME_WAIT - timewait (18.42/0/0)
tcp 0 0 192.168.80.3:29092 192.168.80.3:50090 ESTABLISHED - keepalive (7156.31/0/0)
tcp 0 0 192.168.80.3:29092 192.168.80.3:50204 ESTABLISHED - keepalive (7170.32/0/0)
tcp 0 0 192.168.80.3:50098 192.168.80.3:29092 TIME_WAIT - timewait (17.51/0/0)
tcp 0 0 192.168.80.3:41016 52.85.187.45:443 TIME_WAIT - timewait (27.37/0/0)
tcp 0 0 192.168.80.3:29092 192.168.80.3:50096 ESTABLISHED - keepalive (7156.32/0/0)
tcp 0 0 192.168.80.3:29092 192.168.80.3:50126 ESTABLISHED - keepalive (7159.45/0/0)
tcp 0 0 192.168.80.3:50164 192.168.80.3:29092 TIME_WAIT - timewait (20.53/0/0)
tcp 0 0 192.168.80.3:50112 192.168.80.3:29092 TIME_WAIT - timewait (18.34/0/0)
tcp 0 0 192.168.80.3:50158 192.168.80.3:29092 TIME_WAIT - timewait (20.39/0/0)
What gets my attention is:
tcp 0 0 127.0.0.1:9092 0.0.0.0:* LISTEN - off (0.00/0/0)
vs.
tcp 0 0 0.0.0.0:9092 0.0.0.0:* LISTEN - off (0.00/0/0)
I'm guessing that could be part of the problem (I remember having troubles when running HTTP server in docker for instance, and having to use 0.0.0.0
as listen host).
So I tried and changed my config to:
broker:
image: confluentinc/cp-kafka:7.0.1
hostname: broker
container_name: broker
ports:
- "9092:9092"
- "9101:9101"
environment:
KAFKA_BROKER_ID: 1
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: 'CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT'
KAFKA_ADVERTISED_LISTENERS: 'PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092'
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
KAFKA_JMX_PORT: 9101
KAFKA_JMX_HOSTNAME: localhost
KAFKA_PROCESS_ROLES: 'broker,controller'
KAFKA_NODE_ID: 1
KAFKA_CONTROLLER_QUORUM_VOTERS: '1@broker:29093'
KAFKA_LISTENERS: 'PLAINTEXT://broker:29092,CONTROLLER://broker:29093,PLAINTEXT_HOST://0.0.0.0:9092'
KAFKA_INTER_BROKER_LISTENER_NAME: 'PLAINTEXT'
KAFKA_CONTROLLER_LISTENER_NAMES: 'CONTROLLER'
KAFKA_LOG_DIRS: '/tmp/kraft-combined-logs'
KAFKA_LOG4J_LOGGERS: "kafka.controller=TRACE,kafka.server=TRACE,kafka.broker=TRACE,kafka.server.IncrementalFetchContext=WARN"
volumes:
- ./update_run.sh:/tmp/update_run.sh
command: "bash -c 'if [ ! -f /tmp/update_run.sh ]; then echo \"ERROR: Did you forget the update_run.sh file that came with this docker-compose.yml file?\" && exit 1 ; else /tmp/update_run.sh && /etc/confluent/docker/run ; fi'"
the change being:
KAFKA_LISTENERS: 'PLAINTEXT://broker:29092,CONTROLLER://broker:29093,PLAINTEXT_HOST://0.0.0.0:9092'
And it looks ok so far, need to experiment more and make sure everything is still working both from the inside of docker compose and from the outside.
For now:
- topic creation is OK (from a kafka-setup container inside docker-compose)
- a producer (from the outside, local machine) can connect using
localhost:9092
and produce messages - control center shows everything properly, including the messages sent by the producer
- connect instance looks ok so far
- ksqldb still to test
Hopefully this helps. Can file a PR if this is indeed the appropriate solution.
@aesteve thank you for tracking down the issue. I have verified in my environment that kafka-topics --bootstrap-server localhost:9092 --list
fails with the current config and works with the proposed changed. If you could please file a PR, that would be excellent! Note: please base/merge on 6.2.0-post
(not latest 7.0.1-post
) since the problem exists there. Once PR is merged, I'll propagate the fix to all recent branches.
Now fixed in latest release: https://github.com/confluentinc/cp-all-in-one/blob/7.0.1-post/cp-all-in-one-kraft/docker-compose.yml#L25