Issue with running kafka (without zookeeper)

Question

Issue with running kafka (without zookeeper)

shavo007 opened this issue 3 years ago · 11 comments

Description
Error connecting to broker when running kraft

cp-all-in-one/cp-all-in-one-kraft
https://github.com/confluentinc/cp-all-in-one/tree/6.2.0-post/cp-all-in-one-kraft

Troubleshooting
When i run the sample producer i get an exception:

2021-09-22 12:39:35 WARN  NetworkClient:1060 - [Producer clientId=producer-1] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected

I checked the logs and the broker seems up but can't connect to it

The other examples work fine with zookeeper but not this one.

Environment

GitHub branch: 6.2.0-post
Operating System: mac os
Version of Docker: Version: 20.10.8
Version of Docker Compose: docker-compose version 1.29.2

Answer 1 · 2021-10-18T21:41:54.000Z

Same here! I'm trying to connect with kafka-topics:

$ kafka-topics --bootstrap-server localhost:9092 --list
Error while executing topic command : Timed out waiting for a node assignment. Call: listTopics
[2021-10-18 15:38:56,506] ERROR org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: listTopics

Answer 2 · 2021-11-04T09:41:02.000Z

Same Issue here tested with the v7.0.0 container
I am able to run kafka-topics via docker exec but my kafka client on the host is not able to connect and produces the same error.
Interestingly the console consumer (kafka-console-consumer.sh) running on my host machine is able to connect, no idea why. Maybe the error is not logged.

Answer 3 · 2021-11-04T15:23:31.000Z

@bjrke Can you please clarify what you mean by "the console consume"?

Answer 4 · 2021-11-04T16:01:05.000Z

the consoleConsumer provided with kafka itself, sorry for the typo, I will edit my comment

Answer 5 · 2021-11-16T21:12:29.000Z

Does this fix help?
#84

Answer 6 · 2021-11-17T00:41:45.000Z

@mbreevoort Not for me. Made no difference, unfortunately.

Answer 7 · 2022-01-20T08:55:49.000Z

Facing the same issue that I first discovered using librdkafka, but the same happens with a Java producer, too.
What's happening is the API version (v3) request gets "cut" before actually receiving a response.

From the broker container: netstat -ano -p

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State	PID/Program name     Timer
tcp        0	  0 0.0.0.0:9101            0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 0.0.0.0:41777           0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 127.0.0.11:46301        0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 127.0.0.1:9092          0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 192.168.48.2:29092      0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 192.168.48.2:29093      0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 192.168.48.2:29092      192.168.48.3:45354      ESTABLISHED -                    keepalive (6592.04/0/0)
tcp        0	  0 192.168.48.2:29092      192.168.48.3:45348      ESTABLISHED -                    keepalive (6591.66/0/0)
tcp        0	  0 192.168.48.2:29092      192.168.48.3:45360      ESTABLISHED -                    keepalive (6592.52/0/0)
tcp        0	  0 192.168.48.2:29093      192.168.48.2:37916      TIME_WAIT   -                    timewait (52.04/0/0)
tcp        0	  0 192.168.48.2:29093      192.168.48.2:37902      ESTABLISHED -                    keepalive (6587.84/0/0)
tcp        0	  0 192.168.48.2:29092      192.168.48.3:45350      ESTABLISHED -                    keepalive (6591.73/0/0)
tcp        0	  0 192.168.48.2:37902      192.168.48.2:29093      ESTABLISHED -                    keepalive (6587.83/0/0)
udp        0	  0 127.0.0.11:32893        0.0.0.0:*                           -                    off (0.00/0/0)

Whereas if running the standard image with zookeeper I'm getting this result:

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State	PID/Program name     Timer
tcp        0	  0 0.0.0.0:9092            0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 0.0.0.0:29092           0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 0.0.0.0:34405           0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 0.0.0.0:9101            0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 127.0.0.11:37105        0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 0.0.0.0:8090            0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 192.168.80.3:41014      52.85.187.45:443        TIME_WAIT   -                    timewait (27.25/0/0)
tcp        0	  0 192.168.80.3:50126      192.168.80.3:29092      ESTABLISHED -                    keepalive (7159.45/0/0)
tcp        0	  0 192.168.80.3:50106      192.168.80.3:29092      TIME_WAIT   -                    timewait (18.20/0/0)
tcp        0	  0 192.168.80.3:50138      192.168.80.3:29092      TIME_WAIT   -                    timewait (19.78/0/0)
tcp        0	  0 192.168.80.3:50096      192.168.80.3:29092      ESTABLISHED -                    keepalive (7155.10/0/0)
tcp        0	  0 192.168.80.3:50160      192.168.80.3:29092      TIME_WAIT   -                    timewait (20.39/0/0)
tcp        0	  0 192.168.80.3:29092      192.168.80.4:35756      ESTABLISHED -                    keepalive (7159.82/0/0)
tcp        0	  0 192.168.80.3:50116      192.168.80.3:29092      TIME_WAIT   -                    timewait (18.42/0/0)
tcp        0	  0 192.168.80.3:29092      192.168.80.3:50090      ESTABLISHED -                    keepalive (7156.31/0/0)
tcp        0	  0 192.168.80.3:29092      192.168.80.3:50204      ESTABLISHED -                    keepalive (7170.32/0/0)
tcp        0	  0 192.168.80.3:50098      192.168.80.3:29092      TIME_WAIT   -                    timewait (17.51/0/0)
tcp        0	  0 192.168.80.3:41016      52.85.187.45:443        TIME_WAIT   -                    timewait (27.37/0/0)
tcp        0	  0 192.168.80.3:29092      192.168.80.3:50096      ESTABLISHED -                    keepalive (7156.32/0/0)
tcp        0	  0 192.168.80.3:29092      192.168.80.3:50126      ESTABLISHED -                    keepalive (7159.45/0/0)
tcp        0	  0 192.168.80.3:50164      192.168.80.3:29092      TIME_WAIT   -                    timewait (20.53/0/0)
tcp        0	  0 192.168.80.3:50112      192.168.80.3:29092      TIME_WAIT   -                    timewait (18.34/0/0)
tcp        0	  0 192.168.80.3:50158      192.168.80.3:29092      TIME_WAIT   -                    timewait (20.39/0/0)

What gets my attention is:

tcp        0	  0 127.0.0.1:9092          0.0.0.0:*               LISTEN	-                    off (0.00/0/0)

vs.

tcp        0	  0 0.0.0.0:9092            0.0.0.0:*               LISTEN	-                    off (0.00/0/0)

I'm guessing that could be part of the problem (I remember having troubles when running HTTP server in docker for instance, and having to use 0.0.0.0 as listen host).

So I tried and changed my config to:

  broker:
    image: confluentinc/cp-kafka:7.0.1
    hostname: broker
    container_name: broker
    ports:
      - "9092:9092"
      - "9101:9101"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: 'CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT'
      KAFKA_ADVERTISED_LISTENERS: 'PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092'
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_JMX_PORT: 9101
      KAFKA_JMX_HOSTNAME: localhost
      KAFKA_PROCESS_ROLES: 'broker,controller'
      KAFKA_NODE_ID: 1
      KAFKA_CONTROLLER_QUORUM_VOTERS: '1@broker:29093'
      KAFKA_LISTENERS: 'PLAINTEXT://broker:29092,CONTROLLER://broker:29093,PLAINTEXT_HOST://0.0.0.0:9092'
      KAFKA_INTER_BROKER_LISTENER_NAME: 'PLAINTEXT'
      KAFKA_CONTROLLER_LISTENER_NAMES: 'CONTROLLER'
      KAFKA_LOG_DIRS: '/tmp/kraft-combined-logs'
      KAFKA_LOG4J_LOGGERS: "kafka.controller=TRACE,kafka.server=TRACE,kafka.broker=TRACE,kafka.server.IncrementalFetchContext=WARN"
    volumes:
      - ./update_run.sh:/tmp/update_run.sh
    command: "bash -c 'if [ ! -f /tmp/update_run.sh ]; then echo \"ERROR: Did you forget the update_run.sh file that came with this docker-compose.yml file?\" && exit 1 ; else /tmp/update_run.sh && /etc/confluent/docker/run ; fi'"

the change being:

      KAFKA_LISTENERS: 'PLAINTEXT://broker:29092,CONTROLLER://broker:29093,PLAINTEXT_HOST://0.0.0.0:9092'

And it looks ok so far, need to experiment more and make sure everything is still working both from the inside of docker compose and from the outside.
For now:

topic creation is OK (from a kafka-setup container inside docker-compose)
a producer (from the outside, local machine) can connect using localhost:9092 and produce messages
control center shows everything properly, including the messages sent by the producer
connect instance looks ok so far
ksqldb still to test

Hopefully this helps. Can file a PR if this is indeed the appropriate solution.

Answer 8 · 2022-01-20T11:37:23.000Z

@aesteve thank you for tracking down the issue. I have verified in my environment that kafka-topics --bootstrap-server localhost:9092 --list fails with the current config and works with the proposed changed. If you could please file a PR, that would be excellent! Note: please base/merge on 6.2.0-post (not latest 7.0.1-post) since the problem exists there. Once PR is merged, I'll propagate the fix to all recent branches.

Answer 9 · 2022-01-20T13:22:52.000Z

Now fixed in latest release: https://github.com/confluentinc/cp-all-in-one/blob/7.0.1-post/cp-all-in-one-kraft/docker-compose.yml#L25

Answer 10 · 2022-01-20T16:21:52.000Z

Thanks much for your fix, @aesteve, and for the quick incorporation of that fix, @ybyzek.

Answer 11 · 2022-01-24T15:47:15.000Z

its working! thx @aesteve and @ybyzek