confluentinc/cp-all-in-one

Issue with running kafka (without zookeeper)

shavo007 opened this issue · 11 comments

Description
Error connecting to broker when running kraft

cp-all-in-one/cp-all-in-one-kraft
https://github.com/confluentinc/cp-all-in-one/tree/6.2.0-post/cp-all-in-one-kraft

Troubleshooting
When i run the sample producer i get an exception:

2021-09-22 12:39:35 WARN  NetworkClient:1060 - [Producer clientId=producer-1] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected

I checked the logs and the broker seems up but can't connect to it

The other examples work fine with zookeeper but not this one.

Environment

  • GitHub branch: 6.2.0-post
  • Operating System: mac os
  • Version of Docker: Version: 20.10.8
  • Version of Docker Compose: docker-compose version 1.29.2

Same here! I'm trying to connect with kafka-topics:

$ kafka-topics --bootstrap-server localhost:9092 --list
Error while executing topic command : Timed out waiting for a node assignment. Call: listTopics
[2021-10-18 15:38:56,506] ERROR org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: listTopics

bjrke commented

Same Issue here tested with the v7.0.0 container
I am able to run kafka-topics via docker exec but my kafka client on the host is not able to connect and produces the same error.
Interestingly the console consumer (kafka-console-consumer.sh) running on my host machine is able to connect, no idea why. Maybe the error is not logged.

@bjrke Can you please clarify what you mean by "the console consume"?

bjrke commented

the consoleConsumer provided with kafka itself, sorry for the typo, I will edit my comment

Does this fix help?
#84

@mbreevoort Not for me. Made no difference, unfortunately.

Facing the same issue that I first discovered using librdkafka, but the same happens with a Java producer, too.
What's happening is the API version (v3) request gets "cut" before actually receiving a response.

From the broker container: netstat -ano -p

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State	PID/Program name     Timer
tcp        0	  0 0.0.0.0:9101            0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 0.0.0.0:41777           0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 127.0.0.11:46301        0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 127.0.0.1:9092          0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 192.168.48.2:29092      0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 192.168.48.2:29093      0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 192.168.48.2:29092      192.168.48.3:45354      ESTABLISHED -                    keepalive (6592.04/0/0)
tcp        0	  0 192.168.48.2:29092      192.168.48.3:45348      ESTABLISHED -                    keepalive (6591.66/0/0)
tcp        0	  0 192.168.48.2:29092      192.168.48.3:45360      ESTABLISHED -                    keepalive (6592.52/0/0)
tcp        0	  0 192.168.48.2:29093      192.168.48.2:37916      TIME_WAIT   -                    timewait (52.04/0/0)
tcp        0	  0 192.168.48.2:29093      192.168.48.2:37902      ESTABLISHED -                    keepalive (6587.84/0/0)
tcp        0	  0 192.168.48.2:29092      192.168.48.3:45350      ESTABLISHED -                    keepalive (6591.73/0/0)
tcp        0	  0 192.168.48.2:37902      192.168.48.2:29093      ESTABLISHED -                    keepalive (6587.83/0/0)
udp        0	  0 127.0.0.11:32893        0.0.0.0:*                           -                    off (0.00/0/0)

Whereas if running the standard image with zookeeper I'm getting this result:

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State	PID/Program name     Timer
tcp        0	  0 0.0.0.0:9092            0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 0.0.0.0:29092           0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 0.0.0.0:34405           0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 0.0.0.0:9101            0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 127.0.0.11:37105        0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 0.0.0.0:8090            0.0.0.0:*               LISTEN	-                    off (0.00/0/0)
tcp        0	  0 192.168.80.3:41014      52.85.187.45:443        TIME_WAIT   -                    timewait (27.25/0/0)
tcp        0	  0 192.168.80.3:50126      192.168.80.3:29092      ESTABLISHED -                    keepalive (7159.45/0/0)
tcp        0	  0 192.168.80.3:50106      192.168.80.3:29092      TIME_WAIT   -                    timewait (18.20/0/0)
tcp        0	  0 192.168.80.3:50138      192.168.80.3:29092      TIME_WAIT   -                    timewait (19.78/0/0)
tcp        0	  0 192.168.80.3:50096      192.168.80.3:29092      ESTABLISHED -                    keepalive (7155.10/0/0)
tcp        0	  0 192.168.80.3:50160      192.168.80.3:29092      TIME_WAIT   -                    timewait (20.39/0/0)
tcp        0	  0 192.168.80.3:29092      192.168.80.4:35756      ESTABLISHED -                    keepalive (7159.82/0/0)
tcp        0	  0 192.168.80.3:50116      192.168.80.3:29092      TIME_WAIT   -                    timewait (18.42/0/0)
tcp        0	  0 192.168.80.3:29092      192.168.80.3:50090      ESTABLISHED -                    keepalive (7156.31/0/0)
tcp        0	  0 192.168.80.3:29092      192.168.80.3:50204      ESTABLISHED -                    keepalive (7170.32/0/0)
tcp        0	  0 192.168.80.3:50098      192.168.80.3:29092      TIME_WAIT   -                    timewait (17.51/0/0)
tcp        0	  0 192.168.80.3:41016      52.85.187.45:443        TIME_WAIT   -                    timewait (27.37/0/0)
tcp        0	  0 192.168.80.3:29092      192.168.80.3:50096      ESTABLISHED -                    keepalive (7156.32/0/0)
tcp        0	  0 192.168.80.3:29092      192.168.80.3:50126      ESTABLISHED -                    keepalive (7159.45/0/0)
tcp        0	  0 192.168.80.3:50164      192.168.80.3:29092      TIME_WAIT   -                    timewait (20.53/0/0)
tcp        0	  0 192.168.80.3:50112      192.168.80.3:29092      TIME_WAIT   -                    timewait (18.34/0/0)
tcp        0	  0 192.168.80.3:50158      192.168.80.3:29092      TIME_WAIT   -                    timewait (20.39/0/0)

What gets my attention is:

tcp        0	  0 127.0.0.1:9092          0.0.0.0:*               LISTEN	-                    off (0.00/0/0)

vs.

tcp        0	  0 0.0.0.0:9092            0.0.0.0:*               LISTEN	-                    off (0.00/0/0)

I'm guessing that could be part of the problem (I remember having troubles when running HTTP server in docker for instance, and having to use 0.0.0.0 as listen host).

So I tried and changed my config to:

  broker:
    image: confluentinc/cp-kafka:7.0.1
    hostname: broker
    container_name: broker
    ports:
      - "9092:9092"
      - "9101:9101"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: 'CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT'
      KAFKA_ADVERTISED_LISTENERS: 'PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092'
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_JMX_PORT: 9101
      KAFKA_JMX_HOSTNAME: localhost
      KAFKA_PROCESS_ROLES: 'broker,controller'
      KAFKA_NODE_ID: 1
      KAFKA_CONTROLLER_QUORUM_VOTERS: '1@broker:29093'
      KAFKA_LISTENERS: 'PLAINTEXT://broker:29092,CONTROLLER://broker:29093,PLAINTEXT_HOST://0.0.0.0:9092'
      KAFKA_INTER_BROKER_LISTENER_NAME: 'PLAINTEXT'
      KAFKA_CONTROLLER_LISTENER_NAMES: 'CONTROLLER'
      KAFKA_LOG_DIRS: '/tmp/kraft-combined-logs'
      KAFKA_LOG4J_LOGGERS: "kafka.controller=TRACE,kafka.server=TRACE,kafka.broker=TRACE,kafka.server.IncrementalFetchContext=WARN"
    volumes:
      - ./update_run.sh:/tmp/update_run.sh
    command: "bash -c 'if [ ! -f /tmp/update_run.sh ]; then echo \"ERROR: Did you forget the update_run.sh file that came with this docker-compose.yml file?\" && exit 1 ; else /tmp/update_run.sh && /etc/confluent/docker/run ; fi'"

the change being:

      KAFKA_LISTENERS: 'PLAINTEXT://broker:29092,CONTROLLER://broker:29093,PLAINTEXT_HOST://0.0.0.0:9092'

And it looks ok so far, need to experiment more and make sure everything is still working both from the inside of docker compose and from the outside.
For now:

  • topic creation is OK (from a kafka-setup container inside docker-compose)
  • a producer (from the outside, local machine) can connect using localhost:9092 and produce messages
  • control center shows everything properly, including the messages sent by the producer
  • connect instance looks ok so far
  • ksqldb still to test

Hopefully this helps. Can file a PR if this is indeed the appropriate solution.

@aesteve thank you for tracking down the issue. I have verified in my environment that kafka-topics --bootstrap-server localhost:9092 --list fails with the current config and works with the proposed changed. If you could please file a PR, that would be excellent! Note: please base/merge on 6.2.0-post (not latest 7.0.1-post) since the problem exists there. Once PR is merged, I'll propagate the fix to all recent branches.

Thanks much for your fix, @aesteve, and for the quick incorporation of that fix, @ybyzek.

bjrke commented

its working! thx @aesteve and @ybyzek