miguno/wirbelsturm

Kafka nodes don't reference zookeeper

Closed this issue · 10 comments

I tried setting up a cluster of 3 Kafka brokers and a single Zookeeper node. The provisioning process appears to work just fine, but the Kafka brokers don't connect to Zookeeper and shut down after a short timeout.

I looked at the configuration on one of them in /opt/kafka/config/server.properties, and found this: zookeeper.connect=localhost:2181. It looks like the Kafka nodes are not referencing the Zookeeper node.

Have I made a mistake setting this up? Or is there another step I need to follow for them to connect them to Zookeeper? The only changes to the default wirbelsturm.yaml I made were to comment out the storm_slave and storm_master nodes and enable the commented out kafka_broker ones.

Update: the server.properties file produced by Puppet looks like this:

###
### This file is managed by Puppet.
###

# See http://kafka.apache.org/documentation.html#brokerconfigs for default values.

# The id of the broker. This must be set to a unique integer for each broker.
broker.id=0

# The port the socket server listens on
port=9092

# A comma seperated list of directories under which to store log files
log.dirs=/app/kafka/log

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=localhost:2181

# Additional configuration options may follow here
controlled.shutdown.enable=true
log.retention.hours=48
log.roll.hours=48

The broker.id setting was the same across all 3 kafka nodes (It's supposed to be a different number for each broker in the cluster).

Correction - it looks like the zookeeper.connect setting was in fact set correctly, but ONLY for the first one. I tried deleting the file and reprovisioning with vagrant provision kafka1, and it filled it in as zookeeper1:2181.

It's possible that this is a bug that's only triggered by creating multiple Kafka nodes?

Rob, there's no bug -- you simply needed to add the appropriate Hiera data as needed. That being said, I agree it was not straight-forward how to do that given the existing Hiera data, so I made some changes which I hope will improve the status quo.

See the three commits above for details if you are interested.

To resolve your issue you need to do the following:

  • Pull the latest Wirbelsturm changes to your local checkout.
  • Create copies of puppet/manifests/hieradata/environments/default-environment/hosts/kafka1.yaml for each additional Kafka broker you need, e.g. kafka2.yaml and kafka3.yaml. In each kafka<N>.yaml make sure you set a unique kafka::broker_id.
  • Redeploy your cluster.

That should do it! :-)

And thanks for the detailed bug report!

Thanks for the explanation and changes, that's really helpful! I haven't worked with Hiera before, so I assumed it was autogenerating the configuration for each kafka broker node similarly to the Vagrant machines.

I'll try out the method you outlined above.

Looks like that solved the problem above with the server.properties configuration.

However, I'm seeing one more issue with multiple Kafka brokers. The /etc/hosts file doesn't appear to get an entry for all other hosts. When I recreated the 3-node cluster, the hosts files look like this:

# kafka1
127.0.0.1 localhost
127.0.1.1 kafka1
10.0.0.21 kafka1
10.0.0.241 zookeeper1

# kafka2
127.0.0.1 localhost
127.0.1.1 kafka2
10.0.0.21 kafka1
10.0.0.22 kafka2
10.0.0.241 zookeeper1

# kafka3
127.0.0.1 localhost
127.0.1.1 kafka3
10.0.0.21 kafka1
10.0.0.22 kafka2
10.0.0.23 kafka3
10.0.0.241 zookeeper1

It looks like each node is only getting host entries that are already provisioned? The end result is that a replicated topic throws errors for partitions on kafka1 or kafka2, because they can't contact the other nodes.

I ran vagrant provision again, and it updated the hosts files correctly this time. Could this be an issue with the vagrant-hosts plugin?

It seems unlikely that it would require you to start up all your servers first and then provision them in a separate step though.

Yes, this is indeed an issue with the vagrant-hosts plugin. The latest version of the plugin should have improved the situation but apparently it does not. We've also tried to switch to vagrant-hostmanager (a different plugin) but that resulted in other issues.

At the moment you need to perform either the workaround you described above or simply run the included ./deploy script. The latter will do exactly what you described: first it will launch all the machines, and only after that is completed it will provision the machines -- this means all the machines will know about each other. Also, ./deploy is typically much faster than manually running vagrant up --no-provision && vagrant provision because ./deploy will provision machines in parallel whereas vagrant provision will not.

That makes sense. Thanks again for your help - I'm going to close this issue since this solution works for me and there isn't a bug in Wirbelsturm itself.