Rolling restart unable to restart broker
Opened this issue · 6 comments
Hi
I am trying use rolling restart script(latest) along with Jolokia (jolokia-jvm-1.6.2-agent.jar) which is embedded with the kafka service script running in the brokers node(passed via KAFKA_OPTS).
KAFKA_OPTS="-javaagent:/home/kafka/prometheus/jmx_prometheus_javaagent-0.3.1.jar=8080:/home/kafka/prometheus/kafka-0-8-2.yml -javaagent:/home/kafka/jolokia/jolokia-agent.jar=host=*"
I am able to get jolokia metrics from the remote brokers node using following CURL command.
curl bro1:8778/jolokia/read/kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager/Value | jq
When i run the rolling restart script, it detects all the brokers and after confirmation the script stops the first broker. Then it waits forever to broker 1 to restart with the following messages:
[kfk@admin-node ~]$ kafka-rolling-restart --cluster-type kafka --start-command "/home/kfk/bin/kafka-server-start -daemon /home/kfk/etc/kafka/server.properties " --stop-command "/home/kfk/bin/kafka-server-stop" --check-count 3
Will restart the following brokers in cluster-1:
1: bro1
2: bro2
3: bro3
Do you want to restart these brokers? y
Execute restart
Under replicated partitions: 0, missing brokers: 0 (1/1)
The cluster is stable
Stopping bro1 (1/3)
Starting bro1 (1/3)
Cannot find the key, Kafka is probably still starting up
Under replicated partitions: 40, missing brokers: 1 (0/3)
Broker bro1 is down: HTTPConnectionPool(host='bro1', port=8778): Max retries exceeded with url: /jolokia//read/kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager/Value (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6de814bb50>: Failed to establish a new connection: [Errno 111] Connection refused',)).This maybe because it is starting up
Under replicated partitions: 68, missing brokers: 1 (0/3)
Broker bro1 is down: HTTPConnectionPool(host='bro1', port=8778): Max retries exceeded with url: /jolokia//read/kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager/Value (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6de80d8890>: Failed to establish a new connection: [Errno 111] Connection refused',)).This maybe because it is starting up
Under replicated partitions: 68, missing brokers: 1 (0/3)
Broker bro1 is down: HTTPConnectionPool(host='bro1', port=8778): Max retries exceeded with url: /jolokia//read/kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager/Value (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6de80d8a50>: Failed to establish a new connection: [Errno 111] Connection refused',)).This maybe because it is starting up
Under replicated partitions: 68, missing brokers: 1 (0/3)
Broker bro1 is down: HTTPConnectionPool(host='bro1', port=8778): Max retries exceeded with url: /jolokia//read/kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager/Value (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6de80e9810>: Failed to establish a new connection: [Errno 111] Connection refused',)).This maybe because it is starting up
Under replicated partitions: 68, missing brokers: 1 (0/3)
Broker bro1 is down: HTTPConnectionPool(host='bro1', port=8778): Max retries exceeded with url: /jolokia//read/kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager/Value (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6de8974b90>: Failed to establish a new connection: [Errno 111] Connection refused',)).This maybe because it is starting up
Tried with the following command as well:
[kfk@admin-node ~]$ kafka-rolling-restart --cluster-type kafka --start-command "sudo service kafka start " --stop-command "sudo service kafka stop" --check-count 3
On inspecting brokers node 1, I found kafka is stopped. Upon manual restart of broker 1, the rolling restart script stopped the second broker and again the script waits forever for broker 2 to get up. I have tested all the service command for kafka(start,stop,restart) manually in the broker's node and all of them are working.
It looks rolling restart script able to stop the kafka broker but unable to restart it.
Where could be the issues ?
Kafka version: confluent-5.2.1-2.12
From brokers SSH log, it looks the start and stop command actually reaches the broker but the start command is unable to start the broker.
Oct 29 03:37:45 bro1 sudo: kfk : TTY=pts/4 ; PWD=/home/kfk ; USER=root ; COMMAND=/home/kfk/bin/kafka-server-stop
Oct 29 03:37:45 bro1 sshd[5465]: pam_unix(sshd:session): session closed for user kfk
Oct 29 03:37:45 bro1 sshd[5507]: Accepted publickey for kfk from 172.16.10.161 port 40698 ssh2
Oct 29 03:37:45 bro1 sshd[5507]: pam_unix(sshd:session): session opened for user kfk by (uid=0)
Oct 29 03:37:46 bro1 sudo: kfk : TTY=pts/4 ; PWD=/home/ kfk ; USER=root ; COMMAND=/home/kfk/bin/kafka-server-start -daemon /home/kfk/etc/kafka/server.properties
The problem with rolling script is that both the stop and start command(Kafka) are executed almost at the same time with very little time delay resulting in start command executed even before stop command finishes its task. This is my assumption.
This behavior can be emulated by executing the following command from the terminal.
kafka-server-stop && kafka-server-start -daemon /home/kfk/etc/kafka/server.properties
Is it possible to inject a time delay between execution of the two command ? (not using sleep)
Hi @DwijadasDey , what you are saying seems reasonable to me. I'll try to reproduce your issue internally so we can come up with a good fix
@DwijadasDey I think we never ran into the issue internally since we use the default start/stop command (i.e. using systemd or upstart previously) as they probably have some logic to avoid the situation you are describing.
Is this possible for your deployment? I think using service managers like either systemd would be better overall than running the command daemonized. Otherwise you might want to include a prestarttask in your invocation that waits until the Kafka process has been stopped.
The problem with rolling script is that both the stop and start command(Kafka) are executed almost at the same time with very little time delay resulting in start command executed even before stop command finishes its task. This is my assumption.
This behavior can be emulated by executing the following command from the terminal.
kafka-server-stop && kafka-server-start -daemon /home/kfk/etc/kafka/server.properties
Is it possible to inject a time delay between execution of the two command ? (not using sleep)
What happens to your SSH connection after the broker is stopped?
Based on the log "pam_unix(sshd:session): session closed for user kfk" looks like you lose the
ssh session unless you are killing it.
If you are losing the session then definitely kafka-rolling-restart won't be able to start the broker.
I have the same problem
/usr/lib/python2.7/site-packages/paramiko/kex_ecdh_nist.py:111: CryptographyDeprecationWarning: encode_point has been deprecated on EllipticCurvePublicNumbers and will be removed in a future version. Please use EllipticCurvePublicKey.public_bytes to obtain both compressed and uncompressed point encoding.
Broker 172.26.0.250 is down: HTTPConnectionPool(host='172.26.0.250', port=8778): Max retries exceeded with url: /jolokia//read/kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager/Value (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3e90c30850>: Failed to establish a new connection: [Errno 111] Connection refused',)).This maybe because it is starting up