Scheduler can get into a zombie state when trying to shutdown
mishamo opened this issue · 2 comments
mishamo commented
We have observed the following behaviour when the Scheduler cannot communicate with Kafka (e.g. Kafka is down):
Sometimes the restart works as expected, i.e. we see the following in logs:
| February 15th 2018, 14:03:47.646 | Reader stream has died | ERROR
| February 15th 2018, 14:03:47.641 | Message [akka.kafka.KafkaConsumerActor$Internal$Stop$] without sender to Actor[akka://kafka-message-scheduler/system/kafka-consumer-1#-104435897] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. | INFO
| February 15th 2018, 14:03:47.580 | WakeupException limit exceeded, stopping. | ERROR
| February 15th 2018, 14:03:44.484 | Consumer interrupted with WakeupException after timeout. Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds | WARN
| February 15th 2018, 14:03:41.393 | Consumer interrupted with WakeupException after timeout. Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds | WARN
| February 15th 2018, 14:03:38.304 | Consumer interrupted with WakeupException after timeout. Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds
However, sometimes, after an indeterminate number of restarts we observe the following:
| February 15th 2018, 14:04:56.974 | WakeupException limit exceeded, stopping. | ERROR
| February 15th 2018, 14:04:53.878 | Consumer interrupted with WakeupException after timeout. Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds | WARN
| February 15th 2018, 14:04:50.788 | Consumer interrupted with WakeupException after timeout. Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds | WARN
This is followed by no more processing of messages and is then essentially in a zombie state where the application is running but not processing any traffic.
lacarvalho91 commented
lacarvalho91 commented
Managed to replicate locally and looked into the link above. Tried it out by removing kamon and the app was shutting down properly.