sky-uk/kafka-message-scheduler

Scheduler can get into a zombie state when trying to shutdown

mishamo opened this issue · 2 comments

We have observed the following behaviour when the Scheduler cannot communicate with Kafka (e.g. Kafka is down):

Sometimes the restart works as expected, i.e. we see the following in logs:


  | February 15th 2018, 14:03:47.646 | Reader stream has died | ERROR

  | February 15th 2018, 14:03:47.641 | Message [akka.kafka.KafkaConsumerActor$Internal$Stop$] without sender to Actor[akka://kafka-message-scheduler/system/kafka-consumer-1#-104435897] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. | INFO

  | February 15th 2018, 14:03:47.580 | WakeupException limit exceeded, stopping. | ERROR

  | February 15th 2018, 14:03:44.484 | Consumer interrupted with WakeupException after timeout. Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds | WARN

  | February 15th 2018, 14:03:41.393 | Consumer interrupted with WakeupException after timeout. Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds | WARN

  | February 15th 2018, 14:03:38.304 | Consumer interrupted with WakeupException after timeout. Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds

However, sometimes, after an indeterminate number of restarts we observe the following:


  | February 15th 2018, 14:04:56.974 | WakeupException limit exceeded, stopping. | ERROR

  | February 15th 2018, 14:04:53.878 | Consumer interrupted with WakeupException after timeout. Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds | WARN

  | February 15th 2018, 14:04:50.788 | Consumer interrupted with WakeupException after timeout. Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds | WARN

This is followed by no more processing of messages and is then essentially in a zombie state where the application is running but not processing any traffic.

Managed to replicate locally and looked into the link above. Tried it out by removing kamon and the app was shutting down properly.