sky-uk/kafka-message-scheduler

Scheduler does not recover from Kafka downtime.

mishamo opened this issue · 6 comments

To reproduce:

  • Run scheduler
  • Stop Kafka
  • Start Kafka
  • Produce to scheduler topic

Expected result:

  • Message produced back to Kafka as per usual scheduler behaviour.

Actual result:

  • Message never produced back to Kafka.

I have been able to reproduce this issue in a unit test within SchedulerIntSpec as follows:

"recover from kafka being down" in withRunningSchedulerStream {
      kafkaServer.close()
      kafkaServer.startup()

      val (scheduleId, schedule) = (UUID.randomUUID().toString, random[Schedule].secondsFromNow(2))

      writeToKafka(ScheduleTopic, (scheduleId, schedule.toAvro))

      val cr = consumeFromKafka(schedule.topic, keyDeserializer = new ByteArrayDeserializer).head

      cr.key() should contain theSameElementsInOrderAs schedule.key
      cr.value() should contain theSameElementsInOrderAs schedule.value
      cr.timestamp() shouldBe schedule.timeInMillis +- tolerance.toMillis

      val latestMessageOnScheduleTopic = consumeLatestFromScheduleTopic

      latestMessageOnScheduleTopic.key() shouldBe scheduleId
      latestMessageOnScheduleTopic.value() shouldBe null
    }

This is likely caused by the akka stream getting killed and not restarting (further investigation required). A possible fix for this would be to use a BackoffSupervisor as documented here: https://doc.akka.io/docs/akka-stream-kafka/current/errorhandling.html#failing-consumer

Actually, looked into the test example more; it looks like kafkaServer.startup() never completes in this case. Ignore the unit test for the time being. We have, however, been able to reproduce this manually.

I think the intended behaviour is for the application to completely shut down and to let docker orchestrator keep restarting the app until Kafka is back up. Happy to discuss this

I think we would be fine with that behaviour, but that isn't quite what we've observed. Will investigate further on our end and present our findings.

Our basic manual test looks like this hasn't resolved the issue. Steps to reproduce:

  • Run kafka (we run this in a docker container)
  • Run scheduler (again, we run this in a docker container)
  • Have some kind of produce/consume test continuously running against the scheduler (our app healthcheck does this for us)
  • Stop kafka
  • Check healthcheck (should show kafka and scheduler unhealthy)

Expected result:

  • Eventually scheduler container should die; or the process should begin shutdown. (We would expect kubernetes to then restart it in real environments; we would do that manually locally)

Actual result:

  • Scheduler does not die or begin shutdown.

  • We then brought kafka back up and scheduler was in an unhealthy state (i.e. producing/consuming to it no longer worked).

Strange, I tested this locally and killing kafka stopped the container. Will look into it maybe the release didn't include it for some reason

Ok, let us know if you want us to look into it further or need any more details