Scheduler does not recover from Kafka downtime.
mishamo opened this issue · 6 comments
To reproduce:
- Run scheduler
- Stop Kafka
- Start Kafka
- Produce to scheduler topic
Expected result:
- Message produced back to Kafka as per usual scheduler behaviour.
Actual result:
- Message never produced back to Kafka.
I have been able to reproduce this issue in a unit test within SchedulerIntSpec
as follows:
"recover from kafka being down" in withRunningSchedulerStream {
kafkaServer.close()
kafkaServer.startup()
val (scheduleId, schedule) = (UUID.randomUUID().toString, random[Schedule].secondsFromNow(2))
writeToKafka(ScheduleTopic, (scheduleId, schedule.toAvro))
val cr = consumeFromKafka(schedule.topic, keyDeserializer = new ByteArrayDeserializer).head
cr.key() should contain theSameElementsInOrderAs schedule.key
cr.value() should contain theSameElementsInOrderAs schedule.value
cr.timestamp() shouldBe schedule.timeInMillis +- tolerance.toMillis
val latestMessageOnScheduleTopic = consumeLatestFromScheduleTopic
latestMessageOnScheduleTopic.key() shouldBe scheduleId
latestMessageOnScheduleTopic.value() shouldBe null
}
This is likely caused by the akka stream getting killed and not restarting (further investigation required). A possible fix for this would be to use a BackoffSupervisor
as documented here: https://doc.akka.io/docs/akka-stream-kafka/current/errorhandling.html#failing-consumer
Actually, looked into the test example more; it looks like kafkaServer.startup()
never completes in this case. Ignore the unit test for the time being. We have, however, been able to reproduce this manually.
I think the intended behaviour is for the application to completely shut down and to let docker orchestrator keep restarting the app until Kafka is back up. Happy to discuss this
I think we would be fine with that behaviour, but that isn't quite what we've observed. Will investigate further on our end and present our findings.
Our basic manual test looks like this hasn't resolved the issue. Steps to reproduce:
- Run kafka (we run this in a docker container)
- Run scheduler (again, we run this in a docker container)
- Have some kind of produce/consume test continuously running against the scheduler (our app healthcheck does this for us)
- Stop kafka
- Check healthcheck (should show kafka and scheduler unhealthy)
Expected result:
- Eventually scheduler container should die; or the process should begin shutdown. (We would expect kubernetes to then restart it in real environments; we would do that manually locally)
Actual result:
-
Scheduler does not die or begin shutdown.
-
We then brought kafka back up and scheduler was in an unhealthy state (i.e. producing/consuming to it no longer worked).
Strange, I tested this locally and killing kafka stopped the container. Will look into it maybe the release didn't include it for some reason
Ok, let us know if you want us to look into it further or need any more details