Scheduler does not recover from Kafka downtime.

Question

Scheduler does not recover from Kafka downtime.

mishamo opened this issue 7 years ago · 6 comments

mishamo commented 7 years ago

To reproduce:

Run scheduler
Stop Kafka
Start Kafka
Produce to scheduler topic

Expected result:

Message produced back to Kafka as per usual scheduler behaviour.

Actual result:

Message never produced back to Kafka.

I have been able to reproduce this issue in a unit test within SchedulerIntSpec as follows:

"recover from kafka being down" in withRunningSchedulerStream {
      kafkaServer.close()
      kafkaServer.startup()

      val (scheduleId, schedule) = (UUID.randomUUID().toString, random[Schedule].secondsFromNow(2))

      writeToKafka(ScheduleTopic, (scheduleId, schedule.toAvro))

      val cr = consumeFromKafka(schedule.topic, keyDeserializer = new ByteArrayDeserializer).head

      cr.key() should contain theSameElementsInOrderAs schedule.key
      cr.value() should contain theSameElementsInOrderAs schedule.value
      cr.timestamp() shouldBe schedule.timeInMillis +- tolerance.toMillis

      val latestMessageOnScheduleTopic = consumeLatestFromScheduleTopic

      latestMessageOnScheduleTopic.key() shouldBe scheduleId
      latestMessageOnScheduleTopic.value() shouldBe null
    }

This is likely caused by the akka stream getting killed and not restarting (further investigation required). A possible fix for this would be to use a BackoffSupervisor as documented here: https://doc.akka.io/docs/akka-stream-kafka/current/errorhandling.html#failing-consumer

Answer 1 · 2017-12-15T16:01:50.000Z

Actually, looked into the test example more; it looks like kafkaServer.startup() never completes in this case. Ignore the unit test for the time being. We have, however, been able to reproduce this manually.

Answer 2 · 2017-12-15T18:09:40.000Z

I think the intended behaviour is for the application to completely shut down and to let docker orchestrator keep restarting the app until Kafka is back up. Happy to discuss this

Answer 3 · 2017-12-18T11:24:48.000Z

I think we would be fine with that behaviour, but that isn't quite what we've observed. Will investigate further on our end and present our findings.

Answer 4 · 2018-01-15T13:52:18.000Z

Our basic manual test looks like this hasn't resolved the issue. Steps to reproduce:

Run kafka (we run this in a docker container)
Run scheduler (again, we run this in a docker container)
Have some kind of produce/consume test continuously running against the scheduler (our app healthcheck does this for us)
Stop kafka
Check healthcheck (should show kafka and scheduler unhealthy)

Expected result:

Eventually scheduler container should die; or the process should begin shutdown. (We would expect kubernetes to then restart it in real environments; we would do that manually locally)

Actual result:

Scheduler does not die or begin shutdown.
We then brought kafka back up and scheduler was in an unhealthy state (i.e. producing/consuming to it no longer worked).

Answer 5 · 2018-01-15T13:54:03.000Z

Strange, I tested this locally and killing kafka stopped the container. Will look into it maybe the release didn't include it for some reason

Answer 6 · 2018-01-15T14:03:59.000Z

Ok, let us know if you want us to look into it further or need any more details