Nifi scheduler restarts when mesos master restarts
a-nldisr opened this issue · 3 comments
Today in our test cluster we had a restart of our Mesos leading master. Nifi was the only framework scheduler that restarted due to disconnecting and not attaching to one of the other 2 master/zookeeper nodes. Asked on DCOS slack and got response by mattj.mesosphere to put the issue here since nifi was developed by a Partner team.
Stderr logs:
E1121 09:04:23.426849 61 scheduler.cpp:701] Failed to decode the stream of events: Pipe::Reader failure: failed to decode body
I1121 09:04:23.432196 61 scheduler.cpp:470] Re-detecting master
I1121 09:04:23.432495 60 scheduler.cpp:496] New master detected at master@master01:5050
Scheduler exiting immediately with code: 5
I1121 09:04:23.793958 63 scheduler.cpp:470] Re-detecting master
I1121 09:04:23.794062 63 scheduler.cpp:496] New master detected at master@master01:5050
I1121 09:04:24.168836 3 executor.cpp:938] Command exited with status 5 (pid: 12)
I1121 09:04:25.170900 2 checker_process.cpp:247] Stopped HTTP health check for task 'nifi.xxxxxx-xxxx-xxxxx-xxxxxxxxxxxxx'
I1121 09:04:25.171465 11 process.cpp:887] Failed to accept socket: future discarded
stderr:
INFO 2018-11-21 09:04:21,014 [pool-7-thread-1] com.mesosphere.sdk.scheduler.AbstractScheduler:processQueuedOffers(398): Waiting for queued offers...
INFO 2018-11-21 09:04:23,469 [Thread-27710] com.mesosphere.mesos.HTTPAdapter.MesosToSchedulerDriverAdapter:disconnected(163): Disconnected!
INFO 2018-11-21 09:04:23,469 [Thread-27710] com.mesosphere.mesos.HTTPAdapter.MesosToSchedulerDriverAdapter:cancelHeartbeatTimer(365): Cancelling heartbeat timer upon disconnection
ERROR 2018-11-21 09:04:23,470 [Thread-27710] com.mesosphere.sdk.scheduler.AbstractScheduler:disconnected(370): Disconnected from Master, shutting down.
Scheduler exiting immediately with code: 5
INFO 2018-11-21 09:04:23,495 [Thread-1] com.mesosphere.sdk.scheduler.SchedulerRunner:lambda$run$0(110): Shutdown initiated, releasing curator lock
Wanted to inspect this and see if i could help out, also im missing some configuration options (specifically the proxy settings in the properties file). It would help if i could get the github repo where the scheduler originates from, maybe to contribute or at least be able to debug and point at the cause.
Today we had a node failure in our test cluster, this was running the nifi scheduler. It was marked for more than 10 minutes as healthy and did not stage the task until i marked the node as gone. Health checks are a thing that should be checked too.
Any update? Was related to nifi 1.5. Wonder if the upgrade to 1.7 fixed this but cannot check since there is no public repo for this?