Nifi scheduler restarts when mesos master restarts

Question

Nifi scheduler restarts when mesos master restarts

a-nldisr opened this issue 6 years ago · 3 comments

Today in our test cluster we had a restart of our Mesos leading master. Nifi was the only framework scheduler that restarted due to disconnecting and not attaching to one of the other 2 master/zookeeper nodes. Asked on DCOS slack and got response by mattj.mesosphere to put the issue here since nifi was developed by a Partner team.

Stderr logs:

E1121 09:04:23.426849    61 scheduler.cpp:701] Failed to decode the stream of events: Pipe::Reader failure: failed to decode body
I1121 09:04:23.432196    61 scheduler.cpp:470] Re-detecting master
I1121 09:04:23.432495    60 scheduler.cpp:496] New master detected at master@master01:5050
Scheduler exiting immediately with code: 5
I1121 09:04:23.793958    63 scheduler.cpp:470] Re-detecting master
I1121 09:04:23.794062    63 scheduler.cpp:496] New master detected at master@master01:5050
I1121 09:04:24.168836     3 executor.cpp:938] Command exited with status 5 (pid: 12)
I1121 09:04:25.170900     2 checker_process.cpp:247] Stopped HTTP health check for task 'nifi.xxxxxx-xxxx-xxxxx-xxxxxxxxxxxxx'
I1121 09:04:25.171465    11 process.cpp:887] Failed to accept socket: future discarded

stderr:

INFO  2018-11-21 09:04:21,014 [pool-7-thread-1] com.mesosphere.sdk.scheduler.AbstractScheduler:processQueuedOffers(398): Waiting for queued offers...
INFO  2018-11-21 09:04:23,469 [Thread-27710] com.mesosphere.mesos.HTTPAdapter.MesosToSchedulerDriverAdapter:disconnected(163): Disconnected!
INFO  2018-11-21 09:04:23,469 [Thread-27710] com.mesosphere.mesos.HTTPAdapter.MesosToSchedulerDriverAdapter:cancelHeartbeatTimer(365): Cancelling heartbeat timer upon disconnection
ERROR 2018-11-21 09:04:23,470 [Thread-27710] com.mesosphere.sdk.scheduler.AbstractScheduler:disconnected(370): Disconnected from Master, shutting down.
Scheduler exiting immediately with code: 5
INFO  2018-11-21 09:04:23,495 [Thread-1] com.mesosphere.sdk.scheduler.SchedulerRunner:lambda$run$0(110): Shutdown initiated, releasing curator lock

Answer 1 · 2018-11-21T12:03:50.000Z

Wanted to inspect this and see if i could help out, also im missing some configuration options (specifically the proxy settings in the properties file). It would help if i could get the github repo where the scheduler originates from, maybe to contribute or at least be able to debug and point at the cause.

Answer 2 · 2018-12-03T12:27:10.000Z

Today we had a node failure in our test cluster, this was running the nifi scheduler. It was marked for more than 10 minutes as healthy and did not stage the task until i marked the node as gone. Health checks are a thing that should be checked too.

Answer 3 · 2019-03-15T16:01:57.000Z

Any update? Was related to nifi 1.5. Wonder if the upgrade to 1.7 fixed this but cannot check since there is no public repo for this?