mesosphere/universe

Nifi scheduler restarts when mesos master restarts

a-nldisr opened this issue · 3 comments

Today in our test cluster we had a restart of our Mesos leading master. Nifi was the only framework scheduler that restarted due to disconnecting and not attaching to one of the other 2 master/zookeeper nodes. Asked on DCOS slack and got response by mattj.mesosphere to put the issue here since nifi was developed by a Partner team.

Stderr logs:

E1121 09:04:23.426849    61 scheduler.cpp:701] Failed to decode the stream of events: Pipe::Reader failure: failed to decode body
I1121 09:04:23.432196    61 scheduler.cpp:470] Re-detecting master
I1121 09:04:23.432495    60 scheduler.cpp:496] New master detected at master@master01:5050
Scheduler exiting immediately with code: 5
I1121 09:04:23.793958    63 scheduler.cpp:470] Re-detecting master
I1121 09:04:23.794062    63 scheduler.cpp:496] New master detected at master@master01:5050
I1121 09:04:24.168836     3 executor.cpp:938] Command exited with status 5 (pid: 12)
I1121 09:04:25.170900     2 checker_process.cpp:247] Stopped HTTP health check for task 'nifi.xxxxxx-xxxx-xxxxx-xxxxxxxxxxxxx'
I1121 09:04:25.171465    11 process.cpp:887] Failed to accept socket: future discarded

stderr:

INFO  2018-11-21 09:04:21,014 [pool-7-thread-1] com.mesosphere.sdk.scheduler.AbstractScheduler:processQueuedOffers(398): Waiting for queued offers...
INFO  2018-11-21 09:04:23,469 [Thread-27710] com.mesosphere.mesos.HTTPAdapter.MesosToSchedulerDriverAdapter:disconnected(163): Disconnected!
INFO  2018-11-21 09:04:23,469 [Thread-27710] com.mesosphere.mesos.HTTPAdapter.MesosToSchedulerDriverAdapter:cancelHeartbeatTimer(365): Cancelling heartbeat timer upon disconnection
ERROR 2018-11-21 09:04:23,470 [Thread-27710] com.mesosphere.sdk.scheduler.AbstractScheduler:disconnected(370): Disconnected from Master, shutting down.
Scheduler exiting immediately with code: 5
INFO  2018-11-21 09:04:23,495 [Thread-1] com.mesosphere.sdk.scheduler.SchedulerRunner:lambda$run$0(110): Shutdown initiated, releasing curator lock

Wanted to inspect this and see if i could help out, also im missing some configuration options (specifically the proxy settings in the properties file). It would help if i could get the github repo where the scheduler originates from, maybe to contribute or at least be able to debug and point at the cause.

Today we had a node failure in our test cluster, this was running the nifi scheduler. It was marked for more than 10 minutes as healthy and did not stage the task until i marked the node as gone. Health checks are a thing that should be checked too.

Any update? Was related to nifi 1.5. Wonder if the upgrade to 1.7 fixed this but cannot check since there is no public repo for this?