Intel-bigdata/SSM

Restart active server causes itself dead occasionally

lipppppp opened this issue · 3 comments

After restarting active server on ssm1, the service started normally. But the node info page shows that the status of ssm1 is dead, and cmdlets cannot run on ssm1. This problem is accidental, repeated many times the problem will appear. When ssm1 stoped, there are some error messages in the log.
image
image
image
image
image

In this case, it is still dead after restarting the service on ssm1. And there is no problem in the log. Only after the active server is restarted can it return to normal.
image

I cannot reproduce this issue. You can try to debug it. The exception reported in shutting down doesn't matter I think. HazelcastExecutorService#addMember will add newly started SSM server and deliver message to CmdletDispatcherHelper for further handling, which may be helpful in your debugging.

OK, I will try to debug this process. I found sometimes the state of standby server is normal, but all the tasks occured timeout in this case when there is no agent node in cluster.
image
image