ClusterLabs/fence-agents

[Question] About adjusting stonith-timeout of fence_kdump.

HideoYamauchi opened this issue · 2 comments

Hi All,

We are building a configuration for fence_kdump.

The stonith-timeout of fence_kdump may be too short at 60s because there are differences in communication after booting the Linux second kernel due to differences in the hardware environment.
This causes fence_kdump to time out and be fenced during kdump acquisition by the next fence_agent executed by the topology configuration.

In the same way, I think that there are many people who are adjusting due to lack of stonith-timeout of fence_kdump.
What are your criteria for determining the value of stonith-timeout?
Is it from the boot log of the Linux second kernel? Or are you actually generating a kdump on the cluster and trying to make sure the stnoith-timeout value is sufficient?

I would love to hear your opinions.

Best Regards,
Hideo Yamauchi.

I'm not sure I've ever run into a situation (personally or with a customer) where the default fence_kdump timeout (60s) was insufficient. The panicked node should start sending fence_kdump_send messages to the surviving/listening node almost immediately upon booting into the kdump kernel. By default, the panicked node re-sends the message every 10 seconds until the vmcore dump is complete and the panicked node reboots.

With all that in mind, I think trial-and-error is the best approach (i.e., panic a node and see if fence_kdump "works" in time). There are sometimes other complications involving long corosync token timeouts (where the surviving node doesn't start listening for the fence_kdump_send message for a whlie), but that doesn't seem to be what you're asking about.

It also is not inherently harmful to increase the fence_kdump timeout value to an arbitrarily high value (keeping in mind that you may also need to increase pcmk_reboot_timeout -- or pcmk_off_timeout?) so that pacemaker doesn't kill the action prematurely. The only risk I can think of there, is that fencing/recovery will take longer in the event that the failed node is not panicking, since fence_kdump will listen until either it receives a message or the timeout expires.

Hi Reid,

Thank you for your valuable opinion.

We are encountering this situation with two types of HP servers.
The problem seems to be that it takes a long time for the kdump side to start sending.
For now, fence_kdump is working properly by extending pcmk_off_timeout to 90 seconds.

After all, it seems that there is no choice but to repeat trial and error.

Many thanks,
Hideo Yamauchi.