[Question] About adjusting stonith-timeout of fence_kdump.

Question

[Question] About adjusting stonith-timeout of fence_kdump.

HideoYamauchi opened this issue 3 years ago · 2 comments

Hi All,

We are building a configuration for fence_kdump.

The stonith-timeout of fence_kdump may be too short at 60s because there are differences in communication after booting the Linux second kernel due to differences in the hardware environment.
This causes fence_kdump to time out and be fenced during kdump acquisition by the next fence_agent executed by the topology configuration.

In the same way, I think that there are many people who are adjusting due to lack of stonith-timeout of fence_kdump.
What are your criteria for determining the value of stonith-timeout?
Is it from the boot log of the Linux second kernel? Or are you actually generating a kdump on the cluster and trying to make sure the stnoith-timeout value is sufficient?

I would love to hear your opinions.

Best Regards,
Hideo Yamauchi.

Answer 1 · 2021-10-11T18:12:24.000Z

I'm not sure I've ever run into a situation (personally or with a customer) where the default fence_kdump timeout (60s) was insufficient. The panicked node should start sending fence_kdump_send messages to the surviving/listening node almost immediately upon booting into the kdump kernel. By default, the panicked node re-sends the message every 10 seconds until the vmcore dump is complete and the panicked node reboots.

With all that in mind, I think trial-and-error is the best approach (i.e., panic a node and see if fence_kdump "works" in time). There are sometimes other complications involving long corosync token timeouts (where the surviving node doesn't start listening for the fence_kdump_send message for a whlie), but that doesn't seem to be what you're asking about.

It also is not inherently harmful to increase the fence_kdump timeout value to an arbitrarily high value (keeping in mind that you may also need to increase pcmk_reboot_timeout -- or pcmk_off_timeout?) so that pacemaker doesn't kill the action prematurely. The only risk I can think of there, is that fencing/recovery will take longer in the event that the failed node is not panicking, since fence_kdump will listen until either it receives a message or the timeout expires.

Answer 2 · 2021-10-11T23:41:18.000Z

Hi Reid,

Thank you for your valuable opinion.

We are encountering this situation with two types of HP servers.
The problem seems to be that it takes a long time for the kdump side to start sending.
For now, fence_kdump is working properly by extending pcmk_off_timeout to 90 seconds.

After all, it seems that there is no choice but to repeat trial and error.

Many thanks,
Hideo Yamauchi.