voxpupuli/puppet-corosync

Constantly failing beaker tests

Closed this issue · 3 comments

Current pull requests keep failing due to broken or highly unstable beaker tests.

For example:

  • The ubuntu1604 test worked on travis-ci.org but failed on travis-ci.com for the same branch/commit.
  • The centos7 test fails on travis-ci.org and travis-ci.com but works locally, so I cannot even debug this.
  • The debian9 test fails everywhere, for which I seem to have a workaround.

I cannot figure out the root cause, but now I think that the current debian9 workaround just got lucky. This is not a sd_notify issue but rather bad timing. Switching from type notify to simple just shifted certain events which under certain circumstances lead to service blocking. I am also not so sure any more, that this is a docker-only issue.

I was able the reproduce the error locally under centos7 once or twice. The symptoms are the same as under debian9 even though neither pacemaker.service nor corosync.service use Type=notify. The details are as follows:
The pacemaker process tree looks like this

root      4655  0.0  0.0 104280  9416 ?        Ss   22:02   0:00 /usr/sbin/pacemakerd -f
haclust+  4660  0.3  0.1 108996 22452 ?        Ss   22:02   0:03  \_ /usr/libexec/pacemaker/cib
root      4661  0.0  0.0 107188 10928 ?        Ss   22:02   0:00  \_ /usr/libexec/pacemaker/stonithd
root      4662  0.0  0.0  98832  8560 ?        Ss   22:02   0:00  \_ /usr/libexec/pacemaker/lrmd
haclust+  4663  0.0  0.0 128840 11992 ?        Ss   22:02   0:00  \_ /usr/libexec/pacemaker/attrd
haclust+  4664  0.2  0.1 118624 25768 ?        Ss   22:02   0:03  \_ /usr/libexec/pacemaker/pengine
haclust+  4665  0.1  0.1 182836 20980 ?        Ss   22:02   0:01  \_ /usr/libexec/pacemaker/crmd

During the error condition, the pacemaker service is deactivating (stop-sigterm). When one sends pacemakerd more SIGTERMs manually or via pacemakerd -S, they are logged e.g. like this:

Dec 16 22:27:39 centos7-64-1 crmd[22951]:   notice: Transition 40 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemak
Dec 16 22:27:39 centos7-64-1 crmd[22951]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Dec 16 22:27:39 centos7-64-1 pacemakerd[22938]:   notice: Shutting down in response to ticket quit-pacemakerd-1544999259-87 (main)
Dec 16 22:27:41 centos7-64-1 pacemakerd[22938]:   notice: Shutting down in response to ticket quit-pacemakerd-1544999261-88 (main)
Dec 16 22:27:43 centos7-64-1 pacemakerd[22938]:   notice: Shutting down in response to ticket quit-pacemakerd-1544999263-89 (main)
Dec 16 22:27:45 centos7-64-1 pacemakerd[22938]:   notice: Shutting down in response to ticket quit-pacemakerd-1544999265-90 (main)
Dec 16 22:27:45 centos7-64-1 pacemakerd[22938]:   notice: Still waiting for crmd to terminate
Dec 16 22:27:47 centos7-64-1 pacemakerd[22938]:   notice: Shutting down in response to ticket quit-pacemakerd-1544999267-91 (main)
Dec 16 22:27:49 centos7-64-1 pacemakerd[22938]:   notice: Shutting down in response to ticket quit-pacemakerd-1544999269-92 (main)
Dec 16 22:27:51 centos7-64-1 pacemakerd[22938]:   notice: Shutting down in response to ticket quit-pacemakerd-1544999271-93 (main)
Dec 16 22:27:53 centos7-64-1 pacemakerd[22938]:   notice: Shutting down in response to ticket quit-pacemakerd-1544999273-94 (main)
Dec 16 22:27:55 centos7-64-1 pacemakerd[22938]:   notice: Shutting down in response to ticket quit-pacemakerd-1544999275-95 (main)
Dec 16 22:28:13 centos7-64-1 pacemakerd[22938]:   notice: Caught 'Terminated' signal
Dec 16 22:28:27 centos7-64-1 pacemakerd[22938]:   notice: Caught 'Terminated' signal
Dec 16 22:28:27 centos7-64-1 pacemakerd[22938]:   notice: Still waiting for crmd to terminate

The SIGTERMs are not propagated by pacemakerd towards crmd. Instead crmd seems to wait for corosync, because if one sends a SIGTERM towards the corosync process, both corosync and pacemaker, along with crmd and others, terminate "properly". However, corosync will not be stopped by systemd before pacemaker because pacemaker.service specifies After=corosync.service. This means that, on the systemd-level, corosync waits for pacemaker to be stopped, while pacemaker waits for crmd, and crmd waits for corosync to be stopped. In all my tests, blocking conditions were accompanied by pengine errors like

Dec 16 22:27:39 centos7-64-1 pengine[22950]:     crit: Cannot shut down node 'centos7-64-1' because of pgsql: unmanaged failed (pgsql_stop_0)
Dec 16 22:27:39 centos7-64-1 pengine[22950]:   notice: Calculated transition 40, saving inputs in /var/lib/pacemaker/pengine/pe-input-230.bz2
Dec 16 22:27:39 centos7-64-1 crmd[22951]:   notice: Transition 40 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemak
Dec 16 22:27:39 centos7-64-1 crmd[22951]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE

or

Dec 16 22:02:02 centos7-64-1 pengine[4443]:    error: Resource start-up disabled since no STONITH resources have been defined
Dec 16 22:02:02 centos7-64-1 pengine[4443]:    error: Either configure some or disable STONITH with the stonith-enabled option
Dec 16 22:02:02 centos7-64-1 pengine[4443]:    error: NOTE: Clusters with shared data need STONITH to ensure data integrity

However, the pengine errors did not imply that one had a blocking condition. It seems to be a combination of bad timing and those errors.

What is the purpose of the beaker tests? Should they reveal any corosync/pacemaker error? Because they don't; there are several errors concerning failed actions while the corresponding beaker tests still appear successful, e.g.

Dec 16 22:55:31 ubuntu1604-64-1 pengine[2745]:  warning: Processing failed op start for nginx_vip on ubuntu1604-64-1: unknown error (1)
Dec 16 22:55:31 ubuntu1604-64-1 pengine[2745]:  warning: Processing failed op start for nginx_service on ubuntu1604-64-1: unknown error (1)
Dec 16 22:55:31 ubuntu1604-64-1 pengine[2745]:  warning: Forcing nginx_vip away from ubuntu1604-64-1 after 1000000 failures (max=1000000)

If those errors are considered irrelevant, then I propose a last resort workaround:

# Issue 455: There are recurring problems with the pacemaker systemd service                                                                       
# refusing to stop its crmd subprocess leading to test timeouts. Force a fast 
# SigKill here.
def override_timeout_on(host)
  on host, 'mkdir /etc/systemd/system/pacemaker.service.d'
  on host, 'echo -e "[Service]\nSendSIGKILL=yes\nTimeoutStopSec=60s" > /etc/systemd/system/pacemaker.service.d/10-timeout.conf'
end
override_timeout_on(host) if fact('os.family') == 'Debian' && ['9', '16.04'].include?(fact('os.release.major'))
override_timeout_on(host) if fact('os.family') == 'RedHat' && fact('os.release.major') == '7' 

Or some higher TimeoutStopSec lower than 10 minutes.

towo commented

FWIW, timing is a definite issue; I can manually trigger builds and have different sets of beaker targets fail. At some other time the tests completely ran through without problems.

My initial assumption when having a look into this was that if pacemaker is restarted while systemd still has it in the state of activating, something breaks in the dependency management, since a regular pacemaker restart works well enough in practice. I didn't verify that, though.

towo commented

This should have all been fixed by now, the current tests are definitely working, especially after moving to GitHub Actions. I'll take the liberty and close this, thanks for the input!