kubernetes-retired/kube-aws

etcdctl 3.2.7 endpoint health output defaults to stderr instead of stdout, breaking etcdadm checks

pedrobizzotto opened this issue · 4 comments

Hello,
Creating a new cluster with kube-aws 0.14.2, and trying etcd v3.2.7.
Problem: no Etcd snapshots are written to s3.
Checking the logs in the Etcd instances I saw that the cluster was not marked as healthy, despite working quite well. As I had a cluster with kube-aws 0.14.1 and Etcd v3.2.6 running well, I checked the sha1sum of etcdadm in both machines, confirming that the file is the same.

Running some more tests, I've discovered that etcdctl in v3.2.7 sends the output of the endpoint health command to stderr instead of the previous stdout, as follows:

etcdctl 3.2.6 (Old behavior)

Redirecting stdout to /dev/null

# ETCDCTL_API=3 etcdctl --key="/etc/ssl/certs/etcd-client-key.pem" --cert="/etc/ssl/certs/etcd-client.pem" --cacert="/etc/ssl/certs/etcd-trusted-ca.pem" --endpoints="https://etcd0.mydomain.local:2379,https://etcd1.mydomain.local:2379,https://etcd2.mydomain.local:2379" endpoint health 1>/dev/null
2019-10-21 18:06:22.997387 W | pkg/flags: recognized environment variable ETCDCTL_CERT, but unused: shadowed by corresponding flag
2019-10-21 18:06:22.997452 W | pkg/flags: recognized environment variable ETCDCTL_CACERT, but unused: shadowed by corresponding flag
2019-10-21 18:06:22.997476 W | pkg/flags: unrecognized environment variable ETCDCTL_CA_FILE=/etc/ssl/certs/etcd-trusted-ca.pem
2019-10-21 18:06:22.997493 W | pkg/flags: unrecognized environment variable ETCDCTL_KEY_FILE=/etc/ssl/certs/etcd-client-key.pem
2019-10-21 18:06:22.997512 W | pkg/flags: unrecognized environment variable ETCDCTL_ENDPOINT=https://etcd2.mydomain.local:2379
2019-10-21 18:06:22.997529 W | pkg/flags: unrecognized environment variable ETCDCTL_CERT_FILE=/etc/ssl/certs/etcd-client.pem
2019-10-21 18:06:22.997544 W | pkg/flags: recognized environment variable ETCDCTL_KEY, but unused: shadowed by corresponding flag

Redirecting stderr to /dev/null

# ETCDCTL_API=3 etcdctl --key="/etc/ssl/certs/etcd-client-key.pem" --cert="/etc/ssl/certs/etcd-client.pem" --cacert="/etc/ssl/certs/etcd-trusted-ca.pem" --endpoints="https://etcd0.mydomain.local:2379,https://etcd1.mydomain.local:2379,https://etcd2.mydomain.local:2379" endpoint health 2>/dev/null
https://etcd2.mydomain.local:2379 is healthy: successfully committed proposal: took = 2.339582ms
https://etcd0.mydomain.local:2379 is healthy: successfully committed proposal: took = 3.21403ms
https://etcd1.mydomain.local:2379 is healthy: successfully committed proposal: took = 3.698586ms

Output of journalctl -u etcdadm-save.service

Oct 21 18:12:32 redacted.compute.internal systemd[1]: Starting etcd snapshot...
Oct 21 18:12:32 redacted.compute.internal systemctl[26860]: active
Oct 21 18:12:32 redacted.compute.internal etcdadm[26863]: declare -x ETCDCTL_CACERT="/etc/ssl/certs/etcd-trusted-ca.pem"
Oct 21 18:12:32 redacted.compute.internal etcdadm[26863]: declare -x ETCDCTL_CA_FILE="/etc/ssl/certs/etcd-trusted-ca.pem"
Oct 21 18:12:32 redacted.compute.internal etcdadm[26863]: declare -x ETCDCTL_CERT="/etc/ssl/certs/etcd-client.pem"
Oct 21 18:12:32 redacted.compute.internal etcdadm[26863]: declare -x ETCDCTL_CERT_FILE="/etc/ssl/certs/etcd-client.pem"
Oct 21 18:12:32 redacted.compute.internal etcdadm[26863]: declare -x ETCDCTL_KEY="/etc/ssl/certs/etcd-client-key.pem"
Oct 21 18:12:32 redacted.compute.internal etcdadm[26863]: declare -x ETCDCTL_KEY_FILE="/etc/ssl/certs/etcd-client-key.pem"
Oct 21 18:12:34 redacted.compute.internal etcdadm[26863]: true
Oct 21 18:12:36 redacted.compute.internal etcdadm[26863]: https://etcd0.mydomain.local:2379 is healthy: successfully committed proposal: took = 4.153001ms
Oct 21 18:12:38 redacted.compute.internal etcdadm[26863]: https://etcd1.mydomain.local:2379 is healthy: successfully committed proposal: took = 3.997093ms
Oct 21 18:12:40 redacted.compute.internal etcdadm[26863]: https://etcd2.mydomain.local:2379 is healthy: successfully committed proposal: took = 3.790349ms
Oct 21 18:12:40 redacted.compute.internal etcdadm[26863]: /opt/bin/etcdadm: info: cluster_is_unhealthy: quorum=2 healthy=3
Oct 21 18:12:40 redacted.compute.internal etcdadm[26863]: /opt/bin/etcdadm: info: cluster_is_unhealthy: cluster is healthy
Oct 21 18:12:41 redacted.compute.internal etcdadm[26863]: Snapshot saved at snapshots/etcd2.db
Oct 21 18:12:42 redacted.compute.internal etcdadm[26863]: 8033ef11, 12831188, 2557, 9.1 MB
Oct 21 18:12:43 redacted.compute.internal etcdadm[26863]: /opt/bin/etcdadm: info: member_upload_snapshot: uploading /var/run/coreos/etcdadm/snapshots/etcd2.db to s3://redacted/snapshot.db
Oct 21 18:12:45 redacted.compute.internal etcdadm[26863]: [1.9K blob data]
Oct 21 18:12:46 redacted.compute.internal etcdadm[26863]: /opt/bin/etcdadm: info: member_upload_snapshot: verifying the upload...
Oct 21 18:12:46 redacted.compute.internal etcdadm[26863]: /opt/bin/etcdadm: info: member_remote_snapshot_exists: checking existence of s3://redacted/snapshot.db
Oct 21 18:12:48 redacted.compute.internal etcdadm[26863]: 2019-10-21 18:12:45    9080864 snapshot.db
Oct 21 18:12:48 redacted.compute.internal etcdadm[26863]: /opt/bin/etcdadm: info: member_remote_snapshot_exists: s3://redacted/snapshot.db exists
Oct 21 18:12:48 redacted.compute.internal etcdadm[26863]: /opt/bin/etcdadm: info: member_remove_snapshot: removing write protected local snapshot file: /var/run/coreos/etcdadm/snapshots/etcd2.db
Oct 21 18:12:48 redacted.compute.internal systemd[1]: etcdadm-save.service: Succeeded.

etcdctl 3.2.7 (New behavior)

Redirecting stdout to /dev/null

# ETCDCTL_API=3 etcdctl --key="/etc/ssl/certs/etcd-client-key.pem" --cert="/etc/ssl/certs/etcd-client.pem" --cacert="/etc/ssl/certs/etcd-trusted-ca.pem" --endpoints="https://etcd0.mydomain.local:2379,https://etcd1.mydomain.local:2379,https://etcd2.mydomain.local:2379" endpoint health 1>/dev/null
2019-10-21 17:51:10.175228 W | pkg/flags: recognized environment variable ETCDCTL_CERT, but unused: shadowed by corresponding flag
2019-10-21 17:51:10.175275 W | pkg/flags: recognized environment variable ETCDCTL_CACERT, but unused: shadowed by corresponding flag
2019-10-21 17:51:10.175286 W | pkg/flags: unrecognized environment variable ETCDCTL_CA_FILE=/etc/ssl/certs/etcd-trusted-ca.pem
2019-10-21 17:51:10.175294 W | pkg/flags: unrecognized environment variable ETCDCTL_KEY_FILE=/etc/ssl/certs/etcd-client-key.pem
2019-10-21 17:51:10.175300 W | pkg/flags: unrecognized environment variable ETCDCTL_ENDPOINT=https://etcd2.mydomain.local:2379
2019-10-21 17:51:10.175305 W | pkg/flags: unrecognized environment variable ETCDCTL_CERT_FILE=/etc/ssl/certs/etcd-client.pem
2019-10-21 17:51:10.175308 W | pkg/flags: recognized environment variable ETCDCTL_KEY, but unused: shadowed by corresponding flag
https://etcd2.mydomain.local:2379 is healthy: successfully committed proposal: took = 11.617275ms
https://etcd0.mydomain.local:2379 is healthy: successfully committed proposal: took = 16.422628ms
https://etcd1.mydomain.local:2379 is healthy: successfully committed proposal: took = 15.911099ms

Redirecting stderr to /dev/null

# ETCDCTL_API=3 etcdctl --key="/etc/ssl/certs/etcd-client-key.pem" --cert="/etc/ssl/certs/etcd-client.pem" --cacert="/etc/ssl/certs/etcd-trusted-ca.pem" --endpoints="https://etcd0.mydomain.local:2379,https://etcd1.mydomain.local:2379,https://etcd2.mydomain.local:2379" endpoint health 2>/dev/null
[no output]

Output of journalctl -u etcdadm-save.service

Oct 21 17:54:39 redacted.compute.internal systemd[1]: Starting etcd snapshot...
Oct 21 17:54:39 redacted.compute.internal systemctl[3330]: active
Oct 21 17:54:39 redacted.compute.internal etcdadm[3332]: declare -x ETCDCTL_CACERT="/etc/ssl/certs/etcd-trusted-ca.pem"
Oct 21 17:54:39 redacted.compute.internal etcdadm[3332]: declare -x ETCDCTL_CA_FILE="/etc/ssl/certs/etcd-trusted-ca.pem"
Oct 21 17:54:39 redacted.compute.internal etcdadm[3332]: declare -x ETCDCTL_CERT="/etc/ssl/certs/etcd-client.pem"
Oct 21 17:54:39 redacted.compute.internal etcdadm[3332]: declare -x ETCDCTL_CERT_FILE="/etc/ssl/certs/etcd-client.pem"
Oct 21 17:54:39 redacted.compute.internal etcdadm[3332]: declare -x ETCDCTL_KEY="/etc/ssl/certs/etcd-client-key.pem"
Oct 21 17:54:39 redacted.compute.internal etcdadm[3332]: declare -x ETCDCTL_KEY_FILE="/etc/ssl/certs/etcd-client-key.pem"
Oct 21 17:54:40 redacted.compute.internal etcdadm[3332]: true
Oct 21 17:54:40 redacted.compute.internal etcdadm[3332]: https://etcd0.mydomain.local:2379 is healthy: successfully committed proposal: took = 3.500538ms
Oct 21 17:54:41 redacted.compute.internal etcdadm[3332]: https://etcd1.mydomain.local:2379 is healthy: successfully committed proposal: took = 3.8024ms
Oct 21 17:54:41 redacted.compute.internal etcdadm[3332]: https://etcd2.mydomain.local:2379 is healthy: successfully committed proposal: took = 1.681459ms
Oct 21 17:54:41 redacted.compute.internal etcdadm[3332]: /opt/bin/etcdadm: info: cluster_is_unhealthy: quorum=2 healthy=0
Oct 21 17:54:41 redacted.compute.internal etcdadm[3332]: /opt/bin/etcdadm: info: cluster_is_unhealthy: cluster is unhealthy
Oct 21 17:54:41 redacted.compute.internal etcdadm[3332]: /opt/bin/etcdadm: info: member_save_snapshot: cluster is not healthy. skipped taking snapshot because the cluster can be unhealthy due to the corrupted etcd data of members, including this member
Oct 21 17:54:41 redacted.compute.internal systemd[1]: etcdadm-save.service: Succeeded.

This behavior seems to affect the function member_is_ready of etcdadm:

member_is_healthy() {
  member_etcdctl endpoint health | grep "is healthy" 1>&2
}

Downgrading etcd to 3.2.6 solved the snapshot issue.

Please see this pull request and this pull request, I think that were the changes in etcd.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.