kubernetes-retired/kube-aws

0.13-rc1 to 0.13-rc2 upgrade faild

paalkr opened this issue · 30 comments

etcd did not start, and the cfn-init service did never send OK signal. I would imagine that the updated etcd version of rc2 might be the issue, failing to upgrade?

+00:17:55       ip-10-1-43-193.eu-west-1.compute.internal: "etcdadm-reconfigure.service: Failed with result 'exit-code'."
+00:17:55       ip-10-1-43-193.eu-west-1.compute.internal: "Dependency failed for etcd (System Application Container)."
+00:17:55       ip-10-1-43-193.eu-west-1.compute.internal: "Dependency failed for etcdadm update status."
+00:17:55       ip-10-1-43-193.eu-west-1.compute.internal: "cfn-signal.service: Failed with result 'exit-code'."
+00:19:05       ip-10-1-43-193.eu-west-1.compute.internal: "etcdadm-reconfigure.service: Failed with result 'exit-code'."
+00:19:05       ip-10-1-43-193.eu-west-1.compute.internal: "Dependency failed for etcd (System Application Container)."
+00:19:05       ip-10-1-43-193.eu-west-1.compute.internal: "Dependency failed for etcdadm update status."
+00:19:05       ip-10-1-43-193.eu-west-1.compute.internal: "cfn-signal.service: Failed with result 'exit-code'."

Rollback did also fail, never sending cfn-init.

Why did this happen, and what can be done to recover?

So it's impossible to recover this stack now.
I'm not allowed to continue a rollback on a nested stack. And I cannot skip a nested stack that is in rollback failed status when trying to continue rollback on the parent stack. So a catch 22...

image

Hi, sorry to hear that - I haven’t experienced this issue with rc2. Perhaps you could take a look at the etcd-member journalctl logs to see if ectd is starting? Or what it is erroring with? If you can manually get etcd to start then you can tell cloud formation to continue the rollback.

Thanks. First of all, this is only a test cluster. So no harm done. Second, the etcd node that failed was terminated by CloudFormation when the rollback started. So I have no way examining that failed instance. The replacement/rollback etcd instance is of course running, but etcd has probably not started because cfn-init never executed.

How to I proceed to debug and get etcd started manually?

This is the log from the rollback instance

-- Logs begin at Fri 2019-06-21 05:30:25 UTC, end at Fri 2019-06-21 07:01:49 UTC. --
Jun 21 07:01:45 ip-10-1-43-240.eu-west-1.compute.internal systemd[1]: Stopped etcd (System Application Container).
Jun 21 07:01:45 ip-10-1-43-240.eu-west-1.compute.internal systemd[1]: etcd-member.service: Scheduled restart job, restart counter is at 2502.
Jun 21 07:01:45 ip-10-1-43-240.eu-west-1.compute.internal systemd[1]: etcd-member.service: Service RestartSec=10s expired, scheduling restart.
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal systemd[1]: Failed to start etcd (System Application Container).
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal systemd[1]: etcd-member.service: Failed with result 'exit-code'.
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal systemd[1]: etcd-member.service: Main process exited, code=exited, status=1/FAILURE
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.372226 C | etcdmain: member dab84c2963a2830f has already been bootstrapped
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.361388 I | pkg/netutil: resolving ip-10-1-43-133.eu-west-1.compute.internal:2380 to 10.1.43.133:2380
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.361082 I | pkg/netutil: resolving ip-10-1-43-133.eu-west-1.compute.internal:2380 to 10.1.43.133:2380
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.349844 I | embed: listening for client requests on 10.1.43.133:2379
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.349792 I | embed: listening for peers on https://10.1.43.133:2380
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.348982 I | embed: peerTLS: cert = /etc/ssl/certs/etcd.pem, key = /etc/ssl/certs/etcd-key.pem, ca = , trusted-ca>
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.348961 N | etcdmain: the server is already initialized as member before, starting as etcd member...
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.348914 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.348901 I | etcdmain: Go OS/Arch: linux/amd64
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.348893 I | etcdmain: Go Version: go1.8.5
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.348889 I | etcdmain: Git SHA: 95a726a
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.348884 I | etcdmain: etcd Version: 3.2.13
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.348844 W | pkg/flags: unrecognized environment variable ETCD_IMAGE_TAG=v3.2.13
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.348838 W | pkg/flags: unrecognized environment variable ETCD_USER=etcd
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.348818 I | pkg/flags: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/etc/ssl/certs/etcd-trusted>
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.348802 I | pkg/flags: recognized and used environment variable ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/certs/etcd-tr>
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.348795 I | pkg/flags: recognized and used environment variable ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd-key.pem
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.348783 I | pkg/flags: recognized and used environment variable ETCD_PEER_CERT_FILE=/etc/ssl/certs/etcd.pem
Jun 21 07:01:35 ip-10-1-43-240.eu-west-1.compute.internal etcd-wrapper[15974]: 2019-06-21 07:01:35.348775 I | pkg/flags: recognized and used environment variable ETCD_NAME=etcd0
-- Logs begin at Fri 2019-06-21 05:30:25 UTC, end at Fri 2019-06-21 07:01:49 UTC. --

I managed to get the rollback to proceed. But the etcd0 node is now in a bad state.

I imagine that the upgrade service tried to upgrade the etcd0 node to a new version of etcd (part of 0.13-rc2), then failed to execute cfn-init (maybe because of underlying etcd problems, I don't know). The etcd data was probably upgraded. A rollback was executed by CloudFormation due to the failed cfn-init service. Then the etcd0 node created by the rollback refuses to start because the data has already been upgraded to a new etcd version. Is it possible to wipe all etcd data on a node, and let the node catch up by reading data from the two other etcd nodes in the cluster?

This is the log when I tried the upgrade again.

core@ip-10-1-43-119 ~ $ journalctl -xe
-- The job identifier is 6110 and the job result is dependency.
Jun 21 07:24:32 ip-10-1-43-119.eu-west-1.compute.internal systemd[1]: etcdadm-update-status.service: Job etcdadm-update-status.service/start failed with result 'depende>
Jun 21 07:24:32 ip-10-1-43-119.eu-west-1.compute.internal systemd[1]: etcd-member.service: Job etcd-member.service/start failed with result 'dependency'.
Jun 21 07:24:32 ip-10-1-43-119.eu-west-1.compute.internal systemd[1]: Starting cfn-signal.service...
-- Subject: A start job for unit cfn-signal.service has begun execution
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- A start job for unit cfn-signal.service has begun execution.
--
-- The job identifier is 6020.
Jun 21 07:24:32 ip-10-1-43-119.eu-west-1.compute.internal systemctl[6570]: inactive
Jun 21 07:24:32 ip-10-1-43-119.eu-west-1.compute.internal systemd[1]: cfn-signal.service: Control process exited, code=exited, status=3/NOTIMPLEMENTED
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- An ExecStartPre= process belonging to unit cfn-signal.service has exited.
--
-- The process' exit code is 'exited' and its exit status is 3.
Jun 21 07:24:32 ip-10-1-43-119.eu-west-1.compute.internal systemd[1]: cfn-signal.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- The unit cfn-signal.service has entered the 'failed' state with result 'exit-code'.
Jun 21 07:24:32 ip-10-1-43-119.eu-west-1.compute.internal systemd[1]: Failed to start cfn-signal.service.
-- Subject: A start job for unit cfn-signal.service has failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- A start job for unit cfn-signal.service has finished with a failure.
--
-- The job identifier is 6020 and the job result is failed.

The upgrade looks pretty inconspicuous - fc29a9f, but this feels like a blocker if we're going to include the etcd upgrade in the v0.13.x release. Especially if it's reproducible.

I agree. Etcd updates that is not backwards compatible is always dangerous, given the way CloudFormation handles rollbacks. I think etcd upgrades is something that a user should have to specifically initiate by specifying the new version in the cluster.yaml file.

Any tips on how to recover the etcd instance?

Is there only one out of the three etcd instances that have failed?

Yes. The stack update failed on etcd0, so etcd1 and etcd2 are untouched and still reports as healthy.

The following should probably fix etcd0.

$ set -a; source /var/run/coreos/etcdadm-environment; source /etc/etcd-environment; source /var/run/coreos/etcdadm-environment; set +a
$ sudo systemctl stop etcd-member.service
$ /opt/bin/etcdadm replace
$ sudo systemctl start etcd-member.service

I thought with the upgrade being within the 3.2.x etcd branch that it should remain compatible with the previous 3.2.x version. I had been told that etcd runs in compatibility mode until it sees all members have been updated then it enables the new features. You can safely reclone etcd0 from the others by deleting the data a rejoining it to the cluster. I will keep an eye out for issue, so far I’ve not seen it upgrading legacy versions to 0.13.0-rc3.

On 21 Jun 2019, at 10:44, Dominic Gunn @.***> wrote: The upgrade looks pretty inconspicuous (fc29a9f, but this feels like a blocker if we're going to include the etcd upgrade in the v0.13.x release. Especially if it's reproducible. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Yeah, I thought the same @davidmccormick, which is why I thought it was inconspicuous. It looks like it shouldn't cause an issue, but I also trust @paalkr's reports. I'd be interested in finding out what exactly caused this.

You've had successful upgrades from v0.13.0 to v0.0.13.0-RC3 right? What version have you tried to update from @paalkr?

This issue happened when trying to upgrade from 0.13.0-rc1 to 0.13.0-rc2. I college if mine will explain the actions he has carried out, trying to recover the etcd node.

@dominicgunn , you are referring to 0.13.0-rc3? I may have missed something, but I can't seem to find that release anywhere.

I will also try to upgrade another dev cluster from 0.13.0-rc1, to see if the same thing happens to this cluster as well.

Apologies, I referred to 0.13.0-rc3 which I haven’t released - I have been testing by building the HEAD of the v0.13.x branch. I’m just waiting for a couple of changes before making it a release.

Let’s see if it happens on any other clusters

@dominicgunn I'm a colleague of @paalkr that has been looking into this. I've tried the commands you've listed above (basically the same as those listed here: https://kubernetes-incubator.github.io/kube-aws/advanced-topics/etcd-backup-and-restore.html#manually-restoring-a-permanently-failed-etcd-node-from-etcd-snapshot), and it does not fix the issue.

The log from etcd-member.service after doing those commads are:

Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal systemd[1]: Stopped etcd (System Application Container).
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal systemd[1]: etcd-member.service: Succeeded.
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal rkt[15547]: pod "b52ab21d-4c61-4842-8bf2-6d519681dc7d" is already stopped
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.055355 E | etcdmain: forgot to set Type=notify in systemd service file?
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053696 I | rafthttp: stopped peer c8a02416947a5853
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053691 I | rafthttp: stopped streaming with peer c8a02416947a5853 (stream Message reader)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053677 I | rafthttp: stopped streaming with peer c8a02416947a5853 (stream MsgApp v2 reader)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053645 I | rafthttp: stopped HTTP pipelining with peer c8a02416947a5853
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053628 I | rafthttp: stopped streaming with peer c8a02416947a5853 (writer)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053619 I | rafthttp: stopped streaming with peer c8a02416947a5853 (writer)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053609 I | rafthttp: stopping peer c8a02416947a5853...
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053604 I | rafthttp: stopped peer 268d5cbd6382d8f2
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053573 I | rafthttp: stopped streaming with peer 268d5cbd6382d8f2 (stream Message reader)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053565 I | rafthttp: stopped streaming with peer 268d5cbd6382d8f2 (stream MsgApp v2 reader)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053556 I | rafthttp: stopped HTTP pipelining with peer 268d5cbd6382d8f2
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053539 I | rafthttp: stopped streaming with peer 268d5cbd6382d8f2 (writer)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053531 I | rafthttp: stopped streaming with peer 268d5cbd6382d8f2 (writer)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053496 I | rafthttp: stopping peer 268d5cbd6382d8f2...
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053469 I | etcdserver: aborting publish because server is stopped
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053454 E | etcdserver: publish error: etcdserver: request cancelled
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053396 I | etcdserver: the data-dir used by this member must be removed.
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.053373 E | etcdserver: the member has been permanently removed from the cluster
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.049609 I | etcdserver/membership: added member dab84c2963a2830f [https://ip-10-1-43-133.eu-west-1.compute.internal:2380] to cluster e146e62db33b7661
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.049527 I | etcdserver/membership: added member c8a02416947a5853 [https://ip-10-1-45-117.eu-west-1.compute.internal:2380] to cluster e146e62db33b7661
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.049368 I | etcdserver/membership: added member 268d5cbd6382d8f2 [https://ip-10-1-44-127.eu-west-1.compute.internal:2380] to cluster e146e62db33b7661
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.042771 I | rafthttp: started streaming with peer c8a02416947a5853 (stream MsgApp v2 reader)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.035834 I | rafthttp: started streaming with peer c8a02416947a5853 (stream Message reader)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.035651 I | rafthttp: started streaming with peer c8a02416947a5853 (writer)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.032264 I | embed: ClientTLS: cert = /etc/ssl/certs/etcd.pem, key = /etc/ssl/certs/etcd-key.pem, ca = , trusted-ca = /etc/ssl/certs/etcd-trusted-ca.pem, client-cert-auth = true
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.027408 I | etcdserver: starting server... [version: 3.2.13, cluster version: to_be_decided]
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.027316 I | rafthttp: added peer c8a02416947a5853
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.027214 I | rafthttp: started peer c8a02416947a5853
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.027001 I | rafthttp: started streaming with peer c8a02416947a5853 (writer)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.026600 I | rafthttp: started streaming with peer 268d5cbd6382d8f2 (stream MsgApp v2 reader)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.020326 I | rafthttp: started streaming with peer 268d5cbd6382d8f2 (stream Message reader)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.007598 I | rafthttp: started HTTP pipelining with peer c8a02416947a5853
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.007467 I | rafthttp: starting peer c8a02416947a5853...
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.007316 I | rafthttp: added peer 268d5cbd6382d8f2
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.006977 I | rafthttp: started peer 268d5cbd6382d8f2
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.005147 I | rafthttp: started streaming with peer 268d5cbd6382d8f2 (writer)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.000964 I | rafthttp: started streaming with peer 268d5cbd6382d8f2 (writer)
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:01.000618 I | rafthttp: started HTTP pipelining with peer 268d5cbd6382d8f2
Jun 26 07:36:01 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.999942 I | rafthttp: starting peer 268d5cbd6382d8f2...
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.989195 W | auth: simple token is not cryptographically signed
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.981813 I | raft: dab84c2963a2830f became follower at term 1
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.981806 I | raft: newRaft dab84c2963a2830f [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.981793 I | raft: dab84c2963a2830f became follower at term 0
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.981754 I | etcdserver: starting member dab84c2963a2830f in cluster e146e62db33b7661
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.977731 I | etcdserver: initial cluster = etcd0=https://ip-10-1-43-133.eu-west-1.compute.internal:2380,etcd1=https://ip-10-1-44-127.eu-west-1.compute.internal:2380,etcd2=https://ip-10-1-45-117.eu-west-1.compute.internal:2380
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.977716 I | etcdserver: initial advertise peer URLs = https://ip-10-1-43-133.eu-west-1.compute.internal:2380
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.977690 I | etcdserver: advertise client URLs = https://ip-10-1-43-133.eu-west-1.compute.internal:2379
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.977679 I | etcdserver: snapshot count = 100000
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.977674 I | etcdserver: election = 1000ms
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.977670 I | etcdserver: heartbeat = 100ms
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.977665 I | etcdserver: member dir = /var/lib/etcd/member
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.977658 I | etcdserver: data dir = /var/lib/etcd
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.977630 I | etcdserver: name = etcd0
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.969205 I | pkg/netutil: resolving ip-10-1-43-133.eu-west-1.compute.internal:2380 to 10.1.43.133:2380
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.968280 I | pkg/netutil: resolving ip-10-1-43-133.eu-west-1.compute.internal:2380 to 10.1.43.133:2380
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.960243 I | embed: listening for client requests on 10.1.43.133:2379
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.960189 I | embed: listening for peers on https://10.1.43.133:2380
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959354 I | embed: peerTLS: cert = /etc/ssl/certs/etcd.pem, key = /etc/ssl/certs/etcd-key.pem, ca = , trusted-ca = /etc/ssl/certs/etcd-trusted-ca.pem, client-cert-auth = false
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959283 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959277 I | etcdmain: Go OS/Arch: linux/amd64
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959273 I | etcdmain: Go Version: go1.8.5
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959268 I | etcdmain: Git SHA: 95a726a
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959263 I | etcdmain: etcd Version: 3.2.13
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959237 W | pkg/flags: unrecognized environment variable ETCD_IMAGE_TAG=v3.2.13
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959229 W | pkg/flags: unrecognized environment variable ETCD_USER=etcd
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959205 I | pkg/flags: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/etc/ssl/certs/etcd-trusted-ca.pem
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959184 I | pkg/flags: recognized and used environment variable ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/certs/etcd-trusted-ca.pem
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959177 I | pkg/flags: recognized and used environment variable ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd-key.pem
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959164 I | pkg/flags: recognized and used environment variable ETCD_PEER_CERT_FILE=/etc/ssl/certs/etcd.pem
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959155 I | pkg/flags: recognized and used environment variable ETCD_NAME=etcd0
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959140 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=https://10.1.43.133:2380
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959132 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://10.1.43.133:2379
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959122 I | pkg/flags: recognized and used environment variable ETCD_KEY_FILE=/etc/ssl/certs/etcd-key.pem
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959110 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959098 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd0=https://ip-10-1-43-133.eu-west-1.compute.internal:2380,etcd1=https://ip-10-1-44-127.eu-west-1.compute.internal:2380,etcd2=https://ip-10-1-45-117.eu-west-1.compute.internal:2380
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959037 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=https://ip-10-1-43-133.eu-west-1.compute.internal:2380
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.959006 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.958991 I | pkg/flags: recognized and used environment variable ETCD_CLIENT_CERT_AUTH=true
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.958983 I | pkg/flags: recognized and used environment variable ETCD_CERT_FILE=/etc/ssl/certs/etcd.pem
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.958973 I | pkg/flags: recognized and used environment variable ETCD_AUTO_COMPACTION_RETENTION=1
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: 2019-06-26 07:36:00.958883 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://ip-10-1-43-133.eu-west-1.compute.internal:2379
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: + exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid --trust-keys-from-https --mount volume=coreos-systemd-dir,target=/run/systemd/system --volume coreos-systemd-dir,kind=host,source=/run/systemd/system,readOnly=true --volume coreos-data-dir,kind=host,source=/var/lib/etcd2,readOnly=false --volume coreos-etc-ssl-certs,kind=host,source=/etc/ssl/certs,readOnly=true --volume coreos-usr-share-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume coreos-etc-hosts,kind=host,source=/etc/hosts,readOnly=true --volume coreos-etc-resolv,kind=host,source=/etc/resolv.conf,readOnly=true --mount volume=coreos-data-dir,target=/var/lib/etcd --mount volume=coreos-etc-ssl-certs,target=/etc/ssl/certs --mount volume=coreos-usr-share-certs,target=/usr/share/ca-certificates --mount volume=coreos-etc-hosts,target=/etc/hosts --mount volume=coreos-etc-resolv,target=/etc/resolv.conf --inherit-env --stage1-from-dir=stage1-fly.aci quay.io/coreos/etcd:v3.2.13 --user=232 --
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[15517]: ++ id -u etcd
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal systemd[1]: Started etcd (System Application Container).
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal systemctl[15513]: active
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal systemctl[15511]: active
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal rkt[15504]: rm: failed to remove one or more pods
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal rkt[15504]: rm: cannot get pod: no matches found for "55fef84a-b746-4e04-bfe1-8157513251fd"
Jun 26 07:36:00 ip-10-1-43-48.eu-west-1.compute.internal systemd[1]: Starting etcd (System Application Container)...

What I notice from the logs above, is that this node (the failing one), still lists it's old etcd cluster id (dab84c2963a2830f), even after running the replace command. I would assume it should get a new id?

Also, I did some searching around for etcd cluster recovery, and as far as I could tell, when adding a node into an existing cluster, we should set the ETCD_INITIAL_CLUSTER_STATE to existing. I've also tried to do that, but I've been unable to pass that in to the command, since the cfn scripts pulls down the config on service and sets the env variables, overwriting any changes I've done.
I guess one way to buypass this would be to just remove the cfn config part from the etcd-member.service start temporarily, but I haven't looked more into the ETCD_INITIAL_CLUSTER_STATE=existing yet.

Hey @andersosthus.

Yes, it should definitely get a new id. I'm not massively concerned about the initial cluster state flag, but what you could try doing is removing the bad id from the cluster completely, something along these lines perhaps.

On the broken node, source all the env configs, and follow the steps up until stopping etcd. After that, remove the broken node from the cluster using its old id, $ etcdctl member remove e6e3c5cbabef648a, then follow the replace & start procedure.

I've replaced a handful of broken nodes with these steps, so i'd be surprised if it didn't work.

Fwiw, the commands I gave you differ a little from the doc you shared, there's some additional source'ing going on to pull in some env vars there are otherwise missed.

@dominicgunn Yeah, I noticed that there was an extra env file that you included in your command :)

So, following the steps, and running etcdctl member remove dab84c2963a2830f gives the output: Couldn't find a member in the cluster with an ID of dab84c2963a2830f.

Then continuing with etcdadm replace and starting etcd-member.service, the logs show the same id for this broken node

...
Jun 26 09:01:07 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[17055]: 2019-06-26 09:01:07.108451 I | etcdserver/membership: added member dab84c2963a2830f [https://ip-10-1-43-133.eu-west-1.compute.internal:2380] to cluster e146e62db33b7661
Jun 26 09:01:07 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[17055]: 2019-06-26 09:01:07.108368 I | etcdserver/membership: added member c8a02416947a5853 [https://ip-10-1-45-117.eu-west-1.compute.internal:2380] to cluster e146e62db33b7661
Jun 26 09:01:07 ip-10-1-43-48.eu-west-1.compute.internal etcd-wrapper[17055]: 2019-06-26 09:01:07.108245 I | etcdserver/membership: added member 268d5cbd6382d8f2 [https://ip-10-1-44-127.eu-west-1.compute.internal:2380] to cluster e146e62db33b7661
...

I'm not that familiar with etcd, but to me, it kinda looks like the id is cached on the node so the etcd-member.service thinks it's still the old node trying to join.

But then again, I'm just guessing here.

I created a fresh 0.13-rc1 cluster, the cluster did launch without errors.
The tried to upgrade to 0.13-rc2 by
Replacing the kube-aws binary
run ./kube-aws render stack
run ./kube-aws apply --pretty-print

Same problem. etcd0 fails to execute cfn-signal and start a rollback. The rollback etcd0 instances fail to start.

I've tried the replace procedure you listed above on the new cluster that also failed @dominicgunn, and the output is as follows:

Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal systemd[1]: Stopped etcd (System Application Container).
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal systemd[1]: etcd-member.service: Succeeded.
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal rkt[31888]: pod "4352a8f3-3476-4d55-acf5-ebc5438273c5" is already stopped
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal systemd[1]: Started etcd (System Application Container).
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.016861 I | rafthttp: stopped peer 70f28abc02f5566f
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.016855 I | rafthttp: stopped streaming with peer 70f28abc02f5566f (stream Message reader)
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.016848 I | rafthttp: stopped streaming with peer 70f28abc02f5566f (stream MsgApp v2 reader)
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.016839 I | rafthttp: stopped HTTP pipelining with peer 70f28abc02f5566f
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.016793 I | rafthttp: stopped streaming with peer 70f28abc02f5566f (writer)
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.016785 I | rafthttp: stopped streaming with peer 70f28abc02f5566f (writer)
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.016776 I | rafthttp: stopping peer 70f28abc02f5566f...
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.016770 I | rafthttp: stopped peer 123c3adfe7b1a0db
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.016764 I | rafthttp: stopped streaming with peer 123c3adfe7b1a0db (stream Message reader)
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.016753 I | rafthttp: stopped streaming with peer 123c3adfe7b1a0db (stream MsgApp v2 reader)
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.016689 I | rafthttp: stopped HTTP pipelining with peer 123c3adfe7b1a0db
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.013897 I | rafthttp: stopped streaming with peer 123c3adfe7b1a0db (writer)
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.013870 I | rafthttp: stopped streaming with peer 123c3adfe7b1a0db (writer)
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.013851 I | rafthttp: stopping peer 123c3adfe7b1a0db...
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.013807 I | etcdserver: aborting publish because server is stopped
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.013762 I | etcdserver: the data-dir used by this member must be removed.
Jun 26 12:28:48 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:48.013727 E | etcdserver: the member has been permanently removed from the cluster
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.989592 E | rafthttp: streaming request ignored (ID mismatch got 1a62be873a3c997c want 34082a3956203c13)
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.989347 E | rafthttp: streaming request ignored (ID mismatch got 1a62be873a3c997c want 34082a3956203c13)
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.989134 E | rafthttp: streaming request ignored (ID mismatch got 1a62be873a3c997c want 34082a3956203c13)
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.988684 I | raft: raft.node: 34082a3956203c13 elected leader 123c3adfe7b1a0db at term 9
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.988675 I | raft: 34082a3956203c13 became follower at term 9
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.988646 I | raft: 34082a3956203c13 [term: 1] received a MsgHeartbeat message with higher term from 123c3adfe7b1a0db [term: 9]
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.988358 E | rafthttp: streaming request ignored (ID mismatch got 1a62be873a3c997c want 34082a3956203c13)
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.987546 I | rafthttp: started streaming with peer 70f28abc02f5566f (stream Message reader)
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.987330 I | rafthttp: started streaming with peer 70f28abc02f5566f (stream MsgApp v2 reader)
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.983359 I | embed: ClientTLS: cert = /etc/ssl/certs/etcd.pem, key = /etc/ssl/certs/etcd-key.pem, ca = , trusted-ca = /etc/ssl/certs/etcd-trusted-ca.pem, client-cert-auth = true
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.982484 I | rafthttp: started streaming with peer 70f28abc02f5566f (writer)
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.982372 I | etcdserver: starting server... [version: 3.2.13, cluster version: to_be_decided]
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.982283 I | rafthttp: added peer 70f28abc02f5566f
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.982188 I | rafthttp: started peer 70f28abc02f5566f
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.982051 I | rafthttp: started streaming with peer 70f28abc02f5566f (writer)
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.981874 I | rafthttp: started streaming with peer 123c3adfe7b1a0db (stream MsgApp v2 reader)
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.981446 I | rafthttp: started streaming with peer 123c3adfe7b1a0db (stream Message reader)
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.981085 I | rafthttp: started HTTP pipelining with peer 70f28abc02f5566f
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.981002 I | rafthttp: starting peer 70f28abc02f5566f...
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.980865 I | rafthttp: added peer 123c3adfe7b1a0db
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.977715 I | rafthttp: started peer 123c3adfe7b1a0db
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.960965 I | rafthttp: started streaming with peer 123c3adfe7b1a0db (writer)
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.954896 I | rafthttp: started streaming with peer 123c3adfe7b1a0db (writer)
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.942874 I | rafthttp: started HTTP pipelining with peer 123c3adfe7b1a0db
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.942675 I | rafthttp: starting peer 123c3adfe7b1a0db...
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.922510 W | auth: simple token is not cryptographically signed
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.907041 I | mvcc: restore compact to 95296
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.900823 I | etcdserver/membership: added member 123c3adfe7b1a0db [https://ip-10-1-45-203.eu-west-1.compute.internal:2380] to cluster 766b2c59c152b9ce from store
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.900808 I | etcdserver/membership: added member 70f28abc02f5566f [https://ip-10-1-44-92.eu-west-1.compute.internal:2380] to cluster 766b2c59c152b9ce from store
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.900786 I | etcdserver/membership: added member 34082a3956203c13 [https://ip-10-1-43-74.eu-west-1.compute.internal:2380] to cluster 766b2c59c152b9ce from store
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.900597 I | raft: newRaft 34082a3956203c13 [peers: [123c3adfe7b1a0db,34082a3956203c13,70f28abc02f5566f], term: 1, commit: 3, applied: 3, lastindex: 3, lastterm: 1]
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.900569 I | raft: 34082a3956203c13 became follower at term 1
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.900504 I | etcdserver: restarting member 34082a3956203c13 in cluster 766b2c59c152b9ce at commit index 3
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.839875 I | etcdserver: advertise client URLs = https://ip-10-1-43-74.eu-west-1.compute.internal:2379
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.839783 I | etcdserver: snapshot count = 100000
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.839717 I | etcdserver: election = 1000ms
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.839650 I | etcdserver: heartbeat = 100ms
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.839573 I | etcdserver: member dir = /var/lib/etcd/member
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.839494 I | etcdserver: data dir = /var/lib/etcd
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.839315 I | etcdserver: name = etcd0
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.828274 I | mvcc: restore compact to 95296
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.812532 I | etcdserver: recovered store from snapshot at index 3
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.799192 I | embed: listening for client requests on 10.1.43.74:2379
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.799009 I | embed: listening for peers on https://10.1.43.74:2380
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.794017 I | embed: peerTLS: cert = /etc/ssl/certs/etcd.pem, key = /etc/ssl/certs/etcd-key.pem, ca = , trusted-ca = /etc/ssl/certs/etcd-trusted-ca.pem, client-cert-auth = false
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.793859 N | etcdmain: the server is already initialized as member before, starting as etcd member...
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.793738 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.793669 I | etcdmain: Go OS/Arch: linux/amd64
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.793602 I | etcdmain: Go Version: go1.8.5
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.793529 I | etcdmain: Git SHA: 95a726a
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.793455 I | etcdmain: etcd Version: 3.2.13
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.793358 W | pkg/flags: unrecognized environment variable ETCD_IMAGE_TAG=v3.2.13
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.793289 W | pkg/flags: unrecognized environment variable ETCD_USER=etcd
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.793191 I | pkg/flags: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/etc/ssl/certs/etcd-trusted-ca.pem
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.793097 I | pkg/flags: recognized and used environment variable ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/certs/etcd-trusted-ca.pem
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.793015 I | pkg/flags: recognized and used environment variable ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd-key.pem
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.792922 I | pkg/flags: recognized and used environment variable ETCD_PEER_CERT_FILE=/etc/ssl/certs/etcd.pem
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.792847 I | pkg/flags: recognized and used environment variable ETCD_NAME=etcd0
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.792760 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=https://10.1.43.74:2380
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.792673 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://10.1.43.74:2379
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.792590 I | pkg/flags: recognized and used environment variable ETCD_KEY_FILE=/etc/ssl/certs/etcd-key.pem
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.791386 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.791319 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd0=https://ip-10-1-43-74.eu-west-1.compute.internal:2380,etcd1=https://ip-10-1-44-92.eu-west-1.compute.internal:2380,etcd2=https://ip-10-1-45-203.eu-west-1.compute.internal:2380
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.791248 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=https://ip-10-1-43-74.eu-west-1.compute.internal:2380
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.791161 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.791074 I | pkg/flags: recognized and used environment variable ETCD_CLIENT_CERT_AUTH=true
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.791002 I | pkg/flags: recognized and used environment variable ETCD_CERT_FILE=/etc/ssl/certs/etcd.pem
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.790919 I | pkg/flags: recognized and used environment variable ETCD_AUTO_COMPACTION_RETENTION=1
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: 2019-06-26 12:28:47.790675 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://ip-10-1-43-74.eu-west-1.compute.internal:2379
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: + exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/etcd-member-wrapper.uuid --trust-keys-from-https --mount volume=coreos-systemd-dir,target=/run/systemd/system --volume coreos-systemd-dir,kind=host,source=/run/systemd/system,readOnly=true --mount volume=coreos-notify,target=/run/systemd/notify --volume coreos-notify,kind=host,source=/run/systemd/notify --set-env=NOTIFY_SOCKET=/run/systemd/notify --volume coreos-data-dir,kind=host,source=/var/lib/etcd2,readOnly=false --volume coreos-etc-ssl-certs,kind=host,source=/etc/ssl/certs,readOnly=true --volume coreos-usr-share-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume coreos-etc-hosts,kind=host,source=/etc/hosts,readOnly=true --volume coreos-etc-resolv,kind=host,source=/etc/resolv.conf,readOnly=true --mount volume=coreos-data-dir,target=/var/lib/etcd --mount volume=coreos-etc-ssl-certs,target=/etc/ssl/certs --mount volume=coreos-usr-share-certs,target=/usr/share/ca-certificates --mount volume=coreos-etc-hosts,target=/etc/hosts --mount volume=coreos-etc-resolv,target=/etc/resolv.conf --inherit-env --stage1-from-dir=stage1-fly.aci quay.io/coreos/etcd:v3.2.13 --user=232 --
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal etcd-wrapper[31864]: ++ id -u etcd
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal systemctl[31860]: active
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal systemctl[31858]: active
Jun 26 12:28:47 ip-10-1-43-221.eu-west-1.compute.internal rkt[31847]: "f13fba88-4a37-4e99-94e9-9c3e9ccd98ba"
Jun 26 12:28:46 ip-10-1-43-221.eu-west-1.compute.internal systemd[1]: Starting etcd (System Application Container)...

From the logs, etcd is adding the node with it's old id, 34082a3956203c13. Then it gets the new id 1a62be873a3c997c, but rejects it since it thinks it's looking for the old id 34082a3956203c13.

I think I can finally throw some light on this issue

Jun 26 15:01:10 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[20903]: Unable to find image 'quay.io/coreos/etcd:vv3.2.26' locally

This is from the node that fails to start during update, and that eventually will trigger a rollback. Look at the etcd image. the double v in the version tag. This does only happen during upgrade, because launching a new cluster with kube-aws 0.13-rc2 does work.

Jun 26 15:10:37 ip-10-1-43-147.eu-west-1.compute.internal systemd[1]: Failed to start etcdadm reconfigure runner.
Jun 26 15:10:37 ip-10-1-43-147.eu-west-1.compute.internal systemd[1]: etcdadm-reconfigure.service: Failed with result 'exit-code'.
Jun 26 15:10:37 ip-10-1-43-147.eu-west-1.compute.internal systemd[1]: etcdadm-reconfigure.service: Main process exited, code=exited, status=125/n/a
Jun 26 15:10:37 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: See '/run/torcx/bin/docker run --help'.
Jun 26 15:10:37 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /run/torcx/bin/docker: Error response from daemon: manifest for quay.io/coreos/etcd:vv3.2.26 not found.
Jun 26 15:10:36 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: Unable to find image 'quay.io/coreos/etcd:vv3.2.26' locally
Jun 26 15:10:36 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: info: member_restore_from_local_snapshot: restoring etcd0
Jun 26 15:10:36 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: info: member_clean_data_dir: no files found in /var/lib/etcd2
Jun 26 15:10:36 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: info: member_clean_data_dir: data dir /var/lib/etcd2 exists. finding files to remove
Jun 26 15:10:36 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: info: member_clean_data_dir: cleaning data dir of etcd0
Jun 26 15:10:36 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: info: member_bootstrap: backup found. restoring etcd0...
Jun 26 15:10:36 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: info: member_local_snapshot_exists: checking existence of file /var/run/coreos/etcdadm/snapshots/etcd0.db
Jun 26 15:10:36 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: info: member_local_snapshot_exists: checking existence of file /var/run/coreos/etcdadm/snapshots/etcd0.db
Jun 26 15:10:36 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: [1.0K blob data]
Jun 26 15:10:35 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: info: member_download_snapshot: downloading /var/run/coreos/etcdadm/snapshots/etcd0.db from s3://gdo-kubernetes/k8s7/kube-aws/clusters/k8s7/instances/ca3b0cf0-981a-11e9-9b11-0ae78d106d3e/etcd-snapshots/snapshot.db
Jun 26 15:10:35 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: info: member_remote_snapshot_exists: s3://gdo-kubernetes/k8s7/kube-aws/clusters/k8s7/instances/ca3b0cf0-981a-11e9-9b11-0ae78d106d3e/etcd-snapshots/snapshot.db exists
Jun 26 15:10:34 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: 2019-06-26 15:09:26    4382752 snapshot.db
Jun 26 15:10:34 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: info: member_remote_snapshot_exists: checking existence of s3://gdo-kubernetes/k8s7/kube-aws/clusters/k8s7/instances/ca3b0cf0-981a-11e9-9b11-0ae78d106d3e/etcd-snapshots/snapshot.db
Jun 26 15:10:34 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: info: member_reconfigure: only 0 of 3 nodes for etcd members are running, which means cluster is still in bootstrap process. searching for a etcd snapshot to recover this member
Jun 26 15:10:34 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: info: member_set_unit_type: setting etcd unit type to "simple". `systemctl daemon-reload` required afterwards
Jun 26 15:10:34 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: info: member_reconfigure: 3 more nodes are required until the quorum is met
Jun 26 15:10:34 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: error: cluster_num_running_nodes:  not found
Jun 26 15:10:34 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: line 153: tester_num_running_nodes_file: command not found
Jun 26 15:10:34 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /opt/bin/etcdadm: info: member_reconfigure: observing cluster state: quorum=2 healthy=0
Jun 26 15:10:34 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: See '/run/torcx/bin/docker run --help'.
Jun 26 15:10:34 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /run/torcx/bin/docker: Error response from daemon: manifest for quay.io/coreos/etcd:vv3.2.26 not found.
Jun 26 15:10:33 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: Unable to find image 'quay.io/coreos/etcd:vv3.2.26' locally
Jun 26 15:10:33 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: See '/run/torcx/bin/docker run --help'.
Jun 26 15:10:33 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /run/torcx/bin/docker: Error response from daemon: manifest for quay.io/coreos/etcd:vv3.2.26 not found.
Jun 26 15:10:32 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: Unable to find image 'quay.io/coreos/etcd:vv3.2.26' locally
Jun 26 15:10:31 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: See '/run/torcx/bin/docker run --help'.
Jun 26 15:10:31 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: /run/torcx/bin/docker: Error response from daemon: manifest for quay.io/coreos/etcd:vv3.2.26 not found.
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: Unable to find image 'quay.io/coreos/etcd:vv3.2.26' locally
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal sudo[3638]: pam_unix(sudo:session): session closed for user root
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal sudo[3638]: pam_unix(sudo:session): session opened for user root by (uid=0)
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal sudo[3638]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/[ -w /var/lib/etcd2 ]
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal sudo[3633]: pam_unix(sudo:session): session closed for user root
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal sudo[3633]: pam_unix(sudo:session): session opened for user root by (uid=0)
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal sudo[3633]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/[ -w /var/run/coreos/etcdadm/snapshots ]
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal sudo[3616]: pam_unix(sudo:session): session closed for user root
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal sudo[3616]: pam_unix(sudo:session): session opened for user root by (uid=0)
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal sudo[3616]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/[ -w /var/run/coreos/etcdadm ]
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: declare -x ETCDCTL_KEY_FILE="/etc/ssl/certs/etcd-client-key.pem"
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: declare -x ETCDCTL_KEY="/etc/ssl/certs/etcd-client-key.pem"
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: declare -x ETCDCTL_CERT_FILE="/etc/ssl/certs/etcd-client.pem"
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: declare -x ETCDCTL_CERT="/etc/ssl/certs/etcd-client.pem"
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: declare -x ETCDCTL_CA_FILE="/etc/ssl/certs/etcd-trusted-ca.pem"
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal etcdadm[3610]: declare -x ETCDCTL_CACERT="/etc/ssl/certs/etcd-trusted-ca.pem"
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal systemctl[3606]: active
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal systemctl[3604]: active
Jun 26 15:10:30 ip-10-1-43-147.eu-west-1.compute.internal systemd[1]: Starting etcdadm reconfigure runner...

This line is the source of the problem i think.
https://github.com/HotelsDotCom/kube-aws/blob/fc29a9f5f46872cdb72ccab9f9b767807ac8eadc/build#L9

should be
ETCD_VERSION="3.2.26"
not
ETCD_VERSION="v3.2.26"

The etcd version in the cluster.yaml under etcd.version is specified without a v
https://github.com/HotelsDotCom/kube-aws/blob/fc29a9f5f46872cdb72ccab9f9b767807ac8eadc/builtin/files/cluster.yaml.tmpl#L765

This does explain why the etcdadm-reconfigure.service fails to execute during a cluster upgrade, but it does not explain why the rollback operation fails.

This is how etcd version is specified on line 88 of etcdadm
etcd_version=${ETCD_VERSION:-3.2.13}

The member_restore_from_local_snapshot() function in etcdadm eventually will be executed by the /opt/bin/etcdadm reconfigure command in the etcdadm-reconfigure.service. This function tries to download the etcd docker image defined on line 514 as quay.io/coreos/etcd:v$etcd_version.

Should be fixed in #1645

@davidmccormick , is this more in line with your desired way of handling etcd versions?
#1646

Closing this issue as resolved, following these merged PRs, thanks for the work @paalkr and @andersosthus!

#1646
#1658
#1657

There is still the issue that when the upgrade failed, it did not auto-revert etcd, that this issue was originally about. Also, we haven't been able to get that etcd node that's failing back in the etcd cluster.

@dominicgunn I managed to fix the broken etcd node a few weeks ago.

I basically did the same as the etcdctl replace, but manually doing each step in the script instead of running the script. That worked, and the node recovered. Based on this, I think there might be a subtle bug in etcdctl replace, but I haven't looked for it.

I could try to create a new cluster using 0.13 rc1, upgrade with 0.13 rc2 to reproduce this issue and see if I can figure out what the issue with the script is.

I appreciate you continuing to look into this @andersosthus, is there is a bug in the script getting it fixed would be great!