githubixx/ansible-role-etcd

Problems bootstrapping the etcd cluster

Closed this issue · 2 comments

Hello again,

I'm now in part 5 of the tutorial and I'm kind of stuck getting the etcd cluster up and running. It seems the etcd install succeeded and the service did start on all 3 controllers, however I cannot list the cluster members:
screen shot 2018-05-11 at 17 22 09

Also, it seems the controllers are unable to communicate with one another (the logs are identical on all controllers):
screen shot 2018-05-11 at 17 21 43

The VPN seems ok:
screen shot 2018-05-11 at 17 27 48

screen shot 2018-05-11 at 17 28 28

And the FW rules as well:
screen shot 2018-05-11 at 17 27 36

Any help would be more than welcome!

Thanks

Hard to tell from here but in general the command for checking the etcd members is

ETCDCTL_API=3 etcdctl member list
645277c31f2e59fe, started, k8s-controller1, https://10.3.0.201:2380, https://10.3.0.201:2379
a81925033e34d269, started, k8s-controller2, https://10.3.0.202:2380, https://10.3.0.202:2379
ecf70543fa3a5935, started, k8s-controller3, https://10.3.0.203:2380, https://10.3.0.203:2379

I would first check network connectivity beginning with a ping to all etcd member IPs. If that works then the basic VPN connectivity between the nodes should be ok at least.

Next check if etcd is really listening on port 2379 and 2380 on all etcd nodes e.g.:

sudo netstat -tlpn | grep -E "23[0-9]{2}"
tcp        0      0 10.3.0.201:2379         0.0.0.0:*               LISTEN      21091/etcd      
tcp        0      0 127.0.0.1:2379          0.0.0.0:*               LISTEN      21091/etcd      
tcp        0      0 10.3.0.201:2380         0.0.0.0:*               LISTEN      21091/etcd

If that is also ok I would try to connect to the etcd peer port 2380 (which is used for server-to-server communication) via telnet e.g.:

telnet 10.3.0.201 2380     
Trying 10.3.0.201...
Connected to 10.3.0.201.
Escape character is '^]'.

Or with netcat and checking for the exit code (which should be 0) e.g.:

nc -w 2 10.3.0.201 2380
echo $?
0

If that works then etcd server-to-server communication should be at least possible.

Next problem could be the certificates but I would first check the things mentioned above.

Thank you for your answer. I've been working on this issue for a while now and I got really tired of debugging PeerVPN as there isn't much documentation available to work on. Anyway, I've started over on LXD for now (easier to snapshot / debug) and I'll give PeerVPN a go later on when I get everything else working and I move back to Scaleway cloud instances (the creation of new instances was really buggy all weekend, might be related to their ongoing migration).

The good news is I seem to have a working etcd cluster running now! I did run into some other issues which might be related to the use of a btrfs filesystem instead of ext4. I'll document them even if they might no be worth modifying your code (I'll let you be the judge of that). It might at least help someone else at some point.