Missing ports in nomad security group
gthieleb opened this issue · 1 comments
I have an issue when using the "security group" module, when the incoming_cidr is adpated to a custom IP addr (st. else then 0.0.0.0/0).
My ASG is created with help of the terraform-aws-modules/terraform-aws-autoscaling
module using custom userdata and ubuntu 20.04. The userdata incorporates the hashicorp repos and performs a default installation of nomad and consul:
userdata script:
#!/bin/sh
apt update
apt install -y \
software-properties-common \
curl \
vim-tiny \
netcat \
file \
bash-completion
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
apt update
apt install -y consul
apt install -y nomad
/etc/nomad.d/nomad.hcl:
datacenter = "us-east-1"
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"
# Enable the server
server {
enabled = true
bootstrap_expect = 3
}
consul {
address = "127.0.0.1:8500"
token = "***************************"
}
/etc/consul.d/consul.hcl:
cat /etc/consul.d/consul.hcl
datacenter = "us-east-1"
server = true
bootstrap_expect = 3
data_dir = "/opt/consul/data"
client_addr = "0.0.0.0"
log_level = "INFO"
ui = true
# AWS cloud join
retry_join = ["provider=aws tag_key=Nomad-Cluster tag_value=dev-nomad"]
# Max connections for the HTTP API
limits {
http_max_conns_per_client = 128
}
performance {
raft_multiplier = 1
}
acl {
enabled = true
default_policy = "allow"
enable_token_persistence = true
tokens {
master = "***************************************"
}
}
encrypt = "************************"
When opening the browser I see the following message:
No Cluster Leader
The cluster has no leader. Read about Outage Recovery.
In the nomad logs it shows:
sudo journalctl -t nomad:
Oct 02 11:43:51 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:43:51.616Z [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Oct 02 11:43:57 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:43:57.320Z [ERROR] http: request failed: method=GET path=/v1/agent/health?type=server error="{"server":{"ok":false,"message":"No cluster lead
er"}}" code=500
It seems that the communication for port 4647 is currently not allowed within the security group.
Trying to access the port of a server node from another server node times out:
nc -zv -w 5 10.10.10.48 4647
nc: connect to 10.10.10.48 port 4647 (tcp) timed out: Operation now in progress
After allowing port 4647 communication within the security group the cluster server nodes starts replicate with each other:
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:06.257Z [INFO] nomad: serf: EventMemberJoin: ip-10-10-10-48.global 10.10.10.48
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:06.257Z [INFO] nomad: serf: EventMemberJoin: ip-10-10-10-12.global 10.10.10.12
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:06.257Z [INFO] nomad: adding server: server="ip-10-10-10-48.global (Addr: 10.10.10.48:4647) (DC: us-east-1)"
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:06.265Z [INFO] nomad: found expected number of peers, attempting to bootstrap cluster...: peers=10.10.10.93:4647,10.10.10.48:4647,10.10.1
0.12:4647
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:06.270Z [INFO] nomad: adding server: server="ip-10-10-10-12.global (Addr: 10.10.10.12:4647) (DC: us-east-1)"
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:06.725Z [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:07.151Z [WARN] nomad.raft: heartbeat timeout reached, starting election: last-leader=
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:07.151Z [INFO] nomad.raft: entering candidate state: node="Node at 10.10.10.93:4647 [Candidate]" term=2
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:07.162Z [INFO] nomad.raft: election won: tally=2
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:07.162Z [INFO] nomad.raft: entering leader state: leader="Node at 10.10.10.93:4647 [Leader]"
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:07.163Z [INFO] nomad.raft: added peer, starting replication: peer=10.10.10.48:4647
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:07.163Z [INFO] nomad.raft: added peer, starting replication: peer=10.10.10.12:4647
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:07.163Z [INFO] nomad: cluster leadership acquired
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:07.165Z [INFO] nomad.raft: pipelining replication: peer="{Voter 10.10.10.12:4647 10.10.10.12:4647}"
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:07.166Z [WARN] nomad.raft: appendEntries rejected, sending older logs: peer="{Voter 10.10.10.48:4647 10.10.10.48:4647}" next=1
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:07.168Z [INFO] nomad.raft: pipelining replication: peer="{Voter 10.10.10.48:4647 10.10.10.48:4647}"
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]: 2021-10-02T11:44:07.186Z [INFO] nomad.core: established cluster id: cluster_id=c40704d5-7b77-0ea7-9da2-eef39a58b4bb create_time=1633175047177920656
Question for me is if port 4647 is new or only missing in the security groups module?
The config from a installation using the root module differs slightly but I can't see any pinning to another port:
/opt/nomad/config/default.hcl:
datacenter = "us-east-1c"
name = "i-06382f65cc9495792"
region = "us-east-1"
bind_addr = "0.0.0.0"
advertise {
http = "172.31.84.5"
rpc = "172.31.84.5"
serf = "172.31.84.5"
}
server {
enabled = true
bootstrap_expect = 3
}
consul {
address = "127.0.0.1:8500"
}
Update: It seems port 4648 was missing too. In my previous tests I did not recognized that because I had previously allow-all
inside the security group enabled.
Oct 02 12:44:49 ip-10-10-10-69 nomad[4129]: 2021-10-02T12:44:49.396Z [ERROR] nomad: error looking up Nomad servers in Consul: error="contacted 0 Nomad Servers: 2 errors occurred:
Oct 02 12:44:49 ip-10-10-10-69 nomad[4129]: * Failed to join 10.10.10.38: dial tcp 10.10.10.38:4648: i/o timeout
Oct 02 12:44:49 ip-10-10-10-69 nomad[4129]: * Failed to join 10.10.10.14: dial tcp 10.10.10.14:4648: i/o timeout